[Pitch] Regular Expression Literals

Michael_Ilseman · January 8, 2022, 11:40pm

That's right, there is active development going on right now. I've been meaning to post a version 2 of this pitch, but there hasn't been enough new information quite yet. But there are a lot more details that I can share right now.

This pitch discusses 3 topics:

The regex syntax supported inside the literal
The library-extensibility story via protocols
The choice of delimiter around the literal

Supported syntax

The regex parser is up here. It is written in Swift and architected as a standalone, zero-dependency library, which is used by the Swift compiler. This means it can be used by other source tools, or really anyone who wishes to parse and manipulate regexes the same way the Swift compiler does. I want to land automatic refactoring to result builders in that library as well, such that someone could conceivably write a simple command line tool to do the conversion. Note that all of this is still very much in-progress.

The parser tracks fine-grained source location information (e.g. useful for rich syntax highlighting) and the produced AST also has a nice semantic API, of which the backend engine's bytecode compiler is one obvious client.

The current supported syntax status is tracked here. In short, we are supporting a syntactic superset of:

PCRE2, an "industry standard" of sorts, and a rough superset of Perl, Python, etc.
Oniguruma, an internationalization-oriented engine with some modern features
ICU, used by NSRegularExpression, a Unicode-focused engine
Our interpretation of UTS#18's guidance, which is about semantics, but we can infer syntactic feature sets.

These engines aren't strictly compatible with each other: e.g. a set operator like && would, in PCRE2, just be a redundant statement of set member &. But, parsing the superset makes it easy to add compatibility modes in the future if we want to.

We still have yet to define what Swift's preferred spelling is for anything that can be written multiple ways. I think it's clear we'll prefer \u{301} for Unicode scalar literals, and likely offer some automated way to convert other syntaxes into that one (e.g. a fixit). Things like the preferred syntax for named groups is still an open question. It's also a question that can be decided at any point in the process, given that we successfully parse a superset of these spellings.

By "supported syntax", I'm referring to what we will successfully parse. The backend engine might not yet have support for a parsed feature and some features may arrive over multiple releases. But, we will parse it, recognize the feature, and deliver a targeted diagnostic about what exactly is unsupported. Supported features in the engine is tracked here, but otherwise I've been treating this thread as geared around the literal syntax. The parser will naturally support things before the engine does, as parsing is much easier and it's better to parse the entire language you want to up front than to try to incrementally evolve a parser over time.

There is also an "experimental" syntax, which adds many of the kinds of affordances you'd expect in a modern programming environment. Regex literals are sometimes more like miniature program literals than data literals. However, a common concern is that this could be a "slippery slope", where every convenience you add technically breaks compatibility with a set of traditional-syntax regexes, diluting the value of this effort. The experimental syntax explores a series of these simple syntactic features to find what's a local maximum and what's sinking into this uncanny valley. It has support for non-semantic whitespace, using " for escaping/literals, and some minor group syntax tweaks. It's otherwise not formally part of this pitch.

Library-extensibility

We're jettisoning the fine-grained protocol approach that's pitched. It's proving to be unnecessary for the stdlib thanks to this library-oriented architecture for the parser. Most conceivable conformers (e.g. a libPCRE2 wrapper) would just want to pass off the raw, unprocessed content of the string to someone else anyways.

I'm still very interested and excited by library-extensibility here. I think making the parser's AST available to libraries is a much better extensibility story than serializing the information through low-level builder.buildConcatenate() calls. This can be evolved over time, and though we're designing for this to happen in the future, it will likely land later than more fundamental functionality.

Choice of delimiter

No news here. The prototype uses '/regex/' for traditional-syntax regexes and '|regex|' for experimental-syntax regexes, because ' is easy for Swift's lexer to recognize.