[Pitch] Regular Expression Literals

wes1 · October 19, 2021, 12:12am

Like others, I'm excited by typed captures (out of scope here) but nervous about adopting PCRE over other languages. Let's step back a bit to appreciate a modest variation on Kyle's proposal.

Long-term there may be a large class of "string literals with implicit compilation" worth (eventually) supporting as a family.

Medium-term, compatibility with existing regexp languages could be key to interfacing with external UI and (logging) systems. I'd also like to support making breaking changes as required for fixes and language evolution. (i.e., many languages, possibly-evolving?) I'm a little worried about the confusion caused by different libraries providing different semantics for the same syntax and would try to accommodate differences explicitly.

Short-term, it would be nice to move quickly on a user-friendly subset for the vast majority of Swift developers who just need something simple for greenfield (if not unschooled) use.

To weigh in on delimiters (inspired by Kyle Sluder's suggestion #regex(..))...

# indicates something macro-like interpreted at compile-time.
backtick is quote-like but rarely used in text (i.e., rarely escaped). I'm hoping the lexer can distinguish this use from whitespace-delimited use for quoting language keywords.
I'm hoping the reader does not have to balance parentheses to find the end of a literal.

All that said, over time Swift would evolve:

#find`regexp`     // 1st simple Swift interpolation-like
#`regexp`         // perhaps canonical/short form of the above?
#match`regexp`     // perhaps the match-all-input form of find
#replace`regxpr`  // find with substitution
#pcre`regexp`     // perl-compatible
#javare`regpex`   // java-compatible
#template`text`   // another compiled string literal with features beyond interpolation
...

(I hesitate to suggest the compiler could or should plug-in external parsers, but perhaps there's at least an implementation/testing benefit to supporting that.)

hamishknight · October 19, 2021, 12:33pm

Thanks everyone for the feedback given so far! I'm working on updating the pitch to incorporate the feedback gathered, in particular more discussion of the alternative options for delimiters, and nailing down the PCRE superset we want to parse (as well as discussion of alternative syntaxes proposed).

This was called out in the pitch:

Impact of using / as the delimiter

On comment syntax

Single line comments use the syntax // , which would conflict with the spelling for an empty regex literal. As such, an empty regex literal would be forbidden.

I think we probably want to consider that as future work for now. As already pointed out, we wouldn't be able to re-use the \(...) syntax, we'd need to find something that's not already valid in the superset of PCRE that we end up parsing.

If we did end up supporting it, I think it would probably be left up to the RegexLiteralProtocol conforming builder types to decide what types of interpolations they want to support, and how they want to handle the interpolated values, similar to how StringInterpolationProtocol works.

breathe:

In my opinion, a single design change would illuminate a path that could correct all these flaws and lead to a vastly superior literal design in the end ...

terminals inside a regex literal must be inside of a delimiter (example delimiter: double-quote ")

This change would immediately allow for actual identifiers to be used within the literal (when not inside the terminal delimiters) -- so instead of \w, \s, and . we could have 'word', 'space', any . Instead of /#\(regex1)|#\(regex2)/ interpolation we could have sigil free interpolation ... Instead of escape characters everywhere for all regexp control sigils, we simply have quotes around the terminal usages which makes it immediately clear which sigils are being used as terminals and which not ...
let identifier = /alpha word*/
let hexdigit = /"0x" ("A"..."F" | "a"..."f")+/
let someDumbFormatExpression = /identifier "=" identifier ("+" | "*" | "-") hexdigit newline/

While I do kind of like this design, I feel that if we're introducing a new custom regex syntax for users to learn, we'd rather they pick up the more versatile Pattern DSL. As Kyle notes, we are definitely aware of the shortcomings of using PCRE syntax. However for simple regexes we feel that the familiarity and ubiquity of the syntax outweighs these shortcomings. And for more complex regexes, we feel users would be better served by the more general Pattern DSL.

gscarr · October 21, 2021, 3:29pm

As a former regex user in other languages I really like this idea, but I don't immediately associate / as a regex delimiter . While #/ probably solves the aforementioned parsing ambiguities, it seems visually heavy and somewhat ugly to me, and I shudder at the idea of having to put a ; on a preceding line.
Why not have something like an r immediately preceding any existing string literal to make it a regex ? Since it is a regex one would also want to automatically make the string a raw string. So a regex could be something like r"^\s+(\S+)@(\S+).com" for a simple: " name@hackers.com" but you could also use r###....### or r"""...""" for delimiting more complicated expressions or multiline literals when you really need to. If people think that an r alone is too simple, one could use "re" as more of a cognitive trigger for regexes.
Since I'm here I'd like to add that I've found named capture to be particularly useful in interpreting matches and deciphering the regex you wrote last week. I would prefer a simple "< name >" syntax (Python's P prefix is ugly and Perl's ?' or 's alone seem too light)so that the former could become something more understandable like:
NOTE: I ended up needing to insert spaces around the capture names to prevent the forum software from stripping them (which wouldn't be needed in Swift)

Regexes are never going to be pretty, but this seems suitably lightweight and Swifty(with or without the extra spaces):

re"^\s+(< name >\S+)@(< company >\S+).com"

trs · October 23, 2021, 9:40am

breathe:

let identifier = #re { alpha word* }
let hexdigit = #re { "0x" ("A"..."F" | "a"..."f")+ }
let someDumbExpression = #re { identifier "=" identifier ("+" | "*" | "-") hexdigit newline }

Love it!

benrimmington · October 23, 2021, 2:25pm

Some of the syntax supported by PCRE2 may be misleading.

\0… is always the null character in Swift;
but it can also begin an octal code in PCRE2.
Could the compiler suggest a \o{…} fix-it for the latter?
\8… and \9… are always backreferences;
but \1… through \7… can also begin octal codes.
Could the compiler suggest \g{…} and \o{…} fix-its?
\x… and \x{…} are hexadecimal codes.
Should the braced version be preferred?
\N… is not-a-newline; \N{U+…} is a Unicode scalar code.
Could the compiler suggest a \u{…} fix-it for the latter?
(This uses an optional PCRE2 syntax for JavaScript compatibility.)
Uppercased usually means negated, but not always:
- \r is carriage return; \R is any Unicode newline sequence.
- \a is alarm or bell; \A is start-of-subject anchor.

johnno1962 · November 19, 2021, 12:32am

I still feel /regex/ is too flimsy a syntax on it's own. I saw this tweet today and thought of this thread:

dlbuckley · January 7, 2022, 11:10pm

It’s been a while since there has been any activity on this pitch. Are there any updates or has it moved down the priority list for now?

Francois_Green · January 8, 2022, 5:22am

There's a lot of activity here: GitHub - apple/swift-experimental-string-processing: An early experimental general-purpose pattern matching engine for Swift.

Michael_Ilseman · January 8, 2022, 11:40pm

That's right, there is active development going on right now. I've been meaning to post a version 2 of this pitch, but there hasn't been enough new information quite yet. But there are a lot more details that I can share right now.

This pitch discusses 3 topics:

The regex syntax supported inside the literal
The library-extensibility story via protocols
The choice of delimiter around the literal

Supported syntax

The regex parser is up here. It is written in Swift and architected as a standalone, zero-dependency library, which is used by the Swift compiler. This means it can be used by other source tools, or really anyone who wishes to parse and manipulate regexes the same way the Swift compiler does. I want to land automatic refactoring to result builders in that library as well, such that someone could conceivably write a simple command line tool to do the conversion. Note that all of this is still very much in-progress.

The parser tracks fine-grained source location information (e.g. useful for rich syntax highlighting) and the produced AST also has a nice semantic API, of which the backend engine's bytecode compiler is one obvious client.

The current supported syntax status is tracked here. In short, we are supporting a syntactic superset of:

PCRE2, an "industry standard" of sorts, and a rough superset of Perl, Python, etc.
Oniguruma, an internationalization-oriented engine with some modern features
ICU, used by NSRegularExpression, a Unicode-focused engine
Our interpretation of UTS#18's guidance, which is about semantics, but we can infer syntactic feature sets.

These engines aren't strictly compatible with each other: e.g. a set operator like && would, in PCRE2, just be a redundant statement of set member &. But, parsing the superset makes it easy to add compatibility modes in the future if we want to.

We still have yet to define what Swift's preferred spelling is for anything that can be written multiple ways. I think it's clear we'll prefer \u{301} for Unicode scalar literals, and likely offer some automated way to convert other syntaxes into that one (e.g. a fixit). Things like the preferred syntax for named groups is still an open question. It's also a question that can be decided at any point in the process, given that we successfully parse a superset of these spellings.

By "supported syntax", I'm referring to what we will successfully parse. The backend engine might not yet have support for a parsed feature and some features may arrive over multiple releases. But, we will parse it, recognize the feature, and deliver a targeted diagnostic about what exactly is unsupported. Supported features in the engine is tracked here, but otherwise I've been treating this thread as geared around the literal syntax. The parser will naturally support things before the engine does, as parsing is much easier and it's better to parse the entire language you want to up front than to try to incrementally evolve a parser over time.

There is also an "experimental" syntax, which adds many of the kinds of affordances you'd expect in a modern programming environment. Regex literals are sometimes more like miniature program literals than data literals. However, a common concern is that this could be a "slippery slope", where every convenience you add technically breaks compatibility with a set of traditional-syntax regexes, diluting the value of this effort. The experimental syntax explores a series of these simple syntactic features to find what's a local maximum and what's sinking into this uncanny valley. It has support for non-semantic whitespace, using " for escaping/literals, and some minor group syntax tweaks. It's otherwise not formally part of this pitch.

Library-extensibility

We're jettisoning the fine-grained protocol approach that's pitched. It's proving to be unnecessary for the stdlib thanks to this library-oriented architecture for the parser. Most conceivable conformers (e.g. a libPCRE2 wrapper) would just want to pass off the raw, unprocessed content of the string to someone else anyways.

I'm still very interested and excited by library-extensibility here. I think making the parser's AST available to libraries is a much better extensibility story than serializing the information through low-level builder.buildConcatenate() calls. This can be evolved over time, and though we're designing for this to happen in the future, it will likely land later than more fundamental functionality.

Choice of delimiter

No news here. The prototype uses '/regex/' for traditional-syntax regexes and '|regex|' for experimental-syntax regexes, because ' is easy for Swift's lexer to recognize.

dlbuckley · January 9, 2022, 8:18pm

Thanks for the update! I didn't realise it was an altogether different repo, Just got quite excited by the pitch and then noticed the thread went a bit quiet.

Looks like there is a lot of work ahead, but a ton has already been done. Looking forward to this landing in its final form.

Saklad5 · January 12, 2022, 5:38pm

I’m actually okay with that long-term, especially if we ultimately add custom compile-time literal parsing using '. In such a scenario, existing regex literals would seamlessly start using that system without complicating the grammar.

For the traditional syntax, anyway: if something is experimental, it shouldn’t work in a stable release of Swift without special compiler flags. The proliferation of underscored attributes is proof that we can’t trust everyone not to use it anyway.

I definitely think that’s the right step forward.

I personally would prefer that parsers be distributed as Swift packages rather than core libraries, since the latter are much more complicated when it comes to platform support, but so long as they can be in the future that’s probably acceptable.

For the sake of simplicity, would it be possible to write regex literals that strictly follow ICU syntax? Given Swift’s embrace of Unicode in other respects, it seems more fitting than adding yet another regex standard to the pile.

johnno1962 · February 1, 2022, 6:15am

Seriously? We're going to squander single quote on regular expressions? I still held out hope character literals will make a comeback to bring Swift into line with other languages. What was the problem with #/regex/#? More consistent with the rest of the language no?

sspringer · February 1, 2022, 9:50am

Yes, such an optimization is what I really, really need, I do a lot of replacements via regex in one of my applications, and it is not really fast (Note: My application is cross-platform, so I cannot use any macOS / iOS specific optimizations.)

...Such an efficiency enhancement as discussed here would of course only apply to regex expressions known at compile time, right? Then what about getting regex expressions that are not known at compile time fast ? (Compare this to Java where "compiled" regex expressions via Pattern.compile(regex) are really fast applying them many times once they are "compiled", and they are fast with regex expressions not known at compile time. I do not know what magic is done there.) Even when getting those regex expressions not known at compile time faster: How much would then the difference be to regex expressions known at compile time?

Avi · February 1, 2022, 10:10am

All modern Java runtimes use JIT compilation. I bet that helps.

Karl · February 1, 2022, 10:31am

It's worth noting that not all platforms support generating executable code at runtime (i.e. JIT). iOS, for example, doesn't generally allow it except for privileged processes, and neither do many other closed platforms as it can be a security risk.

So it is very important to have a fast interpreter for regexes only known at runtime, and it is certainly possible. For example, see this blog post from the V8 team about improvements to their Regexp implementation:

The new interpreter is up to 2× as fast as the old one, averaging about 1.45× as fast. We even come quite close to the performance of JITted RegExp for most benchmarks, with Regex DNA being the only exception. The reason why interpreted RegExp are that much slower than JITted RegExp on this benchmark is due to the long subject strings (~300,000 characters) used.

They achieved those improvements by lowering their internal overheads when jumping between JS and C++ (not a concern for us), implementing bytecode peephole optimisations (which we could also do), and using computed gotos rather than switches (which is how most interpreters seem to be written these days).

I'm definitely interested to see how the Swift implementation stacks up. I'm not expecting the initial release to challenge V8 for performance, but since we also have ABI stability to worry about, I hope we keep things abstract enough that we can implement some of those kinds of optimisations down the road.

JanWillemBrands · February 1, 2022, 11:17am

And if we're going to squander them, can we not squander them wholesale?
Why '/regex/' or '|regex|' ? Is 'regex' not possible?
It would perhaps get us character literals 'a' for free

hamishknight · February 1, 2022, 12:57pm

'/regex/' and '|regex|' are just placeholder delimiters to get the prototype up and running. They're not any indication of what the actual delimiters will end up being.

johnno1962 · February 1, 2022, 1:52pm

Happy to hear it! Making an analogue from regex literals to raw strings makes a lot of sense since raw strings were pretty much proposed and designed with regular expressions in mind. It emphasises that \ escapes are not normally processed but passed through while leaving the door open to interpolation (though it's not clear how the compiler would handle that in checking/constructing the literal).

austintatious · March 6, 2022, 9:58pm

I like this best.

stackotter · March 7, 2022, 12:40am

I agree. It also avoids breaking code bases that have already defined / as a custom operator (such as CasePaths).

[Pitch] Regular Expression Literals

Impact of using / as the delimiter

On comment syntax

Supported syntax

Library-extensibility

Choice of delimiter

Impact of using `/` as the delimiter