[Pitch] Regex Syntax

Michael_Ilseman · March 9, 2022, 12:08am

Something else that's interesting is mentioned in the Introduce a novel syntax alternative, which is that we're developing a experimental extended syntax for Swift which goes a little further. This is not proposed here and is likely to be future work (it's a lot to bite off at the same time), but the rough and ever-evolving idea is:

All ASCII values outside of [A-Za-z0-9] are reserved for metacharacters and should be escaped or quoted for literal treatment.
All whitespace is non-semantic unless escaped; # is supported for end-of-line comments (pending delimiter, perhaps // too)
Quoted literal content uses double-quotes, so you can say "a.b" instead of \Qa.b\E. These would be Swift string literals eventually supporting interpolation, raw strings, etc.
Clearer capture group syntax and defaults: (...) is non-capturing, (_: ...) for unnamed capture, and (name: ...) for named, etc.
Support Swift-syntax ranges for ranged quantification, i.e. x{3..<8} for x{3,7}
Use of other now-free delimiters, e.g. <...>, as a way of naming builtins such as character classes and anchors, perhaps also an interpolation sytax or way to refer to in-scope declarations.

This clearly breaks compatibility with existing regex syntax, so it would need to be clearly delineated and makes sense as future work. There's significant value to allowing things like command-line tools and search fields access to traditional regex syntax, so this wouldn't take the place of what we're proposing.

Another practical reason to consider this future work is that the current effort is pushing the state of the art of the Swift compiler: our regex parser is written as a stand-alone pure-Swift library that gets bundled up and incorporated with the C++ Swift compiler. The Swift compiler yields lexing/parsing state to our library, which then yields back to the Swift compiler after lexing/parsing. A "year 2" of overhauling parts of the Swift lexer could include handling string literals in such a library, making it natural for the regex parser to support embedded Swift string literals. Alternatively, in the nearer term if this deemed high-value, we could support just basic string literals at first.

If we're debating an extended syntax by default for literals (or one of multiple literals), I'm not sure how much value Perl-style xx gives us. Result builders seem like the better way to separate components across lines with comments. The syntax above, especially points 1-3, give us a more compelling extended syntax.