[Pitch] Regular Expression Literals

Karl · October 17, 2021, 5:27pm

I'm quite excited by this pitch, and I have some ideas for what I'd like to do with it.

I think the proposal's motivation section should be expanded, as it essentially starts out from the position that "regexes exist" and so we should add support for them. IMO, the motivation is more that:

Text processing is critically important to most of the domains Swift wants to support, including applications, servers, and especially scripts.
Our current pattern-matching is inadequate, based on overly-verbose, generic collection methods (sometimes supplemented by Foundation)
When terseness and productivity is most important, there is already an industry-standard compact syntax for expressing patterns: regexes.

If you need to do some simple parsing of a machine-generated log file or database (such as the Unicode data tables), Swift makes it possible... but not easy, and not concise.

The pattern-matching DSL proposed elsewhere is great, but it involves a fair amount of ceremony and visually dominates the code around it. It's readable, but also not that concise. Regexes are primarily useful for simple patterns - e.g. split these log lines in to (time, severity, message) based on a given format, and whilst they are powerful enough that they can scale to the moon, like always, it's up to the developer to ensure their code stays readable.

When your regexes get too large or complex, I'd imagine the compiler's refactoring engine would be able to rewrite them using the Swift pattern DSL, extract it as a function, etc. The point is that the language scales to the complexity of the pattern, so both simple and complex patterns are convenient to use and easy to maintain.

I also want to remind people of this post from 2016(!) after Swift 3 was released:

It has taken a while, but the goal is to be better than Perl. Realistically, we can't do that unless we have a way to express simple patterns without a huge amount of ceremony. Any other regex-like pattern literal would just be confusing because regexes are so ubiquitous, and be subject to the same criticism that they could potentially be abused.

As for the proposed design, I think it's really excellent, and a great demonstration of what we can do with the generic builder transform (so far used only by result builders, IIRC). I really like the idea that my code will be able to get the regex AST through the builder, so we can know something about what the regex is going to do and how to incorporate it in to a larger pattern.

One thing that this highlights, though, is that we need to move our other builders - e.g. ExpressibleByStringInterpolation, to the new generic builder model, otherwise we won't be able to compose regexes with string literals and other patterns.

For instance, picture something like the popular JavaScript library path-to-regexp for Swift. It takes a path string, potentially including regexes or other patterns, and returns a pattern object. The best approach would seem to be to use a string interpolation with regex segments, e.g.:

url.matches(path: "/books/id_\(/\d+/)")

I'm guessing that there will be a buildCharacterClass_d callback so I can build a pattern which captures and returns an Int, but since ExpressibleByStringInterpolation uses mutating appendInterpolation calls, those types cannot be reflected in the type of the pattern object or returned by the url.matches function.

This shouldn't be a factor in whether this proposal is accepted, but I just wanted to point out that we may need to adjust other parts of the standard library for this feature to really shine.

As for the delimiter discussion, please also consider what those delimiters might look like as part of a string interpolation. For example:

"/books/id_\(/\d+/)/info/\(/.*/)"
"/books/id_\(#regex(\d+))/info/\(#regex(.*))"
"/books/id_\(#/\d+/#)/info/\(#/.*/#)"
"/books/id_\((\d+))/info/\((.*))"

Personally, I think #regex(...) and #/.../# add too much ceremony.

Also, it might be interesting if there was a way for ExpressibleByStringInterpolation to allow omitting regex delimiters within interpolation segments. It is also a kind of concise DSL which is particularly attractive for text patterns, and removing delimiters in contexts where regexes are common helps the pattern stay readable:

"/books/id_\(\d+)/info/\(.*)"

It's added complexity, and generally I don't like that, but I think the benefit is significant.