SE-0354: Regex Literals

Paul_Cantrell · May 12, 2022, 1:25pm

It strikes me that while both of these changes are desirable, and while I would hope that in future the language makes them possible and the DSL adopts them, thinking it through a bit more, I realize their later adoption would not heavily break existing code.

Adding labels to a tuple does not break existing position member access (.0) and positional destructuring (let (a,b) = …), so existing code would continue to work just fine. While the lack of labels might drive people away from the DSL, that does seem like it could be a purely additive future improvement.

While changing from Foo?? to Foo? will break existing code, the breakage will happen at compile time, and will come with a helpful compile error. Further more, some idioms for unwrapping multiple levels of optionality at once (e.g. if let foo = doubleOptional as? String) will continue to work unmodified. I do wonder whether regex literals should exactly match the optionality of the corresponding regex DSL, just for consistency, but my concern fades the more I think through the implications.

Unlike my concern above about people accidentally ignoring spaces when they insert a line break, there is no nasty runtime behavior change here that does not come with a compiler warning. So yes, I guess it really is perfectly fine to leave these as is for now and hope for a future fix.

Michael_Ilseman · May 12, 2022, 1:54pm

With respect to non-semantic whitespace, the literal proposal presents these 3 cases:

// 1
/whitespace is significant/

// 2
#/whitespace is significant/#

// 3
#/
  whitespace is **not** significant     # nor are comments
#/

Getting behavior such as in #3 is highly desirable, via some delimiter-enabled way. We couldn't find a better one than #/ followed by a newline. The alternative #///'s has some issues with comments, and /// isn't workable AFAICT, but @hamishknight can you comment further?

An argument could be made that #2 should be non-semantic as well, as "extended delimiter" could mean "extended syntax" (and we'd likely error out on a line-ending comment). The downside is that changing a /hello world/ to a #/hello world/# would change meaning of whitespace and that would be weird (as you point out). I'd (weakly) recommend against this direction.

An argument could also be made that all regex literals, including #1, has non-semantic whitespace. That does get weird with the no-leading-space lexing rule (which IIUC we could restrict to start-of-line if we needed to). It's also surprising that /hello world/ doesn't match "hello world", without the newline that #3 has.

edit: to clarify, the below are all compilation errors. The multi-line story only happens if the #/ is immediately followed by a newline:

// Error
/
  abcd
/

// Error
/ ab
  cd
/ 

// Error
#/ ab
   cd
/#

// Ok
#/
  ab
  cd
/#

hamishknight · May 12, 2022, 3:52pm

After exploring things further, we have come up with a revised parsing behavior that does not require prefix operators containing / to be banned, and fixes all the case path compatibility issues we have previously seen.

The changes are twofold:

When encountering a prefix operator containing /, we will not parse a regex literal if there is no closing / delimiter on the same line. This is the same behavior as with unapplied infix operators.
The ) heuristic has been expanded such that the entire range of the regex literal is scanned for an unbalanced ). If such a case is encountered, we will not parse a regex literal. This takes both escapes and custom character classes into consideration.

Together, these changes mean that many uses of prefix / will be unaffected by the introduction of regex literals. It also means that ambiguities can be readily disambiguated with parentheses, for example
foo(/x, y / z) can be disambiguated as foo((/x), y / z) due to the expanded ) heuristic.

hamishknight · May 12, 2022, 4:16pm

I think #/// could work, although it seems unfortunate that it isn't an extension of the /// delimiter, which IMO seems unexpected for an extended # delimiter. There is a minor parsing issue if you already have a ///# comment somewhere in your code, then any opening #/// you write will immediately turn everything up to the comment into a regex. However that would be quite straightforward to fix by writing in the closing delimiter.

/// is quite a bit more problematic I think, I would initally assume that empty literals would be invalid to preserve the following:

///
///

However even with that rule in place, we'd have to contend with cases such as:

///
/// Some interesting function
///
func foo() {}

///
/// Another interesting function
///
func bar() {}

Would this form the following regex literal?

///
func foo() {}

///

allevato · May 12, 2022, 4:22pm

This is fantastic; thanks for taking the extra time and effort to explore this and make it work so well!

This improvement removes the reluctance I had around the bare slash syntax; being able to wrap the remaining ambiguous expressions with parentheses feels like a completely natural workaround since parenthesis-disambiguation already has precedence in the language with expressions like let x: (Int, Int) -> Int = (+), and I would wager that this workaround wouldn't have to be applied that frequently.

I'm really happy to be +1 on this change now.

Avi · May 12, 2022, 4:32pm

To echo what @allevato wrote, this is the best news possible for this proposal.

I am really glad that you have all taken the extensive negative feedback so seriously, and that you’ve put in the work to make everyone satisfied, if not outright happy.

Now I can look forward wholeheartedly to this feature, and I am sure many others will as well.

johnno1962 · May 12, 2022, 4:34pm

This is a HUGE step forward! Bravo!

johnno1962:

Has anybody given any consideration to the following syntax to enable the multi-line, whitespace ignoring version of a regex literal (which was referred to as extended mode in Perl):
#///
   (foo|bar)
   (d|f|t)
   ///#
I know this is an even more ponderous a syntax but it might be worth it to give an extra confirmation something special is happening to the regex. IIRC, if done right this might result in a pleasing unification of the lexer code to tokenise string and regex literals which is probably a good indicator that the mental model for them is going to be more consistent and easier to "grok" for the user.

I find myself drawn to the idea of a "grand unification" between string and regex literals. A regex literal is just a string using / instead of " which the compiler validates and extracts some summary information (groups) for the precise type. If it is multiline, whitespace and comments are stripped out. I had a quick look at it this morning and the existing lexer can be adapted for this quite easily.

ben-cohen · May 12, 2022, 5:07pm

It's worth noting that Hamish's revised implementation will also allow code to continue to compile on older toolchains, since it leverages existing behavior of parenthesis, unlike an update to the language to allow backticks around operators, which would not compile on older toolchains.

technogen · May 12, 2022, 6:06pm

Given that there will be a fully functional result builder for composing regular expressions and full tooling support for things lime auto-completions, what would be the purpose of using regex literals?

I can't think of any reasonable use case other than pasting a regex into your code from somewhere without going into the details of what it's composed of.

If the compiler will provide auto-completions with an explanation for each of the arcane letters, then the whole process of writing a regex is essentially the same as writing a result builder, except less readable. If the compiler is not going to provide that help, then writing a regex becomes more tedious than writing a result builder, because it's not reasonable to assume that every user would be fluent in regex syntax (and all of its complicated sub-syntaxes) without constantly consulting documentation.

I think the entire concept of a regex literal was invented to solve the problem of languages not having more robust, convenient and general-purpose expression building mechanism and leaving the regex in string form offered no compile-time help whatsoever.

What if we reframed the problem being solved as "how do we help the user import a regular expression into the code in an idiomatic manner? We could add a warning on the Regex initializer that takes a string that would be triggered when the regex string is a literal ( much like Selector does right now) offering a fix-it to rewrite the whole thing into a result builder expression. It's effectively a compile-time regex parser and it covers the biggest use case for dealing with a string refex.

JohnBlackburne · May 12, 2022, 6:31pm

My experience is the opposite. It takes me much longer to parse and so understand a results builder regular expression than a /,,,/style literal. I would think this is down to my experience, which is very limited with result builders.

Thankfully these proposals provide both, so you can use either in your own code.

technogen · May 12, 2022, 6:35pm

Of course, if you have extensive experience with writing complex regular expressions, you have it memorized to the tee, so you'll find regex literals compelling. However, in that case, you wouldn't get much benefit from minor compiler error-checking support, over writing regex string literals. My argument is that Swift has an idiomatic way of building expressions and I think it should steer people toward learning them and using them and steer people away from using archaic solutions. After all, result builders are a well-designed solution that fits in with idiomatic Swift code and it's an essential tool for development, while regex is "imported" from another solution domain and doesn't accommodate modern development principles adopted by Swift. Regex literals, if accepted into the language, would look like something that wouldn't pass code review (just like returning an unsafe pointer to internal memory). The fact that you are not too familiar with result builders and would prefer to stick to what you know is perfectly understandable, but it also proves my point.

scanon · May 12, 2022, 6:39pm

The benefits of literals go far beyond error-checking (which is probably the least important). In particular, statically typed and labeled captures are a huge improvement over regexes with string initializers. It's perfectly reasonable for you to say that you won't use them, but this thread is full of people saying that they will use them preferentially over the builder syntax, or together with it, and we shouldn't ignore that.

Panajev · May 12, 2022, 6:44pm

Thank you, this is a super super welcome change. Thanks for bearing the feedback, sometimes a bit strong I will admit it :(, but pushing through it and delivering this.

technogen · May 12, 2022, 6:45pm

Static captures in regex literals (as proposed) are limited to the number of captures and optionally their labels. They're all either Substring or Substring?. If you're in need of statically typed captures, what you really want is to be able to parse the specific data types that you'll be using. As far as the literals go, you can't have that and the only way to do that is to either do post-processing and parsing of those substrings, or use the result builder.

scanon · May 12, 2022, 6:47pm

The capture list itself is statically typed, so that you know exactly how many captures are present and errors in accessing a capture that isn't there can be checked at compile time.

technogen · May 12, 2022, 6:49pm

I realize that. The point that I was making is the same as arguing that [AnyHashable: Any] is type-safe, while NSDictionary isn't. It's not type-safe where it matters. It's still not a good API decision. It's still inferior to properly typed properties.

technogen · May 12, 2022, 6:53pm

If you know that a substring is there (courtesy of compile-time known capture list length) but you don't know if your data model can successfully parse it or not, it doesn't give you any more safety or convenience, compared to requesting a capture and getting an optional string. You're still going to end up with an optional data model, because even if the capture is know to be non-optional, the data model could still fail to parse it.

Avi · May 12, 2022, 7:12pm

You are ignoring all the possible use cases where type conversion isn't necessary at all, but a concise way of testing the shape of the string is useful.

Take input validation as an example. If the user is entering a PIN, you want to reject non-digits. While there are other ways of accomplishing the task, it is very readable and clear to be able to write the test as string.matches(/\d+/). (I have not kept up with the string API additions that have been proposed. Please take this as purely for illustration purposes.)

Your argument seems to boil down to "I don't like it and I won't use it, so no one else should have it either". There are lots of instances where the DSL would be too cumbersome in the face of a simple regex pattern.

ben-cohen · May 12, 2022, 7:26pm

The analogy I have been using for this is to that of closure syntax.

Swift has a very terse syntax for writing a function literal in-line for use with high-order functions like map. When the function is short, it provides a very readably concise expression right there of what the the mapping operation is doing. When the expression is super simple, it also can avoid the ceremony of naming the parameters by using $0, $1 etc. Once it gets longer, you should think about naming the parameters, and if the closure starts to become very long, you should definitely factor the closure out into a real function and call that.

Rexes are similar. As @bjhomer showed above, when the regex literal is simple it can be far more readable beacuse of its concision than the DSL equivalent when you use it inline e.g. in a switch statement, or embedded within the DSL itself. But as everyone knows, once the regex literal gets out of hand it's a readability disaster, and that's when switching to the regex builder syntax would definitely be appropriate.

Now this is clearly subjective. Some users are just never going to be comfortable with regexes and prefer to use the DSL always. Other users might not know regexes at all, and would need to learn them to read and write them. But for those who believe regexes are a useful tool, they play an important part in allowing users to scale between concise ceremony-free code at the call site, and more verbose but clearer syntax for more complex use cases.

technogen · May 12, 2022, 7:37pm

I don't think this judgment is fair:

I'm expressing my concern about the proposed language change serving as a half-measure, carrying the burden of existing regex syntax while not providing substantial enough benefit to justify having it as part of the language.

But if the regex to so simple that it doesn't justify writing a result builder, it should by definition be simple enough to clearly see where the captures are.

This solution assumes that the validation should be strictly confined to string processing and of it strays beyond simple string processing, you'll have to do a post-processing step.

It's at least equally readable to write string.matches(Regex("\d+")), if not more so, due to the explicit mention of Regex and lack of excessive slashes in light of already punctuation-heavy regex string.

At the end of the day, if there's a more complex pattern, it can be made into a static property or Regex, turning it into the ultimate convenience: string.matches(.pinNumber).

To be clear: I'm not arguing that regex syntax shouldn't exist in Swift. i'm arguing that the Regex("...") initializer seems to cover the use cases for in-line terseness.