SE-0354: Regex Literals

Michael_Ilseman · May 7, 2022, 5:03pm

My sympathies to anyone trying to parse Ruby. I tried that 15 years ago and it didn't turn out well. This was, at least at the time, a very well-known issue unique to Ruby compared with pretty much any other popular programming language. There were many high-profile discussions on the ruby language mailing lists on this topic and who was going to maintain the horrendous YACC file (around the time of the Ruby 2 transition). I stopped working with Ruby shortly after so I do not know if this was ever improved (it looks unlikely).

The concern about the breakage of TCA when this flag is enabled is real and concrete. Significant amounts of this thread is devoted to that topic and the core team and project lead have even been weighing in to provide what assurances they can.

This is one of many syntactic migrations for Swift 6. The formal SE process, like all processes, is incomplete and doesn't have a vehicle for every unique consideration. It is my recommendation to the core team to not remove the migration flag until there's a good place for developers to migrate to, and that would (ideally) be language support for case paths. A recommendation is all that proposal authors can give, as a proposal itself is a recommendation to the core team.

The rest of your concerns appear as a vague unease about the changes. This proposal goes to great lengths to explore and try to address this vague unease (a fact that's somehow being used against it).

Do you have any information whatsoever that would clarify your vague unease? If you do, please share it so actual discussion can take place.

johnno1962 · May 7, 2022, 5:08pm

Thanks for the reply. I really don't think I have much more to say other than to re-iterate that IMHO, bare /regex/ syntax is not a good destination in itself even without taking into account the migration issues it creates. I feel I have already made every effort to articulate the reasons for this "vague unease".

rvsrvs · May 7, 2022, 5:11pm

Several points in this thread have changed my mind on the entire idea. I'm now -1 on the proposal.

I was originally in the "don't break my source" camp. @tkremenek, @Ben_Cohen, and others have pointed out that the major source break associated with enum case paths is under consideration for inclusion in Swift 6. If that really happens it moves me to the "please break my source camp", I would voluntarily change all my code to use what I consider to be a missing feature of the language anyway, and I'm sure that's true for much of the TCA community.

The bigger issue for me has become grammar extensibility. The entire proposal is about hosting an external grammar inside of swift. This particular grammar is reasonably characterized as far and away the most widely used "small language" out there. It is also reasonably characterized as strongly resembling line noise on a 300 baud acoustic coupler (dating myself there).

The #regex( ... ) delimiter appeals to me because it does not privilege one particular small language (regex) over others that have been mentioned in both the pitch thread and upthread here. In spite of its lexical terseness, the / ... / syntax does not appeal to me precisely because it takes an element of the small space of operator characters and reserves it for regex use only. @Michael_Ilseman has pointed out (reasonably) that doing that reservation does not preclude us from doing a more general small language hosting in the future, but doing this one this way and others another way seems like imposing additional cognitive burdens on a language syntax which, lets be frank, already has a lot of them.

Summarizing some of the above, #regex( ... ) syntax has been ruled out on two bases (if I have read everything correctly)

It doesn't seem to fit with #available, #file, and #selector.
The particular foibles of regex syntax mean that it will be difficult to disambiguate the closing paren of the #regex element from the line noise of the actual regex itself.

The first seems somewhat strange to me. It's difficult to imagine anything those have in common other than "#ImASpecialCompilerThing". #regex doesn't seem to me to be at all out place in that list. Especially when you compare with #selector. People who are new to the language are going to have no idea what that is about.

The second is what has me changing my vote on the proposal. To me it seems difficult to argue that / ... / provides needed visual decluttering while ignoring that the ... right there in the middle provides the exact opposite. And that that cluttered syntax itself is what makes it incredibly complicated to host this small language inside of Swift in a manner that can be readily extended to other small langs. That got me thinking that where we are is this:

The Swift language is not ready to host external languages in a form other than Result Builders
We have an excellent well-thought out DSL in Result Builder form that does exactly what the regex language does
It would not be difficult to provide tooling to generate a DSL implementation from a regex string. (And perhaps, though I'm well outside my depth here, the reverse as well)

I would much prefer to see a comprehensive proposal for "small lang" extensions to the language (NB this conversation and experience with things like ASP and JSX has convinced me that this may not be possible). Until that time, I would much rather have regex translation implemented at the tool level than at the language level.

Avi · May 7, 2022, 5:29pm

Also the Atom text editor.

rdemarest · May 7, 2022, 6:14pm

I would love to a literal syntax for regexes in Swift, it's quick and easy to handle, we get named captures, but I do not see the attachment to the bare /.../ syntax. It wouldn't be the first time Swift moves away from features of other languages that people seem to be used to.

The Swift community thought that increment and decrement operators were too confusing and gave up on them, but this proposal finds that this kind of construct isn't confusing: f(/, /)?

I think the proposal does respond to a need to have a simple syntax that is expressive enough like regex literal. I don't think the bare syntax is worth keeping it seems to be more confusing than anything, I would vote for #/.../# as the minimum regex literal syntax because it has the merit of being quick to type and still retaining what other languages see as a regex. With this you can just copy the code from another language and add # around it and you're done. If having # by default is going to be confusing, I'd rather vote #Regex() rather than restricting the custom operator syntaxes.

rdemarest · May 7, 2022, 6:24pm

I do not see how this is an acceptable syntax: f(/, /) how is that not confusing? Are we using a function with two arguments or a function with one taking a regex? It doesn't seem right to me to find that kind of syntax acceptable, but finding this value++ unacceptable in the language, at least if I want to I can reintroduce increment and decrement operators in my code.

ksluder · May 7, 2022, 6:44pm

One way to resolve this ambiguity is to make the previous suggestion of backtick escaping mandatory for using operator characters as identifiers in Swift 6 mode. Then bare operators can only ever mean application.

rdemarest · May 7, 2022, 6:59pm

To me it seems that using the / operator would come up more often than using regular expressions, so we would make using the very common / much harder and more confusing by slapping everywhere for the much more uncommonly used regex literal syntax, which I believe is actually made better with explicit#/ /#`.

From the other regex proposals, I was led to believe that the regex literal syntax wasn't even the preferred one, it seems that we want to use the declarative/result builder syntax over the literal syntax, and the literal syntax is here more for convenience and "compatibility with other languages" than clarity, and I don't believe that either of those justify the change. And as I pointed out before, both of those reasons weren't enough to keep the increment and decrement operators in the standard library.

I find the bare regex literal syntax to be extremely confusing on its own, especially when I see used in Perl, and I do not see how requiring the # around a literal is such a burden on the proposal.

ksluder · May 7, 2022, 7:20pm

Backticks would only be necessary to use the / operator as an identifier. My intuition is that a programmer who uses regex literals at all will use them much more frequently than any Swift programmer uses an operator as a bare identifier.

rdemarest · May 7, 2022, 7:36pm

Given that the issue here is between allowing the syntaxes /.../ and #/.../# together versus only allowing #/.../# I still do not see the benefit of the bare syntax compared to all the other issues that it brings in, and I don't think it's just an issue of source compatibility, it removes good usages for the operators, makes illegal some usages that are seen as advantages of the Swift language, like some custom operators, use of operators as parameters to other functions, etc.

The only argument I see in favor of the bare syntax (as opposed to required # as a minimum) is "other languages do it" but given the other languages in question, Perl, Ruby, and JavaScript, that'd be an argument against this syntax, they're not bastions of clarity and readability.

benlings · May 7, 2022, 8:22pm

Fully support strongly typed captures and compile time checked literals for regexes. The inclusion of both plain /abc/ and extended #/abc/# literals mirrors string literals nicely. I'm happy with the /abc/ syntax as long as it doesn't cause problems in practice. I think this will take some real world experience (as did the introduction of multiple trailing closures), so leaving this behind a flag will hopefully allow that to be done without too much disruption.

One of the main problems with using regex literals in other languages is having to run the code before the syntax of the literal embedded in a string can be checked. Having compile time literals for regexes will make them significantly more usable in Swift.

Adding compile time support for regex syntax and captures fits well with Swift's goals of being safe, expressive and to 'present excellent diagnostics'.

I've used regexes in several other languages and have found literal support in JavaScript and Ruby useful to have.

Parsing ambiguity

It would be nice if it were possible to avoid the parsing ambiguity between regex literals and certain operators. One of the preferences expressed elsewhere in the proposal review is that Swift could only have the extended syntax. My feeling is that extended literals look too heavyweight for an ‘everyday’ syntax.

One possibility that I haven't seen expressed is adding only a leading symbol before the first / to help remove the ambiguity (cf Lisp's quoting and Ruby's symbol literals):

:/[a-z]+/ (this also looks similar to Raku’s adverb syntax). Would this be ambiguous with colons in method parameters? It seems like it might work, but I'm sure I haven't thought through all the possibilities.
'/[a-z]+/. I think this wouldn't 'burn' the use of the single quote for other literals, as long as whatever it was used for didn’t need a / at the first character.

Both of these don't look as visually heavyweight as #/[a-z]+/#.

Another question - could some of the ambiguities people have mentioned be resolved by requiring spaces to be escaped within the literal? e.g. foo(/, /) - is it two / operators or a regex? If this were foo(/,\ /), would it be unambiguous?

Syre · May 7, 2022, 9:21pm

I tend to agree with Nevin's sentiment on regex literals in this thread.

Still, if it must happen, I'm personally not so concerned about the potential source break, but I really do feel that getting "regex in the name" somehow would be so much better.

Doug mentioned that he thinks regex(...)
has certain issues, namely the following:

Clearly I'm missing something, I don't necessarily think it's problematic if regex is sorta different from selector etc.

But also, I'm not sure how regex(...) "doesn't adapt to raw and multi-line literals", I'm guessing it's true, but I just haven't grasped it.

In any case, if there was any possible way to adapt something that has "regex in the name", I think that would be significantly better. It takes something that looks like a mess of symbols to the uninitiated and makes it clear, and they say "oh, I guess that's a regex".

masters3d · May 8, 2022, 5:53am

We could have our cake and eat it too if bare regex literals were only allowed within a Regex DSL scope where usually they are added on their own separe line. This would be a compromise that should allow us to limit the blast radius.

dhoepfl · May 8, 2022, 5:52pm

I think, due to ambiguities, /…/ breaks the principle of least astonishment.

I’d love to see this evaluated by people getting asked that do/do not know (a lot) about SE-0354: What do developers expect let y = foo(a, /, b).reduce(1, /) to do? How confident are they? …

jberry · May 8, 2022, 8:44pm

I think that syntax hilighting can go a long way toward making clear what is happening in a case like this. If regexs are syntax colored differently from code, as strings are, for instance, this will stand out just as it would if you replaced the slashes above with quote marks in similar but not identical situations. I realize that not everybody has a syntax coloring editor, though very high percentages do.

sveinhal · May 9, 2022, 12:07pm

I totally agree, and understand that everybody here are expressing their own opinions and views. Even when people express thing as facts, it is understood that they are talking about what they themselves understand as fact.

That said, it is possible to express one's opinion in ways that are more or less constructive for discussion. And one can choose to justify one's opinions and not. Opinions offered in strongly or absolute-worded tone-of-voice and without justification, should expected to be questioned.

I think it's ok to ask for justifications and examples of opinions expressed as fact. And I think it is especially productive when said position is under-represented in the discussion at hand.

However, I'm sorry if that point was poorly expressed, or I otherwise contributed to an unproductive conversation. I'm truly interested in examples. I like to understand how opposers of this proposal suggest we solve some of the use cases in alternative ways.

I'm sorry if I came off as counter-productive.

Paul_Cantrell · May 9, 2022, 6:04pm

While I'm lightly (but not passionately) in favor of respecting precedent from other languages and making /…/ parse, I'm really uncomfortable with making #/…/# the only regex literal syntax. It's just…uuuuugly.

Regexes are already a nasty symbol soup, and more noise doesn't help readability. You may not love this:

/[+-]?\d(\.\d+)/

…but you'll have a hard time convincing me that this is an improvement:

#/[+-]?\d(\.\d+)/#

I can just imagine explaining that to students: “No, no, both # and / are delimiters, whereas all those other symbols are part of the Swift syntax….” Regex literal syntax is daunting enough as it is.

(An aside: several comments mention #/…#/ instead of #/…/#. Surely you mean the latter, not the unbalanced former?! Keep in mind that the Swift extended string syntax is #"…"#. That is a syntactic precedent to strictly respect.)

The extra noise is especially bothersome given the proposal’s appealing idea that regex literals might be used to concisely express small molecules in a larger DSL-based regex. In such usage, regex literals are small, and the signal-to-noise reduction of the bigger delimiter is thus significant. The version on the right is a significant regression to my eye:

let regex = Regex {                    let regex = Regex {
  Capture { /[$£]/ }                     Capture { #/[$£]/# }
  TryCapture {                           TryCapture {
    /\d+/                                  #/\d+/#
    "."                                    "."
    /\d{2}/                                #/\d{2}/#
  } transform: {                         } transform: {
    Amount(twoDecimalPlaces: $0)           Amount(twoDecimalPlaces: $0)
  }                                      }
}                                      }

Bleah. If we're introducing that much noise, let’s at least make it expressive noise (keeping in mind Doug’s concerns upthread about #regex(…), which are compelling):

let regex = Regex {
    Capture { #re"[$£]" }
    TryCapture {
        #re"\d+"
        "."
        #re"\d{2}"
    } transform: {
        Amount(twoDecimalPlaces: $0)
    }
}

Again, I am personally in favor of allowing /…/. Parsing concerns seem more fear-based than evidence-based. Some syntactic familiarity is aways nice to offer language newcomers. But if community sentiment runs squarely against it, I’d argue in favor of a new pitch thread devoted to finding a better delimiter than #/…/#. That just seems to me like syntactic salt (or maybe syntactic Bitrex) against using regex literals at all.

While syntax always steals the show, I'd like to reiterate my concerns over some other things that IMO deserve a little more discussion than they've received:

Mismatch in optionality between literals and the DSL for nested capture groups:

/(.)*|\d/

→ match type of (Substring, Substring?)

…but IIUC…
```
ChoiceOf {   // supposed to be equiv to above; don't know DSL well yet; making up details
  Capture {
    ZeroOrMore(.any)
  }
  .digit
}
```
→ match type of (Substring, Substring??)
The hidden change in the meaning of all whitespace when #/…/# contains a newline:
```
#/ (foo|bar)(d|f|t) /#
```
matches " foot "

…but IIUC…
```
#/ (foo|bar)
   (d|f|t) /#
```
does not match " foot "

…which seems to me like a footgun.

Edit: whitespace in the middle makes an even more compelling example of this problem:
```
#/hello world/#
```
matches "hello world"
```
#/
  hello world
/#
```
does not match "hello world"

woolsweater · May 9, 2022, 6:24pm

It's just…uuuuugly.

I don't know that we can say there's any objective determination to make about the syntax. I'm looking at your DSL examples with the embedded literals (thanks for making them) and I personally find that the octothorpes are easier to read, because they contrast with their surroundings and let me quickly recognize what they are.

EDITED to clarify exactly which bit I was responding to.

Paul_Cantrell · May 9, 2022, 6:26pm

Well of course there isn’t! Aesthetics are subjective.

They're also necessary and important in language design. Insist on objectivity and nothing but, and you miss half of what matters.

Michael_Ilseman · May 9, 2022, 6:53pm

An argument could be made that #/ as the extended delimiter should have extended syntax enabled by default.