SE-0354: Regex Literals

Jumhyn · May 10, 2022, 1:17am

This is a good point! But also worth noting, I think, that this complexity would not necessarily extend to tools which can mostly assume correct code/syntax, and/or which don’t really care about producing diagnostics. For instance, a source hosting/viewing tool like GitHub could probably get away with a less-complex syntax highlighting algorithm were /.../ not considered valid syntax, since users are (typically) not editing code there. So I think we should not totally discount the wins from keeping parsing rules simple just because we may want to parse some invalid syntax for the purposes of diagnostics.

QuinceyMorris · May 10, 2022, 4:57am

I don't really have a horse in this race, but I've been wondering over the past few days whether there might not be a different way of "assembling" all of the constituent parts that might satisfy most of the objections on both sides.

To construct a regex out of a string, you use syntax like this:

    let regex = Regex("…guts…") // <-- says "Regex"

To construct a regex in a builder, you use syntax like this:

    let regex = Regex { // <-- says "Regex"
        … builderGuts …
    }

and you can construct a builder regex (using proposed syntax) like this:

    let regex = Regex { /…guts…/ } // <-- says "Regex"  **(a)**

IIUC, the last construct is functionally equivalent to:

    let regex = /…guts…/ // <--- doesn't say "Regex"  **(b)**

Now, a number of people want the literal regex syntax to say that it's a regex, so that the meaning is clear regardless of how messy the literal guts are, or what delimiter is chosen.

It seems to me that it might be feasible to use (a) as the real regex literal syntax, and dropping (b) from the proposal.

What I have in mind is this:

Delimited regex literals would only be used inside a Regex { … } construct.
Inside that construct, the proposed "new" rules for lexical interpretation of the / symbol would apply.
Outside that construct, the existing rules for / would apply.
This is 100% not-source-breaking, because the Regex { … } construct does not currently exist.
Compiler Magic™ would allow the Regex type name to be recognized lexically as something sorta like the #regex symbol that some people have proposed.
If what's inside the braces is just a /-delimited regex literal, the whole expression is just a regex literal.
If what's inside the braces contains something other than (or as well as) a /-delimited literal, then it's a regex builder, not a regex literal. Multiple /-delimited literals would be allowed inside a builder without needing to nest them with additional Regex {…} syntax, so using lots of itty-bitty literals this way is no harder than originally proposed.

IOW, I'm suggesting something like #regex(…) literal syntax, but actually spelling it Regex {/…/}. Then, the interior of regex builders and literals would be the only lexical contexts where / is interpreted in a new way (for certain edge cases spelled out in the proposal).

The downsides here are:

The spelling of a top-level literal regex has a few more characters. However, this can be seen as an advantage — every regex is introduced by the exact same Regex symbol without exception.
The rules for / are different in different places. However, the members of the core team who've weighed in here have consistently stated that they don't expect the new rules to confuse developers.

masters3d · May 10, 2022, 5:49am

Xcode already has special visual representation of literals like #imageLiteral(…)s so for folks who are using Xcode the same treatment could be applied to extended regex literals so they can be visually appealing when single line.

In most platforms folks can adopt code ligature fonts to mitigate unwanted noise if they choose.

For me /../ is such a foreign concept. It’s as if somebody told me that they wanted to use percentage signs %..% because historically this is the ways it’s been done.

johnno1962 · May 10, 2022, 8:13am

Have I got this right? We're suggesting hundreds, perhaps thousands of lines of existing code out there in the community have to move to this unattractive syntax so we can roll out bare /regex/ syntax on "aesthetic grounds" . If we are to deprecate on TCA, let's go the extra mile to a better end point where the syntax we're suggesting people have to take the trouble to move to could at least be \Authentication.authenticated which also solves the problem rather than this half way house.

All this so the very, very, very small number of people who will use the proposed syntax over the DSL version don't have to hold their nose and type a couple of extra #. What kind of preemptive "active harm" justification is this?

I believe if we are to pursue the bare regex syntax as a destination it will be a multi-year process that needs to be planned out thoughtfully rather than conspicuous specific source breaks of key open source projects not even being mentioned in the proposal which I find "odd".

If you can't see why this is receiving so much attention take time out of your day to watch the 5 free videos on the TCA homepage. It will enrich your programming life.

gwendal.roue · May 10, 2022, 11:05am

I have not used Case Paths yet, but I'm pretty sure that trying them is adopting them.

Case Paths are a great example of community-driven evolution. They provide a great service for building the kind of software that Swift is used for today.

And that's where I don't get why we're debating so much about regex literals. Is Swift about to become a fashionable alternative to sed, ruby, awk, or Perl? I'm quite versed in regex, and yet in years of Swift development, I think I can count on the fingers of one hand the number of times I needed some.

Why this fuss about regex literals??

The Core Team is again posing as a hostile group who could really improve its empathy skills, and pay due credit to the community. I just don't get what are the benefits. I'm desperately searching for enthusiasm, pride, and care.

bjhomer · May 10, 2022, 12:54pm

I don't think we have any reason to assume that this number would be "very, very, very small". There are a ton of Swift developers out there with previous experience using regexes, and I'd imagine that many of them would prefer to use the shorthand syntax, especially for short patterns. For example, I'd much prefer this:

switch input {
  case /\d+/:
    print("It's a decimal")
  case /0x[0-9a-fA-F]+/:
    print("It's a hex")
  default:
    print("It's something else")
}

to this:

switch input {
  case Regex { 
   OneOrMore(.digit) 
  }:
    print("It's a decimal")

  case Regex {
    "0x"
    OneOrMore(.hexDigit)
  }:
    print("It's a hex")

  default:
    print("It's something else")
}

The first option with short inline regexes is far more readable to me. That may not be true of you, and if you'd prefer to use the DSL version you'd be welcome to it, but I at least would happily use the shorthand syntax in many cases.

Maybe I'm one of only a very, very, very small number of people, but I don't see any evidence that that's the case.

Edit: Adjusted a regex that had a mistake

johnno1962 · May 10, 2022, 2:09pm

You're getting a lot of likes there. I actually agree with you. I shouldn't have added the extra "very," and "over the DSL syntax" at the last minute. I use raw strings myself and this library SwiftRegex5 which isn't as easy to use or type safe as what is proposed but doesn't source break anything.

dlbuckley · May 10, 2022, 2:10pm

I think this is a bold assumption to make. I think the number of people using TCA is probably much smaller than the number of people who would reach for a standard regex.

johnno1962 · May 10, 2022, 2:10pm

This I believe is the mistaken assumption. A lot of people use TCA. As most data is structured these days (JSON) it's comparatively uncommon to have to reach for a regex.

s-k · May 10, 2022, 2:49pm

I see a lot of arguments focusing on technical problems and solutions (source breakage, backticks to disambiguate, whitespace rules, ...). While I agree that it is important to look at the proposal from the compiler perspective, I feel that it is equally or even more important to see what these changes mean for people writing and reading Swift.

This thread leaves me with the impression that many people are worried that the bare /.../ syntax may lead to a worse developer experience. It also seems to me like the core team and the proposal authors have a hard time grasping the worries of this group. Maybe this is because these worries are more subjective and less technical. From a compiler perspective, it is probably not that big of a problem to make the proposed changes work. Also changing the code that is impacted by the source breakage is probably not that hard. However, measuring how much impact this change has on how comfortable people feel with the language is hard to measure.

So let me try to make this point a little more objective: I have not tallied up all posts, but it seems clear that a large percentage of people posting in this thread feel uneasy about the proposed change. On the other hand, the only reasons that I have seen mentioned for /.../ over #/.../# are that it has precedent and it looks better. Both of these reasons assume that programmers feel more comfortable using the bare syntax. However, it seems that people voting for the bare syntax clearly are in the minority, at least in this thread.

Jumhyn · May 10, 2022, 3:03pm

Thanks for the response, Doug. I mostly agree, especially if the mitigation strategies that have been discussed end up panning out. I just wanted to make sure we were teasing out the distinction between “this source break is ‘better’/‘worse’ than other source breaks” (difficult to have an objective measure for) and “this source break is larger/smaller than other source breaks” (not possible to objectively measure directly, but we have some objective data sources that can certainly support an inference).

In particular:

I think comparisons like this are good. Though I’d be curious to know what the response is when a break like this is discovered. Is it expected that workarounds would be implemented to mitigate the break? Would patches be reverted? Or would it simply be treated as acceptable for the cost of a bug fix.

Alternatively: it would establish a higher consensus requirement for further source breaking changes. In any event, this was more of a meta-point about how we manage the Swift 6 transition. I don’t think we should be using other source breaks as a justification for additional source breaks. The way I view it is that we have a certain amount of developer goodwill to 'spend' (or go into debt against) on a given major language transition, and each additional break increases the cost. At some point, I believe we will need to start weighing the cost of additional breaks against the cost of "defer this feature until Swift 7."

In the absence of actual usage data, this is something that's simply not knowable, which is yet another reason I think it would be valuable to ship with the #/.../# syntax and then have some empirical data about how commonly Swift developers end up using regex literals versus the DSL. I don't think it can be the case that this source break would still be worth it if all but a handful of developers would reject regex literals entirely. I trust that the core team is quite confident in their evaluation that the presence of 'clean' literals will be worth it, but it's not a decision that we can (easily) take back if it ends up being mistaken. I'd rather be conservative from the outset.

This touches on another meta-point that I have been trying to put into words. Regardless of the underlying merits of the bare syntax, this thread has had emotions running high with several long-time members expressing disillusionment with the evolution process as a whole, which I've found quite disappointing. I think there's a good chance it would be healthier for the community if the bare syntax were split into a separate proposal.

jeremy · May 10, 2022, 3:24pm

I think the argument is for /.../ and #/.../# (rather than /.../ over #/.../#), and part of the motivation is that this is analogous to "..." and #"..."# for string literals.

I'm not arguing one way or the other, just trying to understand the different viewpoints.

hooman · May 10, 2022, 4:01pm

Swift inherited " delimiter from its C-like syntax. It was never chosen independent of the fact that Swift syntax is C-like. #"..."# is a very clever Swift extension. We are now talking about introducing / as a literal delimiter. The only other C-like language that uses this delimiter (for the same purpose) is Java Script and it lacks operator overloading capability of Swift.

Although many people think operator overloading is a bad idea, it is a fundamental choice in Swift language to provide a dialect-free base language where many of its features can be implemented as libraries written in Swift itself. For example, look at how logical operators are defined in the Swift standard library.

In my opinion, harming this base feature needs a very good reason behind it. I am arguing that the presented advantages of bare / delimiter do not meet that bar. I also believe that the way Swift literal syntax is defined, it does not look like an integral native part of Swift. To me, it feels like a compatibility layer, and not ready as the Swift regex literal syntax.

To better see what I mean, please read Swift Canonical Syntax part of SE-0355.
It is also interesting that despite basically rejecting it, Modern Literal Syntax appears in Future Directions instead of Alternative Considered. There is also this under alternative considered:

I do agree that regex literals deserve Swift compiler support, but more like compiler support for some foreign language, not as a canonical part of the Swift language itself.

The classic regex literal syntax, even with the improvements and unifications of SE-355, does not feel native to Swift to earn the full endorsement of being blessed as a fundamental part of the language syntax and harming other fundamental features to earn the lightest and most familiar delimiter.

It might even be too soon to commit to the proposed syntax as it is going to be next to impossible to meaningfully improve it without making the result more noisy than the legacy syntax. That is why I prefer this literal syntax to have some kind of prefix and keep a prefix-free representation for the real Swifty regex literal.

stephencelis · May 11, 2022, 1:24am

Maintainers of Case Paths and TCA here. Just wanted to weigh in because some of the discussion has gotten energetic around the symbol we incidentally squatted

As far as the pitch goes, neither of us have a strong opinion as to what the right literal syntax is.

We do want to make it pretty clear, though, that we feel our library’s stake in the operator should not hold back language evolution. We built a library that introduced missing key path functionality to enums, and we did our best to emulate native key path syntax by adopting an available operator. Had that operator not been available we would have introduced a different syntax, and if the operator is retired, we will introduce different syntax.

Ideally, though, we retire the library entirely in favor of language-level support, and we’re happy to see interest from compiler engineers in this thread to make it a priority for Swift 6!

There’s been some back-and-forth in the discussion around the “impact” of such a change, especially with regard to TCA applications. We can only offer the statistics that are available to us, and they should be taken with a grain of salt. According to our GitHub repo, swift-case-paths currently gets about 10,000 clones per week. This may be a substantial number or not depending on your expectation, but at the end of the day we don’t think it’s something that should be used to make a decision. It cannot capture private forks, precompiled binaries, etc. We only share it for full transparency.

At the end of the day, if Swift 6 ships without language-level support for enum case paths, we will offer a migration strategy in our library for Swift 6 language mode. Case Paths uses the / operator for two main uses:

As key path functionality for enums, where it offers extract-embed functionality for enums with associated values that mirrors writable key path getter-setter functionality.

The best migration we envision is the following, which is to migrate uses of the / operator to an initializer (thoughwe’re open to community feedback if folks have better ideas):
```
-.pullback(state: \Struct.property, action: /Enum.case)
+.pullback(state: \Struct.property, action: CasePath(Enum.case))
```
It’s more verbose and less symmetrical, but it gets the job done. Let’s hope for language-level syntax soon after!
As a key path literal function expression (laid out in SE-0249) equivalent for enum case paths, where case path literals are automatically promoted to extract functions.

Key paths literals allow for:
```
users.map(\.name)
```
While case paths allow for:
```
results.compactMap(/Result.success)
```
Without literal syntax, we’ll likely require explicitly creating a case path and referencing its extract function instead:
```
-results.compactMap(/Result.success)
+results.compactMap(CasePath(Result.success).extract)
```
Again, more verbose and less symmetrical, but it gets the job done.

TL;DR: We'd love language-level support for case paths, but we're here in the meantime!

xwu · May 11, 2022, 4:45am

Douglas_Gregor:

What about:
3.14159
vs.
3
.14159
Should we have picked a different syntax for tuple literals, array literals, dictionary literals, and floating point literals because the newline rule is weird and we can come up with confusing cases?

I'll share some other thoughts in a bit, but I think it's important here to remind ourselves that (as compared to other C-family languages) in fact we (Swift) did pick a different syntax for floating-point literals!

.14159 // error: '.14159' is not a valid floating point literal; it must be written '0.14159'

I think this is actually a pretty good example of how we deviated from a very common precedent among other languages in order to adapt a literal syntax to be a better fit for Swift.

xwu · May 11, 2022, 5:39am

Sliding in before the deadline again—sorry! Fortunately, I think many in the community have already expressed thoughts in line with mine, and I've already shared mine before in the pitch phase. In brief, I agree with the idea that top-notch support for regex literals is a major and apt next step for Swift. I share concerns that have been discussed at length in the previous over 200 posts about the bare /.../ syntax.

It is incontrovertible that it breaks existing source and that it complicates the mental model of how we (as humans) will parse the code. That the heuristics are feasibly implementable in the lexer and parser and that the source breaks are minimizable are (for me) rather meager comfort set against the countervailing question as to whether any such effort is wise given that the #/.../# syntax is in any case being added and holds its own weight regardless of whether we have a bare syntax.

I agree with @Jumhyn that comparisons to await are inapt (since no counterpart such as #await# would bear its own weight); and in the meantime I think it is very salient what others have pointed out about the panoply of languages supporting bare /.../ that simultaneously support (or even encourage) the use of alternative delimiters.

I also find it unconvincing to think that #/.../# is difficult to teach because it is a two-character delimiter when we've always had, like many other languages, comments delimited thus: /* ... */.

Another way of phrasing the argument, but kind of belaboring the point so I'm going to hide it behind a disclosure triangle.

Put another way, thus far (and I apologize if I have unintentionally overlooked an argument made here), the arguments for the bare /.../ syntax principally bolster the case that it is a supremely elegant syntax. I agree. The arguments against the bare syntax generally boil down to the argument that it is a poor fit for Swift because it necessitates source breaks that alternative syntaxes do not. I agree. Both can be true simultaneously, and I don't think this will be the last time we encounter a scenario such as this.

For the sake of argument, let's say one recognizes that /.../ is infinitely more elegant than any other possible regex literal syntax. Now let's set against this the consideration that /.../ is also infinitely more source breaking than #/.../# (which is part of this proposal no matter what we conclude here). By that metric, not supporting bare /.../ would be an infinitely better fit for Swift's evolution if source breakage is even a minuscule factor in determining fit (since even a minuscule factor multiplied by infinity is still infinity).

Having "strong manned" both of these principal arguments, the question before us would boil down to this: Given two options A and B, where A is infinitely more elegant than B and B fits infinitely better into the existing language than A, which is the choice to adopt? Even the name of our decision-making process (Swift Evolution) speaks to the importance of path dependency, and my answer would be that even an infinitely more elegant solution must be discarded when an alternative design that is otherwise plausible has incomparably better fit.

So much for that argumentation.

If we are to pursue the addition of bare /.../ to the language, I would hope that we can undergo further focused revision before proposal acceptance on when the syntax would be usable.

To me, it is a bit uncomfortable how only fairly late in the process (as I recall) was it worked out that not only prefix /-containing operators but also infix operators with double slashes such as /*/ must be forbidden going forward to prevent ambiguity. In order that more such issues not be discovered down the road when the feature is already shipping, I would encourage exploring an approach opposite to that of making /.../ support as permissive as possible in the first go. Instead, since both the "insides" of the literal and the "outsides" surrounding it can have limits to prevent ambiguity, I wonder if additional rules could, for example:

make foo(a, /, b).reduce(1, /) parse unambiguously as it currently does, without restricting anything an average user might want to do with regex literals—can the logic for parenthesized unapplied operators, or something similar to it, be extended to take into account the closing parenthesis after b in that example because it follows the open parenthesis that precedes the operator?
parse foo(/,/) as being ambiguous rather than simply considering it to have a different arity in a future version of Swift—i.e., can we devise an explainable rule such that the compiler always requires disambiguation here with either parens—foo((/), (/)) (or backticks if that's something the authors want to add to the proposal)—or hashes—foo(#/,/#)?

johnno1962 · May 11, 2022, 5:47am

Which is let's face it a lot! As a parting thought I was thinking about this problem:

Paul_Cantrell:

The hidden change in the meaning of all whitespace when #/…/# contains a newline:
#/ (foo|bar)(d|f|t) /#
matches " foot "

…but IIUC…
#/ (foo|bar)
   (d|f|t) /#
does not match " foot "

…which seems to me like a footgun.

Has anybody given any consideration to the following syntax to enable the multi-line, whitespace ignoring version of a regex literal (which was referred to as extended mode in Perl):

#///
   (foo|bar)
   (d|f|t)
   ///#

I know this is an even more ponderous a syntax but it might be worth it to give an extra confirmation something special is happening to the regex. IIRC, if done right this might result in a pleasing unification of the lexer code to tokenise string and regex literals which is probably a good indicator that the mental model for them is going to be more consistent and easier to "grok" for the user.

Paul_Cantrell · May 11, 2022, 2:07pm

Please note that my “foo|bar” example is incorrect, because there is not a newline immediately after the #/. My other example was a much better one:

#/hello world/#

matches "hello world"

#/
  hello world
/#

does not match "hello world"

johnno1962:

Has anybody given any consideration to the following syntax to enable the multi-line, whitespace ignoring version of a regex literal (which was referred to as extended mode in Perl):
#///
 (foo|bar)
 (d|f|t)
   ///#

Yes! Great minds think alike! (Other kinds of minds too, but we’ll ignore that.) And note that if (sorry, touching the radioactive topic again) we don't require #, then this would also work:

    ///
    (foo|bar)
    (d|f|t)
    ///

…which not only provides a tidy parallel to strings, but (1) solves the special-casing of /…/ vs #/…/# (where before the latter supported multiline but not the former), and (2) allows the compiler to intervene with guidance when a user inadvertently switches to ignored whitespace by inserting a newline. As I wrote above, as the proposal stands, that change in interpretation of whitespace in a multiline regex is a surprise, and [edited slightly]…

Paul_Cantrell:

…it's an especially insidious surprise, because there is no compile-time diagnostic to help; it’s only (possibly case-specific and easy-to-miss) runtime behavior that changes.

Once the bug is apparent, the answer is perhaps hard to discover. However, an explicit “ignore whitespace” mode could come with a helpful compiler diagnostic that both makes the behavior change visible at compile time, and makes the solution apparent:
error: regular expression literal spans multiple lines

fixit: use `///` to span multiple lines
       note: use `\ ` to match spaces inside a multiline regular expression

hamishknight · May 11, 2022, 2:08pm

Just to be clear, we are not proposing forbidding any infix operators containing /. The following example taken from the proposal will continue to parse as before:

infix operator /^/
func /^/ (lhs: Int, rhs: Int) -> Int { 0 }
let i = 0 /^/ 1

However, we did indeed discover that when an operator such as /^/ appears as an unapplied operator in a paren, tuple, or argument list, that would be turned into a regex literal. This is a particularly gnarly variant of the unapplied / case, but it can be disambiguated with a closure e.g { $0 /^/ $1 }.

Michael_Ilseman · May 12, 2022, 3:08am

Paul_Cantrell:

Mismatch in optionality between literals and the DSL for nested capture groups:

/(.)*|\d/

→ match type of (Substring, Substring?)

…but IIUC…
ChoiceOf {   // supposed to be equiv to above; don't know DSL well yet; making up details
  Capture {
    ZeroOrMore(.any)
  }
  .digit
}
→ match type of (Substring, Substring??)

If there was a way have the DSL automatically coalesce multiply nested optionals, we would go with that. Similarly for supporting tuple labels for named captures. Unfortunately, there are language limitations stopping this. We explored many workarounds, but ultimately found that there'd need to be some explicit operation on the Regex to do this. The best approach was also the most generally useful and sought after, which is to pull in mapOutput from future work into the newest DSL revision.