SE-0354: Regex Literals

Ben_Cohen · April 28, 2022, 2:22pm

Hello, Swift community.

The review of SE-0354: Regex Literals begins now and runs through May 10, 2022.

This review is part of a collection of proposals for better string processing in Swift. The proposal authors have put together a proposal overview with links to in-progress pitches and reviews. This proposal introduces a literal syntax for the Regex to the language. It will be run simultaneously with a proposal regarding the syntax for constructing that type from a String or literal.

As with the concurrency initiative last year, the core team acknowledges that reviewing a large number of interlinked proposals can be challenging. In particular, acceptance of one of the proposals should be considered provisional on future discussions of follow-on proposals that are closely related but have not yet completed the evolution review process. Similarly, reviewers should hold back on in-depth discussion of a subject of an upcoming review. Please do your best to review each proposal on its own merits, while still understanding its relationship to the larger feature.

Reviews are an important part of the Swift evolution process. All review feedback should be either on this forum thread or, if you would like to keep your feedback private, directly to the review manager. If you do email me directly, please put "SE-0354" somewhere in the subject line.

What goes into a review?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift. When writing your review, here are some questions you might want to answer in your review:

What is your evaluation of the proposal?
Is the problem being addressed significant enough to warrant a change to Swift?
Does this proposal fit well with the feel and direction of Swift?
If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

More information about the Swift evolution process is available at:

https://github.com/apple/swift-evolution/blob/main/process.md

As always, thank you for contributing to Swift.

Ben Cohen

Review Manager

Ben_Cohen · April 28, 2022, 2:35pm

A note on terminology:

The word "regex" is somewhat overloaded in this discussion. I'd suggest the following to avoid talking at cross-purposes:

If you mean the type introduced in SE-0350, use backticks to refer to it in code voice: Regex
If you mean the terse string-based syntax, clarify that with "regex syntax" or "regex strings"
If you mean the literal syntax as proposed, or compared to regex literals in other languages, use the term "regex literals"
If you mean the things that can be expressed with regexes, use terms like "regular language" or "regular grammar" (though note, modern regexes don't map exactly onto these specific things).

1-877-547-7272 · April 28, 2022, 4:10pm

Small nitpick: this should be SE-0354 right?

Jumhyn · April 28, 2022, 4:13pm

Broadly, +0.5.

I remain convinced by arguments from the pitch thread that the 'bare' /.../ syntax is not obviously worth the breakage, and I see no real reason why it needs to be included in this proposal. In the interest of more conservative evolution that is informed by usage, I'd prefer to see this proposal tackle the #/.../# syntax to bring regex literals into the language, and a later proposal address the question of "do we need a more terse syntax?" once the community has had broader experience with the base feature.

As it stands, the bare syntax won't be available by default until Swift 6 which is an unknown number of months (years?) away, so I don't think there's a huge amount of functional difference between splitting this into two proposals, one of which could be evaluated closer to Swift-6-time. However, I think it does make a difference process-wise. IMO combining both syntaxes into this proposal muddies the evolution process and makes it more difficult for the community to evaluate each aspect individually.

My subjective impression from the pitch thread is that there's overwhelming support for the #/.../# syntax, but much more apprehension about the /.../ syntax. I'd hate for us to get caught up in the excitement of introducing regex literals to the language (which is very exciting!) and adopt a syntax which does not pull its weight compared to the difficult-to-quantify impact of a source break in terms of developer effort and harm to goodwill.

davedelong · April 28, 2022, 4:38pm

Overall, +0.75. Having regex literals will be great, but I am also concerned about the breakage that /.../ will cause. My vote is to eschew the slashes entirely and use #regex(...).

Karl · April 28, 2022, 5:06pm

It would be nice if the proposal included a summary of what other languages use, given that familiarity is a big motivation behind so many regex features. Do all other languages use bare /-es?

This is all I could find:

Proposed Solution:

Forward slashes are a regex term of art. They are used as the delimiters for regex literals in, e.g., Perl, JavaScript and Ruby. Perl and Ruby additionally allow for user-selected delimiters to avoid having to escape any slashes inside a regex. For that purpose, we propose the extended literal #/.../# .

[...]

Alternatives considered:

Given the fact that /.../ is an existing term of art for regular expressions, we feel it should be the preferred delimiter syntax. It should be noted that the syntax has become less popular in some communities such as Perl, however we still feel that it is a compelling choice, especially with extended delimiters #/.../# .

This seems to be suggesting that maybe / is for backwards compatibility? But then lots of communities are abandoning it for less-portable custom alternatives (like we are, with "extended literals"). But also everyone is using it?

If they're all going towards custom delimiters (or extended literals, in our case), why should we even introduce the bare / syntax and go through this?

This could all be a lot clearer. Which syntaxes are in common use?

Michael_Ilseman · April 28, 2022, 7:14pm

No, backwards compatibility for a delimiter across languages is not a thing. You might be thinking of regex syntax, i.e. the parts in between delimiters, which does aim for broad compatibility. This is under simultaneous review in SE-0355: Regex Syntax and Runtime Construction.

Karl · April 28, 2022, 7:30pm

They are related. The delimiters decide which parts of the contents need to be escaped and how.

In other words: the reason we can't go with #/.../# exclusively is that existing regexes using /.../ would need to be unescaped and possibly re-escaped for the new delimiter. Is that the argument for adopting this syntax despite it being source-breaking?

Because as you say, backwards compatibility for delimiters is not a concern other than that. That would seem to be a strong argument against adopting this syntax, "term of art" or not, because it is source-breaking.

Michael_Ilseman · April 28, 2022, 8:00pm

No, that is not the argument at all and would be quite a bizarre strawperson.

Perhaps the prose could use a little elaboration, but that seems orthogonal to your argument about compatibility.

edit: On the topic of escaping a literal / inside of an extended regex literal,. #/a/b/# is equivalent to #/a\/b/# as slashes inside regex syntax can either be escaped or not. Similarly, Regex(#"a/b"#) and Regex(#"a\/b"#) are the same.

CC @hamishknight on whether the compiler will warn for this. Warnings for unneeded escapes is most valuable for letters or other things that might be confused with known escape sequences. Terminology quickly becomes confusing as we use language to talk about a language integrated into another language...

Ben_Cohen · April 28, 2022, 9:24pm

Note, while it will be gated under a Swift 6 language mode, this proposal also includes a Swift driver flag that will let users bring forward the syntax in Swift 5 mode: -enable-bare-regex-syntax.

Some readers of the proposal might have missed this, since sometimes in these proposals, the implementation is merged to main under a flag you need to use to enable the feature before it's been through evolution (this was how the concurrency implementation was handled last year). If the proposal is accepted, that flag goes away as it's just part of the language without needing to be enabled. But in this case, -enable-bare-regex-syntax is a flag that is actually being proposed as a production flag that would ship for pre-Swift 6 mode use.

Jumhyn · April 28, 2022, 9:35pm

Right, I was (perhaps too subtly) gesturing towards that with “by default.”

My evaluation isn’t swayed much in either direction by the presence of this flag. The strongest proponents of the bare syntax will likely use it judiciously, but that’s the cohort that would be most accepting of the source break anyway. It seems to me like accepting this flag just creates a “bare regex” dialect of Swift 5, and it’s not clear to me what benefit it provides over an unreviewed -enable-experimental-bare-regex flag.

YOCKOW · April 28, 2022, 11:53pm

Sorry, but let me claim it again.

[Pitch #2] Regex Literals

General remarks about bare /.../ :

I guess the reason why some people support /.../ is because "we have seen it in other languages" .
However, we have to remember that Swift is different from other languages in many senses.

First, (as mentioned repeatedly in this thread,) we can define prefix/infix/postfix operators containing / .
Authors simply think it is enough to change the syntax rule of Swift, but the fact that certain number of projects will be broken has come to light.

Second, regex in Swift may differ from ones in other languages.
Although this is out of scope of this pitch, it is related still.

Declarative String Processing Overview

When applied with grapheme-cluster semantics (the default if applied to String ), it would match grapheme-cluster by grapheme-cluster and comparison obeys canonical equivalence. There are some features that might not be supported, e.g. generalizing some scalar properties to grapheme clusters. Resulting indices would be grapheme-cluster aligned.

Such feature would confuse some folks especially from other languages.
Let me quote my opinion from pitch#1 thread:

[Pitch] Regular Expression Literals

My thoughts are

If Swift adopts similar literals with e.g. Perl, semantics should be similar with Perl.

If Swift adopts different semantics from e.g. Perl, literals should be different from Perl.

That would be called perceived affordance.

Lastly, Swift has sublime philosophy (I hope).
I agree that /.../ is simple and easy to write.
However, to be simple is not enough to be good in Swift.
I want to quote Mr. Lattner's utterance:

`if let` shorthand

The goal of Swift language design isn't to minimize number of characters in code, it is to find a balance between "expressivity", and "readability". Readability isn't just "what does this line of code do" it is a deeper "what is the lifecycle of maintaining and evolving a codebase, particularly when it is worked on by multiple people, or the project spans years of development".

/.../ will certainly break Swift.
/.../ is not Swift's.
Does Swift have to borrow the syntax from others to break itself?
Will we get more benefit from /.../ than loss from it?
Think different.

ensan-hcl · April 29, 2022, 4:23am

+1 for adding regex literal, mainly as the way to write short expression instead of ZeroOrMore or ChoiceOf. It would be reasonable in that such usage would not spoil readability and learnability too much but enable quick implementation. In addition, +1 for the suggestion that we should have review on delimiter proposal and literal proposal separately.

-1 for adding named captures or other capability which Regex DSL does not or cannot have. I believe developers should use Regex DSL in principal, for readability. Regex literals should not have useful features which DSL cannot support, because it can be a reason to choose regex literals unwillingly. Of course, when Regex DSL finally gets such abilities, they can be also added into regex literals.

By the way, is it considered to add regex operators, like "a"+ or regex1 | regex2? I thought that can be more general way to support regex literal like syntax.

jayton · April 29, 2022, 8:51am

This seems like an astonishing departure from precedent – not doing this has heretofore been a central design principle of Swift – to the point that this should probably be the main focus of the review.

rvsrvs · April 29, 2022, 12:06pm

What is your evaluation of the proposal?

Strong +1 on the overall proposal.

-1 on / ... / delimiter  
+ .5 on #/ ... /# 
+1 on #regex(...)

As I stated in the pitch thread, the additional value of reserving / ... / to delimit regex literals would have made great sense prior to say, the great renaming of Swift 3. But at this point it disrupts existing codebases for what seems to be the somewhat illusory gains of: 1) avoiding a slight amount of visual clutter and 2) allowing people who are new to the language to have a sense of familiarity derived from similar notation in other languages.

In the proposal as put forth, the #\ ... \# will have to be available anyway, so it seems we still get the visual clutter (though not always) plus the cognitive overhead of learning the rules on when we have to use the the more cluttered syntax. The combination of these things seems to me to not clear the bar required for source breaking changes which at this point needs to be pretty high.

In another sense, I am strongly drawn to the extensibility of the #regex( ... ) syntax as it makes any similar future extensions to the language purely additive.

I am also strongly drawn to the idea of providing an external tool to take PCRE strings and convert them to the Swift Result Builder syntax so that we could dispense with PCRE literals in the language entirely. I suspect that for maintainability and debugging reasons that this is the way regular grammars will be incorporated in many code bases. We always use a standard example of cutting and pasting a regex literal from stackoverflow into swift code, I can say for a fact that these won't be allowed in the codebases I currently supervise, and that conversion to result builder format will be required. So there's a strong case to be made for acknowledging and encouraging that pattern. But I recognize others feel differently, especially if the goal is to build Swift into a strong ETL scripting language.

Is the problem being addressed significant enough to warrant a change to Swift?

Absolutely.

Does this proposal fit well with the feel and direction of Swift?

Generally. But we should acknowledge that we are adding yet another layer of cognitive load to the language and that its not clear there's enough progressive disclosure here to allay that concern.

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

quite favorably. I've done a fair amount of Perl regex handling in the past and I wish I'd had this tool then.

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

I've been thinking about this issue for a long time because I have an abiding interest in using swift as a scripting language. Verbosity and clumsiness of string handling in general is the largest block which prevents that. I read the pitch threads and the proposal in detail.

masters3d · April 29, 2022, 3:01pm

+1 but only support extended literal #/.../#

-1 for bare literal /.../

schutt · April 29, 2022, 6:18pm

What is your evaluation of the proposal?

+ 1

Is the problem being addressed significant enough to warrant a change to Swift?

Yes. A regex literal would be very useful, especially with compile-time checking.

Does this proposal fit well with the feel and direction of Swift?

Yes. Allowing named captures to drive tuple elements fits right in.

I also appreciate the proposed solution for including regex literals within more complex Regex result builders for cases when that may be clearer than a multiline regex literal.

Can these two concepts combine to get named tuple elements from using a regex literal within a result builder? If not this seems to be a missed opportunity.

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

It's good to see a common format used here.

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

I've been following the previous discussions and proposals.

hooman · April 29, 2022, 7:40pm

Enthusiatic +1 on the goal:

I have reservations about feature mismatch between literal and DSL, but can accept that provided we quickly move in the direction of adding language features necessary to have DSL and literal parity.

Concerning delimiters:

I clearly see that a lot of thought and effort went into making /.../ work. The huge amount of work invested in this choice proves that it is impossible to make it work without serious compromises. The compromises are twofold: You propose we both remove/restrict features from the language and simultaneously make it more complex with new special rules and exceptions. I don't like this, even if the features are rarely used and special rules rarely encountered.

I honestly think that the main reason for defending this delimiter is the big effort already invested in it. I also clearly see that there have not been enough serious work and effort on making any of the alternatives work well.

Considering how close we are to WWDC, is there any real chance/possibility that the core team will seriously consider anything other than accepting /.../ as is?

If there is, then I will have more to say.

Ben_Cohen · April 29, 2022, 7:47pm

Please review the proposal without concern to implementation timeframes. If you believe there are better alternatives, it's best to lay them out and explain why you feel they're preferable.

Please don't second guess peoples motivations like this – it's not appropriate. The proposal authors have laid out a case for why they believe that /.../ is the best choice for the language, based on factors like look and feel, familiarity with other languages etc. This should be taken at face value, and engaged with as presented in good faith rather than making accusations that the true reason is sunk cost or something else.

hooman · April 29, 2022, 8:11pm

My apologies if I came across accusatory. That is the only plausible explanation I could come up with. I don't mean any disrespect to any of the people involved, and apologize if I did.

I have seen strong opposition from the core team when it comes to ideas that result removal/restriction of (otherwise harmless) features and simultaneously add complexity by introducing new exceptions and special rules.

To me, the choice of the delimiter does not look to be so fundamental to warrant such an invasive change to the language. And all of the discussion so far have not convinced me otherwise. Some very common languages such as Python don't even have a dedicated delimiter for regex literals. And most of the ones that already use /.../ are already offering alternatives.

Since I can't explain the current choice any other way to myself, I thought I better bring it up. This gives you a chance to clarify further why you think this is a worthy tradeoff and why /.../ is so fundamental. I don't think what have been said so far in defense of /.../ is enough justification for the damages to the language. I don't think I am the only one who sees it this way. Maybe others are too polite to bring it up.

I will put up my suggestions in my next post.