SE-0354: Regex Literals

A note on terminology:

The word "regex" is somewhat overloaded in this discussion. I'd suggest the following to avoid talking at cross-purposes:

  • If you mean the type introduced in SE-0350, use backticks to refer to it in code voice: Regex
  • If you mean the terse string-based syntax, clarify that with "regex syntax" or "regex strings"
  • If you mean the literal syntax as proposed, or compared to regex literals in other languages, use the term "regex literals"
  • If you mean the things that can be expressed with regexes, use terms like "regular language" or "regular grammar" (though note, modern regexes don't map exactly onto these specific things).
10 Likes

Small nitpick: this should be SE-0354 right?

1 Like

Broadly, +0.5.

I remain convinced by arguments from the pitch thread that the 'bare' /.../ syntax is not obviously worth the breakage, and I see no real reason why it needs to be included in this proposal. In the interest of more conservative evolution that is informed by usage, I'd prefer to see this proposal tackle the #/.../# syntax to bring regex literals into the language, and a later proposal address the question of "do we need a more terse syntax?" once the community has had broader experience with the base feature.

As it stands, the bare syntax won't be available by default until Swift 6 which is an unknown number of months (years?) away, so I don't think there's a huge amount of functional difference between splitting this into two proposals, one of which could be evaluated closer to Swift-6-time. However, I think it does make a difference process-wise. IMO combining both syntaxes into this proposal muddies the evolution process and makes it more difficult for the community to evaluate each aspect individually.

My subjective impression from the pitch thread is that there's overwhelming support for the #/.../# syntax, but much more apprehension about the /.../ syntax. I'd hate for us to get caught up in the excitement of introducing regex literals to the language (which is very exciting!) and adopt a syntax which does not pull its weight compared to the difficult-to-quantify impact of a source break in terms of developer effort and harm to goodwill.

27 Likes

Overall, +0.75. Having regex literals will be great, but I am also concerned about the breakage that /.../ will cause. My vote is to eschew the slashes entirely and use #regex(...).

13 Likes

It would be nice if the proposal included a summary of what other languages use, given that familiarity is a big motivation behind so many regex features. Do all other languages use bare /-es?

This is all I could find:

Proposed Solution:

Forward slashes are a regex term of art. They are used as the delimiters for regex literals in, e.g., Perl, JavaScript and Ruby. Perl and Ruby additionally allow for user-selected delimiters to avoid having to escape any slashes inside a regex. For that purpose, we propose the extended literal #/.../# .

[...]

Alternatives considered:

Given the fact that /.../ is an existing term of art for regular expressions, we feel it should be the preferred delimiter syntax. It should be noted that the syntax has become less popular in some communities such as Perl, however we still feel that it is a compelling choice, especially with extended delimiters #/.../# .

This seems to be suggesting that maybe / is for backwards compatibility? But then lots of communities are abandoning it for less-portable custom alternatives (like we are, with "extended literals"). But also everyone is using it?

If they're all going towards custom delimiters (or extended literals, in our case), why should we even introduce the bare / syntax and go through this?

This could all be a lot clearer. Which syntaxes are in common use?

6 Likes

No, backwards compatibility for a delimiter across languages is not a thing. You might be thinking of regex syntax, i.e. the parts in between delimiters, which does aim for broad compatibility. This is under simultaneous review in SE-0355: Regex Syntax and Runtime Construction.

2 Likes

They are related. The delimiters decide which parts of the contents need to be escaped and how.

In other words: the reason we can't go with #/.../# exclusively is that existing regexes using /.../ would need to be unescaped and possibly re-escaped for the new delimiter. Is that the argument for adopting this syntax despite it being source-breaking?

Because as you say, backwards compatibility for delimiters is not a concern other than that. That would seem to be a strong argument against adopting this syntax, "term of art" or not, because it is source-breaking.

6 Likes

No, that is not the argument at all and would be quite a bizarre strawperson.

Perhaps the prose could use a little elaboration, but that seems orthogonal to your argument about compatibility.

edit: On the topic of escaping a literal / inside of an extended regex literal,. #/a/b/# is equivalent to #/a\/b/# as slashes inside regex syntax can either be escaped or not. Similarly, Regex(#"a/b"#) and Regex(#"a\/b"#) are the same.

CC @hamishknight on whether the compiler will warn for this. Warnings for unneeded escapes is most valuable for letters or other things that might be confused with known escape sequences. Terminology quickly becomes confusing as we use language to talk about a language integrated into another language... :sweat_smile:

2 Likes

Note, while it will be gated under a Swift 6 language mode, this proposal also includes a Swift driver flag that will let users bring forward the syntax in Swift 5 mode: -enable-bare-regex-syntax.

Some readers of the proposal might have missed this, since sometimes in these proposals, the implementation is merged to main under a flag you need to use to enable the feature before it's been through evolution (this was how the concurrency implementation was handled last year). If the proposal is accepted, that flag goes away as it's just part of the language without needing to be enabled. But in this case, -enable-bare-regex-syntax is a flag that is actually being proposed as a production flag that would ship for pre-Swift 6 mode use.

7 Likes

Right, I was (perhaps too subtly) gesturing towards that with “by default.”

My evaluation isn’t swayed much in either direction by the presence of this flag. The strongest proponents of the bare syntax will likely use it judiciously, but that’s the cohort that would be most accepting of the source break anyway. It seems to me like accepting this flag just creates a “bare regex” dialect of Swift 5, and it’s not clear to me what benefit it provides over an unreviewed -enable-experimental-bare-regex flag.

6 Likes

Sorry, but let me claim it again.

10 Likes

+1 for adding regex literal, mainly as the way to write short expression instead of ZeroOrMore or ChoiceOf. It would be reasonable in that such usage would not spoil readability and learnability too much but enable quick implementation. In addition, +1 for the suggestion that we should have review on delimiter proposal and literal proposal separately.

-1 for adding named captures or other capability which Regex DSL does not or cannot have. I believe developers should use Regex DSL in principal, for readability. Regex literals should not have useful features which DSL cannot support, because it can be a reason to choose regex literals unwillingly. Of course, when Regex DSL finally gets such abilities, they can be also added into regex literals.

By the way, is it considered to add regex operators, like "a"+ or regex1 | regex2? I thought that can be more general way to support regex literal like syntax.

4 Likes

This seems like an astonishing departure from precedent – not doing this has heretofore been a central design principle of Swift – to the point that this should probably be the main focus of the review.

12 Likes
  • What is your evaluation of the proposal?

Strong +1 on the overall proposal.

-1 on / ... / delimiter  
+ .5 on #/ ... /# 
+1 on #regex(...)

As I stated in the pitch thread, the additional value of reserving / ... / to delimit regex literals would have made great sense prior to say, the great renaming of Swift 3. But at this point it disrupts existing codebases for what seems to be the somewhat illusory gains of: 1) avoiding a slight amount of visual clutter and 2) allowing people who are new to the language to have a sense of familiarity derived from similar notation in other languages.

In the proposal as put forth, the #\ ... \# will have to be available anyway, so it seems we still get the visual clutter (though not always) plus the cognitive overhead of learning the rules on when we have to use the the more cluttered syntax. The combination of these things seems to me to not clear the bar required for source breaking changes which at this point needs to be pretty high.

In another sense, I am strongly drawn to the extensibility of the #regex( ... ) syntax as it makes any similar future extensions to the language purely additive.

I am also strongly drawn to the idea of providing an external tool to take PCRE strings and convert them to the Swift Result Builder syntax so that we could dispense with PCRE literals in the language entirely. I suspect that for maintainability and debugging reasons that this is the way regular grammars will be incorporated in many code bases. We always use a standard example of cutting and pasting a regex literal from stackoverflow into swift code, I can say for a fact that these won't be allowed in the codebases I currently supervise, and that conversion to result builder format will be required. So there's a strong case to be made for acknowledging and encouraging that pattern. But I recognize others feel differently, especially if the goal is to build Swift into a strong ETL scripting language.

  • Is the problem being addressed significant enough to warrant a change to Swift?

Absolutely.

  • Does this proposal fit well with the feel and direction of Swift?

Generally. But we should acknowledge that we are adding yet another layer of cognitive load to the language and that its not clear there's enough progressive disclosure here to allay that concern.

  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

quite favorably. I've done a fair amount of Perl regex handling in the past and I wish I'd had this tool then.

  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

I've been thinking about this issue for a long time because I have an abiding interest in using swift as a scripting language. Verbosity and clumsiness of string handling in general is the largest block which prevents that. I read the pitch threads and the proposal in detail.

7 Likes

+1 but only support extended literal #/.../#

-1 for bare literal /.../

7 Likes
  • What is your evaluation of the proposal?

+ 1

  • Is the problem being addressed significant enough to warrant a change to Swift?

Yes. A regex literal would be very useful, especially with compile-time checking.

  • Does this proposal fit well with the feel and direction of Swift?

Yes. Allowing named captures to drive tuple elements fits right in.

I also appreciate the proposed solution for including regex literals within more complex Regex result builders for cases when that may be clearer than a multiline regex literal.

Can these two concepts combine to get named tuple elements from using a regex literal within a result builder? If not this seems to be a missed opportunity.

  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

It's good to see a common format used here.

  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

I've been following the previous discussions and proposals.

Enthusiatic +1 on the goal:

I have reservations about feature mismatch between literal and DSL, but can accept that provided we quickly move in the direction of adding language features necessary to have DSL and literal parity.

Concerning delimiters:

I clearly see that a lot of thought and effort went into making /.../ work. The huge amount of work invested in this choice proves that it is impossible to make it work without serious compromises. The compromises are twofold: You propose we both remove/restrict features from the language and simultaneously make it more complex with new special rules and exceptions. I don't like this, even if the features are rarely used and special rules rarely encountered.

I honestly think that the main reason for defending this delimiter is the big effort already invested in it. I also clearly see that there have not been enough serious work and effort on making any of the alternatives work well.

Considering how close we are to WWDC, is there any real chance/possibility that the core team will seriously consider anything other than accepting /.../ as is?

If there is, then I will have more to say.

8 Likes

Please review the proposal without concern to implementation timeframes. If you believe there are better alternatives, it's best to lay them out and explain why you feel they're preferable.

Please don't second guess peoples motivations like this – it's not appropriate. The proposal authors have laid out a case for why they believe that /.../ is the best choice for the language, based on factors like look and feel, familiarity with other languages etc. This should be taken at face value, and engaged with as presented in good faith rather than making accusations that the true reason is sunk cost or something else.

6 Likes

My apologies if I came across accusatory. That is the only plausible explanation I could come up with. I don't mean any disrespect to any of the people involved, and apologize if I did.

I have seen strong opposition from the core team when it comes to ideas that result removal/restriction of (otherwise harmless) features and simultaneously add complexity by introducing new exceptions and special rules.

To me, the choice of the delimiter does not look to be so fundamental to warrant such an invasive change to the language. And all of the discussion so far have not convinced me otherwise. Some very common languages such as Python don't even have a dedicated delimiter for regex literals. And most of the ones that already use /.../ are already offering alternatives.

Since I can't explain the current choice any other way to myself, I thought I better bring it up. This gives you a chance to clarify further why you think this is a worthy tradeoff and why /.../ is so fundamental. I don't think what have been said so far in defense of /.../ is enough justification for the damages to the language. I don't think I am the only one who sees it this way. Maybe others are too polite to bring it up.

I will put up my suggestions in my next post.

5 Likes

Part 1
Having dedicated literal: +1. I already stated my support for the goal of having regex literals in Swift source code that provide compile-time checks and typed-capture inference.

Part 2
Choice of delimiter. I think the magic literal alternative is not adequately considered. The idea of replacing the internal delimiter with another character, specifically /, is not adequately considered. The only justification provided is that we have not done this before. Are there any other down sides to it besides it being novel?

The advantage of #regex/.../ is that for the contents of the literal it has exactly the same behavior as if the literal was /.../. It does also score in familiarity, as it is already the same literal as say Perl, with a prefix. The other advantage is opening the door for normalizing this type of syntax for other compile-time checked and potentially library-based new types of literals (like property wrappers). We should be able to be flexible about the specific delimiter used after the #whatever and if we support library defined custom literals, we could let the library specify the supported delimiters. For example, both #sql"...." and #sql'...'. We should also be able to easily extend it to support multi-line variants.

The down sides that I see are being novel and its verbosity.

I don't see the first one (bringing something new to the language) as inherently bad, unless there is a clearly better alternative within the current norms, which I don't see any. Please correct me if I am wrong on this.

The verbosity: this is more subjective and depends on people's preferences, uses and experiences. This can be mostly addressed by privileging regexes as the most common case of such delimiters as being allowed to use #/.../ (as opposed to shortened magic literal) in addition to the already proposed #/.../#.

I also think we could explore using single quotes without completely sacrificing it to this use case a bit deeper. For example, we could ask '/' and '//' literals written as ''/' and ''//' to enable using '/.../' to achieve minimal noise and keep single quote still usable for uses like ASCII literal (e.g. 'A') and source compatibility. I know ''/' and ''//' would be special cases, but they would come after '/.../' is established. We could even reserve the whole starting with a delimiter case to support other literal types with other delimiters (e.g. '!...!').

5 Likes