SE-0354 (Second Review): Regex Literals

Ben_Cohen · May 16, 2022, 3:16pm

Hello, Swift community.

The second review of SE-0354: Regex Literals begins now and runs through May 23, 2022.

The core team has decided to run a second review while accepting in principle the need for a regex literal and the use of /.../ as the delimiter.

The majority of discussion in the first review was regarding the choice of delimiter, and its impact on existing source – specifically due to removal of prefix / operators. During the review discussion, an alternative parsing rule was established that eliminated the need to remove these operators.

The additions to the proposal consists of two parts:

looking forward for unmatched closing parentheses within the regular expression, and only parsing the / as a regex if there are none. This resolves ambiguity such as f(x, /, y).reduce(/)
parsing / as an operator if there is no second / on the same line

Testing by the proposal authors indicate that several open-source packages that used those operators now compile cleanly with the 5.7 release branch.

Given this, the core team has decided to open a second round of review, with the new parsing rule, for further feedback. In particular, the core team would like this review to focus on other aspects of the proposal, such as multi-line non-semantic whitespace literals, and the typed capture behavior. Feedback on any unanticipated edge cases with the new parsing rule would also be appreciated.

This review is part of a collection of proposals for better string processing in Swift. The proposal authors have put together a proposal overview with links to in-progress pitches and reviews. This proposal introduces a literal syntax for the Regex to the language. It will be run simultaneously with a proposal regarding the syntax for constructing that type from a String or literal .

As with the concurrency initiative last year, the core team acknowledges that reviewing a large number of interlinked proposals can be challenging. In particular, acceptance of one of the proposals should be considered provisional on future discussions of follow-on proposals that are closely related but have not yet completed the evolution review process. Similarly, reviewers should hold back on in-depth discussion of a subject of an upcoming review. Please do your best to review each proposal on its own merits, while still understanding its relationship to the larger feature.

Reviews are an important part of the Swift evolution process. All review feedback should be either on this forum thread or, if you would like to keep your feedback private, directly to the review manager. If you do email me directly, please put "SE-0354" somewhere in the subject line.

What goes into a review?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift. When writing your review, here are some questions you might want to answer in your review:

What is your evaluation of the proposal?
Is the problem being addressed significant enough to warrant a change to Swift?
Does this proposal fit well with the feel and direction of Swift?
If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

More information about the Swift evolution process is available at:

https://github.com/apple/swift-evolution/blob/main/process.md

As always, thank you for contributing to Swift.

Ben Cohen

Review Manager

Paul_Cantrell · May 13, 2022, 3:43pm

Thanks for exploring. I expected it might end up in a mess like this, but I appreciate you at least thinking it over!

Lest it be lost in the din, my original concern was not about paralleling string syntax per se, but about this footgun:

Somebody mentioned upthread the idea of making “ignore whitespace / extended mode” a separate flag, orthogonal to the regex spanning multiple lines. Not sure that makes sense either. (Should a non-extended-mode regex match embedded newlines? What about indentation, then?). Still, perhaps worth a moment of consideration in the post-review followup.

In practice, I hope that the community will generally favor the builder DSL over multiline regex literals, making this is a relatively rare footgun.

I'll ultimately trust the core team's judgement on this question. Thanks again for navigating this epic barrage of questions.

xwu · May 13, 2022, 6:46pm

@hamishknight One more thought while we're on the topic of whitespace: since the proposal already proposes rejecting bare /.../ regex literals with leading spaces, is there any reason to think it would be unduly restrictive also to reject bare /.../ regex literals with trailing spaces?

Besides restoring a symmetry which is subjectively an aesthetic improvement, since most idiomatic styles use spaces surrounding binary operators and after commas, this simple modification of an already proposed restriction would eliminate here even the need to disambiguate foo(/x, y / z) at all, as well as the need to disambiguate foo(/x, /y), or foo(/, /), etc.

Ben_Cohen · May 16, 2022, 3:23pm

2 posts were merged into an existing topic: SE-0354: Regex Literals

hamishknight · May 13, 2022, 7:52pm

IMO it would be nice to support the ability to use whitespace to aid readability, an example from the proposal:

let regex = #/
  # Match a line of the format e.g "DEBIT  03/03/2022  Totally Legit Shell Corp  $2,000,000.00"
  (?<kind>    \w+)                \s\s+
  (?<date>    \S+)                \s\s+
  (?<account> (?: (?!\s\s) . )+)  \s\s+ # Note that account names may contain spaces.
  (?<amount>  .*)
  /#

That being said, I do agree that the following is surprising:

At the very least, if this is the behavior we decide to go with, it seems like we could warn when whitespace is intermixed between literal characters in a regex. For example a b c would raise a warning, but \d \s | [abc] would not. I'm not sure whether we'd want the warning to apply to a character class such as [ a b c ] though (maybe only if there's a single space?).

hamishknight · May 13, 2022, 7:58pm

Interesting idea! I think that seems pretty reasonable, assuming it won't impact many more regex patterns. I think the cases where we now currently break source are already somewhat uncommon, but that additional rule would probably bring it down to near-zero, as infix / with surrounding whitespace would effectively never be considered a regex literal ending.

My only possible concern would be that it may seem weird to type out your regex, and then have it completely change to something else if you type space as the last character (or as you're typing spaces in the middle of the pattern).

Paul_Cantrell · May 13, 2022, 8:04pm

I agree. Your example is compelling, and I like your approach better than my idea above.

xwu · May 13, 2022, 8:26pm

Yes, definitely would be janky. Two points in reply:

If typing out the line from start to end, one's not going to have a regex literal at all until the closing / delimiter is encountered, given the new rule about having two / delimiters on the same line. (Unless editors are presumptively supplying the closing / whenever a single / is typed—which seems like a bad experience if a user wants to just do division and therefore unlikely to be the case.)

Still, you describe a suboptimal experience when editing an existing regex literal. Perhaps trailing whitespace could be accepted as an error production, at least where there is no alternative valid parse, so that in most cases editing an existing literal doesn't flip back and forth between regex and not-regex.

Based on the first point about the required second / delimiter, my overall impression is that editors will need some sort of additional heuristic to start syntax highlighting, etc., for an "in progress" regex literal for the best user experience. Perhaps, for example, some rule is used to identify a likely opening delimiter based on unbalanced whitespace before and after. As part of that, it may be that editors will have to somehow accept trailing spaces while the cursor is positioned inside the literal. As for compiler support, are there any facilities connected with the placeholder syntax work that could facilitate such behaviors?

michelf · May 13, 2022, 9:01pm

I don't think it's a good idea to make subtle changes in how whitespace is handled in whitespace-ignoring mode compared to other languages. It'd make copy-pasting whitespace-ignoring regex from elsewhere error prone. If someone were to use whitespace to align elements of the regex in columns, using the same regex in Swift would have different semantics:

let delimited = #/
  \(  .*  \)  |
  \[  .*  \]  |
  \{  .*  \}  |
   <  .*   >
/#

I suppose a warning could work, but how do you disable that warning without rewriting the regex?

You can write a significant space with [ ] in Perl /x mode, so I assume it'd work similarly here. Whitespace is not ignored in a character class.

hamishknight · May 13, 2022, 9:07pm

In SE-0355: Regex Syntax and Runtime Construction, we are proposing a unified non-semantic whitespace behavior that treats whitespace as non-semantic both inside and outside custom character classes:

In both PCRE and Perl, this is enabled through the (?x) , and in later versions, (?xx) matching options. The former allows non-semantic whitespace outside of character classes, and the latter also allows non-semantic whitespace in custom character classes.

Oniguruma, Java, and ICU however enable the more broad behavior under (?x) . We therefore propose following this behavior, with (?x) and (?xx) being treated the same.

michelf · May 13, 2022, 9:13pm

Oh! This is surprising to me given I'm used to Perl and PCRE, not Java or ICU. I suppose it makes sense, but I'd be a bit baffled by [ ] not matching a space. Will check the other thread.

xwu · May 14, 2022, 12:23am

Would it be possible to unify the behavior of multiline regex literals with that of regexes initialized at runtime from multiline strings, but in the other direction from that discussed in the SE-0355 review thread? Namely, to require (?x) explicitly for non-semantic whitespace behavior regardless of the regex literal delimiter (while still eliding the first and last newline in a multiline regex literal and any indentation less than the closing delimiter's)?

It would be super if the final design could achieve the goal that all of the following ultimately mean the same thing (modulo static typing, etc.), or at least for as many expressions as possible:

let a = Regex("<some regex>")
let b = Regex("""
  <some regex>
  """)
let c = #/<some regex>/#
let d = #/
  <some regex>
  /#

Lantua · May 15, 2022, 8:23pm

What is the use of raw literal with many #, e.g., ###/.../###? The only benefit from #/.../# seems to be access to /# as part of the regex. The large bracket is already unlikely given that ##### could be contracted to #{5}. Am I missing something?

John_McCall · May 13, 2022, 7:06pm

I’d like to see discussion about whether the rule should be that all whitespace is non-semantic or just leading and trailing whitespace.

Paul_Cantrell · May 13, 2022, 7:24pm

One reasonable rule — not necessarily advocating, just musing — would be as follows:

Remove comments
For each line, remove all leading and trailing whitespace
Remove newlines
Any whitespace that remains is significant

For example:

    #/
        (
            hello        # morning
            |
            good night   # evening  (this and only this space character is preserved)
        )
        (
            ,\s+
            every
            (body|one)
        )?
   /#

…would be equivalent to:

/(hello|good night)(,\s+every(body|one))?/

Edit to add: We might want an additional rule that any space preceded by a backslash is not removed in step 2, so that this works:

#/
  hello\       # space after backslash is not removed, but subsequent spaces on this line are
  world
/#

Edit again: With the thread move, my reply to a reply to this post got out of order; note that I found @hamishknight’s counterargument compelling and prefer their proposed alternative.

Ben_Cohen · May 16, 2022, 3:26pm

Apologies, I moved some of these comments from the previous thread for further discussion, so they are a bit out of order (e.g. @Paul_Cantrell's post above this one then got later replies that are now above it).

fclout · May 16, 2022, 4:41pm

Regarding the new syntax: how does Swift diagnose incorrect regular expression syntax? What do you get here, for instance?

let foo = (/hello|(world))/;

hooman · May 16, 2022, 5:24pm

@Ben_Cohen, do we have a toolchain with the currently proposed behavior to check such things ourselves?

Ben_Cohen · May 16, 2022, 8:36pm

The approach taken with the regex proposals (as with Swift Concurrency) is that the work is getting integrated under a compiler flag (-enable-bare-slash-regex in this case) while under review. This means you can use the nightly toolchains from swift.org (either main or release/5.7) to try out the feature. But it looks like recent nightly toolchain builds haven't been posted yet – I'm checking on this and the latest 5.7 branch should be available shortly.

That said, looking specifically at the diagnostics currently output by the compiler when code is invalid should not be considered something that is covered by this review.

The primary reason for this is that the bar for evolution proposals is a prototype implementation that demonstrates how the feature is used. The expectation is not that this prototype is yet "shippable" or even mergeable into the main branch without additional work. Part of the work to get it to that point, which happens after proposal acceptance, is often quality-of-implementation work such as good quality diagnostics when the compiler hits invalid code.

Of course, sometimes having this kind of QoI is highly desirable at the proposal stage. Without it, reviewers need to reason about the results of using a fully productized implementation, not just the prototype provided for review. A similar example is runtime performance optimization – with some proposals, performance is a key driver and so not having the final fully optimized implementation may present challenges to reviewers who might be considering whether, say, such a proposal is a worthwhile tradeoff versus the complexity it might add to the language.

Nevertheless, having a full production-worth implementation is felt to be too high a bar for proposal to make it to the review stage. So we ask reviewers to bear with the proposal and try and work through these things on paper instead.

Feedback on whether that bar should be raised is welcome, but would be more appropriate on a dedicated thread, probably one in the Evolution/Discussion category. Feedback on diagnostic implementation is also welcome, but probably belongs in the Development/Compiler category.

So to bring it back to the immediate question, I guess it really needs to come back as another question: as a human looking at that code, on paper what would the ideal diagnostic be for this code?

let foo = (/hello|(world))/;

Once there's consensus amongst us humans for what the "right" diagnostic is to give for this code (bearing in mind you can have the compiler more than one diagnostic for two different interpretations) then we can discuss whether it's possible given the parsing rules to have the compiler emit them. If the answer might be "no", then that's very relevant to the proposal review. Such feedback might lead to re-considering deprecating the prefix / operator, for example.

It's worth noting that diagnostics on invalid code are able to use more information than is available when parsing valid code. For example, in the f(/,/) case, the diagnostic can make use of knowledge from the type checker that there isn't a unary function that would accept a Regex but there is a binary function that takes two binary functions.

hooman · May 16, 2022, 8:56pm

Thanks for the detailed response. I completely understand that we can't expect much from the diagnostics at this stage.

On the other hand, I think playing with a rough implementation of the rules and trying to see how compiler reacts to various situations can give more insight into whether the current rules are going to be enough for a good developer experience or not.

For example, what is going to happen in a place like playground when compiler is continuously trying to parse and diagnose as you type, and being in the middle of a regex literal is a totally new and weird place to be for the compiler.

For other literal types, there are good distinct indicators at least for their beginning, but / can be harder to detect at the start of a regex literal. For example, editor can confidently insert a closing delimiter as we type the opening delimiter, (which helps compiler with partially typed code) but this is only possible with / if compiler already expects a regex literal in that position. I want to get a better feeling of how many times that context is available to the compiler to see how the experience of typing a regex literal is going to be compared to, say, a string.