SE-0354 (Second Review): Regex Literals

Michael_Ilseman · May 26, 2022, 1:26pm

Discussed earlier:

Michael_Ilseman:

An alternative could be a flag-like special parsing rule in-between the delimiter and the newline, something like:
#/x
  ...
/#

#/
  ...
/#
// error: newline in regex literal, use "x" before the newline
That would restore "explicit" via your definition, but I don't think it improves clarity. Another alternative is supporting (?x) instead of x there, allowing it to trigger literal multi-line mode as well. But that's inconsistent with syntactic options elsewhere and how they interact with literal delimiters and newlines.

To elaborate, anything on that line would be part of the delimiter itself and not part of the regex body. That is, using:

#/(?x)
  \d{2} / \d{2} / \d{4}
/#

wouldn't really be setting a syntactic option with (?x), as you couldn't have content before it, nor could you set other options inside of it, nor content on the same line after it.

xwu · May 26, 2022, 3:04pm

Michael_Ilseman:

To elaborate, anything on that line would be part of the delimiter itself and not part of the regex body. That is, using:
#/(?x)
  \d{2} / \d{2} / \d{4}
/#
wouldn't really be setting a syntactic option with (?x) , as you couldn't have content before it, nor could you set other options inside of it, nor content on the same line after it.

Yes, as envisioned it would really be just a part of the delimiter, but now that you mention it—why not make it a real syntactic option?

I don’t think (?x) is valid Swift; therefore, it would seem that the lexer can look for either a closing delimiter for a single-line regex literal, or the presence of (?x) (alone or in combination with any other options, and not subsequently unset) anywhere in the first line to permit multi-line literals. Perhaps, then, this multi-line support could be even extended to naked /…/ syntax. It is altogether reasonable for Swift to understand and take into account regex syntax options as part of its first-class support for regex literals.

The way to teach this would simply then be: multiline regex literals aren’t a separate thing; rather, newlines in regex literals are supported in Swift when the x option is toggled on and prohibited otherwise. This would fit neatly with the decision not to have #///…///# syntax or some such parallel to multiline string literal delimiters.

scanon · May 26, 2022, 3:07pm

That's right.

Paul_Cantrell · May 26, 2022, 4:26pm

I am a long-time regex user, and have happily made use in Ruby of multi-line regex literals in extended mode (mostly for the ability to add comments). As such, I'm sympathetic this line of argument. However, I don’t find at as open-and-shut as this.

As someone who has both seen and authored multiline extended mode regexes in the wild, this feels like a leap to me too. I think it is plausible — likely, even — that general usage ends up following this pattern in practice:

…especially given the ability to mix small, single-line regex literals into a larger builder superstructure, as @AliSoftware discussed.

Extended mode regex literals are likely to end up being one of those things that is present in the language, but rarely used. A few adherents use them and love them; most teams discourage them with style guides and linters and social consensus, if they are aware of them at all.

(Please note, again, that their utility in other languages is not evidence that they will be equally useful in Swift, given the presence of features those other languages do not have.)

If this speculation about usage is correct, then in that hypothetical world, extended mode multiline regexes take on one of two roles:

They might be a curious side pocket, like differentiable functions or #dsohandle or @usableFromInline: valuable to a few people, occasionally vexing to those who encounter them in the wild without context, but essentially invisible and harmless to all but those few developers actively seeking the feature.
They might be a legacy albatross, like AnyObject dispatch: something that adds language surface so as to trip up people not even intending to use the feature, and adds bulk and bugs and corner cases that increase the maintenance burden of the language, but is impossible to remove because too much code relies on it.

Something like @xwu’s suggestion of an explicit (?x) mitigates the “accidental syntax” risk of scenario 2, but increases the “maintenance burden” part of that scenario. All the wrangling over parsing and whitespace upstream here is dancing with similar tradeoffs.

Ultimately, I trust the judgement of the core team and the other language maintainers on whether we’re risking scenario 1 or 2 here. It is worth pointing out, however, that they don’t need to make that decision stuck with nothing better the speculation in this review thread:

The new information we could acquire is data to replace speculation about actual usage in practice. That’s not nothing.

A question for @Michael_Ilseman:

What features, precisely, are “currently inexpressible in the builders” if builders can contain single-line, non-extended-mode regex literals?

The only one I’m aware of is named captures becoming tuple labels. That is a serious builder shortcoming, and one I would certainly hope to see addressed (or at least mitigated!) regardless of the availability of extended mode.

Are there any other features that would be inexpressible without extended mode? That seems important.

Michael_Ilseman · May 27, 2022, 2:05pm

The only thing extended syntax mode gives you is extended syntax. You can always delete all the whitespace and newlines and put it on a single line if you really want to.

Again, extended syntax mode gives you extended syntax, not brand new features.

As for "expressible in the builders", I was referring to builders which contain no literals. We parse all of e.g. PCRE2, but not everything has been implemented yet. In theory, we could support more functionality in the underlying engine before there's a corresponding public builder API.

The two main shortcomings of the builders are lack of named capture support and multiply-nested optionality. With mapOutput, you can (albeit more verbosely than we would like) add tuple labels to captures and coalesce extra layers of optionality. But that would not all by itself give you more niche things like named capture backreferences (for that, use Reference).

Michael_Ilseman · May 27, 2022, 2:18pm

This alternative is to have the Swift lexer and other source tools understand the setting and un-setting behavior of syntactic options inside the interior regex's syntax, in addition to delimiters. This includes not just option-setting groups but isolated options as well. Option-setting groups requires parsing as there's nesting involved. Isolated options require a lot more parsing nuance and there's syntactic disagreements to resolve. This is discussed the syntax proposal.

If used in the branch of an alternation, an isolated group affects all the following branches of that alternation. For example, a(?i)b|c|d is treated as a(?i:b)|(?i:c)|(?i:d) .

Doing this would give two advantages:

Additional visual signaling for extended syntax mode beyond multiple source lines
Ability to go to extended syntax across multiple source lines in the middle of an othewise single-line regex

The first advantage can be achieved without this complexity by any delimiter change discussed above.

The second advantage seems dubious to me and I don't think it's worth the considerable complexity of forcing source tools to understand option setting and unsetting.

xwu · May 27, 2022, 4:27pm

I'm sympathetic to the additional complexity of such an approach, but it does seem to be the logical endpoint of having first-class support for regex literals with options that have syntactic effects:

So I'm not (at all) arguing that the second advantage you quote above is per se worthy of sprouting all the complexity. Rather, to me it seems that the need to have sophisticated built-in handling for these options at compile time is already accounted for in design considerations, such that the direction of Swift points strongly to having such complexity in the fullness of time anyway even if it's not plumbed through quite yet.

It would be worthwhile, then, to make the most of it from the start rather than committing to fundamental decisions with a wall of separation between regex syntax and the lexer that may not be there by the time the full feature set is rounded out.

hooman · May 27, 2022, 6:24pm

Based on the discussion so far, here is where I stand now:

+1 on single line literal as proposed (with possible further refinements to reduce the chance of ambiguity), including its extended #/.../# format.
-1 on using the same #/.../# delimiter for multi-line literals. It is too different to use the same delimiter.
+1 on keeping multi-line literals.

Also I would:

Provide the ability to specify global semantic and syntactic options at the beginning of multiple-line literal.
Have defaults that are sensible for multiple-line complex literals and accept those defaults being different from the single-line literals.
Provide the ability to have end of line Swift comment on the fist line (opening delimiter of the literal) such as:

    let complexMatch = #///[options here]          // Swift end of line comment here
           the actual literal
    ///#

jpmhouston · May 27, 2022, 7:35pm

So obviously then it should be:

#/#/#/
hello world
/#/#/#

:^)

I think spreading the #/ /# over multiple lines is enough of a clue that such patterns are a little different. For example, would a newbie expect it to match the leading and trailing newlines too? I don't think beginners would be constructing a regex literal of increasing length and reasonably think "oh, I'll just put this over multiple lines and assume it will just work". They'd literally be adding whitespace (newlines & intendation) they want to be ignored and assuming the compiler could read their mind about which whitespace would count.

Is it the proposal itself or an earlier comment which points out that ignored whitespace is the norm for multi-line patterns in other environments, and the suggested use case is copy and pasting such patterns into Swift source? Wherever I saw that, I agree that this should be what to keep in mind when thinking about syntax.

I think of the """ syntax as an indication purely of the special indentation handling. If there were another literal that maintained whitespace but ignored matching indentation in its multi-line form, then I'd expect it to also do the triple-literal thing, but multi-line regex patterns aren't that.

So I agree with way this discussion has gone since about ?x or whatever that even if we wanted a special form to represent non-semantic whitespace then I don't think it should be triple-anything. But I personally don't think it's needed.

Paul_Cantrell · May 27, 2022, 7:43pm

That's not true. Or rather, maybe ignoring whitespace in multiline patterns is a stylistic norm, but it is not actually a language feature. In every language I know of that allows multiline regex literals at all, ignoring whitespace (“extended mode”) is a separate flag that must be explicitly specified, and is independent of whether the literal spans multiple lines. In Ruby and Perl, for example, this:

/foo

bar/

…matches "foo\n\nbar" but not "foobar", whereas both of these (note the x flag):

/foo

bar/x

/foo  bar/x

…match "foobar" but not "foo\n\nbar".

This proposal is in territory without precedent in other languages. Thus the debate.

jpmhouston · May 27, 2022, 8:12pm

Sorry, I indeed meant to refer to those stylistic norms, how multl-line literals are used in practice, not the specific details of other languages.

Speaking of, those /foo bar/ with newlines example looks like a code smell. I'm sure it also matches the spaces used to indent the subsequent lines and I doubt competent developers would seriously rely on such a construct. If this is how regex delimiters with newlines are interpreted in other languages, I think that puts Swift on a stronger footing for not following their lead.

hooman · May 27, 2022, 8:38pm

This is not a sensible example of a multi-line literal in the context of Swift regex literal. As I noted in my response above, multi-line literals are for complex literals that would be hard to understand if they where presented as a single-line literal. Based on this, I think the defaults should be optimized for the realistic use case of complex multi-line literals. Also, that is why I think the delimiter should look distinct and we should not use #/.../# for multi-line literals.

wowbagger · June 3, 2022, 2:22pm

Apologies for this extremely late review. I haven't gotten much time to spend on Swift Evolution lately.

I think one tenet of good design is the idea that form follows function. This does not mean that the aesthetic of the form doesn't matter. It still does. It's still very important, but the function must take priority over it.

In the case of regex literals, I think their primary function is indicating (clearly) to both the compiler and the code reader that a piece of data should be treated as a regex pattern. Therefore, a good syntax for these literals should be, foremost, fully capable at performing this function.

My impression after reading this and the prior revision of the proposal is that the bare-slash syntax is designed from the opposite direction: a desired aesthetic is chosen first, and then workarounds are found to try to fit it over the necessary function. Even though there are many entries in the "alternatives considered" detailing why the bare-slash syntax is preferred, the rationales provided mostly focus on its aesthetic advantages, with not much on the functional ones. In fact, the rationales suggest some of the alternatives do fit better with their intended function.

I think this design process (if true) has led to the complex parsing rules and ambiguities as shown in the proposal. In my opinion, complexity and ambiguity aren't inherently bad by themselves. Sometimes, they're good, because computers and humans digest code differently, and the complexity and ambiguity in compiler (e.g. contextual keywords) are there to help keep code stay intuitive to humans. However, in this case, some of the complexity and ambiguity is added into human comprehension. This is bad in my opinion, and I would go so far to say that it's actively harmful.

Consider this simple example from the proposal:

baz(/, /)

Under this proposal, this will become a function call taking a regex literal. However, to many people who aren't staying up to date with all the shiny new things we add into the language (as we shouldn't expect them to be), this change is completely imperceptible until they compile/run it and are confused to find that it's not doing what they have learnt to expect. This is a departure from the philosophy of progressive disclosure of complexity that Swift prides itself in: now in order to use unapplied operators properly, a user must also learn regex literals (which more likely than not they'll never need to use directly) to understand how to handle edge cases, and vice versa.

Although tooling (mostly syntax highlighting, I suppose) can help alleviate the problem to some extent, it's external to the language itself and so shouldn't be relied upon as the solution to a problem created by the language itself. There are also speculations upthread to the instability of syntax highlighting given the parsing rules' complexity (and potential additional complexity). We should also remember not everyone has access to good syntax highlighting. Not everyone is privileged enough to be able to afford a Mac and use Xcode.

Even for those who are fully aware of and understand the new parsing rules will perhaps find Swift code become less predictable and reliable. Humans are bad at keeping track of all the tiny details. This is why typos happen in code, why they persist even after multiple careful proofreads, and why sometimes they persist even after the compiler points out precisely where the error is. I suspect details as small as whitespace placement are easily automatically neglected by human brains, and unless the compiler can point out precisely what's wrong and how to fix it, the user will find themselves tripped up and left directionless. All the gotchas combined with the already lengthy compilation time of Swift might lead to more frustration and poorer user experience, and push it further away from being a suitable first language.

I agree with the core team that we should try to make regex literals elegant, but realistically, how elegant can we make regex through the delimiters alone, if we're keeping almost all inelegant aspects of regex? And how much more inelegant can regex be, if the delimiters are inelegant? Is elegance primarily associated with visual simplicity, or is there more to it such as the ease of use and clear reflection of its function? Also, is elegance what many of those who reach for regex care about, and is it worth it to complicate human understanding of Swift code in general for what might amount to rare uses in code?

If the goal is to make regex easier to use in Swift, the bare-slash syntax does quite the contrary in my opinion, along with it makes some other parts of the language harder to use and reason. At the end of the day humans write (and read while writing) code. The compiler can't read peoples' mind, but only reads what's written and translates it for computers. So it's important that we try to make sure what humans think they're telling the compiler matches what the compiler thinks the humans are telling them.

jpmhouston · June 4, 2022, 12:27am

a desired aesthetic is chosen first, and then workarounds are found to try to fit it over the necessary function

I think it's more than aesthetic, it's a syntax that was picked as ideal because of terms of art and familiarity. The initial language team similarly didn't (probably) weigh the function of square bracket subscript syntax against boatloads of other possibilities, it was simply chosen.

The step then was to see if the desired syntax could be made to work, whether heuristics in the parser could be arrived at which will overwhelmingly often parse what the coder intends. Apparently the parser already has many such heuristics that allow the syntax the Swift team desired. (I'm not an insider, someone please correct me if I'm wrong about any of this)

You've come across an example where it seems the parser would fail, so what to do? Deciding that this proves it will never work is certainly one option. I think there could instead be a discussion about this example (if there hasn't been already), whether it's an singular outlier or represents a large class of similar fails, maybe whether a tweak to parser heuristics could address them.

xwu · June 4, 2022, 3:51am

Just scroll up to the third post in this thread and read on from there.

jpmhouston · June 4, 2022, 4:05pm

Just scroll up to the third post in this thread and read on from there.

Thanks. Indeed I missed that and the few other posts in this second review that haven't been about multi-line patterns. I guess my gripe that this line of discussion was missing was mostly towards the previous review thread where everyone seemed to be quick to dismiss bare "/" delimiters out of hand.

I strongly agree with extending whitespace-adjacency heuristics in whatever way to disambiguate foo(/x, y / z) etc. In fact, if the balance is between:

the parser assuming regex and requiring parentheses around sub-expressions to disambiguate, vs:
the parser rejecting a possible regex requiring the escaped #/ /# form to disambiguate

then I'm in favor of going very far to ensure the latter over the former, both because the parentheses disambiguation looks weirder than the escaped regex does, and because the latter allows more existing code to compile as-is.

Do I understand correctly that @xwu's suggestion is require # for both patterns like / foo/ and /foo /, whereas the current proposal is only for one of those two? Is there any push-back that this extension would be too onerous to regex patterns? I might personally have avoided such patterns anyway as a floating "/" delimiter would look ambiguous to the reader, let alone the parser. Would this suggestion solve most of the similar problematic examples?

I've been a "nearly +1" towards the current and previous proposals, only with reservations about exactly this sort of disambiguation balance. This extension to the heuristics might be enough to put me in the "unreserved +1" camp, however I'd first like to see if any more reasonable examples can be thought of where the heuristics would get it wrong.

xwu · June 4, 2022, 4:56pm

Yes.

Yes, read the replies for a thoughtful discussion on how it impacts the editing experience.

Yes.

Jon_Shier · June 6, 2022, 10:46pm

@Douglas_Gregor It appears the bare literal syntax is enabled by default in Xcode 14b1, but that version of Swift is missing the parser enhancements required for the / prefix operator to still work. While that will likely change (soon, I hope), does this mean the feature will be enabled by default for Swift 5.7? I thought previous discussion had stated the bare syntax would reserved for Swift 6 mode?

Also, this default on behavior applies unconditionally to the SwiftUI preview build (lol) which is a fun bug.

hamishknight · June 7, 2022, 1:30pm

Unfortunately there is currently a bug with Xcode previews in beta 1 that prevents the prefix / compatibility rules from working as expected. The rules do however work for a regular compilation, so should work fine for e.g running in the simulator. It also appears that syntax highlighting does not yet have the new rules.

Until fixed, one potential workaround to get previews working is to add a comment on the same line as the prefix operator use, e.g:

/x // dummy comment

This is also a known issue.

Ben_Cohen · June 7, 2022, 2:21pm

It is off by default, but the proposal includes a compiler flag to turn it on in 5.7, not wait for Swift 6. Xcode 14 has a build setting to toggle this flag on and off, and it defaults to on.