SE-0354 (Second Review): Regex Literals

If we aren't supporting Perl-like regex modifiers after the literal (i.e., /foo/i), then are there any situations where a valid regex literal and an identifier character would be juxtaposed like that? I can't think of any but I could very well be missing something. But if not, could the lexer detect that and reject it as a regex? I think we could add digits as regex non-followers as well.

I just thought of another weird one, extending the above idea to "what juxtaposed characters might we want to prevent a regex literal". I don't think it's a big deal and parentheses can be used to disambiguate this, but just to continue this line of thinking:

foo(/MyEnum.case1, /(bar))

Does this get parsed as a regex literal being called via a callAsFunction extension, or as two applications of prefix /?

1 Like

I think you're right. It would be worth studying that as a possible fifth condition, as that would catch a lot of common cases, including most of the CasePath uses.

I'm pretty sure it would parse as the former.

I'm not entirely sure I'm comfortable with the use of multiline regex literals. It feels like many expect them to have a magic "only match the whitespace I want it to match" heuristic but deciding which whitespace to match is difficult and any useful system is likely to be difficult to explain. If we have to have them at all, then I think I'd prefer to make them work exactly like multiline strings (with the same rules for removing indentation) and leave the remaining whitespace as significant.

I think in practice I'd most likely use the Regex builder DSL containing a bunch of single-line regex literals rather than a multiline literal with extended syntax. I can't see a good reason to use the multiline literal syntax unless I really want newlines to be significant. (And in the odd case where I do want extended syntax, I can always force it with (?x) anyway.)

6 Likes

In my opinion, the growing list of parsing rules for this simple literal type is starting to become a bit symptomatic of larger problems with the associated syntax changes. Straying away from the single forward slash feels like it would simplify parsing quite a bit, meaning using the #regex(stuff) syntax or just the extended syntax.

Hereā€™s another example that could be a bit harder to fix:

someFunc(/MyEnum.someCase, /) (the second slash is being passed as the division operator)

4 Likes

Very similarly to another recent example, this is another one that can be fixed either by putting the first argument in parentheses or by separating the two arguments onto separate lines.

Either like this:

someFunc((/MyEnum.someCase), /)

Or like this:

someFunc(
    /MyEnum.someCase,
    /
)
1 Like

I know it can be fixed by writing the code differently, but that doesn't change that it's a breaking change. And it's a relatively easily avoidable breaking change with little gain in my opinion.

I understand that I'm probably fighting a losing battle because there are a lot of people who want Swift's regex literals to look exactly like their counterparts in other languages. However, I do have one more concern.

Almost all syntax highlighters used on the web are regex based, and I find that they have enough trouble highlighting Swift as-is ā€” regex literals with all their associated rules are surely going to make that worse. This is in contrast to the relatively simple string literals which can be passed relatively easily because " isn't used for anything else (such as division in the case of the / delimiter).

2 Likes

Copying in my reply from the other thread, as I think it helps clarify and establish some reasoning for further discussion.


With respect to non-semantic whitespace, the literal proposal presents these 3 cases:

// 1
/whitespace is significant/

// 2
#/whitespace is significant/#

// 3
#/
  whitespace is **not** significant     # nor are comments
#/  

Getting behavior such as in #3 is highly desirable, via some delimiter-enabled way. We couldn't find a better one than #/ followed by a newline. The alternative #///'s has some issues with comments, and /// isn't workable AFAICT, but @hamishknight can you comment further?

An argument could be made that #2 should be non-semantic as well, as "extended delimiter" could mean "extended syntax" (and we'd likely error out on a line-ending comment). The downside is that changing a /hello world/ to a #/hello world/# would change meaning of whitespace and that would be weird (as you point out). I'd (weakly) recommend against this direction.

An argument could also be made that all regex literals, including #1, has non-semantic whitespace. That does get weird with the no-leading-space lexing rule (which IIUC we could restrict to start-of-line if we needed to). It's also surprising that /hello world/ doesn't match "hello world", without the newline that #3 has.

To clarify, the below are all compilation errors. The multi-line story only happens if the #/ is immediately followed by a newline:

// Error
/
  abcd
/

// Error
/ ab
  cd
/ 

// Error
#/ ab
   cd
/#

// Ok
#/
  ab
  cd
/#

At this point, we have at least five credible suggestions about how to handle multiline regexes. (NB: All the below use the #/ā€¦/# syntax for any regex literals containing newlines, so this is not a delimiter question!)

Itā€™s probably useful to summarize them, since the discussion has become quite tangled at this point:

  1. Strip all unescaped whitespace (traditional extended mode), and emit warnings about unescaped spaces that look suspicious

    Advantage: Allows traditional extended regexes, formatted for readability

    Disadvantage: May be confusing when users encounter it, requires verbose manual escaping of spaces and/or flag to disable warnings, ā€œhello worldā€ footgun still exists despite warning

  2. Remove leading and trailing whitespace (and comments) from each line

    Advantage: Somewhat intuitive behavior

    Disadvantage: Harms the ability to add internal whitespace for readability

  3. Don't allow regex literals to span multiple lines at all; use the regex DSL instead for formatting and commenting long regexes

    Advantage: Encourages people to use the regex builder DSL, which has numerous readability advantages and requires no confusing new rules about whitespace

    Disadvantage: Forces people to use the regex builder DSL, which is more verbose and in some cases clumsier, and (currently) discourages named captures

  4. Use the DSL for formatting long regexes, as in 3, but allow multiline regex literals and treat newlines + whitespace as significant

    Advantage: There is currently no other proposed facility for preserving literal newlines in a regex, which can be useful for matching large chunks of formatted text

    Disadvantage: Interaction with surrounding code gets messy. (How does it handle indentation, for example? Is the rule the same as multiline strings? What are the rules for a bare leading or trailing newline? Is all this really better than explicit \n? etc.)

  5. Combine 1+4: multiline regexes are literal by default (4), but some extra syntax enables extended mode where all unescaped whitespace is ignored (1)

    Advantages: Covers all the bases, more or less

    Disadvantages: Maximally confusing, may not actually carry its weight

  6. Use #///ā€¦///# as a second, separate delimiter to enable extended mode, and either (6a) disallow multiline #/ā€¦/# or (6b) allow multiline #/ā€¦/# and have it treat whitespace as significant

    Advantages: Might mitigate the ā€œhello worldā€ footgun, since itā€™s slightly less easy to accidentally enable, and the delimiter change could help signify that the meaning of whitespace changes

    Disadvantages: May be excessive and unnecessary. Option 6b poses all the problems of Option 4 above.

4 Likes

I thought I would give some more analysis of the different regions in the design space.

We have a 2-dimensional (with some wrinkles) design space and we need to fit in /.../ and #/.../#. To illustrate this space, I'll use the alternative considered re'...' syntax. This is not an attempt to re-litigate the core team's aesthetic preferences. I'm using the alternative because it was designed to treat these design dimensions as orthogonal, but I argue that desirability is not equally distributed across this plane.

Here re is normal syntax and rx is extended; ' is for single line and ''' is for multi-line.


/// Region 1
re'whitespace is significant'


/// Region 2
re'''
  whitespace is still significant
  but leading and trailing are trimmed
'''

re'''
  whitespace is still significant
  \ but leading and trailing are trimmed
'''


/// Region 3
rx'whitespace \s isn't \s significant \s (unless\ escaped)'


/// Region 4
rx'''
  whitespace \s isn't \s significant \s (unless\ escaped)
'''

Region 1: Semantic whitespace in single-line literals

This is a highly desirable region to support as it maximizes compatibility and familiarity.

This is the behavior of / as proposed. The no-leading whitespace rule makes it particularly difficult to have non-semantic whitespace with the bare / delimiter.

Region 2: Semantic whitespace in multi-line literals

Initially this point looks promising given intuition from string literals and this specific example being a long run of verbatim content. But the precise meaning is not clear: should the newlines be preserved as verbatim content like string literals?

Traditionally, a newline sequence encoded into a regex would be treated verbatim and match that exact sequence. This includes any byte sequence that would be a newline within the regex literal, e.g. a CR-LF verbatim matches a CR-LF in the input. And that's fine for run-time string content. But when it comes time to embed a literal in the host language (Swift), the host language handles this structure.

Keeping the newlines as verbatim content but trimming seems like it would be surprising. Dropping the newlines and trimming diverges from string literals but even that can be surprising. In the example shown, the space separating words has to be added/escaped because of where the line break is. If the escape was instead at the end of the line, would that restore a verbatim newline?

There's a lot of details we could work through, but this region does not seem all the desirable to land upon. It's also entirely unprecedented AFAIK, which doesn't necessarily argue against doing it, but lends some credence to the argument that this isn't a particularly pragmatic or useful region.

For actual regexes, long runs of verbatim content are fairly rare, and overall the balance tilts towards non-semantic whitespace being more helpful than confusing. Thus, we are proposing not to target this region.

Region 3: Non-semantic whitespace in multi-line literals

This is a highly desirable region and it's broadly precedented by other language's extended or multi-line literals approach. It splits a regex across multiple lines, ignores the newlines contained, and turns on non-semantic whitespace.

This is what's proposed by a #/ followed by a newline. For example, to quickly capture a couple portions of the transaction used in the overview proposal's example:

// CREDIT    03/01/2022    Payroll from employer      $200.23
let regex = #/
  (?<date>     \d{2} / \d{2} / \d{4})
  (?<middle>   \P{currencySymbol}+)
  (?<currency> \p{currencySymbol})
/#
// Regex<(Substring, date: Substring, middle: Substring, currency: Substring)>

(Note that in this use case I'm not using a strongly-typed Foundation.Date, which represents an instant in time and thus requires a priori knowledge of locale and/or timezone).

Some contingent of developers might shy away from this in lieu of converting everything to builders, and that's totally fine. Everyone has their own conversion curve between literals and builders. But this is still a very useful and valuable region to support.

Region 4: Non-semantic whitespace in single-line literals

This is an interesting region and is commonly supported by other language's single-line literals. It loses the instant familiarity and compatibility of region 1. This is the default for some languages like Raku and can aid in separating delimiter noise from regex content.

This is currently supported explicitly through the use of (?x), but note that syntactic options pertain to the interior regex syntax and wouldn't affect things like how delimiters are parsed. If we were to adopt a no-trailing-whitespace rule, then the following would hold:

/(?x) non semantic whitespace /      // Invalid
#/(?x) non semantic whitespace #/    // Valid

An alternative could be to enter non-semantic whitespace mode if the #/ delimiter is followed by any whitespace, not just a newline, allowing the following:

#/ non semantic whitespace #/ 

We're (weakly) arguing against this direction because that seems like a more surprising on-ramp to non-semantic whitespace than restricting it to when the regex is split across multiple lines. A regex split across multiple lines is a much stronger signal that whitespace is handled differently by Swift.

For non-semantic whitespace content that fits in a single-line, sources can still use the multi-line variant:

#/
  non semantic whitespace
#/ 
7 Likes

I agree, which is why the proposed behavior is non-magical and directly aligns with a very common use scenario. See "Region 2" and "Region 3" above. "Region 2" would be exploring new (and uncanny) ground.

String literals give us a nice basis or a nice example of how to integrate literal content with Swift. But, it carries with it the potential for false intuitions. Regex literals are algorithm literals more than data literals, which means some intuitions are reversed and some conventions are flipped on their head.

The examples in this thread, e.g. /Hello world/ are great at illustrating the differences of semantic whitespace. But, they're regexes containing only a long run of verbatim content. This is the common case for strings, with escapes and interpolation sprinkled throughout. But for regexes it's commonly the reversed.

For example, extended delimiter #/.../# doesn't change the interpretation of a \ inside like a string literal. /\s/ === #/\s/#. Regexes make far heavier use of \ and forcing users to type #\ would be pretty bizarre and hostile. Wanting verbatim treatment of a character such as \ is much less common than in "raw" strings.

Multi-line literals don't treat newlines as significant, nor does any regex inside (?x) mode. (With the note mentioned above that (?x) pertains to interior syntax and wouldn't affect how Swift-the-language parses the delimiters).

@hamishknight What's the status of the no-trailing-whitespace rule? Is it desirable, or not, and would it fix this issue?

That examples matches my intuition of how this should behave. @hamishknight, what do you think?

I don't think we should support the "isolated" option setting behavior if the applicable scope spans lines:

let r = #/
  (?-x)hello
  \ world
  /#
// error: isolated extended-syntax option spans multiple lines in literal

That is, newline treatment is a concern of the host language (Swift) while options affect the interior language (Regex).

As for quoting runs of verbatim content, this is something that comes up and would be nice to support. We do support \Q...\E, but we'd want to be careful about the details.

let r = #/
  \Qhello world\E
/#  

@hamishknight, do you recall whether that would preserve the whitespace character? I know that PCRE2 also supports a bare \Q that quotes the rest of the regex as verbatim content, but we'd want to make sure to reject that when the rest of the content spans multiple lines (just like (?x)).

We're reserving (a potential) interpolation syntax for regex as future work, which would be generic over RegexComponent or similar. Strings (whether literal or a variable) would be interpolated verbatim. I would argue that any contained whitespace is similarly treated verbatim. We also will issue an error for unrecognized syntactic (?...) forms, so we could (future work) support a quoting mechanism that way. E.g.

let verbatimContent = "Some verbatim content"
let r = #/
  <{ verbatimContent }>
/#  
let r = #/
  (?"Some verbatim content")
/#  

edit: fixed some typos

(Michael, a quick clarification: throughout your post, you're using #/ as a closing delimiter. The proposal specifies #/ā€¦/#, not #/ā€¦#/. I assume this is just a mistake in your post? Or is there confusing / debate about this?)


Re your regions of design:

  • I don't think thereā€™s any debate about region 1 here.

  • In region 2, thereā€™s a clear split in the comment: the thing you say ā€œseems like it would be surprisingā€ seems to match many peopleā€™s intuitions, and seems to be clearly wrong to others. There seems to be a split about option 1 and option 2 from my list of possible designs: one is obviously more intuitive and the other is obviously wrong, but thereā€™s not consensus about which is which.

    Iā€™m not sure what to do with this information, but an appeal to obviousness will not resolve the issue.

    My somewhat squishy opinion here is that the lack of an explicit switch to extended mode here exacerbates the problem of that hello world footgun. Iā€™m not aware of any precedent of a language with an implicit extended mode as in this proposal. Even Ruby, which has a second %r{ā€¦} syntax especially suited to multiline regexes, still requires the x flag for extended mode.

  • In region 3, I'm questioning whether this is in fact ā€œhighly desirable.ā€ You write: ā€œit's broadly precedented by other language's extended or multi-line literals approach.ā€ However, those other languages donā€™t have the builder DSL syntax. The possibility of mixing single-line regex literals into a multi-line regex structure that entirely obeys the syntactic rules of the host language, with no confusion about whitespace, might well mean that multiline regex literals in this ā€œregion 3ā€ space simply wouldn't pull their weight in Swift the same way they do in other languages.

    ā€œIt was useful in Perl and Rubyā€ isnā€™t quite enough here. The question we should ask is: Is it useful in Swift, in the presence of this other multiline syntax? Would this feature pull its weight? Or would it be more Swift-like to steer people toward the builder DSL if they want to format and comment long regexes?

    As I wrote above, Iā€™m personally on the fence about this, but I do think the answer is not obvious, and the question deserves serious consideration.

  • In your region 4, it seems to me that (?x) or similar neatly resolves the question.

3 Likes

To me #/ would be a better signal that ws is non semantic. It would be surprising if the behavior changed if I happened to convert a single line extended syntax single line to multiple line or viseversa.

It would be unfortunate if folks have to convert a single line regex to multiple line regex only to get the non-semantic ws behavior.

let r1 = #/ non semantic whitespace /#

let r2 = #/
non semantic whitespace
/#

My mistake, thanks for catching it.

I'm saying that supporting this region at all requires making calls about verbatim vs stripped newlines and whitespace, and any call that we might pick would be surprising to anyone expecting something else. There is also no precedent for partially-semantic whitespace to follow, while there is broad precedent for extended syntax mode.

What is proposed is explicit, though I agree that a flag like in other languages would even more visually distinct. But, we're not proposing flags.

An alternative could be a flag-like special parsing rule in-between the delimiter and the newline, something like:

#/x
  ...
/#

#/
  ...
/#
// error: newline in regex literal, use "x" before the newline

That would restore "explicit" via your definition, but I don't think it improves clarity. Another alternative is supporting (?x) instead of x there, allowing it to trigger literal multi-line mode as well. But that's inconsistent with syntactic options elsewhere and how they interact with literal delimiters and newlines.

A key feature of regex syntax is broad compatibility with other engine syntaxes. (Note that literal delimiters are not motivated by a copy-paste scenario, but the regex syntax contained is). A key feature of regex literals is compile-time knowledge of the same regex syntax to drive compiler errors and type inference. We want to encourage people to use literals whenever possible instead of using run-time regex compilation from a string.

Without any story for multi-line non-semantic whitespace, a lot of the value of a literal is harmed. The direct workaround would be to represent these regexes as run-time compiled strings with explicit types provided, i.e. don't use literals. The other workarounds are to heavily re-work the regex either into a single-line literal or to convert to a builder (which could have further reaching implications).

Indeed, that would be a poor argument if it were made entirely in isolation. I include an illustrative example and musings about literal extensions which in particular shine in a multi-line mode.

But, it being useful and widespread in Perl and Ruby is actually a strong motivation for having some way to support it as a literal instead of run-time compilation or having them rework it.

We're not talking about shipping the "good" feature and the "bad" feature. We're talking about two good features with their own, mutually complementary, strengths and weaknesses.

If there's a "bad" feature here to nudge people off of, it's using run-time construction for statically-known regexes. If do not have a multi-line non-semantic whitespace literal solution to offer, then we are nudging people onto this "bad" path.

1 Like

As far as I see, it seems like:

  • The use cases and the behavior of extended syntax (#/.../#) is not as clear-cut and obvious as is the case for its string counterpart. It seems like their similar syntax might be more misleading than helpful.

  • The rules governing multi-line regexes are not obvious and there isn't a consensus on what is the best solution. There are additional issues around how to specify the global options.

May I suggest we focus on making sure /.../ works well enough with the proposed rules in absence of the extended syntax, and approve this subset of the proposal, and then go back to a more open discussion about:

  • The meaning of extended mode #/.../#
  • Multi-line literals
  • The place and relationship of regex global options with regex literals

I think, addressing the above issues may call for a different solution and we should not rush it. For example, we may need to also support some variant of #regex to cover multi-line case and global options specification. In those cases verbosity of the syntax might actually be useful.

2 Likes

Following precedent on this point is indeed a strong argument, which makes me like @hamishknightā€™s suggestion for using traditional extended mode, but emitting a warning for an internal space surrounded by literal characters (my option 1). I like that when Hamish proposed it, and itā€™s still compelling now.

Yes, understood, agreed, and well said.

The point Iā€™m advancing for consideration here is that multiline regex literals may in fact be a ā€œbadā€ feature in the presence of the builder DSL, which is not precedented in other languages that support multiline extended mode.

Swift is a language that is willing to be fairly opinionated about narrowing its syntactic options, especially in the service of readability and clarity. Consider the history of ++: the feature has some confusing behaviors around pre- / post-increment behavior (and parsing, for that matter!) that often tripped up people discovering it for the first time, but it was so long-precedented and so familiar that it seemed utterly essential. Of course Swift included the ++ operator! But over time, it turned out not to carry its weight.

Why not? Because of for x in a..<b. Language features not present in C (for-each loops and ranges) shifted the balance, and what once seemed essential became more burden that benefit. In the contentious debate over removing it, people brought up parallel points: It's useful in other languages! Copying and pasting existing code becomes harder! Why should the language get opinionated about there being a good way and a bad way when we could just provide the choice? But in the end, we removed the feature, and the language has not suffered greatly for it.

TL;DR: The presence of newer features can sometimes make longstanding, seemingly essential precedents inessential.

To be clear, I'm not 100% decided on this question myself here. I'm not sure that multiline extended mode is a situation that parallels ++. What I'm arguing for is seriously considering that it might be.

3 Likes

I'm very hesitant to add unsuppressable warnings for valid content. Warnings should be used when there's something that could be actively misleading or harmful and carry a strong recommendation to write it another way.

I think not supporting any multi-line literal would be a good "Alternatives Considered". I can appreciate the reasoning, though my recommendation is still to support it. Thanks for helping me flesh out the rationale and bring this consideration and alternative some more visibility.

3 Likes

I believe that's exactly what happened in round 1 of the review, and this round is for the more open discussion you mention.

2 Likes

It is indeed surprising. Thus this discussion!

The trouble with making #/ vs / be the ā€œignore whitespaceā€ flag is that it mucks up the original purpose of the syntax, which is to prevent having to escape slashes. For example:

/https:\/\/forums.swift.org\/t\/(.+)\/(\d+)\/(\d+)/  šŸ˜µā€šŸ’«

#/https://forums.swift.org/t/(.+)/(\d+)/(\d+)/#      šŸ™‚

But if #/ is also a flag to ignore whitespace, then this happens:

/<a href="(https?:\/\/[^"]+)">(.*)<\/a>/  šŸ˜µā€šŸ’«

#/<a href="(https?://[^"]+)">(.*)</a>/#   šŸ˜µ Oops! Now we're matching <ahref=ā€¦>
8 Likes

Agreed 100% re unsurpressable warnings, but the proposed one isn't unsurpressable.

If somebody intended this to match "foobar":

#/
foo bar
/#

ā€¦they could use any of the following alternatives to suppress the warning:

(foo) (bar)
(?:foo) :bar
foo() bar
foo
bar

Maybe not ideal, but not a deal-killer either.

(I wonder how many extended mode regexes in the wild would run afowl of this warning?)