[Pitch] Regex Syntax

Ah thank you. Perhaps we have a preferred init that gives our warnings and another escape hatch init that doesn’t enforce any canonical spellings.

Another option might be to introduce a canonical global to regex flag that can is used at the beginning of the regex.

Yeah, I agree that what Perl calls xx is a more reasonable baseline for x behavior, so I support unifying the behavior even if it doesn't end up being the default. Having "extended" syntax be the default could also be beneficial for the choice of delimiter syntax—with extended syntax, there's less reason to begin a regex literal with whitespace, because if you want to match a space you have to write a space-matching pattern using printable characters. If we're going to reserve an existing operator character like / for introducing regex literals, we could potentially say that it must be immediately followed by non-whitespace, so that we only have to reserve the prefix operator form and don't break code that spreads a binary division expression across lines, for instance.

5 Likes

Yeah, initially the accepting of unknown letter character escapes was done for compatibility with e.g Oniguruma, but I agree it's worth breaking compatibility in this case as it's not a useful thing to write, and would block the addition of future escape sequences. Extending it to non-ASCII non-whitespace characters too also seems reasonable.

2 Likes

Probably more of a personal style thing, but I feel like if extended syntax were the default, I would be more likely to want to start the regex with whitespace, e.g I would rather write:

foo(/ [a-z A-Z]+ \s* : \s* \d /)

than:

foo(/[a-z A-Z]+ \s* : \s* \d/)

When reading code, I would expect whitespace to be significant without a flag indicating otherwise.

5 Likes

I agree with this. It matches the behavior of string literals and will alert people to near misses (or unrecognized metacharacters from other regex engines, if they're able to find one that we missed).

3 Likes

As much as extended mode is nice, if we use it by default I worry that it will silent break a lot of regexes that a user might paste in from some other source.

1 Like

Mmm, I would have the opposite expectation for code that's parsed in Swift (as regex literals would be here): whitespace is generally not significant.

Has that been the case for Raku? I would be reassured if the empiric experience there has been that it's mostly fine, and if not then certainly we should worry about the same.

3 Likes

Whitespace is significant within quotation marks even in Swift. In every other language with regular expressions that I have used, they have behaved like quoted strings rather than parenthesis-delimited expressions.

1 Like

For me at least, the most exciting part about regex literals as they're proposed for Swift is that they're not going to be in quotation marks and won't behave like quoted strings (the type of these values will reflect the parsed syntax tree, and you'll be able to do tuple destructuring for matches, etc.). And since that's the overall vibe we're going for, I'd expect whitespace not to be significant.

4 Likes

As far as I know, delimiters have not yet been decided.

[quote]and won't behave like quoted strings (the type of these values will reflect the parsed syntax tree, and you'll be able to do tuple destructuring for matches, etc.).
[/quote]

A richer representation doesn’t mean they have to have unexpected behavior.

1 Like

It's worth keeping in mind that even with a non-string regex literal, we will still want to allow initialization of regexes from strings, using a common interior syntax. If you're writing a text editor, you want your users to be able to provide a regex as a string and use it to perform search and replace operations, for example.

6 Likes

How about 2 different literals - one for ‘regex’ that is whitespace sensitive, and another for ‘multi line regex’ that isn’t? I know that delimiters aren’t up for discussion now, but some sort of parallel to ”…” and ”””…””” might make sense? Or maybe it’s more like string vs raw string?

I can see that there are advantages to encourage people to use non-whitespace-significant regex as maybe the default. But I also think that it could be annoying/ bug-inducing to have to rewrite any pre-existing regexes to ‘escape’ the whitespace. So, it would be useful to be able to write both, and possibly not just with a flag at the end.

2 Likes

Something else that's interesting is mentioned in the Introduce a novel syntax alternative, which is that we're developing a experimental extended syntax for Swift which goes a little further. This is not proposed here and is likely to be future work (it's a lot to bite off at the same time), but the rough and ever-evolving idea is:

  1. All ASCII values outside of [A-Za-z0-9] are reserved for metacharacters and should be escaped or quoted for literal treatment.
  2. All whitespace is non-semantic unless escaped; # is supported for end-of-line comments (pending delimiter, perhaps // too)
  3. Quoted literal content uses double-quotes, so you can say "a.b" instead of \Qa.b\E. These would be Swift string literals eventually supporting interpolation, raw strings, etc.
  4. Clearer capture group syntax and defaults: (...) is non-capturing, (_: ...) for unnamed capture, and (name: ...) for named, etc.
  5. Support Swift-syntax ranges for ranged quantification, i.e. x{3..<8} for x{3,7}
  6. Use of other now-free delimiters, e.g. <...>, as a way of naming builtins such as character classes and anchors, perhaps also an interpolation sytax or way to refer to in-scope declarations.

This clearly breaks compatibility with existing regex syntax, so it would need to be clearly delineated and makes sense as future work. There's significant value to allowing things like command-line tools and search fields access to traditional regex syntax, so this wouldn't take the place of what we're proposing.

Another practical reason to consider this future work is that the current effort is pushing the state of the art of the Swift compiler: our regex parser is written as a stand-alone pure-Swift library that gets bundled up and incorporated with the C++ Swift compiler. The Swift compiler yields lexing/parsing state to our library, which then yields back to the Swift compiler after lexing/parsing. A "year 2" of overhauling parts of the Swift lexer could include handling string literals in such a library, making it natural for the regex parser to support embedded Swift string literals. Alternatively, in the nearer term if this deemed high-value, we could support just basic string literals at first.

If we're debating an extended syntax by default for literals (or one of multiple literals), I'm not sure how much value Perl-style xx gives us. Result builders seem like the better way to separate components across lines with comments. The syntax above, especially points 1-3, give us a more compelling extended syntax.

16 Likes

I think there's an issue in the proposal, it says that HexDigit can takee a-zA-Z?

HexDigit   -> [0-9a-zA-Z]

I think you meant

HexDigit   -> [0-9a-fA-F]
2 Likes

Good catch! Fixed in Fix HexDigit definition in RegexSyntax.md by hamishknight · Pull Request #253 · apple/swift-experimental-string-processing · GitHub.

1 Like

I have merged an update that pulls in run-time construction and AnyRegexOutput:

@hamishknight can you update the link in your original post to point to the new version? Thanks.

Unfortunately it seems you can't edit old posts. Updated pitch thread: [Pitch #2] Regex Syntax and Run-time Construction

1 Like

Apologies for the late reply on this, we plan on mentioning it in the regex literal pitch as we feel it's more of a detail of the literal itself than the syntax of the regex engine.

1 Like

Agreed, and I’d even go further and say that, per Steve Canon’s comment, that it’s best to hew to tradition with this syntax so as to favor reuse of existing regexes from other languages, and save the bold new design ideas for the new DSL.

People are going to get frustrated fast if they can’t copy and paste that “regex for valid emails, attempt 6003” answer from Stack Overflow.

7 Likes