[Pitch] Regex Syntax

hamishknight · March 3, 2022, 8:53pm

Hello, we want to issue an update to Regular Expression Literals and prepare for a formal proposal. The great delimiter deliberation continues to unfold, so in the meantime, we have a significant amount of surface area to present for review/feedback: the syntax inside a regex literal. Additionally, this is the syntax accepted from a string used for run-time regex construction, so we're devoting an entire pitch/proposal to the topic of regex syntax, distinct from the result builder DSL or the choice of delimiters for literals.

The full pitch is available here: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntax.md.

Joe_Groff · March 3, 2022, 9:14pm

I think that starting from a widely-deployed baseline, such as PCRE, is the right approach. From my experience with Perl, I would say that one extremely high-value tweak to the standard-ish regex syntax is its /x mode, which allows for free whitespace formatting and comments inside the regex literal (and in turn, requires you to write an explicit escape sequence like \s or character class like [ ] where you want to match whitespace). Being able to format even simple regexes greatly increases their readability, and allows for longer regexes to be manageable as well without needing to switch over to some other more explicit string matching syntax. Raku (the former "Perl 6") made /x mode the standard behavior for its pattern literals, and I think that any new language adding string processing support would be good to follow its lead.

ksluder · March 3, 2022, 9:21pm

Other literal characters may also be preceded with a backslash, but it has no effect if they are unknown escape sequences, e.g \I is literal I .

This surprised me. I would expect unknown escape sequences to generate an error, or at least a warning.

Joe_Groff · March 3, 2022, 9:24pm

I agree, that also seems like a worthwhile break from tradition.

roosterboy · March 3, 2022, 9:25pm

The lookahead and lookbehind syntax is currently described as:

?= positive lookahead
?! negative lookahead
?<= positive lookbehind
?!< negative lookbehind

Shouldn't that last one be ?<! to keep it parallel with the rest and also how it's done in PCRE2, Oniguruma and, I assume, other of the standard regex engines mentioned in this pitch?

hamishknight · March 3, 2022, 9:44pm

Good catch! That is correct, let me update it.

rintaro · March 3, 2022, 9:59pm

A metacharacter may be treated as literal by preceding it with a backslash. Other literal characters may also be preceded with a backslash, but it has no effect if they are unknown escape sequences, e.g \I is literal I

Is it worth to mention that \ can escape the literal delimiters, or to define "metacharacter" to include the delimiter?

Nevin · March 4, 2022, 12:34am

I remain adamant in my stance that designing regex syntax before, or at the same time as, a true native Swift parsing and pattern-matching feature, is actively harmful and will cause long-lasting damage to the language.

I understand this is a stark position to hold, so please let me explain my reasoning.

We should design a first-class, powerful, and versatile solution for parsing and pattern-matching in native Swift, and we should make it so convenient and so useful that nobody ever wants to use regular expressions instead.

This is a lofty goal. It might even be an impossible goal. But we should nonetheless take it as a goal, and do our level best to achieve it if we can.

If we add regular expressions to Swift before such a parsing and pattern-matching feature, or if we design them at the same time, or if we so much as plan to eventually support regex literals, then we will inevitably fall into a design pitfall which compromises the usefulness of the general feature.

What will happen is, while designing the general feature, there will be a tendency to engineer for complex cases at the expense of simple cases. People will say, “Yes, the general feature is verbose or unwieldy for some use-cases, but those use-cases are simple enough to solve with a short regex instead, so that’s okay.”

Except that is absolutely and fundamentally not okay.

The general solution must be designed for ease of use in simple cases, every bit as much as it must also be designed for comprehensive utility in advanced cases.

It is easy to say, “Of course we will design the general feature for ease of use, and we would never compromise its convenience in simple cases.”

But I am convinced that is exactly what will happen, unintentional though it be.

Even if we make an active effort not to do so, despite our best intentions, it will nonetheless be in the back of everyone’s mind that regular expressions are available for certain use-cases. This will tint our view of the design even if we don’t want it to.

Even if no one ever says out loud, “That’s okay, regexes can handle it,” and even if we consciously endeavor to ignore them while designing the general feature, their mere presence in Swift will subtly affect the way we think about the possible solution space.

Instead, we should design the general feature first, with an overt goal and intention of being so powerful and so delightfully convenient that we will never need nor want to introduce regex literals at all.

scanon · March 4, 2022, 1:59am

We have a first-class, powerful, and versatile solution for parsing and pattern-matching in native Swift. That’s a separate proposal (actually multiple proposals, some of which have already been pitched).

We still want to have regexes, because:

programs need to be able to leverage user-specified parsers and pattern matchers, not only those fixed in their source code, and those need to support a string representation. For the purposes of interoperability with existing tools, it is advantageous for regexes to be one such representation.
they are the lingua franca of such programs today, and that familiarity confers substantial advantages.

That’s this proposal.

Dante-Broggi · March 4, 2022, 2:26am

I agree, and have another reason regex literals should come strictly after a true native Swift parsing and pattern-matching feature:

It is my opinion that when Swift supports Regex literals, it should be extensible to all regex varieties. Specifically, regardless of where one gets the regex from, be it Python, PCRE, Javascript, etc, one should simply need to specify the applicable variety name in the initializer, alongside the regex string.
In addition, Swift regexes should have an accessor to get a string version of the regex parameterized by the desired variety name.

This is essentially similar to how Swift handles string encoding forms.

masters3d · March 4, 2022, 3:35pm

On properties, can we offer fix its for special properties so the checked in code has consistent checked representation? The fuzzy matching is doesn’t seem necessary to for checked in code. If the compiler can infer the property then make it consistent via a fix it. It’s going to be very difficult to write linters for regex literals so I am hoping compiler can help standardize this part via a warning and fix it.

 * The special Java properties `javaLowerCase` , `javaUpperCase` , `javaWhitespace` , `javaMirrored` .

We follow [UTS#18](https://www.unicode.org/reports/tr18/)'s guidance for character properties, including fuzzy matching for property name parsing, according to rules set out by [UAX44-LM3](https://www.unicode.org/reports/tr44/#UAX44-LM3). The following property names are equivalent:

* `whitespace`
* `isWhitespace`
* `is-White_Space`
* `iSwHiTeSpaCe`
* `i s w h i t e s p a c e`

Michael_Ilseman · March 4, 2022, 3:57pm

We support extended syntax: Extended syntax modes

Various regex engines offer an "extended syntax" where whitespace is treated as non-semantic (e.g a b c is equivalent to abc ), in addition to allowing end-of-line comments # comment . In both PCRE and Perl, this is enabled through the (?x) , and in later versions, (?xx) matching options. The former allows non-semantic whitespace outside of character classes, and the latter also allows non-semantic whitespace in custom character classes.

Oniguruma, Java, and ICU however enable the more broad behavior under (?x) . We therefore propose following this behavior, with (?x) and (?xx) being treated the same.

An additional tidbit is that Perl's (?x) came historically before (?xx), and we propose that unifying on the more modern interpretation is a better and more consistent story.

It is definitely an interesting discussion of what the default should be and whether a choice of whitespace treatment is also reflected in API (in addition to options specified within the regex). There's no API in this pitch, but I think the final proposal would include initializers on a Regex type (to be proposed elsewhere), and labels could clarify that aspect. So this definitely is something we'll have to decide for the next version of this pitch or proposal.

I think we should error out for any backslash-escaped ASCII characters in [a-zA-Z] that are not builtins. I (weakly-held opinion) think that we should also error for any backslash-escaped non-ASCII non-whitespace characters. @hamishknight what do you think?

Michael_Ilseman · March 4, 2022, 4:12pm

We discuss this some in the Swift canonical syntax section.

Character properties can be spelled \p{...} or [:...:] . We recommend preferring \p{...} as the bracket syntax historically meant POSIX-defined character classes, and still has that connotation in some engines. The spelling of properties themselves can be fuzzy and we (weakly) recommend the shortest spelling (no opinion on casing yet).

I think it's good to discuss what the best Swift spelling is as well as what mechanisms to employ.

I'm generally in favor of erroring on almost-certainly-an-error regexes, warning with fixits towards a better or "canonical" syntax, etc. But I'm also sympathetic towards the fact that Swift doesn't have a fine-grained warning suppression story, which would hurt anyone trying to keep regexes in Swift in sync with regexes elsewhere. I don't know how common or important that uses case is, and there's always workarounds such as suppress all warnings or run-time construction from a raw string.

masters3d · March 4, 2022, 4:35pm

Ah thank you. Perhaps we have a preferred init that gives our warnings and another escape hatch init that doesn’t enforce any canonical spellings.

Another option might be to introduce a canonical global to regex flag that can is used at the beginning of the regex.

Joe_Groff · March 4, 2022, 6:09pm

Yeah, I agree that what Perl calls xx is a more reasonable baseline for x behavior, so I support unifying the behavior even if it doesn't end up being the default. Having "extended" syntax be the default could also be beneficial for the choice of delimiter syntax—with extended syntax, there's less reason to begin a regex literal with whitespace, because if you want to match a space you have to write a space-matching pattern using printable characters. If we're going to reserve an existing operator character like / for introducing regex literals, we could potentially say that it must be immediately followed by non-whitespace, so that we only have to reserve the prefix operator form and don't break code that spreads a binary division expression across lines, for instance.

hamishknight · March 4, 2022, 9:35pm

Yeah, initially the accepting of unknown letter character escapes was done for compatibility with e.g Oniguruma, but I agree it's worth breaking compatibility in this case as it's not a useful thing to write, and would block the addition of future escape sequences. Extending it to non-ASCII non-whitespace characters too also seems reasonable.

hamishknight · March 4, 2022, 9:55pm

Probably more of a personal style thing, but I feel like if extended syntax were the default, I would be more likely to want to start the regex with whitespace, e.g I would rather write:

foo(/ [a-z A-Z]+ \s* : \s* \d /)

than:

foo(/[a-z A-Z]+ \s* : \s* \d/)

ksluder · March 4, 2022, 9:58pm

When reading code, I would expect whitespace to be significant without a flag indicating otherwise.

nnnnnnnn · March 4, 2022, 10:26pm

I agree with this. It matches the behavior of string literals and will alert people to near misses (or unrecognized metacharacters from other regex engines, if they're able to find one that we missed).

bjhomer · March 5, 2022, 2:25am

As much as extended mode is nice, if we use it by default I worry that it will silent break a lot of regexes that a user might paste in from some other source.