[Pitch] Regular Expression Literals

hamishknight · October 15, 2021, 11:08am

allevato:

So I hope the authors can elaborate more on what changes they think might be necessary. I think requiring semicolons to disambiguate a terminated statement from a wrapped one would be a major shortcoming (although result builders seem to have already cracked the door open a bit to that, based on the other examples they gave).

rintaro:

If we really need to change the parsing rule (under a version check), that would be something like "treat the / at the beginning of a line as the start of a regular expression literal".

EDIT: Wait, yeah I now understand your concern, @allevato . Let's see...

Right—a rule like that would make it impossible to wrap expressions such that binary operators are placed at the beginning of the next line when they occur at line breaks, without introducing an awkward special case for division, even if it was guarded by a language mode flag. (I'm not just saying that because operator-at-the-beginning-of-the-line is the default wrapping style used by swift-format, although that's part of the reason.) Asking users to rewrap their code when upgrading to the next version of Swift would be... unfortunate.

Yeah I think if we did introduce a parsing rule like that, we'd likely also want to consider the token that comes after on the same line, so e.g

1
/ 2 / 3

1
/ 2 
/ 3

would both continue to parse as binary chains. But more ambiguous cases like:

0
/ 1 / .foo

0
/ 1 /
2

would start parsing as regex literals.

hamishknight · October 15, 2021, 11:16am

Yup, that's essentially the approach I'm trying out in [DNM] Try out some / delimiter lexing by hamishknight · Pull Request #39755 · apple/swift · GitHub. It does have some logic to also look ahead to the next token in the case where it's a regex that starts a new line, but I'm not sure whether that will end up being necessary.

Saklad5 · October 15, 2021, 1:40pm

I’d be fine with using single quotation marks for regex right now, and expanding that to arbitrary custom literals in the future (which regex could eventually use instead for the sake of simplifying the language).

schutt · October 15, 2021, 1:50pm

Huge +1 for adding a regex literal using / for the delimiter and following standard PCRE syntax. In my experience most languages or libraries using different delimiters or constructors lead to confusion about what escape sequences are required.

Michael_Ilseman · October 15, 2021, 3:02pm

AliSoftware:

That would make way more sense indeed to have finalize(…) in this example be mutating … which means it should thus return Void and be used like below instead:
builder.finalize(__D1)
return T(regexLiteral: builder)
That would solve my initial confusion

I opened a PR to incorporate this change and accumulate others. This example is meant to guide intuition, so whatever formulation guides intuition best is fine with me .

It's a little artificial this early in the process and the final design would need to be vetted by several conformers. It's irrelevant to what we need feedback and further investigation on, such the choice of supported syntax (PCRE) and choice of delimiters. That being said, I'm always happy to spitball compiler-library interfaces.

@beccadax does a great job presenting 4 such style of conformers.

Breaking this down a bit, we see the potential for 4 types:

The resulting constructed value, which might be partially compiled. (e.g. Regex above)
A finalized AST. This could be the return value of finalize() or stored in the builder. (RegexLiteral above)
A reference to an AST fragment, the return value of buildConcatenate(). (e.g UInt)
The stateful builder that constructs the AST. (also RegexLiteral above)

That is, what we pitched uses the same type for a stateful builder and a finalized AST representation. It would be interesting to see if splitting out the two concerns enables any more functionality. Conversely, if we always immediately pass the builder in to init(regexLiteral:) after finalize(), we might decide to merge the two operations if there's no benefit.

ensan-hcl · October 15, 2021, 4:23pm

Does it mean that you adopt PCRE syntax but not semantics? IIUC, the word characters in the PCRE document means Unicode Scalars. PCRE regex with grapheme-semantics seems strange to me.

al45tair · October 15, 2021, 4:45pm

I mentioned this on a PR already, but it would be nice to support UTS#18 as much as possible.

UTS#18 aims to be similar to Perl/PCRE, so should be broadly compatible. It has some significant extensions to character class syntax IIRC, including (in the Level 2 support) to deal with things that occupy multiple code points).

Michael_Ilseman · October 15, 2021, 5:07pm

We should be a syntactic superset of UTS#18 and PCRE if possible. If UTS#18 has an incompatibility with PCRE, we're more likely to side with PCRE. @hamishknight, do you see any issue with this approach?

Semantics are explicitly listed as being outside of this pitch's scope. They're a loose end that needs more focused investigation and their own discussions. The dependency graph of pitching regex is cyclical, and this is where we're breaking the cycle. Resolving the loose ends would be required to graduate this pitch into a proper proposal.

Alejandro_Martinez · October 15, 2021, 5:19pm

Could some example about this be provided? At a first read I thought a conforming type would be able to change how regex is parsed which doesn't make much sense does it? But then I'm not sure what it implies exactly.

I'm probably lacking the imagination and experience for this so apologies, I just think a small example would be nice.

Michael_Ilseman · October 15, 2021, 6:33pm

The idea is for different libraries to provide different semantics using the same regex syntax.

For example, a higher level framework that knows the current locale of the reader and/or application domain, might want to provide more linguistically sophisticated matching. Examples:

Digraphs, such as "ch" in Czech, as a single distinct letter for matching purposes, if applicable to the user's current language
Ligatures, such as ﬁ, might be comparable with their expanded form fi, or not, depending on application
Word boundaries, such as \b, could incorporate large language dictionaries to better understand where boundaries are inside languages that don't separate words by whitespace (e.g. Chinese).
Fuzzy matching, such as allowing to match the same word whether it is typed as a compound, properly hyphenated, or two separate words (windswept, wind-swept, wind swept, wind-\n\s*swept, etc).

ksluder · October 15, 2021, 7:05pm

Given that the type of a literal expression is not itself a literal, how would this work at the call site? For example, let’s say I’m writing a web app using a framework that routes URLs using regular expressions. I have a route that handles products by ID:

WebFramework.route(.GET, /products\/([0-9]+/) { showProduct(request: $0) }

Now I want to use a locale-aware regular expression implementation so I can also route based on SEO-friendly product names. Because a stand-alone /…/ literal produces a Swift.Regex, I have to be more verbose:

WebFramework.route(.GET, /products\/([[:alpha:]]|-)+/ as BetterLanguageRegex) { showProduct(request: $0) }

This privileges the stdlib’s regular expression compiler over other libraries’, and might lead to confusion if someone tries to factor out the expression literal. I’ll also note that this highly visible application of regular expressions immediately runs into the slash escaping issue.

If we made #regex(…) the canonical syntax, it puts it on equal footing with other implementations, which could be spelled e.g. #BetterLanguageRegex(…).

beccadax · October 15, 2021, 7:25pm

The lexer (!) cannot know which keywords are regular expression types, so this would amount to grabbing every currently unused #<keyword>(…) syntax for regular expressions. I think we’d like to keep that syntax available for other uses.

(This is not an argument against #regex, just against the idea of allowing custom type names in place of regex.)

JoeyKL · October 15, 2021, 8:21pm

I really like this comment by Chris Lattner in an old thread about adding regexes:

Just to throw some more ideas into the mix:

regex pattern matching is the dual of printing/formatting. I see them as very similar to string interpolation in a lot of ways. Just like interpolation, matching should be type driven (types should specify their matching rules) and there should be some way to customize formatting (e.g. the equivalent of printf style modifiers).

regex matching in Swift should integrate with pattern matching in general.

Perl 6 has some really great things in this department. That community has spent a very very large amount of time thinking about regex's. perl6 is not taking off in a huge way as a general language, but it makes sense to look at the things they are really great at and learn from them.

Plain regexes are inherently very not Swifty. It may make sense to add them as a legacy convenience, but I think the first approach should be to try to intelligently redesign regexes to create the best balance of familiarity, ease of use, and integration with Swift's features and ethos.

JoeyKL · October 15, 2021, 9:00pm

A very brief sketch of what this could look like:

if "0x\(let number: Int, radix: 16)" = input {
  // ...
}

ksluder · October 16, 2021, 5:28am

Not precisely. It would amount to grabbing every currently unused #keyword(…) syntax for literals. All current uses of #keyword are effectively literals, even constructs like if #available(). Lexing is tractable as long as all literal expressions have balanced parentheses.

xAlien95 · October 16, 2021, 2:07pm

NSRegularExpression uses ICU, which isn't a subset of PCRE(2) (\a, \e, \Uhhhhhhhh are missing in PCRE and \N parses differently between the two). Does it imply that Swift's RegexLiteralProtocol will support both ICU and PCRE in the fullness of time? Or that not every NSRegularExpression string will be able to be replaced with the corresponding regex literal?

Overall, I like this direction. Having syntax highlighting not only for Swift regex literals, but even for obj-C (via Foundation) and JavaScript ones (via JavaScriptCore) can help a lot, especially if alt+clicking the literal can show an informative popup containing its explanation a la regex101.com.

Michael_Ilseman · October 16, 2021, 2:12pm

Our aim is to parse a superset. This pitch is describing syntax only (there's already a lot to talk about) and not semantics such as definitions of character classes, nor what will be supported at any point in time.

ensan-hcl · October 16, 2021, 3:02pm

(I attached wrong reply target... I'm sorry )

I think modes (or flags) of regex should be expressed by modifiers like /regex/.mode(), rather than /.../m syntax. It would allow customized modes in libraries.

By the way, does regex literal support interpolation? I sometimes want to combine regexes like /\(regex1)|\(regex2)/.

michelf · October 16, 2021, 3:14pm

It'd be nice. But it'll have to be another syntax because \( already has the meaning of a parenthesis character (escaped) inside a regex.

michelf · October 16, 2021, 3:20pm

One syntax I like is this one:

#([a-z0-0]+)

Upsides:

It's less verbose than #regex().
The opening delimiter "#(" is unambiguous with any existing Swift syntax, so there's no need for a lot of new lexing rules.
It's balanced so it's easy to figure out where it ends when mixed with other expressions.
There's one less character to escape because parens already need to be escaped anyway.
Can write an empty regex.
It's also a nice touch that those parens look like a capture group because it will accurately represent capture group 0.

One downside:

string.match(#([a-z0-9]+))

That's a lot of parenthesis when the regex is already inside a parenthesized expression..

I can't say I dislike the idea of single quotes '' regexes. It would certainly read better when using the regex inside a parenthesized expression like a function call:

string.match('[a-z0-9]+')

Upsides:

It's more lightweight than #regex() or #().
The delimiters ' are unambiguous with any existing Swift syntax so lexing is easy and unlikely to produce unexpected results.
It's using a delimiter character not used elsewhere in Swift, so it's easy to read where the regex syntax starts and ends, like in the parenthesized expression above.
Can write an empty regex.
It looks like a string, so it's not expected its content will be Swift syntax.

Downsides:

You need to escape single quotes with \' in the regex. In my experience this is less common than having to escape /.
It looks like a string, and could be confused for one.

Now, trying to compare with the syntax in the pitch itself:

string.match(/[a-z0-9]+/)

Upsides:

It's more lightweight than #regex() or #()
It closely matches the regex syntax in a couple of other languages.

Downsides:

The opening delimiter / is ambiguous with existing Swift syntax, so there's a need for more complex lexing rules.
Syntax highlighters not based on SourceKit are more likely to do a bad job at telling apart regex from non-regex stuff, or properly identifying the boundaries of the regex literal.
You need to escape slashes with \/ in the regex, which are a frequent occurence than single quotes in my experience. Also I find escaping slashes more confusing than escaping other characters because its the same shape mirrored that gets repeated /\/\/.*/.
Cannot write an empty regex.

It's not unworkable, but I don't see many upside to this choice.