[Pitch #2] Regex Literals

Note that nothing proposed precludes a general solution, so we're not missing anything. The work involved in making regex work is necessary to... making regex work, so it's not wasted even if we had a general language escape hatch to leverage.

This is formally outside the scope of the pitch review, but we're careful to integrate the regex parser with the compiler in a library-driven fashion that makes it easier to migrate more compiler code into modular libraries. This also helps carve a path for more general purpose foreign language support in the future. My totally-speculative and not even close to a formalized plan dream would be to open up foreign language snippet support to 3rd party libraries, similarly to how we open up literals, custom string interpolation, property wrappers, and result builders to libraries. That requires the compiler integration mechanisms we're developing with regex, and regex can piggy back on whatever escape hatch emerges in the future. They wouldn't declare their own single-character delimiters, they'd use the more general escape hatch.

My familiarity of AMPL is limited, but I would not expect AMPL constraint fragments to appear directly in API calls and individual lines of a result builder the way regex are. I don't know why a single character delimiter would be appealing for AMPL over #ampl(...) or ampl'...'. This proposal does not forbid such foreign language excerpts nor does it require foreign language excerpts to use single-character delimiters if they're ever added.

Adding a custom parser for a foreign language is a sizable amount of work, but independent from its lexical integration with Swift. Integrating with Swift requires care and a proposal which produces a great deal of scrutiny, which we are engaging in. There is no slippery slope where we wake up one day to find an assortment of foreign language literals delineated by a shrinking pool of single characters. It's an up-hill climb every step of the way.

Better library extension is how we achieve this. We have library-extensible literals, string interpolations, property wrappers, and result builders. I hope library-extensible parsing of clearly and unambiguously delineated foreign language snippets happens one day.

I'm happy to engage further, though it's getting pretty far afield of this pitch.

4 Likes

The #URL(ā€ā€) syntax is interesting to discuss but probably beyond the scope of this pitch (unless itā€™s part of an argument that shorthand literal syntax for regexes is unnecessaryā€¦ but that discussion is probably something that can be had without fleshing out a more generalized/verbose alternative)

(It leads to various questions that definitely merit their own separate thread e.g. Is it a way of saying ā€œan initializer, but all arguments must be literalsā€? Or is shorthand for saying all arguments must be @const? Would it rely on some kind of generalized compile-time interpretation feature? Would it enforce some kind of ā€œmust be evaluable a to compile-timeā€ rule? If so, would it be required for any compile-time-evaluable function or just optional i.e. is it a proposed spelling for ā€œthis function must be evaluated at compile timeā€? How would the typing of the result be generalized to a language feature?)

This is all very well for the compiler, but not so much for editor syntax highlighting. Ideally the two wouldnā€™t need different rules (editors might perhaps choose to simplify Swiftā€™s parsing rules for ease of implementation purposes, but itā€™s less OK to go the other way and require editors to do something the compiler doesnā€™t have to do).

Would it also require special casing the diagnostic for unused results?

Yes good point.

Yes, indeed, which is precisely why I think the idea intriguing even as Iā€™d otherwise prefer a less verbose regex syntax.

Much as the core team has adopted @Sendable closures without generalizing the feature to arbitrary protocol conformances, while still spelling the feature with an eye towards how it could be generalizable, my point here is that if we are to lean towards a verbose regex literal syntax, we ought to really lean into it with a spelling that can be later generalized with all the interesting possibilities and questions you enumerate above.

3 Likes

Unfortunately this would indeed become a regex literal. This is a variant of the unapplied infix / operator case, I will update the pitch to cover it. However unlike infix /, this cannot be disambiguated with parens, I think the best way of disambiguating would likely to be writing it as a closure, e.g:

foo(op: { $0 /Ā¢*Ā¢/ $1 })

The inability to disambiguate with parens also affects other infix operators that start with / and are followed by other operator characters, e.g /^. It may be necessary to tweak the lexing rule to look through operator characters to reject a closing ).

Note it wouldn't be entirely like StaticString (or literals in general), as you also wouldn't be able to intermix any expressions between #regex(...) and the "..." argument. For example you wouldn't be able to write:

#regex(b ? "[abc]" : "[def]")

Or, if you were, it would lose out on editor support.

Just to clarify, contextual information is required while parsing, specifically "are we parsing an expression?". This is necessary to avoid parsing a regex literal in the following cases:

infix operator /^/ // An operator, not a regex
func /^/ (lhs: Int, rhs: Int) -> Int { 0 } // Also an operator
let i = 0 /^/ 1 // A binary operator, not a regex

We originally tried to do this purely based on the previous token while lexing, but it was less robust. However in any case, this is all strictly contained within the parser, no semantic analysis is required.

Itā€™s perhaps important to note that this if the operator were implemented as func /Ā¢*Ā¢/<T>(T: lhs, T: rhs) { }, and if foo were also generic over op, this reformulation would not compile because the T would not be deducible.

Could you elaborate? As far as I'm aware you would face the same issue for an unapplied reference, e.g both of these fail to compile:

infix operator /^/
func /^/ <T>(lhs: T, rhs: T) {}

func foo<T>(_ fn: (T, T) -> Void) {}
foo(/^/)
// error: Generic parameter 'T' could not be inferred

foo({ $0 /^/ $1 })
// error: Unable to infer type of a closure parameter '$0' in the current context

Sorry, Iā€™m still in the _openExistential headspace where the inability for a closure to carry generic parameters is an issue.

+1 from me. I like it a lot, and I do think the /ā€¦/ spelling is worth fighting for, despite the edge-cases.

4 Likes