[Pitch #2] Regex Literals

Good catch! That is correct, I've just fixed it.

Yes, that is right. To be clear, the example is mainly drawing a distinction between string literals, where raw syntax is useful for passing backslash sequences directly to an underlying consumer such as NSRegularExpression, and regex literals, where that is unnecessary. I've edited it to clarify.

1 Like

The problem with #regex("...") is that it looks like a string literal argument to a magic literal, when in fact the quotes are part of the delimiter itself. For example, you wouldn't be able to do:

let pattern = "[abc]+"
let regex = #regex(pattern)

which would likely be unexpected.

4 Likes

After reading the pitch and feedback, I suggest that the bare syntax be moved to the "Future Directions" section. There could still be an experimental compiler flag (i.e. -enable-experimental-bare-regex-syntax) to indicate that this isn't a permanent language dialect. A future proposal could then try to add the bare syntax to Swift 6.

Maybe this is unpopular argument, but if Regex DSL doesn't (can't) support named captures, I think regex literals should not support it either. I believe that regex literals must not have attractive features other than their shortness. If regex literals have powerful features not found in the DSL, developers will manage to choose literals. But regex literals are source of bug, because of its awful readability. Regex literals should be something like a way to write light script when developers want to try implementation quickly.

Considering the behavior of StaticString, I don't think it's so much unnatural.

4 Likes

General remarks about bare /.../:

I guess the reason why some people support /.../ is because "we have seen it in other languages".
However, we have to remember that Swift is different from other languages in many senses.


First, (as mentioned repeatedly in this thread,) we can define prefix/infix/postfix operators containing /.
Authors simply think it is enough to change the syntax rule of Swift, but the fact that certain number of projects will be broken has come to light.


Second, regex in Swift may differ from ones in other languages.
Although this is out of scope of this pitch, it is related still.

Such feature would confuse some folks especially from other languages.
Let me quote my opinion from pitch#1 thread:


Lastly, Swift has sublime philosophy (I hope).
I agree that /.../ is simple and easy to write.
However, to be simple is not enough to be good in Swift.
I want to quote Mr. Lattner's utterance:


/.../ will certainly break Swift.
/.../ is not Swift's.
Does Swift have to borrow the syntax from others to break itself?
Will we get more benefit from /.../ than loss from it?
Think different.

5 Likes

It would be nice to match expressions within a switch case, but I'm concerned about how it would perform.

As an example, I believe something like this would be slow, since it would have to compile the expression on every use of the switch:

switch userInput {
case try! Regex(compiling: #"[aeiou]+"#):
    return "All vowels here"
    
default:
    return "Not all vowels"
}

However I'm hopeful a RegexLiteral in the same position would perform well:

switch userInput {
case /[aeiou]+/:
    return "All vowels here"
    
default:
    return "Not all vowels"
} 

Any thoughts on this use case?

Extra: The pattern matching operator driving the switch

It would be great if this were supported out-of-the-box in the Standard Library, but anyone can try out the first example by defining these operators:

func ~=<Output>(a: Regex<Output>, b: String) -> Bool {
    guard let _ = try? a.matchWhole(b) else { return false }
    return true
}

func ~=<Output>(a: Regex<Output>, b: Substring) -> Bool {
    guard let _ = try? a.matchWhole(b) else { return false }
    return true
}
1 Like

Is this scalable? What happens the next time Swift wants to co-opt a popular operator symbol for a built-in language feature?

I feel that the division between operators and quote marks should remain distinct. Not only does it prevent ill will with the community, but it simplifies understanding of the language.

As for the choice of / considered on its own: I feel that it is a mistake. I learned Perl over 20 years ago, and one of the best features is the ability to use a custom delimiter for regular expressions. Perhaps it's unique to what Perl is often used for, but I found that many times my expressions included path manipulation. Using a delimiter that allowed unescaped forward slashes was so common, the standard syntax could as well not have existed.

I don't find value in harming the Swift ecosystem, even slightly, for a feature no one is asking for.

8 Likes

This was a concern of mine and one of the reasons I was a proponent of the re'...' alternative. It has a clear extension to raw(ish) and multi-line modes as well as establishing a convention for other kinds of "foreign program literals". That is, these are not data literals like numbers or strings, they're algorithm literals with richer structure and should avoid further conflating string literal syntax, hence the '. This distinction is most apparent when the contents of a literal affects the type, as with regex captures. This would extend to, say, a sql'...' or a doc'...', uint8'...', etc.

That being said, regex literals are unique in their prevalence as fragments passed directly to API or as components of a result builder. I don't think that, e.g., SQL fragments would be used in this fashion.

Nothing in this proposal precludes a scalable approach to foreign language literals using #lang(...) or lang'...'. If that happened, regex could clearly participate as well. Even in an alternate reality where we had a formal concept of foreign language literals with a convention, there's still value to a dedicated regex literal. The alternate reality would just shift priorities around (which is inherent to alternate reality scenarios).

There is a significant division. Backslash means something completely different in a regex than a string literal and the contents of the regex must be parsed in order to determine the type. These really are not data literals, except under a pedantically von Neumann view of computing.

Not sure about how a division prevents ill-will, unless you mean the language-mode-gated source breaks proposed. I definitely view breaking TCA as the biggest downside to this proposal.

This is why the #/.../# is being proposed, which does not require escaping interior slashes and is available immediately without a language mode check. This is, alongside the multi-line behavior of #/.../#, what got me (somewhat reluctantly) off the re'...' train.

I appreciate your perspective regarding harm to the ecosystem, I really do. However, the last bit about there being "no one" asking for this feature is unfounded and trivially falsifiable.

8 Likes

The discussion of scalability and of foreign language and algorithm literals makes me wonder if we are not missing an opportunity for something a bit more general here. It strikes me that in the larger sense, with regexes, we are embedding a non-swift programming language inside of swift and looking for a syntax that escapes into that language in a way that allows interoperability with the hosting language and its tool support.

It is easy for me to imagine other small special purpose languages that make sense in any number of fields (AMPL would be my own choice for one). The notion of consuming more and more custom operator characters as we find interesting extensions like these seems wrong. @scanon's comment above about #regex(...) makes me wish that extensions like this could be a normal feature of the language.

I'd like to make sure we avoid the situation haskell is in with its massive amount of language extensions, but I would like to be able to extend in this manner.

3 Likes

Two distinct (potentially provocative, or alternatively very silly) thoughts:


First, regarding #regex("...") syntax—

I've wholeheartedly agreed with @scanon above that a more succinct syntax for regex literals is ideal. However, if we're going to lean in the verbose direction, I'd much rather that we lean into it all the way:

// Literal, with all the build-time validation and strong typing goodness:
let x = #Regex("[abc]+")

// Not literal, but validated at runtime
// (see proposal review re dropping the `compiling` label):
let y = Regex("[abc]+")

This would be generalizable to a variety of existing types when the build-time evaluation facilities permit (see other proposal about @const and its future directions). I'm thinking of URL, for example:

let z = URL("http://example.com")
let w = #URL("http://example.com") // Not possible (yet!), regardless of syntax.

Yes, this would imply that we should support both multiline arguments and what @hamishknight says that one wouldn't be able to do:

let a = #Regex("""
    [abc]+
    """)

let pattern = """
    [abc]+
    """
let b = #Regex(pattern)

Second, regarding /.../ versus #/.../# versus alternatives in that vein, I haven't seen the following alternative mentioned—

In Perl, strings can be delimited by ', ", or custom delimiters (yes, with differences in which delimiters allow for interpolation inside), while in Swift we only support the double quotation marks. So...why not use double slashes as Swift's regex delimiter?

let c = //[abc]+//

Won't it be ambiguous with comments? I'm inspired by the approach taken in certain parts of this proposal where "[t]o avoid parsing confusion, [...] a literal will not be parsed if a closing delimiter is not present." I think we could adopt a similar approach to make double slashes work as delimiters: to avoid parsing confusion, parse as a regex literal only if a closing delimiter is present on the same line.

It is true that this would break some commented-out code that itself has inline comments, but in the future version of Swift where it's enabled such code could be migrated to use outer /* ... */-style comments. Certainly less destructive than making existing operators illegal.

I also know there are some file headers styled // ====== //, but as it happens, nothing is harmed by parsing that as a regex literal and then just dropping it...

Multiline regex literals, then, would be delimited by //////, which ought to be similarly capable of disambiguation versus the empty regex //// (just as """ is from "") as well as /// doc comments.

I'm sure I'm missing something obvious, but mulling over this for a bit, it seems workable from here.

2 Likes

#URL is a very nice idea that would seem to align well with current work.

2 Likes

Note that nothing proposed precludes a general solution, so we're not missing anything. The work involved in making regex work is necessary to... making regex work, so it's not wasted even if we had a general language escape hatch to leverage.

This is formally outside the scope of the pitch review, but we're careful to integrate the regex parser with the compiler in a library-driven fashion that makes it easier to migrate more compiler code into modular libraries. This also helps carve a path for more general purpose foreign language support in the future. My totally-speculative and not even close to a formalized plan dream would be to open up foreign language snippet support to 3rd party libraries, similarly to how we open up literals, custom string interpolation, property wrappers, and result builders to libraries. That requires the compiler integration mechanisms we're developing with regex, and regex can piggy back on whatever escape hatch emerges in the future. They wouldn't declare their own single-character delimiters, they'd use the more general escape hatch.

My familiarity of AMPL is limited, but I would not expect AMPL constraint fragments to appear directly in API calls and individual lines of a result builder the way regex are. I don't know why a single character delimiter would be appealing for AMPL over #ampl(...) or ampl'...'. This proposal does not forbid such foreign language excerpts nor does it require foreign language excerpts to use single-character delimiters if they're ever added.

Adding a custom parser for a foreign language is a sizable amount of work, but independent from its lexical integration with Swift. Integrating with Swift requires care and a proposal which produces a great deal of scrutiny, which we are engaging in. There is no slippery slope where we wake up one day to find an assortment of foreign language literals delineated by a shrinking pool of single characters. It's an up-hill climb every step of the way.

Better library extension is how we achieve this. We have library-extensible literals, string interpolations, property wrappers, and result builders. I hope library-extensible parsing of clearly and unambiguously delineated foreign language snippets happens one day.

I'm happy to engage further, though it's getting pretty far afield of this pitch.

4 Likes

The #URL(””) syntax is interesting to discuss but probably beyond the scope of this pitch (unless it’s part of an argument that shorthand literal syntax for regexes is unnecessary… but that discussion is probably something that can be had without fleshing out a more generalized/verbose alternative)

(It leads to various questions that definitely merit their own separate thread e.g. Is it a way of saying “an initializer, but all arguments must be literals”? Or is shorthand for saying all arguments must be @const? Would it rely on some kind of generalized compile-time interpretation feature? Would it enforce some kind of “must be evaluable a to compile-time” rule? If so, would it be required for any compile-time-evaluable function or just optional i.e. is it a proposed spelling for “this function must be evaluated at compile time”? How would the typing of the result be generalized to a language feature?)

This is all very well for the compiler, but not so much for editor syntax highlighting. Ideally the two wouldn’t need different rules (editors might perhaps choose to simplify Swift’s parsing rules for ease of implementation purposes, but it’s less OK to go the other way and require editors to do something the compiler doesn’t have to do).

Would it also require special casing the diagnostic for unused results?

Yes good point.

Yes, indeed, which is precisely why I think the idea intriguing even as I’d otherwise prefer a less verbose regex syntax.

Much as the core team has adopted @Sendable closures without generalizing the feature to arbitrary protocol conformances, while still spelling the feature with an eye towards how it could be generalizable, my point here is that if we are to lean towards a verbose regex literal syntax, we ought to really lean into it with a spelling that can be later generalized with all the interesting possibilities and questions you enumerate above.

3 Likes

Unfortunately this would indeed become a regex literal. This is a variant of the unapplied infix / operator case, I will update the pitch to cover it. However unlike infix /, this cannot be disambiguated with parens, I think the best way of disambiguating would likely to be writing it as a closure, e.g:

foo(op: { $0 /¢*¢/ $1 })

The inability to disambiguate with parens also affects other infix operators that start with / and are followed by other operator characters, e.g /^. It may be necessary to tweak the lexing rule to look through operator characters to reject a closing ).

Note it wouldn't be entirely like StaticString (or literals in general), as you also wouldn't be able to intermix any expressions between #regex(...) and the "..." argument. For example you wouldn't be able to write:

#regex(b ? "[abc]" : "[def]")

Or, if you were, it would lose out on editor support.

Just to clarify, contextual information is required while parsing, specifically "are we parsing an expression?". This is necessary to avoid parsing a regex literal in the following cases:

infix operator /^/ // An operator, not a regex
func /^/ (lhs: Int, rhs: Int) -> Int { 0 } // Also an operator
let i = 0 /^/ 1 // A binary operator, not a regex

We originally tried to do this purely based on the previous token while lexing, but it was less robust. However in any case, this is all strictly contained within the parser, no semantic analysis is required.

It’s perhaps important to note that this if the operator were implemented as func /¢*¢/<T>(T: lhs, T: rhs) { }, and if foo were also generic over op, this reformulation would not compile because the T would not be deducible.

Could you elaborate? As far as I'm aware you would face the same issue for an unapplied reference, e.g both of these fail to compile:

infix operator /^/
func /^/ <T>(lhs: T, rhs: T) {}

func foo<T>(_ fn: (T, T) -> Void) {}
foo(/^/)
// error: Generic parameter 'T' could not be inferred

foo({ $0 /^/ $1 })
// error: Unable to infer type of a closure parameter '$0' in the current context

Sorry, I’m still in the _openExistential headspace where the inability for a closure to carry generic parameters is an issue.