[Pitch #2] Regex Literals

They aren’t, though. Escape sequences are still parsed within regex literals.

This would be highly undesirable, because it would change the type of a resulting regex depending on whether or not the pattern string was visible to the compiler. "Simple" changes such as moving a pattern into a separate module, or assigning it to a variable that the compiler cannot prove remains unchanged would result in captures changing from typed to untyped, which is a distinctly unpleasant and unpredictable user model. Literals don't just unlock compile-time optimization; that's not even the most important thing that they unlock--typed captures and other forms of checking are the real payoff.

(also, "comptime" doesn't exist in the language yet).

2 Likes

Using bare /.../ as delimiters allows for far too much ambiguity for my comfort. The syntax for prefix/postfix/infix operators in Swift already has a bunch of awkward edge cases around spacing, and the proposed syntax compounds it. Consider one example from the pitch text:

let x = arr.reduce(1, /) / 5

It was mentioned that this would be fine because a regex literal that starts with a ) wouldn't be valid so it wouldn't be parsed as one. That's fine, but if we change the function signature slightly, we end up with it being lexed entirely differently:

let x = arr.reduceButAlsoSomethingElse(1, /, foo) / 5

And now we have another edge case that requires the user to wrap the operator function reference in (...), but only sometimes. We already have some of those in the language (e.g., let x: (Int, Int) -> Int = + is impossible, you have to write ... = (+)), and we ought to avoid adding more unless we're going to just say "source compatibility break: all unbound operator references must be wrapped in parens". But I don't think that kind of source compatibility break would pass muster, so why should the others described here?

If Swift's syntax was being designed from the ground up, maybe it could be possible to fit /.../ into the syntax while not sacrificing other features. But we have an already-existing language with years of evolution, and IMO the parsing challenges and special cases described by the pitch are proof that bare /.../ is not the right solution for Swift today. There is no real harm in not using /.../ other than that they don't look exactly like regular expressions in other languages, but there is active harm in arbitrarily prohibiting entire classes of custom operators in a language that claims to support those.

The un-pitched-but-mentioned-by-others-in-this-thread alternative of "just use #/.../# everywhere" sounds ideal to me. It's unambiguous, barely more intrusive, and doesn't cause harm to other parts of the language grammar or its usability.

23 Likes

They don't "own it", but the point is that it would be a pretty big source breaking change that should really be avoided unless completely necessary. Even if those libraries didn't exist, it would still be a source breaking change.

I will point out that these aren't the only examples of libraries that break due to the /.../ syntax. As @mishal_shah pointed out, this syntax breaks 16 projects out of the 2968 in the Swift Package Index. That certainly isn't insignificant. I believe that following the precedent set by other languages is no reason to introduce source breaking changes, especially since two popular packages would get broken along with all their clients.

2 Likes

In the Escaping of backslashes section, I was confused by the following example:

// Matches '\' <word char> <whitespace>* '=' <whitespace>* <digit>+
let regex = try NSRegularExpression(pattern: "\\\\w\\s*=\\s*\\d+", options: [])

I'd expect the bare string literal to start with six backslashes:

"\\\\\\w\\s*=\\s*\\d+"

I'd expect the extended string and regex literals to be identical within their delimiters:

#"\\\w\s*=\s*\d+"#
#/\\\w\s*=\s*\d+/#

Mishal is not around today but he sent me the logs from the failures, and it might help (without judgement on whether 15 is a high or low number) to break those projects impacted by prefix operator / down a little further:

  • 5 are packages that are part of the composable architecture suite (including CasePaths itself)
  • 5 are users of CasePaths
  • 1 is something that looks like CasePaths
  • 1 is a parser written by @rxwei (sadly not an author of this particular proposal, for irony purposes)
  • 3 are part of a suite that uses pre/postfix / to simulate regular expression syntax

Incidentally, I wanted to give a shout out to @daveverwer and @finestructure for creating such a fantastic resource in SwiftPackageIndex.com that allows for this kind of analysis (as well as all the community members open-sourcing their packages).

20 Likes

Seems, IMO, that #regex() would be an interesting syntax with possibility of being reused moving forward with other data types. In my shallow understating of the drawbacks behind this syntax caused by this syntax with parentheses balancing and and inconsistency with #literal() would #regex("") be a possible solution?

There are no new delimiters but the ones we already use, balancing is handled within the scope of the "string" and a "string" is valid Swift syntax.

Even #regex(#""#) and

#regex("""

""")

could be available depending on the need for raw or multi-line regex.

3 Likes

Good catch! That is correct, I've just fixed it.

Yes, that is right. To be clear, the example is mainly drawing a distinction between string literals, where raw syntax is useful for passing backslash sequences directly to an underlying consumer such as NSRegularExpression, and regex literals, where that is unnecessary. I've edited it to clarify.

1 Like

The problem with #regex("...") is that it looks like a string literal argument to a magic literal, when in fact the quotes are part of the delimiter itself. For example, you wouldn't be able to do:

let pattern = "[abc]+"
let regex = #regex(pattern)

which would likely be unexpected.

4 Likes

After reading the pitch and feedback, I suggest that the bare syntax be moved to the "Future Directions" section. There could still be an experimental compiler flag (i.e. -enable-experimental-bare-regex-syntax) to indicate that this isn't a permanent language dialect. A future proposal could then try to add the bare syntax to Swift 6.

Maybe this is unpopular argument, but if Regex DSL doesn't (can't) support named captures, I think regex literals should not support it either. I believe that regex literals must not have attractive features other than their shortness. If regex literals have powerful features not found in the DSL, developers will manage to choose literals. But regex literals are source of bug, because of its awful readability. Regex literals should be something like a way to write light script when developers want to try implementation quickly.

Considering the behavior of StaticString, I don't think it's so much unnatural.

4 Likes

General remarks about bare /.../:

I guess the reason why some people support /.../ is because "we have seen it in other languages".
However, we have to remember that Swift is different from other languages in many senses.


First, (as mentioned repeatedly in this thread,) we can define prefix/infix/postfix operators containing /.
Authors simply think it is enough to change the syntax rule of Swift, but the fact that certain number of projects will be broken has come to light.


Second, regex in Swift may differ from ones in other languages.
Although this is out of scope of this pitch, it is related still.

Such feature would confuse some folks especially from other languages.
Let me quote my opinion from pitch#1 thread:


Lastly, Swift has sublime philosophy (I hope).
I agree that /.../ is simple and easy to write.
However, to be simple is not enough to be good in Swift.
I want to quote Mr. Lattner's utterance:


/.../ will certainly break Swift.
/.../ is not Swift's.
Does Swift have to borrow the syntax from others to break itself?
Will we get more benefit from /.../ than loss from it?
Think different.

5 Likes

It would be nice to match expressions within a switch case, but I'm concerned about how it would perform.

As an example, I believe something like this would be slow, since it would have to compile the expression on every use of the switch:

switch userInput {
case try! Regex(compiling: #"[aeiou]+"#):
    return "All vowels here"
    
default:
    return "Not all vowels"
}

However I'm hopeful a RegexLiteral in the same position would perform well:

switch userInput {
case /[aeiou]+/:
    return "All vowels here"
    
default:
    return "Not all vowels"
} 

Any thoughts on this use case?

Extra: The pattern matching operator driving the switch

It would be great if this were supported out-of-the-box in the Standard Library, but anyone can try out the first example by defining these operators:

func ~=<Output>(a: Regex<Output>, b: String) -> Bool {
    guard let _ = try? a.matchWhole(b) else { return false }
    return true
}

func ~=<Output>(a: Regex<Output>, b: Substring) -> Bool {
    guard let _ = try? a.matchWhole(b) else { return false }
    return true
}
1 Like

Is this scalable? What happens the next time Swift wants to co-opt a popular operator symbol for a built-in language feature?

I feel that the division between operators and quote marks should remain distinct. Not only does it prevent ill will with the community, but it simplifies understanding of the language.

As for the choice of / considered on its own: I feel that it is a mistake. I learned Perl over 20 years ago, and one of the best features is the ability to use a custom delimiter for regular expressions. Perhaps it's unique to what Perl is often used for, but I found that many times my expressions included path manipulation. Using a delimiter that allowed unescaped forward slashes was so common, the standard syntax could as well not have existed.

I don't find value in harming the Swift ecosystem, even slightly, for a feature no one is asking for.

8 Likes

This was a concern of mine and one of the reasons I was a proponent of the re'...' alternative. It has a clear extension to raw(ish) and multi-line modes as well as establishing a convention for other kinds of "foreign program literals". That is, these are not data literals like numbers or strings, they're algorithm literals with richer structure and should avoid further conflating string literal syntax, hence the '. This distinction is most apparent when the contents of a literal affects the type, as with regex captures. This would extend to, say, a sql'...' or a doc'...', uint8'...', etc.

That being said, regex literals are unique in their prevalence as fragments passed directly to API or as components of a result builder. I don't think that, e.g., SQL fragments would be used in this fashion.

Nothing in this proposal precludes a scalable approach to foreign language literals using #lang(...) or lang'...'. If that happened, regex could clearly participate as well. Even in an alternate reality where we had a formal concept of foreign language literals with a convention, there's still value to a dedicated regex literal. The alternate reality would just shift priorities around (which is inherent to alternate reality scenarios).

There is a significant division. Backslash means something completely different in a regex than a string literal and the contents of the regex must be parsed in order to determine the type. These really are not data literals, except under a pedantically von Neumann view of computing.

Not sure about how a division prevents ill-will, unless you mean the language-mode-gated source breaks proposed. I definitely view breaking TCA as the biggest downside to this proposal.

This is why the #/.../# is being proposed, which does not require escaping interior slashes and is available immediately without a language mode check. This is, alongside the multi-line behavior of #/.../#, what got me (somewhat reluctantly) off the re'...' train.

I appreciate your perspective regarding harm to the ecosystem, I really do. However, the last bit about there being "no one" asking for this feature is unfounded and trivially falsifiable.

8 Likes

The discussion of scalability and of foreign language and algorithm literals makes me wonder if we are not missing an opportunity for something a bit more general here. It strikes me that in the larger sense, with regexes, we are embedding a non-swift programming language inside of swift and looking for a syntax that escapes into that language in a way that allows interoperability with the hosting language and its tool support.

It is easy for me to imagine other small special purpose languages that make sense in any number of fields (AMPL would be my own choice for one). The notion of consuming more and more custom operator characters as we find interesting extensions like these seems wrong. @scanon's comment above about #regex(...) makes me wish that extensions like this could be a normal feature of the language.

I'd like to make sure we avoid the situation haskell is in with its massive amount of language extensions, but I would like to be able to extend in this manner.

3 Likes

Two distinct (potentially provocative, or alternatively very silly) thoughts:


First, regarding #regex("...") syntax—

I've wholeheartedly agreed with @scanon above that a more succinct syntax for regex literals is ideal. However, if we're going to lean in the verbose direction, I'd much rather that we lean into it all the way:

// Literal, with all the build-time validation and strong typing goodness:
let x = #Regex("[abc]+")

// Not literal, but validated at runtime
// (see proposal review re dropping the `compiling` label):
let y = Regex("[abc]+")

This would be generalizable to a variety of existing types when the build-time evaluation facilities permit (see other proposal about @const and its future directions). I'm thinking of URL, for example:

let z = URL("http://example.com")
let w = #URL("http://example.com") // Not possible (yet!), regardless of syntax.

Yes, this would imply that we should support both multiline arguments and what @hamishknight says that one wouldn't be able to do:

let a = #Regex("""
    [abc]+
    """)

let pattern = """
    [abc]+
    """
let b = #Regex(pattern)

Second, regarding /.../ versus #/.../# versus alternatives in that vein, I haven't seen the following alternative mentioned—

In Perl, strings can be delimited by ', ", or custom delimiters (yes, with differences in which delimiters allow for interpolation inside), while in Swift we only support the double quotation marks. So...why not use double slashes as Swift's regex delimiter?

let c = //[abc]+//

Won't it be ambiguous with comments? I'm inspired by the approach taken in certain parts of this proposal where "[t]o avoid parsing confusion, [...] a literal will not be parsed if a closing delimiter is not present." I think we could adopt a similar approach to make double slashes work as delimiters: to avoid parsing confusion, parse as a regex literal only if a closing delimiter is present on the same line.

It is true that this would break some commented-out code that itself has inline comments, but in the future version of Swift where it's enabled such code could be migrated to use outer /* ... */-style comments. Certainly less destructive than making existing operators illegal.

I also know there are some file headers styled // ====== //, but as it happens, nothing is harmed by parsing that as a regex literal and then just dropping it...

Multiline regex literals, then, would be delimited by //////, which ought to be similarly capable of disambiguation versus the empty regex //// (just as """ is from "") as well as /// doc comments.

I'm sure I'm missing something obvious, but mulling over this for a bit, it seems workable from here.

2 Likes

#URL is a very nice idea that would seem to align well with current work.

2 Likes

Note that nothing proposed precludes a general solution, so we're not missing anything. The work involved in making regex work is necessary to... making regex work, so it's not wasted even if we had a general language escape hatch to leverage.

This is formally outside the scope of the pitch review, but we're careful to integrate the regex parser with the compiler in a library-driven fashion that makes it easier to migrate more compiler code into modular libraries. This also helps carve a path for more general purpose foreign language support in the future. My totally-speculative and not even close to a formalized plan dream would be to open up foreign language snippet support to 3rd party libraries, similarly to how we open up literals, custom string interpolation, property wrappers, and result builders to libraries. That requires the compiler integration mechanisms we're developing with regex, and regex can piggy back on whatever escape hatch emerges in the future. They wouldn't declare their own single-character delimiters, they'd use the more general escape hatch.

My familiarity of AMPL is limited, but I would not expect AMPL constraint fragments to appear directly in API calls and individual lines of a result builder the way regex are. I don't know why a single character delimiter would be appealing for AMPL over #ampl(...) or ampl'...'. This proposal does not forbid such foreign language excerpts nor does it require foreign language excerpts to use single-character delimiters if they're ever added.

Adding a custom parser for a foreign language is a sizable amount of work, but independent from its lexical integration with Swift. Integrating with Swift requires care and a proposal which produces a great deal of scrutiny, which we are engaging in. There is no slippery slope where we wake up one day to find an assortment of foreign language literals delineated by a shrinking pool of single characters. It's an up-hill climb every step of the way.

Better library extension is how we achieve this. We have library-extensible literals, string interpolations, property wrappers, and result builders. I hope library-extensible parsing of clearly and unambiguously delineated foreign language snippets happens one day.

I'm happy to engage further, though it's getting pretty far afield of this pitch.

4 Likes

The #URL(””) syntax is interesting to discuss but probably beyond the scope of this pitch (unless it’s part of an argument that shorthand literal syntax for regexes is unnecessary… but that discussion is probably something that can be had without fleshing out a more generalized/verbose alternative)

(It leads to various questions that definitely merit their own separate thread e.g. Is it a way of saying “an initializer, but all arguments must be literals”? Or is shorthand for saying all arguments must be @const? Would it rely on some kind of generalized compile-time interpretation feature? Would it enforce some kind of “must be evaluable a to compile-time” rule? If so, would it be required for any compile-time-evaluable function or just optional i.e. is it a proposed spelling for “this function must be evaluated at compile time”? How would the typing of the result be generalized to a language feature?)

This is all very well for the compiler, but not so much for editor syntax highlighting. Ideally the two wouldn’t need different rules (editors might perhaps choose to simplify Swift’s parsing rules for ease of implementation purposes, but it’s less OK to go the other way and require editors to do something the compiler doesn’t have to do).