[Pitch #2] Regex Literals

Jon_Shier · April 15, 2022, 5:17pm

This is a pretty big assumption. I don't think a succinct delimiter is necessary at all and prefer something like #regex which is actually readable. But given the choice between / and #/, #/ seems rather obviously superior given it avoids edge cases.

And your point about the use of # needing to align with raw strings is rather easily explained by saying that regex literals are always raw strings. Since they have different rules from string literals anyway, that seems to make sense.

ksluder · April 15, 2022, 5:22pm

I think this is a very fair point, and it makes a lot of sense when considering the pitch from the perspective of most users who will encounter this feature.

ksluder · April 15, 2022, 5:23pm

They aren’t, though. Escape sequences are still parsed within regex literals.

scanon · April 15, 2022, 5:25pm

This would be highly undesirable, because it would change the type of a resulting regex depending on whether or not the pattern string was visible to the compiler. "Simple" changes such as moving a pattern into a separate module, or assigning it to a variable that the compiler cannot prove remains unchanged would result in captures changing from typed to untyped, which is a distinctly unpleasant and unpredictable user model. Literals don't just unlock compile-time optimization; that's not even the most important thing that they unlock--typed captures and other forms of checking are the real payoff.

(also, "comptime" doesn't exist in the language yet).

allevato · April 15, 2022, 5:33pm

Using bare /.../ as delimiters allows for far too much ambiguity for my comfort. The syntax for prefix/postfix/infix operators in Swift already has a bunch of awkward edge cases around spacing, and the proposed syntax compounds it. Consider one example from the pitch text:

let x = arr.reduce(1, /) / 5

It was mentioned that this would be fine because a regex literal that starts with a ) wouldn't be valid so it wouldn't be parsed as one. That's fine, but if we change the function signature slightly, we end up with it being lexed entirely differently:

let x = arr.reduceButAlsoSomethingElse(1, /, foo) / 5

And now we have another edge case that requires the user to wrap the operator function reference in (...), but only sometimes. We already have some of those in the language (e.g., let x: (Int, Int) -> Int = + is impossible, you have to write ... = (+)), and we ought to avoid adding more unless we're going to just say "source compatibility break: all unbound operator references must be wrapped in parens". But I don't think that kind of source compatibility break would pass muster, so why should the others described here?

If Swift's syntax was being designed from the ground up, maybe it could be possible to fit /.../ into the syntax while not sacrificing other features. But we have an already-existing language with years of evolution, and IMO the parsing challenges and special cases described by the pitch are proof that bare /.../ is not the right solution for Swift today. There is no real harm in not using /.../ other than that they don't look exactly like regular expressions in other languages, but there is active harm in arbitrarily prohibiting entire classes of custom operators in a language that claims to support those.

The un-pitched-but-mentioned-by-others-in-this-thread alternative of "just use #/.../# everywhere" sounds ideal to me. It's unambiguous, barely more intrusive, and doesn't cause harm to other parts of the language grammar or its usability.

stackotter · April 15, 2022, 9:23pm

They don't "own it", but the point is that it would be a pretty big source breaking change that should really be avoided unless completely necessary. Even if those libraries didn't exist, it would still be a source breaking change.

I will point out that these aren't the only examples of libraries that break due to the /.../ syntax. As @mishal_shah pointed out, this syntax breaks 16 projects out of the 2968 in the Swift Package Index. That certainly isn't insignificant. I believe that following the precedent set by other languages is no reason to introduce source breaking changes, especially since two popular packages would get broken along with all their clients.

benrimmington · April 15, 2022, 9:48pm

In the Escaping of backslashes section, I was confused by the following example:

// Matches '\' <word char> <whitespace>* '=' <whitespace>* <digit>+
let regex = try NSRegularExpression(pattern: "\\\\w\\s*=\\s*\\d+", options: [])

I'd expect the bare string literal to start with six backslashes:

"\\\\\\w\\s*=\\s*\\d+"

I'd expect the extended string and regex literals to be identical within their delimiters:

#"\\\w\s*=\s*\d+"#
#/\\\w\s*=\s*\d+/#

Ben_Cohen · April 15, 2022, 10:12pm

Mishal is not around today but he sent me the logs from the failures, and it might help (without judgement on whether 15 is a high or low number) to break those projects impacted by prefix operator / down a little further:

5 are packages that are part of the composable architecture suite (including CasePaths itself)
5 are users of CasePaths
1 is something that looks like CasePaths
1 is a parser written by @rxwei (sadly not an author of this particular proposal, for irony purposes)
3 are part of a suite that uses pre/postfix / to simulate regular expression syntax

Incidentally, I wanted to give a shout out to @daveverwer and @finestructure for creating such a fantastic resource in SwiftPackageIndex.com that allows for this kind of analysis (as well as all the community members open-sourcing their packages).

aj_ortiz · April 16, 2022, 12:31am

Seems, IMO, that #regex() would be an interesting syntax with possibility of being reused moving forward with other data types. In my shallow understating of the drawbacks behind this syntax caused by this syntax with parentheses balancing and and inconsistency with #literal() would #regex("") be a possible solution?

There are no new delimiters but the ones we already use, balancing is handled within the scope of the "string" and a "string" is valid Swift syntax.

Even #regex(#""#) and

#regex("""

""")

could be available depending on the need for raw or multi-line regex.

hamishknight · April 16, 2022, 11:49am

benrimmington:

In the Escaping of backslashes section, I was confused by the following example:
// Matches '\' <word char> <whitespace>* '=' <whitespace>* <digit>+
let regex = try NSRegularExpression(pattern: "\\\\w\\s*=\\s*\\d+", options: [])
I'd expect the bare string literal to start with six backslashes:
"\\\\\\w\\s*=\\s*\\d+"

Good catch! That is correct, I've just fixed it.

Yes, that is right. To be clear, the example is mainly drawing a distinction between string literals, where raw syntax is useful for passing backslash sequences directly to an underlying consumer such as NSRegularExpression, and regex literals, where that is unnecessary. I've edited it to clarify.

hamishknight · April 16, 2022, 11:57am

The problem with #regex("...") is that it looks like a string literal argument to a magic literal, when in fact the quotes are part of the delimiter itself. For example, you wouldn't be able to do:

let pattern = "[abc]+"
let regex = #regex(pattern)

which would likely be unexpected.

benrimmington · April 16, 2022, 1:41pm

After reading the pitch and feedback, I suggest that the bare syntax be moved to the "Future Directions" section. There could still be an experimental compiler flag (i.e. -enable-experimental-bare-regex-syntax) to indicate that this isn't a permanent language dialect. A future proposal could then try to add the bare syntax to Swift 6.

ensan-hcl · April 16, 2022, 2:52pm

Maybe this is unpopular argument, but if Regex DSL doesn't (can't) support named captures, I think regex literals should not support it either. I believe that regex literals must not have attractive features other than their shortness. If regex literals have powerful features not found in the DSL, developers will manage to choose literals. But regex literals are source of bug, because of its awful readability. Regex literals should be something like a way to write light script when developers want to try implementation quickly.

Considering the behavior of StaticString, I don't think it's so much unnatural.

YOCKOW · April 16, 2022, 3:34pm

General remarks about bare /.../:

I guess the reason why some people support /.../ is because "we have seen it in other languages".
However, we have to remember that Swift is different from other languages in many senses.

First, (as mentioned repeatedly in this thread,) we can define prefix/infix/postfix operators containing /.
Authors simply think it is enough to change the syntax rule of Swift, but the fact that certain number of projects will be broken has come to light.

Second, regex in Swift may differ from ones in other languages.
Although this is out of scope of this pitch, it is related still.

Such feature would confuse some folks especially from other languages.
Let me quote my opinion from pitch#1 thread:

Lastly, Swift has sublime philosophy (I hope).
I agree that /.../ is simple and easy to write.
However, to be simple is not enough to be good in Swift.
I want to quote Mr. Lattner's utterance:

/.../ will certainly break Swift.
/.../ is not Swift's.
Does Swift have to borrow the syntax from others to break itself?
Will we get more benefit from /.../ than loss from it?
Think different.

christopherweems · April 16, 2022, 5:16pm

It would be nice to match expressions within a switch case, but I'm concerned about how it would perform.

As an example, I believe something like this would be slow, since it would have to compile the expression on every use of the switch:

switch userInput {
case try! Regex(compiling: #"[aeiou]+"#):
    return "All vowels here"
    
default:
    return "Not all vowels"
}

However I'm hopeful a RegexLiteral in the same position would perform well:

switch userInput {
case /[aeiou]+/:
    return "All vowels here"
    
default:
    return "Not all vowels"
}

Any thoughts on this use case?

Extra: The pattern matching operator driving the switch

It would be great if this were supported out-of-the-box in the Standard Library, but anyone can try out the first example by defining these operators:

func ~=<Output>(a: Regex<Output>, b: String) -> Bool {
    guard let _ = try? a.matchWhole(b) else { return false }
    return true
}

func ~=<Output>(a: Regex<Output>, b: Substring) -> Bool {
    guard let _ = try? a.matchWhole(b) else { return false }
    return true
}

Avi · April 16, 2022, 5:21pm

Is this scalable? What happens the next time Swift wants to co-opt a popular operator symbol for a built-in language feature?

I feel that the division between operators and quote marks should remain distinct. Not only does it prevent ill will with the community, but it simplifies understanding of the language.

As for the choice of / considered on its own: I feel that it is a mistake. I learned Perl over 20 years ago, and one of the best features is the ability to use a custom delimiter for regular expressions. Perhaps it's unique to what Perl is often used for, but I found that many times my expressions included path manipulation. Using a delimiter that allowed unescaped forward slashes was so common, the standard syntax could as well not have existed.

I don't find value in harming the Swift ecosystem, even slightly, for a feature no one is asking for.

Michael_Ilseman · April 16, 2022, 8:04pm

This was a concern of mine and one of the reasons I was a proponent of the re'...' alternative. It has a clear extension to raw(ish) and multi-line modes as well as establishing a convention for other kinds of "foreign program literals". That is, these are not data literals like numbers or strings, they're algorithm literals with richer structure and should avoid further conflating string literal syntax, hence the '. This distinction is most apparent when the contents of a literal affects the type, as with regex captures. This would extend to, say, a sql'...' or a doc'...', uint8'...', etc.

That being said, regex literals are unique in their prevalence as fragments passed directly to API or as components of a result builder. I don't think that, e.g., SQL fragments would be used in this fashion.

Nothing in this proposal precludes a scalable approach to foreign language literals using #lang(...) or lang'...'. If that happened, regex could clearly participate as well. Even in an alternate reality where we had a formal concept of foreign language literals with a convention, there's still value to a dedicated regex literal. The alternate reality would just shift priorities around (which is inherent to alternate reality scenarios).

There is a significant division. Backslash means something completely different in a regex than a string literal and the contents of the regex must be parsed in order to determine the type. These really are not data literals, except under a pedantically von Neumann view of computing.

Not sure about how a division prevents ill-will, unless you mean the language-mode-gated source breaks proposed. I definitely view breaking TCA as the biggest downside to this proposal.

This is why the #/.../# is being proposed, which does not require escaping interior slashes and is available immediately without a language mode check. This is, alongside the multi-line behavior of #/.../#, what got me (somewhat reluctantly) off the re'...' train.

I appreciate your perspective regarding harm to the ecosystem, I really do. However, the last bit about there being "no one" asking for this feature is unfounded and trivially falsifiable.

rvsrvs · April 16, 2022, 9:30pm

The discussion of scalability and of foreign language and algorithm literals makes me wonder if we are not missing an opportunity for something a bit more general here. It strikes me that in the larger sense, with regexes, we are embedding a non-swift programming language inside of swift and looking for a syntax that escapes into that language in a way that allows interoperability with the hosting language and its tool support.

It is easy for me to imagine other small special purpose languages that make sense in any number of fields (AMPL would be my own choice for one). The notion of consuming more and more custom operator characters as we find interesting extensions like these seems wrong. @scanon's comment above about #regex(...) makes me wish that extensions like this could be a normal feature of the language.

I'd like to make sure we avoid the situation haskell is in with its massive amount of language extensions, but I would like to be able to extend in this manner.

xwu · April 16, 2022, 9:51pm

Two distinct (potentially provocative, or alternatively very silly) thoughts:

First, regarding #regex("...") syntax—

hamishknight:

The problem with #regex("...") is that it looks like a string literal argument to a magic literal, when in fact the quotes are part of the delimiter itself. For example, you wouldn't be able to do:
let pattern = "[abc]+"
let regex = #regex(pattern)
which would likely be unexpected.

I've wholeheartedly agreed with @scanon above that a more succinct syntax for regex literals is ideal. However, if we're going to lean in the verbose direction, I'd much rather that we lean into it all the way:

// Literal, with all the build-time validation and strong typing goodness:
let x = #Regex("[abc]+")

// Not literal, but validated at runtime
// (see proposal review re dropping the `compiling` label):
let y = Regex("[abc]+")

This would be generalizable to a variety of existing types when the build-time evaluation facilities permit (see other proposal about @const and its future directions). I'm thinking of URL, for example:

let z = URL("http://example.com")
let w = #URL("http://example.com") // Not possible (yet!), regardless of syntax.

Yes, this would imply that we should support both multiline arguments and what @hamishknight says that one wouldn't be able to do:

let a = #Regex("""
    [abc]+
    """)

let pattern = """
    [abc]+
    """
let b = #Regex(pattern)

Second, regarding /.../ versus #/.../# versus alternatives in that vein, I haven't seen the following alternative mentioned—

In Perl, strings can be delimited by ', ", or custom delimiters (yes, with differences in which delimiters allow for interpolation inside), while in Swift we only support the double quotation marks. So...why not use double slashes as Swift's regex delimiter?

let c = //[abc]+//

Won't it be ambiguous with comments? I'm inspired by the approach taken in certain parts of this proposal where "[t]o avoid parsing confusion, [...] a literal will not be parsed if a closing delimiter is not present." I think we could adopt a similar approach to make double slashes work as delimiters: to avoid parsing confusion, parse as a regex literal only if a closing delimiter is present on the same line.

It is true that this would break some commented-out code that itself has inline comments, but in the future version of Swift where it's enabled such code could be migrated to use outer /* ... */-style comments. Certainly less destructive than making existing operators illegal.

I also know there are some file headers styled // ====== //, but as it happens, nothing is harmed by parsing that as a regex literal and then just dropping it...

Multiline regex literals, then, would be delimited by //////, which ought to be similarly capable of disambiguation versus the empty regex //// (just as """ is from "") as well as /// doc comments.

I'm sure I'm missing something obvious, but mulling over this for a bit, it seems workable from here.

rvsrvs · April 16, 2022, 9:58pm

#URL is a very nice idea that would seem to align well with current work.