[Pitch] Regular Expression Literals

Is there a way to set modifiers/options/flags (e.g. /foo/i) as all other languages using / support them? I believe /(?i)foo/ is less known and cryptic compared to /foo/i

Assuming Swift's Regex can't be initialized from a String value, (Regex(string: strVal)), punting "raw" regex literal (#/.../#) is unfortunate, maybe too unfortunate. There's no way to treat / as a normal character. Also, #/.../# could be used to disambiguate all the cases listed in "Impact of using / as the delimiter" section. Most other languages supporting /.../ have alternative quotes (e.g. qr{...}) , or at least, regex is constructible from a string (e.g.RegExp("...")).

I think those are great points in favor of #regex() -- and it leaves open a path for future rich (maybe user defined) literals, as well as borrowing Perl's ability to use non-default delimiters for a given regex (e.g. m{^/path/name/} in Perl).

It isn't a huge increase over /pattern/ unless the pattern is trivially short, and it is a great decrease in complexity of finding rules that let us use slashes without breaking "too much" existing valid Swift code.

4 Likes

The pitch states

So at this point, the parser just parse them as it is now.

If we really need to change the parsing rule (under a version check), that would be something like "treat the / at the beginning of a line as the start of a regular expression literal".

EDIT: Wait, yeah I now understand your concern, @allevato . Let's see...

It's the sentence right after that one that concerns me:

So I hope the authors can elaborate more on what changes they think might be necessary. I think requiring semicolons to disambiguate a terminated statement from a wrapped one would be a major shortcoming (although result builders seem to have already cracked the door open a bit to that, based on the other examples they gave).

Right—a rule like that would make it impossible to wrap expressions such that binary operators are placed at the beginning of the next line when they occur at line breaks, without introducing an awkward special case for division, even if it was guarded by a language mode flag. (I'm not just saying that because operator-at-the-beginning-of-the-line is the default wrapping style used by swift-format, although that's part of the reason.) Asking users to rewrap their code when upgrading to the next version of Swift would be... unfortunate.

The Core Team is always reminding us that source compatibility is a high priority for changes to the language—even if a new language version flag is added, users should be able to upgrade their project without major disruptions, or my understanding is that the benefit should at least outweigh the disruption. This has blocked a number of places where Swift could smooth out long-time inconsistencies or rough edges. I hope that for a totally new feature, where we have the freedom to choose any delimiter we want at the beginning, we'll be consistent with that philosophy and not choose a delimiter that has the potential to be incredibly risky in terms of introducing ambiguity in existing code.

6 Likes

You could support both / and e.g. #/ by only allowing the former when the type was expected, using /# as a shorthand for Regex() elsewhere.

let p: Regex = /0x([0-9a-fA-F]+)/
let q = #/0x([0-9a-fA-F]+)/
let r = Regex(/0x([0-9a-fA-F]+)/)

This mirrors how assignment works more generally; if the type is explicitly specified the right hand side can often omit some information, if it can be inferred by the compiler.

enum T {case a, b}
let i: T = .a
let j =  T.b


var a: Int? = 1
var b = Optional<Int>(1)
1 Like

Overall being a big fan and fluent user of RegExes, I really like the pitch and the fact that we're planning on introducing RegEx literals in Swift.

My main concern, like many others above, is with all the impacts on using /.

At first I was like "yeah / makes ses because of term of art"… but now that I've realised how ambiguous it could be in all these situations (including with people already using libraries like PointFree's CasePath), I'm now +💯 on @ksluder 's suggestion of using #regex(…), which also matches precedents in Swift like #color(…) and #image(…) for other literals, and thus seems to fit very well here all while avoid the issues with /.

As for / being a term-of-art kind of separator, I'm not so sure that's strongly true. I mean yes many other languages use it for RegExes, but:

  • Languages like Ruby also allow things like %r{…} as an alternative for /…/ or RegExp(…)
  • If you're copy/pasting your RegEx from another language or from StackOverflow, it is really not a big deal to replace /…/ with #regex(…): what matters is not the delimiters but what's inside them.

Overall, the argument that "this is what other languages use as delimiter and it will make it easy to copy/paste from SO" is not critical imho. Making it easy to copy/paste a PCRE-compliant RegEx content from SO or another language? Definitively yes, 100%. Considering that it's as important to be able to also copy the delimiters around that RegEx content from the SO answer without having to replace the surrounding / with #regex() when pasting into Swift? Not so much.
The important thing imho to be able to copy/paste from another source is the RegEx itself, not its delimiters.


So, TL;DR: Since #regex(…) seems to match with other precedent for specific literals in Swift, and avoids all the issues that / delimiters would raise, I think it's an excellent candidate.

15 Likes

This actually points at a subtle problem with #regex(…), which is that /this syntax/ will include escaped slashes which will not need escaping in #regex(…) syntax. Would copy-pasting an expression and transforming its delimiters change the meaning of those escaped characters?

Perhaps this is best solved by implementing both #/…/ and #regex(…). The latter is the "formal" name for the feature, and the former is a shorthand. If you use the shorthand, you are required to escape any forward slashes. The shorthand is also extensible in raw-string-like ways, such as ##/this syntax/## which would permit unescaped forward slashes.

2 Likes

@hamishknight Oh, one question I had about the pitch though: it's unclear to me with the straw-person example provided around the builder transformation, if the RegexLiteral associated type (and RegexLiteralProtocol) are supposed to be a kind of builder itself, or the result / return type of one.

For example, in the example we have:

let regex = {
  var builder = T.RegexLiteral()

  let __A1 = builder.buildCharacterClass_POSIX_alpha()
…
  let __B1 = builder.buildLiteral(" = ")
…
  let __D1 = builder.buildConcatenate(__A4, __B1, __C3)
  return T(regexLiteral: builder.finalize(__D1))
}()

So in that example, T.RegexLiteral, which is the associated type conforming to RegexLiteralProtocol:

  • Is used as a builder (it's actually also reflected by how you also named the variable, var builder) as the generated code would call builder.buildXXX() on it…
  • But is also used as the return type of the builder, as the last line of the example implies. Indeed, you end up calling builder.finalize(__D1), which must itself return a RegexLiteral – given that's the parameter type expected by T(regexLiteral: RegexLiteral)

So… is it a Regex builder, or the result produced by one? :thinking: I think we might either need an additional, intermediate type to differentiate the builder from the literal type it builds… or if the goal is to make this work very similarly to how StringInterpolation works, that the straw-person example might be slightly misleading, and that we might not need the finalize and that the last line could instead be return T(regexLiteral: __D1) (or, maybe T(regexLiteral: __D1.finalize()) if we do need a finalize operation).

PS: I'm sorry if this might seem nitpicking at an example that is explicitly said to only be illustrative and be straw-person transformation and not the official thing, but I still think it would help understanding the proposal by fixing/clarifying this. Thanks!

2 Likes

That is a good point that I didn't think about.

That being said, I feel like it's more important to avoid potential issues or ambiguities with existing Swift features (like custom operators and use cases like CasePath) than having to unescape any copy/pasted RegEx in order to paste it into your Swift code. And, if anything, removing those unescaped characters will make the resulting regex literal more readable anyway :stuck_out_tongue:
And I also feel like we'd be almost equally likely to write a RegEx manually from scratch in our Swift code that we'd copy/paste one from another language or from SO, and I'd very much appreciate a solution where we could avoid the escaping-hell if possible :wink:

Also, this will only be a problem if you copy the RegEx from a language that do use /, like Perl. If you copy the RegEx from, say, Ruby, most Ruby developers would use %r{…} instead of /…/ when the RegEx contains / literals exactly because they would otherwise have to be escaped so using %r{…} make them more readable, just like using #regex(…) in Swift would.


I like the alternative you suggest of having #regex(…) be the canonical way to do it, and allow a / variant to be a shorthand.
My vote would go for #/…/# rather than just #/…/ though for such shorthand, especially because it would mirror nicely Raw String Literals #"…"# – and would even open the future direction of supporting ###/…/### for RegExes if we want to go in that direction, just like we support ###"…"### for Strings.

1 Like

I don't think the discussion/investigation is far enough along to conclude that there are "so many weird edge cases". There's a lot of prose in the pitch devoted to this topic, but there's not a lot of changes or edge cases in parsing behavior being pitched.

The pitch goes over comments and concludes there's no issue there (beyond future directions concerning multi-line regex literal syntax, which we already have alternatives for). It goes over custom infix operators containing / and concludes there is no issue there, the parsing is the exact same and users disambiguate with whitespace (like they currently would do).

Custom prefix/postfix operators with / is the first place where issues come up. It is true that we may change the set of available prefix/postfix operator characters under a language mode check. Or, alternatively, we may have some way of quoting or escaping an operator, not unlike identifiers. Often, parenthesis disambiguate, just like they do for expressions elsewhere.

The division operator is pitched as parsing the same way it does now if that's "sufficient", pending investigation. If not, then it may be the case that regex literals are preferred (at least under a language mode check) and here is where there are still some unknowns. But, I think its too early to assume that the end result would be a pile of weird edge cases. If it is, then we'd pick another option (e.g. #/ ... /# or '/ ... /').

I'm not trying to understate the impact and it's very much possible that the end result of the investigation is to pick something other than just /. I just don't think we've accumulated as much weirdness as one might think.

Perl's quote operators is mentioned in future directions. Just as with raw string literals, it's more likely we'll be looking into raw regex literals if we are going this route (see below).

From future directions:

If / doesn't work out, one option is to jump straight to this (strawperson) formulation of a raw regex literal, where #/ ... #/ would fix the parsing issue and not require escaping an interior / character (though there's nothing wrong with escaping it).

Yes, it would, and IIUC this is not a direction even being considered. The more likely scenario, as pitched, would be that if you wanted something that would normally parse as a chain of divisions over lines to parse as a regex literal, you would terminate the preceding statement.

The big question is if this is enough, but I think there's a decent chance it is (@hamishknight and @rintaro know this area better than me, though). Regex literals to the right-hand-side of assignment wouldn't suffer from this issue, nor would regex literals passed to API. The main place where you would have an expression without surrounding syntactic context would be inside result builders, which already suffer from this syntax issue. It would be really nice to not have to terminate the prior line to use closures, .member, or regex literals in a result builder, and I think this is where the discussion starts.

4 Likes
  let __D1 = builder.buildConcatenate(__A4, __B1, __C3)
  return T(regexLiteral: builder.finalize(__D1))

__D1 is a (type unspecified in this pitch) token or reference to an AST node. It is not a literal type itself.

The builder.finalize(__D1) might be formulated as just a mutating method that doesn't return the final literal. As you said, it might not even be necessary, but I could imagine wanting to post-process your AST for some reason before trying to run the initializer.

2 Likes

Indeed, future directions hint at workarounds for the escaping problem, but I'd rather the default syntax didn't create that problem in the first place so we wouldn't need another syntax as a workaround. Using () for delimiters we wouldn't need two syntaxes at all.

4 Likes

I’d like to somewhat reiterate my earlier request for help understanding why the pitch is so strongly in favor of choosing the proposed delimiter.

Subsequent comments have made additional arguments for favoring consistency within swift itself over consistency with other languages.

And if there’s a syntax that is held favorably, that has zero ambiguity, no need for version modes, and is consistent with other parts of swift syntax, wouldn’t that be the most desirable route?

9 Likes

That would make way more sense indeed to have finalize(…) in this example be mutating :+1: … which means it should thus return Void and be used like below instead:

builder.finalize(__D1)
return T(regexLiteral: builder)

That would solve my initial confusion of having builder: RegexLiteral seemingly playing a dual role – because otherwise, to make a parallel, the current code looked to me like if I had a BurgerBuilder with methods like addPatty(), addOnions(), … but its burgerBuilder.finalize() would return another BurgerBuilder instead of a Burger


Again, I know it might sound nitpicky (and I'm sorry about that), but I think fixing that tiny thing in the example of the proposal would go a long way in avoiding the confusion and helping better understand the role we plan RegexLiteral, ExpressibleByRegexLiteral and RegexLiteralProtocol to have and how they'd work together. Thanks!

1 Like

From the implementation point of view, the question is, when the Lexer found a / in the source text, how does it decide whether to tokenize it as an operator or a regex literal. Since Lexer should not know the grammar, we want to decide that (hopefully) only from the preceding characters.
@hamishknight do you have any thoughts around here?

I think we should only tokenize it as a regex literal only if the preceding non-white space character is a certain character. Specifically:

  • equal: ... = / foo ...
  • open parens: ... ( / foo ... incl. [ and {
  • operators:e.g. ... * / foo ...
  • colon: ... : / foo ...
  • comma: ... , / foo ...
  • question: ... ? / foo ... (assuming in a ternary operator)
  • semicolon: ... ; / foo ...
  • start of the file: / foo ...

Otherwise we should keep tokenizing it as an operator:

  • identifier : e.g. ... bar / foo ... But how about keywords (e.g. try / foo ...) or contextual keywords (e.g. await / foo ...)
  • close parens: ... ] / foo ... incl. ) and }
  • number: e.g. ... 0.2 / foo ...
  • quote: ... " / foo
  • hash: ... # / foo (# might be a end of a string literal)
  • period: ... . / foo (probably an error)
  • exclaim: ... ! / foo (probably an error)
  • at mark: ... @ / foo (probably an error)
  • backslash: ... \ / foo (probably an error)
2 Likes

I’m deeply opposed to using / literal / as a literal. I feel that / is only indicative of regex in the context of regex. It is not immediately obvious to me that that is regex outside that context. / is primarily used as part of a comment marker or binary infix operator in Swift right now, and even reading this proposal I have trouble shaking that interpretation.

If we are going to have a specialized literal in Swift, we should follow current precedents and spell it out with #regex(literal). The additional verbosity is important, and it would make parsing far easier.

13 Likes

As for the escaping / problem. I realized we could make a rule that slashes enclosed in parens are not delimiters. E.g /(?:/usr/bin)/ is a valid regex literal equivalent to qr{/usr/bin}. Not so cute, but I personally can live with this.

I had basically the same question when I read over the pitch, and the ultimate answer is that the exact set of calls hasn't really been developed yet so this is sort of a placeholder. But I see the builder instance as a sort of context that can accumulate information about the literal on the side. The build* methods store information into the builder, then return values that are used to relate whatever they added to other parts of the literal, but the exact split of information is for the builder to decide. So, for instance, if we used this example from the pitch with different builder types:

  var builder = T.RegexLiteral()

  // __A4 = /([[:alpha:]]\w*)/
  let __A1 = builder.buildCharacterClass_POSIX_alpha()
  let __A2 = builder.buildCharacterClass_w()
  let __A3 = builder.buildConcatenate(__A1, __A2)
  let __A4 = builder.buildCaptureGroup(__A3)

  // __B1 = / = /
  let __B1 = builder.buildLiteral(" = ")

  // __C3 = /([0-9A-F]+)/
  let __C1 = builder.buildCustomCharacterClass(["0"..."9", "A"..."F"])
  let __C2 = builder.buildOneOrMore(__C1)
  let __C3 = builder.buildCaptureGroup(__C2)

  let __D1 = builder.buildConcatenate(__A4, __B1, __C3)
  builder.finalize(__D1)

Then one type's builder could maintain a list of rules and return indices from the build* methods which can be used to reference previous rules:

startingRule = Optional.some(8)
rules = [
    .characterClass([.posixAlpha]),             // 0
    .characterClass([.w])                       // 1
    .sequence([0, 1]),                          // 2
    .capture(2),                                // 3
    .literal(" = "),                            // 4
    .characterClass(["0"..."9", "A"..."F"]),    // 5
    .repeat(5, 1 ..< .max),                     // 6
    .capture(6),                                // 7
    .sequence([3, 4, 7])                        // 8
]

Another type could maintain a stack of regex fragments in the builder and return Void values from the build* methods, simply using the number of parameters to buildConcatenate(...) to figure out how many values to pop from the stack:

fragments = [#"[[:alpha:]]"#]
fragments = [#"[[:alpha:]]"#, #"\w"#]
fragments = [#"[[:alpha:]]\w"#]
fragments = [#"([[:alpha:]]\w)"#]
fragments = [#"([[:alpha:]]\w)"#, #" = "#]
fragments = [#"([[:alpha:]]\w)"#, #" = "#, #"[0-9A-F]"#]
fragments = [#"([[:alpha:]]\w)"#, #" = "#, #"[0-9A-F]*"#]
fragments = [#"([[:alpha:]]\w)"#, #" = "#, #"([0-9A-F]*)"#]
fragments = [#"([[:alpha:]]\w) = ([0-9A-F]*)"#]

A third type could build up some sort of bytecode representation in the builder. A fourth could have nothing at all in the builder and just put all of the information in the return values and parameters. The point is, the code we generate would be flexible enough to support many different implementation approaches.

4 Likes

While it may not matter in practice, I think it is worth noting that it is impossible to have an empty regex literal with / delimiters.

I am in strong agreement with this response. Swift had an opportunity to “Think Different” and create a more modern, much clearer approach to pattern matching.

3 Likes