[Pitch] Regular Expression Literals

ksluder · October 14, 2021, 11:03pm

This actually points at a subtle problem with #regex(…), which is that /this syntax/ will include escaped slashes which will not need escaping in #regex(…) syntax. Would copy-pasting an expression and transforming its delimiters change the meaning of those escaped characters?

Perhaps this is best solved by implementing both #/…/ and #regex(…). The latter is the "formal" name for the feature, and the former is a shorthand. If you use the shorthand, you are required to escape any forward slashes. The shorthand is also extensible in raw-string-like ways, such as ##/this syntax/## which would permit unescaped forward slashes.

AliSoftware · October 14, 2021, 11:10pm

@hamishknight Oh, one question I had about the pitch though: it's unclear to me with the straw-person example provided around the builder transformation, if the RegexLiteral associated type (and RegexLiteralProtocol) are supposed to be a kind of builder itself, or the result / return type of one.

For example, in the example we have:

let regex = {
  var builder = T.RegexLiteral()

  let __A1 = builder.buildCharacterClass_POSIX_alpha()
…
  let __B1 = builder.buildLiteral(" = ")
…
  let __D1 = builder.buildConcatenate(__A4, __B1, __C3)
  return T(regexLiteral: builder.finalize(__D1))
}()

So in that example, T.RegexLiteral, which is the associated type conforming to RegexLiteralProtocol:

Is used as a builder (it's actually also reflected by how you also named the variable, var builder) as the generated code would call builder.buildXXX() on it…
But is also used as the return type of the builder, as the last line of the example implies. Indeed, you end up calling builder.finalize(__D1), which must itself return a RegexLiteral – given that's the parameter type expected by T(regexLiteral: RegexLiteral)

So… is it a Regex builder, or the result produced by one? I think we might either need an additional, intermediate type to differentiate the builder from the literal type it builds… or if the goal is to make this work very similarly to how StringInterpolation works, that the straw-person example might be slightly misleading, and that we might not need the finalize and that the last line could instead be return T(regexLiteral: __D1) (or, maybe T(regexLiteral: __D1.finalize()) if we do need a finalize operation).

PS: I'm sorry if this might seem nitpicking at an example that is explicitly said to only be illustrative and be straw-person transformation and not the official thing, but I still think it would help understanding the proposal by fixing/clarifying this. Thanks!

AliSoftware · October 14, 2021, 11:18pm

That is a good point that I didn't think about.

That being said, I feel like it's more important to avoid potential issues or ambiguities with existing Swift features (like custom operators and use cases like CasePath) than having to unescape any copy/pasted RegEx in order to paste it into your Swift code. And, if anything, removing those unescaped characters will make the resulting regex literal more readable anyway
And I also feel like we'd be almost equally likely to write a RegEx manually from scratch in our Swift code that we'd copy/paste one from another language or from SO, and I'd very much appreciate a solution where we could avoid the escaping-hell if possible

Also, this will only be a problem if you copy the RegEx from a language that do use /, like Perl. If you copy the RegEx from, say, Ruby, most Ruby developers would use %r{…} instead of /…/ when the RegEx contains / literals exactly because they would otherwise have to be escaped so using %r{…} make them more readable, just like using #regex(…) in Swift would.

I like the alternative you suggest of having #regex(…) be the canonical way to do it, and allow a / variant to be a shorthand.
My vote would go for #/…/# rather than just #/…/ though for such shorthand, especially because it would mirror nicely Raw String Literals #"…"# – and would even open the future direction of supporting ###/…/### for RegExes if we want to go in that direction, just like we support ###"…"### for Strings.

Michael_Ilseman · October 14, 2021, 11:31pm

I don't think the discussion/investigation is far enough along to conclude that there are "so many weird edge cases". There's a lot of prose in the pitch devoted to this topic, but there's not a lot of changes or edge cases in parsing behavior being pitched.

The pitch goes over comments and concludes there's no issue there (beyond future directions concerning multi-line regex literal syntax, which we already have alternatives for). It goes over custom infix operators containing / and concludes there is no issue there, the parsing is the exact same and users disambiguate with whitespace (like they currently would do).

Custom prefix/postfix operators with / is the first place where issues come up. It is true that we may change the set of available prefix/postfix operator characters under a language mode check. Or, alternatively, we may have some way of quoting or escaping an operator, not unlike identifiers. Often, parenthesis disambiguate, just like they do for expressions elsewhere.

The division operator is pitched as parsing the same way it does now if that's "sufficient", pending investigation. If not, then it may be the case that regex literals are preferred (at least under a language mode check) and here is where there are still some unknowns. But, I think its too early to assume that the end result would be a pile of weird edge cases. If it is, then we'd pick another option (e.g. #/ ... /# or '/ ... /').

I'm not trying to understate the impact and it's very much possible that the end result of the investigation is to pick something other than just /. I just don't think we've accumulated as much weirdness as one might think.

Perl's quote operators is mentioned in future directions. Just as with raw string literals, it's more likely we'll be looking into raw regex literals if we are going this route (see below).

From future directions:

hamishknight:

User-specified choice of quote delimiters is considered future work. A related approach to this could be a "raw" regex literal analogous to raw strings. For example (total strawperson), an approach where n # s before the opening delimiter would requires n # at the end of the trailing delimiter as well as requiring n-1 # s to access metacharacters.
// All of the below are trying to match a path like "/tmp/foo/bar/File.app/file.txt"

/\/tmp\/.*\/File\.app\/file\.txt/
#//tmp/.*/File\.app/file\.txt/#
##//tmp/#.#*/File.app/file.txt/##

If / doesn't work out, one option is to jump straight to this (strawperson) formulation of a raw regex literal, where #/ ... #/ would fix the parsing issue and not require escaping an interior / character (though there's nothing wrong with escaping it).

Yes, it would, and IIUC this is not a direction even being considered. The more likely scenario, as pitched, would be that if you wanted something that would normally parse as a chain of divisions over lines to parse as a regex literal, you would terminate the preceding statement.

The big question is if this is enough, but I think there's a decent chance it is (@hamishknight and @rintaro know this area better than me, though). Regex literals to the right-hand-side of assignment wouldn't suffer from this issue, nor would regex literals passed to API. The main place where you would have an expression without surrounding syntactic context would be inside result builders, which already suffer from this syntax issue. It would be really nice to not have to terminate the prior line to use closures, .member, or regex literals in a result builder, and I think this is where the discussion starts.

Michael_Ilseman · October 14, 2021, 11:39pm

  let __D1 = builder.buildConcatenate(__A4, __B1, __C3)
  return T(regexLiteral: builder.finalize(__D1))

__D1 is a (type unspecified in this pitch) token or reference to an AST node. It is not a literal type itself.

The builder.finalize(__D1) might be formulated as just a mutating method that doesn't return the final literal. As you said, it might not even be necessary, but I could imagine wanting to post-process your AST for some reason before trying to run the initializer.

michelf · October 14, 2021, 11:42pm

Indeed, future directions hint at workarounds for the escaping problem, but I'd rather the default syntax didn't create that problem in the first place so we wouldn't need another syntax as a workaround. Using () for delimiters we wouldn't need two syntaxes at all.

allenh · October 15, 2021, 12:11am

I’d like to somewhat reiterate my earlier request for help understanding why the pitch is so strongly in favor of choosing the proposed delimiter.

Subsequent comments have made additional arguments for favoring consistency within swift itself over consistency with other languages.

And if there’s a syntax that is held favorably, that has zero ambiguity, no need for version modes, and is consistent with other parts of swift syntax, wouldn’t that be the most desirable route?

AliSoftware · October 15, 2021, 12:31am

That would make way more sense indeed to have finalize(…) in this example be mutating … which means it should thus return Void and be used like below instead:

builder.finalize(__D1)
return T(regexLiteral: builder)

That would solve my initial confusion of having builder: RegexLiteral seemingly playing a dual role – because otherwise, to make a parallel, the current code looked to me like if I had a BurgerBuilder with methods like addPatty(), addOnions(), … but its burgerBuilder.finalize() would return another BurgerBuilder instead of a Burger…

Again, I know it might sound nitpicky (and I'm sorry about that), but I think fixing that tiny thing in the example of the proposal would go a long way in avoiding the confusion and helping better understand the role we plan RegexLiteral, ExpressibleByRegexLiteral and RegexLiteralProtocol to have and how they'd work together. Thanks!

rintaro · October 15, 2021, 12:39am

From the implementation point of view, the question is, when the Lexer found a / in the source text, how does it decide whether to tokenize it as an operator or a regex literal. Since Lexer should not know the grammar, we want to decide that (hopefully) only from the preceding characters.
@hamishknight do you have any thoughts around here?

I think we should only tokenize it as a regex literal only if the preceding non-white space character is a certain character. Specifically:

equal: ... = / foo ...
open parens: ... ( / foo ... incl. [ and {
operators:e.g. ... * / foo ...
colon: ... : / foo ...
comma: ... , / foo ...
question: ... ? / foo ... (assuming in a ternary operator)
semicolon: ... ; / foo ...
start of the file: / foo ...

Otherwise we should keep tokenizing it as an operator:

identifier : e.g. ... bar / foo ... But how about keywords (e.g. try / foo ...) or contextual keywords (e.g. await / foo ...)
close parens: ... ] / foo ... incl. ) and }
number: e.g. ... 0.2 / foo ...
quote: ... " / foo
hash: ... # / foo (# might be a end of a string literal)
period: ... . / foo (probably an error)
exclaim: ... ! / foo (probably an error)
at mark: ... @ / foo (probably an error)
backslash: ... \ / foo (probably an error)

Saklad5 · October 15, 2021, 12:53am

I’m deeply opposed to using / literal / as a literal. I feel that / is only indicative of regex in the context of regex. It is not immediately obvious to me that that is regex outside that context. / is primarily used as part of a comment marker or binary infix operator in Swift right now, and even reading this proposal I have trouble shaking that interpretation.

If we are going to have a specialized literal in Swift, we should follow current precedents and spell it out with #regex(literal). The additional verbosity is important, and it would make parsing far easier.

rintaro · October 15, 2021, 12:54am

As for the escaping / problem. I realized we could make a rule that slashes enclosed in parens are not delimiters. E.g /(?:/usr/bin)/ is a valid regex literal equivalent to qr{/usr/bin}. Not so cute, but I personally can live with this.

beccadax · October 15, 2021, 1:00am

I had basically the same question when I read over the pitch, and the ultimate answer is that the exact set of calls hasn't really been developed yet so this is sort of a placeholder. But I see the builder instance as a sort of context that can accumulate information about the literal on the side. The build* methods store information into the builder, then return values that are used to relate whatever they added to other parts of the literal, but the exact split of information is for the builder to decide. So, for instance, if we used this example from the pitch with different builder types:

  var builder = T.RegexLiteral()

  // __A4 = /([[:alpha:]]\w*)/
  let __A1 = builder.buildCharacterClass_POSIX_alpha()
  let __A2 = builder.buildCharacterClass_w()
  let __A3 = builder.buildConcatenate(__A1, __A2)
  let __A4 = builder.buildCaptureGroup(__A3)

  // __B1 = / = /
  let __B1 = builder.buildLiteral(" = ")

  // __C3 = /([0-9A-F]+)/
  let __C1 = builder.buildCustomCharacterClass(["0"..."9", "A"..."F"])
  let __C2 = builder.buildOneOrMore(__C1)
  let __C3 = builder.buildCaptureGroup(__C2)

  let __D1 = builder.buildConcatenate(__A4, __B1, __C3)
  builder.finalize(__D1)

Then one type's builder could maintain a list of rules and return indices from the build* methods which can be used to reference previous rules:

startingRule = Optional.some(8)
rules = [
    .characterClass([.posixAlpha]),             // 0
    .characterClass([.w])                       // 1
    .sequence([0, 1]),                          // 2
    .capture(2),                                // 3
    .literal(" = "),                            // 4
    .characterClass(["0"..."9", "A"..."F"]),    // 5
    .repeat(5, 1 ..< .max),                     // 6
    .capture(6),                                // 7
    .sequence([3, 4, 7])                        // 8
]

Another type could maintain a stack of regex fragments in the builder and return Void values from the build* methods, simply using the number of parameters to buildConcatenate(...) to figure out how many values to pop from the stack:

fragments = [#"[[:alpha:]]"#]
fragments = [#"[[:alpha:]]"#, #"\w"#]
fragments = [#"[[:alpha:]]\w"#]
fragments = [#"([[:alpha:]]\w)"#]
fragments = [#"([[:alpha:]]\w)"#, #" = "#]
fragments = [#"([[:alpha:]]\w)"#, #" = "#, #"[0-9A-F]"#]
fragments = [#"([[:alpha:]]\w)"#, #" = "#, #"[0-9A-F]*"#]
fragments = [#"([[:alpha:]]\w)"#, #" = "#, #"([0-9A-F]*)"#]
fragments = [#"([[:alpha:]]\w) = ([0-9A-F]*)"#]

A third type could build up some sort of bytecode representation in the builder. A fourth could have nothing at all in the builder and just put all of the information in the return values and parameters. The point is, the code we generate would be flexible enough to support many different implementation approaches.

Saklad5 · October 15, 2021, 1:04am

While it may not matter in practice, I think it is worth noting that it is impossible to have an empty regex literal with / delimiters.

Cgaafary · October 15, 2021, 1:57am

I am in strong agreement with this response. Swift had an opportunity to “Think Different” and create a more modern, much clearer approach to pattern matching.

kiel · October 15, 2021, 2:36am

The opportunity remains. This regex pitch is just one tool for the tool belt:

sindresorhus · October 15, 2021, 7:08am

Any thoughts on supporting non-backtracking regexes like RE2 and Rust regex? Catastrophic backtracking and ReDoS is a big problem in the JavaScript community. The V8 JavaScript engine has an experimental l (linear) flag for this reason.

dlbuckley · October 15, 2021, 8:49am

There is definitely a need for first class regex support and I think the pitch is great for the most part. But the amount of edge cases around using the / delimiter concerns me. While it might be prior art from other languages we have an opportunity to forging our own path here and not follow the mistakes of old.

I'm not sure that I like the #regex(..) suggestion as it feels like it reaches out of swift into some other system which isn't the case. Something short and concise is preferable but then it should be obvious also.

johnno1962 · October 15, 2021, 9:09am

I'm as enthusiastic as anybody about there being a case for more support for using regular expressions in Swift but I'm really not at all sure picking up the / delimiter syntax just because it has a precedent in Perl is a good direction lexically for Swift. At a minimum we should be talking about some sort of #/regex/# syntax or I simply don't see how it is going to be parsable and unambiguous. This is from someone who coded Perl for a living for 15 years. Frequently you wanted to have / inside regexes and I don't see how the lexer could decide when encountering a / followed by practically anything it could know it happens to be a regular expression.

Tino · October 15, 2021, 10:33am

I wonder why single quotation marks ('regex') have not been mentioned in this thread... wasn't this one of the possible cases to finally utilize this character for something?
Afair, the single quotation mark was considered to be too "valuable" for character literals, but where's the value in not using it at all?

hamishknight · October 15, 2021, 10:33am

allevato:

Looking through our code base, I see a handful of line-wrapped expressions of this form:
let result = (Double(someValue) - Double(someOtherValue))
  / Double(somethingElse) / someOtherThing
Would this be unambiguous because since a regular expression literal wouldn't be juxtaposed with another identifier, so / Double(somethingElse) / someOtherThing must refer to division?

Yes this would currently be parsed as a binary operator sequence, as the preceding token is ) which is likely part of an expression, and therefore we'd determine that it shouldn't be immediately followed by another expression. If we decided to additionally change the parser behavior to consider regex literals that start on a new line, we could still disambiguate it by considering that the following token someOtherThing also cannot be sequenced with an expression.

allevato:

But lets tweak this a little bit:
let result: Double = (Double(someValue) - Double(someOtherValue))
  / Double(somethingElse) / .greatestFiniteMagnitude
Should this divide Double(somethingElse) by the inferred static property Double.greatestFiniteMagnitude , or be a regular expression with a member access to the instance property greatestFiniteMagnitude ? This could be potentially resolved by removing the space after the second / (which would be my choice anyway), but it's a place where users don't have to do that today.

Yeah, this is a more tricky case. Currently we'd continue to parse as a binary operation due to the preceding ), but if we wanted a parser rule to consider regex literals starting a new line, it would likely change to parsing as a regex literal.

That being said, I'm not sure at this point whether we need the additional parsing rule to consider regex literals that start on a new line, as they can be disambiguated by using ; on the previous line, and outside of result builders I expect them to be fairly uncommon, they would usually come after e.g = or as an argument to a function call.

I think most cases (at least those where the regex literal doesn't begin on a new line) can be disambiguated by looking only at the previous token. And I agree that if possible it would be great to not require checking one token ahead (mainly as it could cause an odd typing experience). That being said, I don't believe lexing one token ahead is completely untenable. If you're interested, I have a PR with a rough sketch of what the lexer behavior might look like (it's currently just hardcoding checks for specific tokens, but that would need to be formalized). It currently only lexes ahead a token if we're looking at a regex that starts on a new line, but it's possible we may not need to do that given that case should be uncommon outside of result builders.