SE-0355: Regex Syntax and Runtime Construction

Ben_Cohen · April 28, 2022, 2:24pm

Hello, Swift community.

The review of SE-0355: Regex Syntax and Runtime Construction begins now and runs through May 10, 2022.

This review is part of a collection of proposals for better string processing in Swift. The proposal authors have put together a proposal overview with links to in-progress pitches and reviews. This proposal introduces a the syntax for creating a Regex type from a String or a literal. It will be run simultaneously with a proposal regarding adding literal syntax to the language.

As with the concurrency initiative last year, the core team acknowledges that reviewing a large number of interlinked proposals can be challenging. In particular, acceptance of one of the proposals should be considered provisional on future discussions of follow-on proposals that are closely related but have not yet completed the evolution review process. Similarly, reviewers should hold back on in-depth discussion of a subject of an upcoming review. Please do your best to review each proposal on its own merits, while still understanding its relationship to the larger feature.

Reviews are an important part of the Swift evolution process. All review feedback should be either on this forum thread or, if you would like to keep your feedback private, directly to the review manager. If you do email me directly, please put "SE-0355" somewhere in the subject line.

What goes into a review?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift. When writing your review, here are some questions you might want to answer in your review:

What is your evaluation of the proposal?
Is the problem being addressed significant enough to warrant a change to Swift?
Does this proposal fit well with the feel and direction of Swift?
If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

More information about the Swift evolution process is available at:

https://github.com/apple/swift-evolution/blob/main/process.md

As always, thank you for contributing to Swift.

Ben Cohen

Review Manager

benrimmington · April 29, 2022, 9:45pm

Multi-line regex literals enable the (?x) option by default.
Would it be possible to do the same when compiling multi-line strings?

Some of the terminology is inconsistent between the proposals.

Multi-line mode

SE-0354: either #/ followed by newline, which enables (?x) by default.
SE-0355: or the (?m) option, which changes the ^ and $ anchors.

Extended syntax

SE-0354: either #/…/# and ##/…/## (i.e. not bare) delimiters.
SE-0355: or the (?x) option, which allows non-semantic whitespace.

benlings · May 1, 2022, 7:07pm

It’s good to see the addition of APIs for accessing captures by name with AnyRegexOutput. I would have liked to have extra introspection APIs on Regex<AnyRegexOutput> to find out the names/numbers of captures (as discussed in the pitch thread), but appreciate that this is something that can be added later. The use case I see for this is for implementing search with user defined regexes, with either pre-defined behaviour for capture group names, or offering 'replace' functionality that can use capture groups.

I asked in the pitch thread about whether names mattered when converting to strongly typed output (example changed here to cast Regex, not Regex.Match, assuming the same rules would apply to both).

let regex = try! Regex("(?<name>abc)(de)")
let typed1 = regex.as((Substring, name: Substring, Substring).self) // Can cast with names?
let typed2 = regex.as((Substring, Substring, Substring).self) // Can cast without names? 
let typed3 = regex.as((whole: Substring, foo: Substring, bar: Substring).self) // Can cast with different names?

Taking this further, what happens with roundtripping via a strongly typed Output? Does it use the name from the original regex, or the name of the typed regex?

let regex = try! Regex("(?<name>abc)(de)")
let typed = regex.as((whole: Substring, foo: Substring, bar: Substring).self) 
let regex2 = Regex(typed)
regex2.contains(captureNamed: "name") // Uses original names?
regex2.contains(captureNamed: "foo") // Uses strongly typed names?

How is type compatibility handled for AnyRegexOutput to typed conversions of Regex/Match? eg

let regex = try! Regex("(abc)")
let typed1 = regex.as((Substring, Substring).self) // 'Intrinsic' type - succeeds
let typed2 = regex.as((Substring, Substring?).self) // OK? `Substring?` is a subtype of `Substring`
// Similarly for protocols for transformed capture types:
// (Substring, Foo) -> AnyRegexOutput -> (Substring, any Fooable)

As a more general point, it would be helpful to have some examples in the proposal of using AnyRegexOutput, with matches with quantified captures (abc)+, conversions, interaction with literals and the DSL.

hamishknight · May 3, 2022, 8:10pm

I think that seems like a reasonable default behavior, as it seems likely you would want to use \n if you want literal newlines. We could provide API to customize the behavior, but that can be considered future work.

Thanks for pointing out! In relation to the literals, extended and multi-line should be "extended literals" and "multi-line literals" respectively.

ensan-hcl · May 5, 2022, 11:14pm

Can't we clean up messy character classes in regex literals? I don't think it's worth to have three different way to match digit characters as \d, [:digit:], and \p{Digit}, especially for regex literals.

Ben_Cohen · May 5, 2022, 11:38pm

Since this is about the regex syntax, I moved it to the more appropriate parallel review.

I think one thing that is likely considered important is that the literals match what the runtime regex construction can do. So while it's worth discussing which character classes should be supported, the literal won't support more/less than that.

ensan-hcl · May 5, 2022, 11:51pm

Can you elaborate why? I thought regex literals and runtime-constructed String based API are served in different purpose. I don’t think we have to make the syntax equal.

Michael_Ilseman · May 6, 2022, 12:22pm

I addressed your points in the literal thread regarding type system features like named captures. We do not typically make features arbitrarily worse without technical reasons or rationale. What is your rationale for making literals worse than run-time strings regarding accepted syntax?

ensan-hcl · May 6, 2022, 12:36pm

I don't think my suggestion makes regex literals 'worse'. I just proposed cleaning up, for example, integrate \d , [:digit:] , and \p{Digit} into one [:digit:] expression.
(Possibly I was wrong in my choice of the term "clean up". I meant "make simpler").

From what I have read in the SE-0355 proposal, portability is a major goal of the String based API. I think supporting all these expressions is reasonable for this API. However, regex literals can be simpler, so that we can reduce increase in the surface area of the language. Simpler syntax enables more consistent use of regex literals. Generally speaking, it's always better, not worse.

Michael_Ilseman · May 6, 2022, 12:38pm

This is future work, but we could vend something similar to:


extension Regex {
  var caputureList: some Collection<CaptureDescription>

  struct CaptureDescription {
    var name: String?
    var type: Any.Type // Default is Substring.self
    var isOptional: Bool
  }
}

Sorry, this area is getting less attention than some of the more controversial and pressing regex needs.

Regex has access to the AST and answers queries such as contains(captureNamed:), so I think we want the behavior of requiring tuple labels. And similarly, we'd want to preserve optionality depth.

I think an argument could be made that casting AnyRegexOutput should follow Swift's looser tuple casting rules by allowing labels to be added or removed. What do you think of that view?

If we're clear on the desired behavior for Regex and AnyRegexOutput, but Match is still murky, we can defer Match as future work.

Definitely, working on that for testing purposes as well.

Michael_Ilseman · May 6, 2022, 12:45pm

Ah, then I think you've come to the right thread. See Swift canonical syntax.

[:digit:] would be a poor choice of canonical form. From the proposal:

Character properties can be spelled \p{...} or [:...:] . We recommend preferring \p{...} as the bracket syntax historically meant POSIX-defined character classes, and still has that connotation in some engines. The spelling of properties themselves can be fuzzy and we (weakly) recommend the shortest spelling (no opinion on casing yet). For script extensions, we (weakly) recommend e.g. \p{Greek} instead of \p{Script_Extensions=Greek} . We would like more discussion with the community here.

I don't think we'd want to canonicalize \d to \p{digit} though, as \d is so widely precedented and shorthand spellings are beneficial for regex.

I still don't understand. Why should we make the literal under-featured without a technical need to do so? If you don't like literals in your code or certain spellings, that's a linting problem.

ensan-hcl · May 6, 2022, 1:26pm

Thank you, and then I'm suggesting to support only hypothetical 'Swift canonical syntax' in regex literals. Even though it's a literal, we accept the syntax as part of Swift. As far as I know about the Swift Evolution discussion so far, the principle is that Syntax of Swift should be as simple and consistent as possible unless such complexity or inconsistency give enough usefulness. Because portability to regex literals is less important than that to String based API, I believe supporting complex syntax for regex literals is not enough desirable.

But hmmmm, if introducing a messy syntax is just a "linting problem", then we should introduce all proposed syntax sugars if there is no "technical" difficulties. We didn't have to have long long discussion about if let x syntax. If you don't like proposed syntax, you can just remove them by linter.

Zollerboy1 · May 6, 2022, 2:03pm

I think it would be weird to not support all the regex syntax in the literals because then we lose the ability to copy-paste a regex from some other language in your code and that would be a shame. We could very well emit a warning though when the regex in a literal uses constructs that are considered not to be part of the Swift canonical syntax. That would be my preferred way of keeping regex literals somewhat consistent.

jaredgrubb · May 6, 2022, 4:48pm

I think this is fair, but that's the goal of the DSL version: an expressive description of a regex that is kind to the human eyes.

It's a feature that the string-form conforms as closely to some "standard" as possible without forking new extensions. I already find it difficult enough to keep track of which programs want me to type "\(" and which want just "(". We definitely shouldn't try to invent yet-another-dialect that is Swiftier, IMO.

Michael_Ilseman · May 6, 2022, 7:04pm

Beyond the strawperson argument and the strained analogy, future extensibility is a technical difficulty introduced by excessive sugar. Swift is a new(ish) programming language and as such has more syntactic growth in its future. It's also a programming language. Regex syntax is already well established externally to the Swift project and doesn't have nearly as much syntactic growth in its future.

But you do bring up a good point though, we should consider whether a syntactic restriction on literals would allow us to make them even more appealing. There are so many more things expressible in the DSL than a literal, and we should close that gap by making literals more powerful. We can do this by reserving some syntax for literals specifically for future improvements.

If we can reasonably reserve non-meta-character < and > (meaning outside of named-capture syntax) inside the literals, we could use them for interpolation in the future. Any RegexComponent can be used in an interpolation, bringing the literals important expressive power currently only available in the builder DSL.

This would mean that /prefix<dateParser>suffix/ would be roughly equivalent to Regex { /prefix/ ; dateParser ; /suffix/ }

This would also enable useful source tool refactorings like converting portions of a DSL back into a literal for brevity. This is only possible to a very limited extent currently, but interpolation would open it up to the broader case.

We could also reserve <{ ... }> for an in-line closure that receives something from the engine (e.g. current capture state or a direct interface to the engine itself), etc.

If we can't reserve plain '<' and '>', we should definitely still reserve <{ ... }>.

Thank you for the inspiration!

Michael_Ilseman · May 6, 2022, 7:13pm

I'm hesitant to add those warnings until we have a way to suppress them. If the goal is to keep them in their original form and they're not actively harmful/ambiguous/misleading, we wouldn't want to keep throwing up warnings. If we do have an acceptable suppression mechanism, then they're like advisory notes with fixits to convert into canonical syntax. The conversion is feasible by pretty-printing the AST.

benlings · May 7, 2022, 8:49pm

That would be great, and fully understand that it's future (possible) work.

My question about converting to tuple-of-optional (Regex<AnyRegexOutput> -> Regex<(Substring, Substring?)>) and changing the names was about whether this would follow Swift's existing tuple casting rules, or would be more strict.

I don't think this is directly possible because tuple casting also allows changing names:

let a = (foo: 1, "2")
let b: (Int, String) = a
let c: (Int, bar: String) = b

I can see two possible models for names:

Capture names are intrinsic to the regex (as stored in the AST). When casting, only allow casts that keep the same names as defined. This would mean that contains(captureNamed:) has the same behaviour after casting.
Treat names as just labels for capture indexes. The AST stores only the indexes and the regex then has an additional mapping of name to index. This would allow casts to strongly typed tuples with different / no labels, and this would change/remove the name-index mapping. contains(captureNamed:) would have different behaviour after casting.
This might be more practical. For example, if there were an API that had names in its captures: func foo(regex: Regex<(Substring, bar: Substring, baz: Substring)>), they would be treated as documentation. It would be possible to convert a Regex<AnyRegexOutput> to this function, as long as it had two captures.

Similarly for types:

These could either require an exact type match (ie. can only convert to a tuple with the exact match types), or could follow swift's tuple subtyping rules. I think the latter is probably the right option, for the same reasons as above. If an API takes Regex<(Substring, Substring?, Substring?)>, passing in a Regex<AnyRegexOutput> with 2 non-optional captures should be fine - the 'optional' captures will always have a value.

The difference I see with types, compared to labels, is that the level of optionality would be retained in the AST and just have a super type expressed in the static type.

Michael_Ilseman · May 8, 2022, 3:03am

For Regex, it seems desirable to be strict about both names and optionality. We could consider a strippingCaptureNames member on Regex as well as AnyRegexOutput to allow people to cast by shape instead.

For AnyRegexOutput, I could see arguments either way. We have enough information to reconstruct tuple labels. In fact, we could consider having the (optional) name be a member of AnyRegexOutput.Element.

I'm not sure about an implicit lift operation, seems a little weird to me, but not being able to do so is also weird. This should be possible via an explicit map. Relatedly, @nnnnnnnn, how does map and tuple labels interact?

christopherweems · May 9, 2022, 9:26pm

I have a bit of API that needs to prevent an end-user from injecting arbitrary symbols into a runtime-compiled Regex. The API I'm trying to implement exposes a named capture as part of its type-checked input, leaving RegexBuilder out as a possibility for this implementation.

A straight-foward, but unsafe example:

func buildPrefixExpression(_ prefix: String) throws -> Regex<(Substring, suffix: Substring)> {
    try Regex("\(prefix)(?<suffix>.*)") // unsafe: `prefix` could be any bit of regex
}

Sanitizing input before compiling seems it will be a common requirement across the ecosystem, so one might expect ~~Regex to implement~~ there to be some implementation of ExpressibleByStringInterpolation such that this could to be written:

func buildPrefixExpression(_ prefix: String) throws -> Regex<(Substring, suffix: Substring)> {
    try Regex("\(verbatim: prefix)(?<suffix>.*)") // safe: `prefix` cannot inject regex symbols here
}

It seems like we're partially there with Regex.init(verbatim:), but I can't make it work for this use case.

hamishknight · May 10, 2022, 7:33pm

After exploring this further, we're going to leave the default behavior of compiling a multi-line regex with Regex(...) as matching a multi-line input (i.e the newlines remain semantic). This is consistent with engines such as PCRE, ICU, and Oniguruma, none of which infer (?x) based on the contents of the pattern, and treat newlines literally by default.

While inferring (?x) would be more consistent with multi-line literals, we feel that the behavior would be even more subtle as you wouldn't have to use a particular delimiter, e.g it would apply to "a/nb" as well as:

"""
a
b
"""

Additionally it would mean that Regex("a\nb") would have a subtly different behavior to Regex(#"a\nb"#), which seems undesirable. Furthermore, the regex might be provided externally (e.g through user input), and may not therefore be aware of such inference behavior.

Users can always begin the regex with (?x) to explicitly enable extended syntax, and we can explore adding API to customize the default matching option behavior as future work.