SE-0355: Regex Syntax and Runtime Construction

I think it would be weird to not support all the regex syntax in the literals because then we lose the ability to copy-paste a regex from some other language in your code and that would be a shame. We could very well emit a warning though when the regex in a literal uses constructs that are considered not to be part of the Swift canonical syntax. That would be my preferred way of keeping regex literals somewhat consistent.

3 Likes

I think this is fair, but that's the goal of the DSL version: an expressive description of a regex that is kind to the human eyes.

It's a feature that the string-form conforms as closely to some "standard" as possible without forking new extensions. I already find it difficult enough to keep track of which programs want me to type "\(" and which want just "(". We definitely shouldn't try to invent yet-another-dialect that is Swiftier, IMO.

2 Likes

Beyond the strawperson argument and the strained analogy, future extensibility is a technical difficulty introduced by excessive sugar. Swift is a new(ish) programming language and as such has more syntactic growth in its future. It's also a programming language. Regex syntax is already well established externally to the Swift project and doesn't have nearly as much syntactic growth in its future.

But you do bring up a good point though, we should consider whether a syntactic restriction on literals would allow us to make them even more appealing. There are so many more things expressible in the DSL than a literal, and we should close that gap by making literals more powerful. We can do this by reserving some syntax for literals specifically for future improvements.

If we can reasonably reserve non-meta-character < and > (meaning outside of named-capture syntax) inside the literals, we could use them for interpolation in the future. Any RegexComponent can be used in an interpolation, bringing the literals important expressive power currently only available in the builder DSL.

This would mean that /prefix<dateParser>suffix/ would be roughly equivalent to Regex { /prefix/ ; dateParser ; /suffix/ }

This would also enable useful source tool refactorings like converting portions of a DSL back into a literal for brevity. This is only possible to a very limited extent currently, but interpolation would open it up to the broader case.

We could also reserve <{ ... }> for an in-line closure that receives something from the engine (e.g. current capture state or a direct interface to the engine itself), etc.

If we can't reserve plain '<' and '>', we should definitely still reserve <{ ... }>.

Thank you for the inspiration!

I'm hesitant to add those warnings until we have a way to suppress them. If the goal is to keep them in their original form and they're not actively harmful/ambiguous/misleading, we wouldn't want to keep throwing up warnings. If we do have an acceptable suppression mechanism, then they're like advisory notes with fixits to convert into canonical syntax. The conversion is feasible by pretty-printing the AST.

That would be great, and fully understand that it's future (possible) work.

My question about converting to tuple-of-optional (Regex<AnyRegexOutput> -> Regex<(Substring, Substring?)>) and changing the names was about whether this would follow Swift's existing tuple casting rules, or would be more strict.

I don't think this is directly possible because tuple casting also allows changing names:

let a = (foo: 1, "2")
let b: (Int, String) = a
let c: (Int, bar: String) = b

I can see two possible models for names:

  1. Capture names are intrinsic to the regex (as stored in the AST). When casting, only allow casts that keep the same names as defined. This would mean that contains(captureNamed:) has the same behaviour after casting.

  2. Treat names as just labels for capture indexes. The AST stores only the indexes and the regex then has an additional mapping of name to index. This would allow casts to strongly typed tuples with different / no labels, and this would change/remove the name-index mapping. contains(captureNamed:) would have different behaviour after casting.
    This might be more practical. For example, if there were an API that had names in its captures: func foo(regex: Regex<(Substring, bar: Substring, baz: Substring)>), they would be treated as documentation. It would be possible to convert a Regex<AnyRegexOutput> to this function, as long as it had two captures.

Similarly for types:

These could either require an exact type match (ie. can only convert to a tuple with the exact match types), or could follow swift's tuple subtyping rules. I think the latter is probably the right option, for the same reasons as above. If an API takes Regex<(Substring, Substring?, Substring?)>, passing in a Regex<AnyRegexOutput> with 2 non-optional captures should be fine - the 'optional' captures will always have a value.

The difference I see with types, compared to labels, is that the level of optionality would be retained in the AST and just have a super type expressed in the static type.

For Regex, it seems desirable to be strict about both names and optionality. We could consider a strippingCaptureNames member on Regex as well as AnyRegexOutput to allow people to cast by shape instead.

For AnyRegexOutput, I could see arguments either way. We have enough information to reconstruct tuple labels. In fact, we could consider having the (optional) name be a member of AnyRegexOutput.Element.

I'm not sure about an implicit lift operation, seems a little weird to me, but not being able to do so is also weird. This should be possible via an explicit map. Relatedly, @nnnnnnnn, how does map and tuple labels interact?

1 Like

I have a bit of API that needs to prevent an end-user from injecting arbitrary symbols into a runtime-compiled Regex. The API I'm trying to implement exposes a named capture as part of its type-checked input, leaving RegexBuilder out as a possibility for this implementation.

A straight-foward, but unsafe example:

func buildPrefixExpression(_ prefix: String) throws -> Regex<(Substring, suffix: Substring)> {
    try Regex("\(prefix)(?<suffix>.*)") // unsafe: `prefix` could be any bit of regex
}

Sanitizing input before compiling seems it will be a common requirement across the ecosystem, so one might expect Regex to implement there to be some implementation of ExpressibleByStringInterpolation such that this could to be written:

func buildPrefixExpression(_ prefix: String) throws -> Regex<(Substring, suffix: Substring)> {
    try Regex("\(verbatim: prefix)(?<suffix>.*)") // safe: `prefix` cannot inject regex symbols here
}

It seems like we're partially there with Regex.init(verbatim:), but I can't make it work for this use case.

1 Like

After exploring this further, we're going to leave the default behavior of compiling a multi-line regex with Regex(...) as matching a multi-line input (i.e the newlines remain semantic). This is consistent with engines such as PCRE, ICU, and Oniguruma, none of which infer (?x) based on the contents of the pattern, and treat newlines literally by default.

While inferring (?x) would be more consistent with multi-line literals, we feel that the behavior would be even more subtle as you wouldn't have to use a particular delimiter, e.g it would apply to "a/nb" as well as:

"""
a
b
"""

Additionally it would mean that Regex("a\nb") would have a subtly different behavior to Regex(#"a\nb"#), which seems undesirable. Furthermore, the regex might be provided externally (e.g through user input), and may not therefore be aware of such inference behavior.

Users can always begin the regex with (?x) to explicitly enable extended syntax, and we can explore adding API to customize the default matching option behavior as future work.

4 Likes

You can work around it, poorly, by making sure prefix doesn't contain the subsequence \E and wrapping it in a \Q...\E.

I think the better general solution is to support regex interpolations, which is future work.

1 Like

My understanding, (please correct me if I'm wrong @rxwei @nnnnnnnn), is that with the soon-to-be-revised DSL's mapOutput:

func buildPrefixExpression(_ prefixStr: String) throws -> Regex<(Substring, suffix: Substring)> {
    Regex { 
      prefixStr
      Capture { /.*/ }
    }.mapOutput {
      ($0, suffix: $1)
    }
}
1 Like

That's right — this kind of control over composition is one of the primary motivations for creating the RegexBuilder approach to building regexes.

That's fantastic!

.mapOutput(..) will be a heavy hitter for loads of regex code for sure.

1 Like

I fully support the work being done here and it looks really good. But I can't vote on it. I can't provide valid detailed feedback on the proposal because of my limited exposure and actual use case for many of the advanced and somewhat problematic corners of regex syntax and the unification and Unicode full adoption effort. A huge amount of work have been done, but I am afraid it might be too soon to commit to this at the standard library level and make it subject to source break rules.

I didn't get a chance to read the responses, so please accept my apology if this question is duplicate:

If this proposal is accepted and released, are we going to be locked out of breaking changes to the syntax until Swift 7? Strings with this literal syntax might be stored externally. If we do make a breaking change, will compiler and Xcode be able to help migrate the existing strings? Especially if we create the string at runtime using literal string fragments plus dynamic runtime information (such as user provided word to match)

No. There are several mechanisms available that could assist us in doing a migration if we had to (though I don't think that we will). The first one that came to mind is that rather than migrate existing strings, we would continue to support the existing syntax via a labeled Regex(swift5_7syntax: pattern) or similar, and migrate existing unlabeled inits to that via tooling. I can think of a few other ways to address it as well, so I do not believe we would have painted ourselves into a corner.

1 Like

Good to hear. How about the ABI?

We'd be able to do a similar thing at the ABI level so that already-compiled code continued to see the same behavior.

3 Likes

Great. In that case I am fully +1 on this.

Apologies for another extremely late review.

My primary concern is the proposal's adherence to the group numbering convention.

I'm not aware of the historical reasons for this convention, but I suspect it was because it was good enough for people back then without overly complicating regex-parsing, perhaps out of concern for technical constraint at the time. I don't know–I'm only speculating.

Regardless of what the reasons might be before, I don't think it's good to stay with this convention for nested groups. Most human eyes/brains are not good at counting (this is why we have things like rainbow parentheses), which means the numbering is an error prone area. Especially with only-Substring captures, it could be difficult to find wrong numbering until the program is run. Additionally, with this linear numbering, editing a group may very likely result in editing many unrelated match calls. These match calls could be very far away from the regex pattern definition, maybe even in different projects, and thus very difficult to keep track of and update for numbering changes. This seems to go contrary to Swift's stance on good local reasoning.

Perhaps nested numbering via nested tuples is a better solution for nested groups?

Review Conclusion

The proposal has been accepted.

1 Like