SE-0355: Regex Syntax and Runtime Construction

This is future work, but we could vend something similar to:


extension Regex {
  var caputureList: some Collection<CaptureDescription>

  struct CaptureDescription {
    var name: String?
    var type: Any.Type // Default is Substring.self
    var isOptional: Bool
  }
}

Sorry, this area is getting less attention than some of the more controversial and pressing regex needs.

Regex has access to the AST and answers queries such as contains(captureNamed:), so I think we want the behavior of requiring tuple labels. And similarly, we'd want to preserve optionality depth.

I think an argument could be made that casting AnyRegexOutput should follow Swift's looser tuple casting rules by allowing labels to be added or removed. What do you think of that view?

If we're clear on the desired behavior for Regex and AnyRegexOutput, but Match is still murky, we can defer Match as future work.

Definitely, working on that for testing purposes as well.

Ah, then I think you've come to the right thread. See Swift canonical syntax.

[:digit:] would be a poor choice of canonical form. From the proposal:

Character properties can be spelled \p{...} or [:...:] . We recommend preferring \p{...} as the bracket syntax historically meant POSIX-defined character classes, and still has that connotation in some engines. The spelling of properties themselves can be fuzzy and we (weakly) recommend the shortest spelling (no opinion on casing yet). For script extensions, we (weakly) recommend e.g. \p{Greek} instead of \p{Script_Extensions=Greek} . We would like more discussion with the community here.

I don't think we'd want to canonicalize \d to \p{digit} though, as \d is so widely precedented and shorthand spellings are beneficial for regex.

I still don't understand. Why should we make the literal under-featured without a technical need to do so? If you don't like literals in your code or certain spellings, that's a linting problem.

Thank you, and then I'm suggesting to support only hypothetical 'Swift canonical syntax' in regex literals. Even though it's a literal, we accept the syntax as part of Swift. As far as I know about the Swift Evolution discussion so far, the principle is that Syntax of Swift should be as simple and consistent as possible unless such complexity or inconsistency give enough usefulness. Because portability to regex literals is less important than that to String based API, I believe supporting complex syntax for regex literals is not enough desirable.

But hmmmm, if introducing a messy syntax is just a "linting problem", then we should introduce all proposed syntax sugars if there is no "technical" difficulties. We didn't have to have long long discussion about if let x syntax. If you don't like proposed syntax, you can just remove them by linter.

1 Like

I think it would be weird to not support all the regex syntax in the literals because then we lose the ability to copy-paste a regex from some other language in your code and that would be a shame. We could very well emit a warning though when the regex in a literal uses constructs that are considered not to be part of the Swift canonical syntax. That would be my preferred way of keeping regex literals somewhat consistent.

3 Likes

I think this is fair, but that's the goal of the DSL version: an expressive description of a regex that is kind to the human eyes.

It's a feature that the string-form conforms as closely to some "standard" as possible without forking new extensions. I already find it difficult enough to keep track of which programs want me to type "\(" and which want just "(". We definitely shouldn't try to invent yet-another-dialect that is Swiftier, IMO.

2 Likes

Beyond the strawperson argument and the strained analogy, future extensibility is a technical difficulty introduced by excessive sugar. Swift is a new(ish) programming language and as such has more syntactic growth in its future. It's also a programming language. Regex syntax is already well established externally to the Swift project and doesn't have nearly as much syntactic growth in its future.

But you do bring up a good point though, we should consider whether a syntactic restriction on literals would allow us to make them even more appealing. There are so many more things expressible in the DSL than a literal, and we should close that gap by making literals more powerful. We can do this by reserving some syntax for literals specifically for future improvements.

If we can reasonably reserve non-meta-character < and > (meaning outside of named-capture syntax) inside the literals, we could use them for interpolation in the future. Any RegexComponent can be used in an interpolation, bringing the literals important expressive power currently only available in the builder DSL.

This would mean that /prefix<dateParser>suffix/ would be roughly equivalent to Regex { /prefix/ ; dateParser ; /suffix/ }

This would also enable useful source tool refactorings like converting portions of a DSL back into a literal for brevity. This is only possible to a very limited extent currently, but interpolation would open it up to the broader case.

We could also reserve <{ ... }> for an in-line closure that receives something from the engine (e.g. current capture state or a direct interface to the engine itself), etc.

If we can't reserve plain '<' and '>', we should definitely still reserve <{ ... }>.

Thank you for the inspiration!

I'm hesitant to add those warnings until we have a way to suppress them. If the goal is to keep them in their original form and they're not actively harmful/ambiguous/misleading, we wouldn't want to keep throwing up warnings. If we do have an acceptable suppression mechanism, then they're like advisory notes with fixits to convert into canonical syntax. The conversion is feasible by pretty-printing the AST.

That would be great, and fully understand that it's future (possible) work.

My question about converting to tuple-of-optional (Regex<AnyRegexOutput> -> Regex<(Substring, Substring?)>) and changing the names was about whether this would follow Swift's existing tuple casting rules, or would be more strict.

I don't think this is directly possible because tuple casting also allows changing names:

let a = (foo: 1, "2")
let b: (Int, String) = a
let c: (Int, bar: String) = b

I can see two possible models for names:

  1. Capture names are intrinsic to the regex (as stored in the AST). When casting, only allow casts that keep the same names as defined. This would mean that contains(captureNamed:) has the same behaviour after casting.

  2. Treat names as just labels for capture indexes. The AST stores only the indexes and the regex then has an additional mapping of name to index. This would allow casts to strongly typed tuples with different / no labels, and this would change/remove the name-index mapping. contains(captureNamed:) would have different behaviour after casting.
    This might be more practical. For example, if there were an API that had names in its captures: func foo(regex: Regex<(Substring, bar: Substring, baz: Substring)>), they would be treated as documentation. It would be possible to convert a Regex<AnyRegexOutput> to this function, as long as it had two captures.

Similarly for types:

These could either require an exact type match (ie. can only convert to a tuple with the exact match types), or could follow swift's tuple subtyping rules. I think the latter is probably the right option, for the same reasons as above. If an API takes Regex<(Substring, Substring?, Substring?)>, passing in a Regex<AnyRegexOutput> with 2 non-optional captures should be fine - the 'optional' captures will always have a value.

The difference I see with types, compared to labels, is that the level of optionality would be retained in the AST and just have a super type expressed in the static type.

For Regex, it seems desirable to be strict about both names and optionality. We could consider a strippingCaptureNames member on Regex as well as AnyRegexOutput to allow people to cast by shape instead.

For AnyRegexOutput, I could see arguments either way. We have enough information to reconstruct tuple labels. In fact, we could consider having the (optional) name be a member of AnyRegexOutput.Element.

I'm not sure about an implicit lift operation, seems a little weird to me, but not being able to do so is also weird. This should be possible via an explicit map. Relatedly, @nnnnnnnn, how does map and tuple labels interact?

1 Like

I have a bit of API that needs to prevent an end-user from injecting arbitrary symbols into a runtime-compiled Regex. The API I'm trying to implement exposes a named capture as part of its type-checked input, leaving RegexBuilder out as a possibility for this implementation.

A straight-foward, but unsafe example:

func buildPrefixExpression(_ prefix: String) throws -> Regex<(Substring, suffix: Substring)> {
    try Regex("\(prefix)(?<suffix>.*)") // unsafe: `prefix` could be any bit of regex
}

Sanitizing input before compiling seems it will be a common requirement across the ecosystem, so one might expect Regex to implement there to be some implementation of ExpressibleByStringInterpolation such that this could to be written:

func buildPrefixExpression(_ prefix: String) throws -> Regex<(Substring, suffix: Substring)> {
    try Regex("\(verbatim: prefix)(?<suffix>.*)") // safe: `prefix` cannot inject regex symbols here
}

It seems like we're partially there with Regex.init(verbatim:), but I can't make it work for this use case.

1 Like

After exploring this further, we're going to leave the default behavior of compiling a multi-line regex with Regex(...) as matching a multi-line input (i.e the newlines remain semantic). This is consistent with engines such as PCRE, ICU, and Oniguruma, none of which infer (?x) based on the contents of the pattern, and treat newlines literally by default.

While inferring (?x) would be more consistent with multi-line literals, we feel that the behavior would be even more subtle as you wouldn't have to use a particular delimiter, e.g it would apply to "a/nb" as well as:

"""
a
b
"""

Additionally it would mean that Regex("a\nb") would have a subtly different behavior to Regex(#"a\nb"#), which seems undesirable. Furthermore, the regex might be provided externally (e.g through user input), and may not therefore be aware of such inference behavior.

Users can always begin the regex with (?x) to explicitly enable extended syntax, and we can explore adding API to customize the default matching option behavior as future work.

4 Likes

You can work around it, poorly, by making sure prefix doesn't contain the subsequence \E and wrapping it in a \Q...\E.

I think the better general solution is to support regex interpolations, which is future work.

1 Like

My understanding, (please correct me if I'm wrong @rxwei @nnnnnnnn), is that with the soon-to-be-revised DSL's mapOutput:

func buildPrefixExpression(_ prefixStr: String) throws -> Regex<(Substring, suffix: Substring)> {
    Regex { 
      prefixStr
      Capture { /.*/ }
    }.mapOutput {
      ($0, suffix: $1)
    }
}
1 Like

That's right — this kind of control over composition is one of the primary motivations for creating the RegexBuilder approach to building regexes.

That's fantastic!

.mapOutput(..) will be a heavy hitter for loads of regex code for sure.

1 Like

I fully support the work being done here and it looks really good. But I can't vote on it. I can't provide valid detailed feedback on the proposal because of my limited exposure and actual use case for many of the advanced and somewhat problematic corners of regex syntax and the unification and Unicode full adoption effort. A huge amount of work have been done, but I am afraid it might be too soon to commit to this at the standard library level and make it subject to source break rules.

I didn't get a chance to read the responses, so please accept my apology if this question is duplicate:

If this proposal is accepted and released, are we going to be locked out of breaking changes to the syntax until Swift 7? Strings with this literal syntax might be stored externally. If we do make a breaking change, will compiler and Xcode be able to help migrate the existing strings? Especially if we create the string at runtime using literal string fragments plus dynamic runtime information (such as user provided word to match)

No. There are several mechanisms available that could assist us in doing a migration if we had to (though I don't think that we will). The first one that came to mind is that rather than migrate existing strings, we would continue to support the existing syntax via a labeled Regex(swift5_7syntax: pattern) or similar, and migrate existing unlabeled inits to that via tooling. I can think of a few other ways to address it as well, so I do not believe we would have painted ourselves into a corner.

1 Like

Good to hear. How about the ABI?

We'd be able to do a similar thing at the ABI level so that already-compiled code continued to see the same behavior.

3 Likes

Great. In that case I am fully +1 on this.