SE-0351: Regex Builder DSL

Hello, Swift community.

The review of SE-0351: Regex Builder DSL begins now and runs through April 18, 2022.


This review is part of a collection of proposals for better string processing in Swift. The proposal authors have put together a proposal overview with links to in-progress pitches and reviews. This proposal builds on the Regex type being reviewed simultaneously with this one, and is one of the proposed ways of creating such a type.

As with the concurrency initiative last year, the core team acknowledges that reviewing a large number of interlinked proposals can be challenging. In particular, acceptance of one of the proposals should be considered provisional on future discussions of follow-on proposals that are closely related but have not yet completed the evolution review process. Similarly, reviewers should hold back on in-depth discussion of a subject of an upcoming review. Please do your best to review each proposal on its own merits, while still understanding its relationship to the larger feature.


Reviews are an important part of the Swift evolution process. All review feedback should be either on this forum thread or, if you would like to keep your feedback private, directly to the review manager. If you do email me directly, please put "SE-0351" somewhere in the subject line.

What goes into a review?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift. When writing your review, here are some questions you might want to answer in your review:

  • What is your evaluation of the proposal?
  • Is the problem being addressed significant enough to warrant a change to Swift?
  • Does this proposal fit well with the feel and direction of Swift?
  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

More information about the Swift evolution process is available at:

https://github.com/apple/swift-evolution/blob/main/process.md

As always, thank you for contributing to Swift.

Ben Cohen

Review Manager

10 Likes

Is .eagerly the right default for quantification, rather than .possessively? It's well known that eager quantification can be a performance footgun in regex engines when the expression doesn't match, to the point of being considered a DoS vulnerability. The improved composability of this syntax, and particularly the ability to progressively fold results into a well-typed tree, seems to encourage using this in more situations where the performance matters. Traditional regular expressions default to eager quantification, and matching those semantics for that syntax is important for people migrating regular expressions from other languages and APIs. But this syntax requires the regular expression to be completely rewritten, so it seems to me that we have an opportunity for a better default.

13 Likes

Also, can you discuss generally where (if anywhere) this falls short of combinator parsing? The knot-tying you can do with the special Regex constructor seems to allow fairly general parsing.

Please, could you explain what does DSL mean?

Apart from that I think the proposal is excellent. Finally a readable syntax for regexes. Long live Swift.

DSL stands for domain-specific language. Here's a related WWDC session: write a DSL in Swift using result builders.

3 Likes

Thank you.

I may have missed this; but if you use the DSL to build a regex could you output the raw regex from it? I'm thinking of scenarios where someone new to regex would use the DSL to build up their pattern and then they want to see the output to learn more about the underlying structure, or even someone more versed in regex themselves wanting to see the underlying raw regex for debugging/porting purposes.

12 Likes

This could also be used to validate a regex after porting it to Swift by pasting the output into the original test suite.

2 Likes

Some minor discussion points below, that might not have been as obvious in context of the large proposal.

edit: I didn't realize I was listed as an author, as such I've rephrased my questions as discussion points :sweat_smile:


// Parse the transaction kind.
  TryCapture {
    ChoiceOf {
      "CREDIT"
      "DEBIT"
    }
  } transform: {
    TransactionKind(rawValue: String($0))
  }

Any thoughts/bikeshedding for the transform label? Transform describes what it does, but it is a pretty generic term especially since the transform can fail producing a matching failure.


 OneOrMore(.custom([
      .characterClass(.word),
      .characterClass(.whitespace)
    ]))

Note that .custom and .characterClass are not proposed here, they are proposed in Unicode for String Processing.


    Repeat(.digit, count: 2)

Inside the detailed design, we see that the argument order is reversed for the resuilt-builder-taking variant, allowing something like

  Repeat(count: 2) { CharacterClass.digit }
  Repeat(2...2) { CharacterClass.digit }

AlternationBuilder is a result builder type for creating alternations from components of a block.

Alternation is technically precise name for advanced library authors, and I don't feel that it has to use the exact same terminology as the end-user-facing ChoiceOf.


With TryCapture, when the closure returns nil or throws, the failure becomes a no-match.

Clarification: a throw signals a pattern matching abort, meaning no backtracking to try alternatives. Returning nil will signal a local matching failure, backtracking to try alternatives.


public subscript(_ reference: Reference) -> Capture { get }

More details: This will trap if a reference wasn't present in the regex. We can debate whether it should return nil in that case (like Dictionary, but would be obnoxious at the use site, potentially) or whether there should be API to query a reference. This becomes interesting when the definition and use site grow farther apart in source code and time.


extension Regex {
  public init<R: RegexComponent>(
    @RegexComponentBuilder _ content: (Regex<Substring>) -> R
  ) where R.Output == Match
}
Regex { wholeSentence in
  ChoiceOf {
   "I"
   "you"
  }
  "say"
  ChoiceOf {
    "goodbye"
    "hello"
    wholeSentence
  }
}

Does this impact type checking and error messages at all? Most important would be any impact on diagnostics for non-recursive regexes, since closures and especially result builders can give less-than-useful error messages. I'm wondering if the use of a label somewhere could help, especially since it's not super clear at the use site what the role of the wholeSentence parameter is.


Support buildOptional and buildEither

Just to double check, we're not precluding ourselves from supporting builder-time if in the future. Parameterized regex construction could be very useful, but it totally makes sense to defer it as future work. There's a lot going on here as it is.


Flatten optionals

Clarification: this behavior matches the literal typing behavior. We're interested in ways to flatten them in the future, no concrete plans at this moment.

3 Likes

I think it's a useful future direction! However, I wanted to clarify two things:

  • Regex's underlying representation is not the textual regex, but a general-purpose pattern matching program (for efficiency). So obtaining a textual regex from Regex would be a conversion.
  • Not every Regex built with regex builder can be converted to a textual regex. Regex builder supports arbitrary types that conform to the RegexComponent protocol, including CustomMatchingRegexComponent (pitched in regex-powered string processing algorithms) which can be implemented with arbitrary code. If a Regex contains a CustomMatchingRegexComponent, it cannot be converted to a textual regex.
7 Likes

As a bit of background, I was very strongly in favor of this approach initially. Attempts to make the Regex result builders behave like Pattern<T> (a type for parsers) increasingly created uncanny hybrids, and I've reversed my position for Regex specifically. A proper parser system needs more and deserves its own presentation, even if it's built on top of the same underlying capabilities here.

I think it's important to reduce the semantic divergence between the literals and result builders, and my main concern is that possessive-by-default creates a less understandable and less consistent system. This is especially the case if we only do it for quantification but keep global backtracking across subpatterns.

x*+ is syntactic sugar for (?>x*), that is wrapping the quantification in an atomic group. An atomic group establishes a local backtracking scope where, upon any successful exiting of that scope, all backtracking entries added within that scope are discarded. The result builder spelling of this is Local.

OneOrMore(..., .possessive)

Local { OneOrMore { ... } }

Personally I prefer the latter, as I feel it more directly and faithfully surfaces the semantics and seems an easier system to learn and teach. I could even see the argument to drop the quantification kind entirely from API in favor of OneOrMore, MinimallyOneOrMore (or some better name; these are fairly rare), and Local { OneOrMore }.

Future work can be a (names TBD) ratchet operation that will discard any backtracking entries since some prior ratchet point. These would be a lower-level formulation of Local that's suitable when needs transcend nesting. I.e. Local is akin to a pseudo-code defer { ratchet } at the top of the scope.


Possessive by default is most commonly found in grammar systems, where it's much easier to compose these linear-ish expressions and rules provide a fully local backtracking approach. There non-possessive can naturally desugar to S -> x S | <empty>.

I'm a big fan of possessive-by-default local-backtracking-only rules which are combined into a grammar. Unfortunately, we're not building a grammar system here (yet) and we're not claiming regex provides a full fledged parser combinator system (even if one can be forcibly wedged in via CustomMatchingRegexComponent). Regex could participate in such a system, but I'm wary of diverging fundamental concepts too much from what they mean within the world of regex.

What we're proposing is the regexiest regex, while still very much being a regex. Regex is inherently global backtracking by default with local backtracking opt-in; quantification is just a specific instance of this.

I realize that possessive-by-default is less strongly motivated by this line of reasoning than some of the capture representations, and I do hope it can be alleviated through API design. Ideally, I feel this would be best solved with a clear naming scheme, but my mind fails to find such a suitable scheme.

2 Likes

How about adding something like func buildRegexString() -> String to RegexComponent, allowing (and enforcing) every Regex component to have its own String representation? A CustomMatchingRegexComponent provider can choose its preferred spelling (or placeholder) for the component (or maybe leave it up to end-user through the initializer).

Conversion to textual regex is a possible future direction, but I think the functionality is purely additive and therefore is out of scope for this proposal.

I added a future direction section about this in apple/swift-evolution#1612.

Future directions

Conversion to textual regex

Sometimes it may be useful to convert a regex created using regex builder to textual regex. This may be achieved in the future by extending RegexComponent with a computed property.

extension RegexComponent {
  public func makeTextualRegex() -> String?
}

It is worth noting that the internal representation of a Regex is not textual regex, but an efficient pattern matching bytecode compiled from an abstract syntax tree. Moreover, not every Regex can be converted to textual regex. Regex builder supports arbitrary types that conform to the RegexComponent protocol, including CustomMatchingRegexComponent (pitched in [String Processing Algorithms]) which can be implemented with arbitrary code. If a Regex contains a CustomMatchingRegexComponent, it cannot be converted to textual regex.

2 Likes

Clarified the precondition:

extension Regex.Match {
  /// Returns the capture referenced by the given reference.
  ///
  /// - Precondition: The reference must have been captured in the regex that produced this match.
  public subscript<Capture>(_ reference: Reference<Capture>) -> Capture { get }
}

Good point. The recursive subpattern API could use some more exploration. I moved it to an alternative considered section in apple/swift-evolution#1612.

Future directions

...

Recursive subpatterns

Sometimes, a textual regex may also use (?R) or (?0) to recusively evaluate the entire regex. For example, the following textual regex matches "I say you say I say you say hello".

(you|I) say (goodbye|hello|(?R))

For this, Regex offers a special initializer that allows its pattern to recursively reference itself. This is somewhat akin to a fixed-point combinator.

extension Regex {
  public init<R: RegexComponent>(
    @RegexComponentBuilder _ content: (Regex<Substring>) -> R
  ) where R.Output == Match
}

With this initializer, the above regex can be expressed as the following using regex builder.

Regex { wholeSentence in
  ChoiceOf {
   "I"
   "you"
  }
  "say"
  ChoiceOf {
    "goodbye"
    "hello"
    wholeSentence
  }
}

There are some concerns with this design which we need to consider:

  • Due to the lack of labeling, the argument to the builder closure can be arbitrarily named and cause confusion.
  • When there is an initializer that accepts a result builder closure, overloading that initializer with the same argument labels could lead to bad error messages upon interor type errors.

Builder-time conditions can still be supported with buildOptional and buildEither in the future. I think we need some future investigation to make sure it supports heterogeneous types in branches and won't conflict with regex conditionals.

2 Likes

Agreed. Richard and I discussed this in some detail, and concluded that not having support initially was the best option to leave the door open for implementing the most useful behavior in the future, rather than tying us to something that we'll regret later.

Since this is also my dream, I am afraid that allowing statements within the builder will preclude this eventuality. It's also unfortunate to have regular expression literals having labeled captures (which, in Swift, I guess will happen almost 100% of the times) to break code when converted into a combination of Regex DSLs and References. Ideally, you shouldn't do anything after converting a regex literal into its DSL equivalent.

Also, is the current design capable of placing labeled regexes in Regex DSLs or are labeled tuples not handled by the 10-nple overloads?


Should the transform closure on Try/Captures become a .transform method on instances of RegexComponent? It could be useful to be able to try a transformation without actually capturing the value.


Regarding the Structured rather than flat captures final section in the considered alternatives: I'm not sure it's a future-proof decision to split Regex and an hypothetical Pattern based on their capture structure and default semantics. If a more powerful, "Swifty" and structured Pattern is available, would users still use this hybrid Regex DSL?

Regex literals with named captures (labeled tuples) currently do not work as a builder component because builder methods on unlabeled tuples won't accept labeled tuples as arguments. To support this as a future direction I think we need a new type system feature.

In today's regex builder DSL you can use Reference instead to achieve similar functionality.

Example
let a = Reference(Substring.self)
let b = Reference(Substring.self)
let regex = Regex {
  Capture("abc", as: a)
  Capture("def", as: b)
  a
  Capture(b)
}

if let result = input.firstMatch(of: regex) {
  print(result[a]) // => "abc"
  print(result[b]) // => "def"
}

This has been discussed in the pitch thread. A method named transform(_:) or map(_:) on RegexComponent would be totally useful as a future direction, but it should have very different semantics from the transform: parameter on Capture/TryCapture initializer. A map(_:), similar to other map methods, would transform the entire Output type.

The transform: parameter transforms the most recent capture, so if we were to make it a method, it would be called something like mapFirstCapture(_:) which I'm not sure reads as clearly as Capture(..., transform: ...). Moreover, mapFirstCapture(_:) needs 10 overloads to support each capture arity.

1 Like

Probably an unpopular opinion and not really constructive to the proposal itself: much of the appeal of regex is in its concise form and reusability, oftentimes between different languages and platforms. With external tools able to decompose complex regex and help you create, test and understand it, I can't say I'd use a harder to read, less-concise, non-portable DSL (that I also have to learn first).
Getting strong typed results from a regular regex (or fail) + compile time valid regex checks would be ideal (and enough in terms of safety) IMO.

1 Like

That’s a very reasonable request, which is covered by a separate proposal (discussed some in SE-350, and more in a forthcoming proposal for the literal syntax).

1 Like

Could you elaborate? I don't see why this would preclude that at all. Also note that this is in the RegexBuilder module.

Could you explain what you mean by future-proofing?

If Pattern existed powered by an improved Swift DSL / result builder story alongside better type operations, then clearly Regex would use that. Regex would be able to be used within a pattern and patterns could be regex components. This is all a little hand-wavy, because Pattern hasn't been fully designed and depends on language improvements.

As it is, this Regex DSL is helping to clarify and motivate those language improvements, such as variadic generics and issues around tuple labels.