SE-0351: Regex Builder DSL

DSL stands for domain-specific language. Here's a related WWDC session: write a DSL in Swift using result builders.

3 Likes

Thank you.

I may have missed this; but if you use the DSL to build a regex could you output the raw regex from it? I'm thinking of scenarios where someone new to regex would use the DSL to build up their pattern and then they want to see the output to learn more about the underlying structure, or even someone more versed in regex themselves wanting to see the underlying raw regex for debugging/porting purposes.

12 Likes

This could also be used to validate a regex after porting it to Swift by pasting the output into the original test suite.

2 Likes

Some minor discussion points below, that might not have been as obvious in context of the large proposal.

edit: I didn't realize I was listed as an author, as such I've rephrased my questions as discussion points :sweat_smile:


// Parse the transaction kind.
  TryCapture {
    ChoiceOf {
      "CREDIT"
      "DEBIT"
    }
  } transform: {
    TransactionKind(rawValue: String($0))
  }

Any thoughts/bikeshedding for the transform label? Transform describes what it does, but it is a pretty generic term especially since the transform can fail producing a matching failure.


 OneOrMore(.custom([
      .characterClass(.word),
      .characterClass(.whitespace)
    ]))

Note that .custom and .characterClass are not proposed here, they are proposed in Unicode for String Processing.


    Repeat(.digit, count: 2)

Inside the detailed design, we see that the argument order is reversed for the resuilt-builder-taking variant, allowing something like

  Repeat(count: 2) { CharacterClass.digit }
  Repeat(2...2) { CharacterClass.digit }

AlternationBuilder is a result builder type for creating alternations from components of a block.

Alternation is technically precise name for advanced library authors, and I don't feel that it has to use the exact same terminology as the end-user-facing ChoiceOf.


With TryCapture, when the closure returns nil or throws, the failure becomes a no-match.

Clarification: a throw signals a pattern matching abort, meaning no backtracking to try alternatives. Returning nil will signal a local matching failure, backtracking to try alternatives.


public subscript(_ reference: Reference) -> Capture { get }

More details: This will trap if a reference wasn't present in the regex. We can debate whether it should return nil in that case (like Dictionary, but would be obnoxious at the use site, potentially) or whether there should be API to query a reference. This becomes interesting when the definition and use site grow farther apart in source code and time.


extension Regex {
  public init<R: RegexComponent>(
    @RegexComponentBuilder _ content: (Regex<Substring>) -> R
  ) where R.Output == Match
}
Regex { wholeSentence in
  ChoiceOf {
   "I"
   "you"
  }
  "say"
  ChoiceOf {
    "goodbye"
    "hello"
    wholeSentence
  }
}

Does this impact type checking and error messages at all? Most important would be any impact on diagnostics for non-recursive regexes, since closures and especially result builders can give less-than-useful error messages. I'm wondering if the use of a label somewhere could help, especially since it's not super clear at the use site what the role of the wholeSentence parameter is.


Support buildOptional and buildEither

Just to double check, we're not precluding ourselves from supporting builder-time if in the future. Parameterized regex construction could be very useful, but it totally makes sense to defer it as future work. There's a lot going on here as it is.


Flatten optionals

Clarification: this behavior matches the literal typing behavior. We're interested in ways to flatten them in the future, no concrete plans at this moment.

3 Likes

I think it's a useful future direction! However, I wanted to clarify two things:

  • Regex's underlying representation is not the textual regex, but a general-purpose pattern matching program (for efficiency). So obtaining a textual regex from Regex would be a conversion.
  • Not every Regex built with regex builder can be converted to a textual regex. Regex builder supports arbitrary types that conform to the RegexComponent protocol, including CustomMatchingRegexComponent (pitched in regex-powered string processing algorithms) which can be implemented with arbitrary code. If a Regex contains a CustomMatchingRegexComponent, it cannot be converted to a textual regex.
7 Likes

As a bit of background, I was very strongly in favor of this approach initially. Attempts to make the Regex result builders behave like Pattern<T> (a type for parsers) increasingly created uncanny hybrids, and I've reversed my position for Regex specifically. A proper parser system needs more and deserves its own presentation, even if it's built on top of the same underlying capabilities here.

I think it's important to reduce the semantic divergence between the literals and result builders, and my main concern is that possessive-by-default creates a less understandable and less consistent system. This is especially the case if we only do it for quantification but keep global backtracking across subpatterns.

x*+ is syntactic sugar for (?>x*), that is wrapping the quantification in an atomic group. An atomic group establishes a local backtracking scope where, upon any successful exiting of that scope, all backtracking entries added within that scope are discarded. The result builder spelling of this is Local.

OneOrMore(..., .possessive)

Local { OneOrMore { ... } }

Personally I prefer the latter, as I feel it more directly and faithfully surfaces the semantics and seems an easier system to learn and teach. I could even see the argument to drop the quantification kind entirely from API in favor of OneOrMore, MinimallyOneOrMore (or some better name; these are fairly rare), and Local { OneOrMore }.

Future work can be a (names TBD) ratchet operation that will discard any backtracking entries since some prior ratchet point. These would be a lower-level formulation of Local that's suitable when needs transcend nesting. I.e. Local is akin to a pseudo-code defer { ratchet } at the top of the scope.


Possessive by default is most commonly found in grammar systems, where it's much easier to compose these linear-ish expressions and rules provide a fully local backtracking approach. There non-possessive can naturally desugar to S -> x S | <empty>.

I'm a big fan of possessive-by-default local-backtracking-only rules which are combined into a grammar. Unfortunately, we're not building a grammar system here (yet) and we're not claiming regex provides a full fledged parser combinator system (even if one can be forcibly wedged in via CustomMatchingRegexComponent). Regex could participate in such a system, but I'm wary of diverging fundamental concepts too much from what they mean within the world of regex.

What we're proposing is the regexiest regex, while still very much being a regex. Regex is inherently global backtracking by default with local backtracking opt-in; quantification is just a specific instance of this.

I realize that possessive-by-default is less strongly motivated by this line of reasoning than some of the capture representations, and I do hope it can be alleviated through API design. Ideally, I feel this would be best solved with a clear naming scheme, but my mind fails to find such a suitable scheme.

2 Likes

How about adding something like func buildRegexString() -> String to RegexComponent, allowing (and enforcing) every Regex component to have its own String representation? A CustomMatchingRegexComponent provider can choose its preferred spelling (or placeholder) for the component (or maybe leave it up to end-user through the initializer).

Conversion to textual regex is a possible future direction, but I think the functionality is purely additive and therefore is out of scope for this proposal.

I added a future direction section about this in apple/swift-evolution#1612.

Future directions

Conversion to textual regex

Sometimes it may be useful to convert a regex created using regex builder to textual regex. This may be achieved in the future by extending RegexComponent with a computed property.

extension RegexComponent {
  public func makeTextualRegex() -> String?
}

It is worth noting that the internal representation of a Regex is not textual regex, but an efficient pattern matching bytecode compiled from an abstract syntax tree. Moreover, not every Regex can be converted to textual regex. Regex builder supports arbitrary types that conform to the RegexComponent protocol, including CustomMatchingRegexComponent (pitched in [String Processing Algorithms]) which can be implemented with arbitrary code. If a Regex contains a CustomMatchingRegexComponent, it cannot be converted to textual regex.

2 Likes

Clarified the precondition:

extension Regex.Match {
  /// Returns the capture referenced by the given reference.
  ///
  /// - Precondition: The reference must have been captured in the regex that produced this match.
  public subscript<Capture>(_ reference: Reference<Capture>) -> Capture { get }
}

Good point. The recursive subpattern API could use some more exploration. I moved it to an alternative considered section in apple/swift-evolution#1612.

Future directions

...

Recursive subpatterns

Sometimes, a textual regex may also use (?R) or (?0) to recusively evaluate the entire regex. For example, the following textual regex matches "I say you say I say you say hello".

(you|I) say (goodbye|hello|(?R))

For this, Regex offers a special initializer that allows its pattern to recursively reference itself. This is somewhat akin to a fixed-point combinator.

extension Regex {
  public init<R: RegexComponent>(
    @RegexComponentBuilder _ content: (Regex<Substring>) -> R
  ) where R.Output == Match
}

With this initializer, the above regex can be expressed as the following using regex builder.

Regex { wholeSentence in
  ChoiceOf {
   "I"
   "you"
  }
  "say"
  ChoiceOf {
    "goodbye"
    "hello"
    wholeSentence
  }
}

There are some concerns with this design which we need to consider:

  • Due to the lack of labeling, the argument to the builder closure can be arbitrarily named and cause confusion.
  • When there is an initializer that accepts a result builder closure, overloading that initializer with the same argument labels could lead to bad error messages upon interor type errors.

Builder-time conditions can still be supported with buildOptional and buildEither in the future. I think we need some future investigation to make sure it supports heterogeneous types in branches and won't conflict with regex conditionals.

2 Likes

Agreed. Richard and I discussed this in some detail, and concluded that not having support initially was the best option to leave the door open for implementing the most useful behavior in the future, rather than tying us to something that we'll regret later.

Since this is also my dream, I am afraid that allowing statements within the builder will preclude this eventuality. It's also unfortunate to have regular expression literals having labeled captures (which, in Swift, I guess will happen almost 100% of the times) to break code when converted into a combination of Regex DSLs and References. Ideally, you shouldn't do anything after converting a regex literal into its DSL equivalent.

Also, is the current design capable of placing labeled regexes in Regex DSLs or are labeled tuples not handled by the 10-nple overloads?


Should the transform closure on Try/Captures become a .transform method on instances of RegexComponent? It could be useful to be able to try a transformation without actually capturing the value.


Regarding the Structured rather than flat captures final section in the considered alternatives: I'm not sure it's a future-proof decision to split Regex and an hypothetical Pattern based on their capture structure and default semantics. If a more powerful, "Swifty" and structured Pattern is available, would users still use this hybrid Regex DSL?

Regex literals with named captures (labeled tuples) currently do not work as a builder component because builder methods on unlabeled tuples won't accept labeled tuples as arguments. To support this as a future direction I think we need a new type system feature.

In today's regex builder DSL you can use Reference instead to achieve similar functionality.

Example
let a = Reference(Substring.self)
let b = Reference(Substring.self)
let regex = Regex {
  Capture("abc", as: a)
  Capture("def", as: b)
  a
  Capture(b)
}

if let result = input.firstMatch(of: regex) {
  print(result[a]) // => "abc"
  print(result[b]) // => "def"
}

This has been discussed in the pitch thread. A method named transform(_:) or map(_:) on RegexComponent would be totally useful as a future direction, but it should have very different semantics from the transform: parameter on Capture/TryCapture initializer. A map(_:), similar to other map methods, would transform the entire Output type.

The transform: parameter transforms the most recent capture, so if we were to make it a method, it would be called something like mapFirstCapture(_:) which I'm not sure reads as clearly as Capture(..., transform: ...). Moreover, mapFirstCapture(_:) needs 10 overloads to support each capture arity.

1 Like

Probably an unpopular opinion and not really constructive to the proposal itself: much of the appeal of regex is in its concise form and reusability, oftentimes between different languages and platforms. With external tools able to decompose complex regex and help you create, test and understand it, I can't say I'd use a harder to read, less-concise, non-portable DSL (that I also have to learn first).
Getting strong typed results from a regular regex (or fail) + compile time valid regex checks would be ideal (and enough in terms of safety) IMO.

1 Like

That’s a very reasonable request, which is covered by a separate proposal (discussed some in SE-350, and more in a forthcoming proposal for the literal syntax).

1 Like

Could you elaborate? I don't see why this would preclude that at all. Also note that this is in the RegexBuilder module.

Could you explain what you mean by future-proofing?

If Pattern existed powered by an improved Swift DSL / result builder story alongside better type operations, then clearly Regex would use that. Regex would be able to be used within a pattern and patterns could be regex components. This is all a little hand-wavy, because Pattern hasn't been fully designed and depends on language improvements.

As it is, this Regex DSL is helping to clarify and motivate those language improvements, such as variadic generics and issues around tuple labels.

Assuming that let bindings can be used to specify labeled captures,

Regex {
  let a = "abc"
  let b = "def"
  a
  let _ = b
}

would change its behavior (from being equivalent to /abc/ to /(?x) (?<a>abc) (?<b>def) \k<a> (\k<b>)/) and be of type Regex<(Substring, a: Substring, b: Substring, Substring)> instead of the current Regex<Substring>.

I'm sorry for derailing the conversation. I was thinking of Pattern as being capable of containing regexes but not viceversa. It would be nice to have a Future directions section in this proposal stating how the authors expect structured captures and/or collected captures (i.e. having OneOrMore(Capture(...)) to return an array of all the captured values instead of the last one only) to be handled.

Just pointing out that this is a huge assumption and not workable with result builders as they exist for the foreseeable future. The Swift parser does not know whether something is a result builder or not, that's resolved at type checking and name lookup time. We really don't want parsing to depend on type checking. So this would have to either be a dramatic overhaul of result builders or some other Swift DSL solution, or else other significant changes to Swift.

That would not be result builders at that point, which allow you to use lets inside to refactor common subexpressions.

Finally, the issues surrounding labeled tuples is a separate issue than result builders. IIUC, we need type-level operations that operate over tuples with labels, and this work is actually helping to motivate some of these tricky details surrounding variadic generics with tuple labels. @rxwei and @codafi would know the details much better than me.

No worries, but we might want to take this off-thread if we're debating speculative language features. Those are not scheduled on any rigid timeline, and even if they were it's not clear to me that we'd change anything we're proposing.

History preservation would be explicitly opt-in. We're going with a model that encourages interleaving code, either from closures or CustomMatchingRegexComponent conformers in with the regex algorithm itself. History preservation by default would either end up wasting a ton of space for uses that don't care about it, or would try to save engine state to replay upon request. The latter would re-run closures, which could produce spooky-action-at-a-distance, especially if invariants were maintained when the search happened that are no longer maintained now that the result has been passed off.

Even then, we'd be less likely to use Array, instead it'd be some lazy collection that provided O(1) access to the last element but would have to replay the matching algorithm to get any other elements. That's a pretty nuanced concept to introduce right now, so it's future work.

Recursively nested structured captures would be explicitly transformed to/from flattened captures when working with something like Pattern, details TBD. It's less clear that users would want to write tree-traversal code inside a find/replace API, but they certainly would for a parse result.

4 Likes

Yeah, I think making let be a monadic binding in result builders would be a pretty problematic reinterpretation. And there are a lot of good reasons to support both non-monadic and monadic binding; among other things, note that Haskell allows both, with let vs. <-. So if we want to add monadic binding, I think we should find a new syntax for it instead of repurposing let.

Also, traditional monadic binding would directly conflict with what I think the authors want to do with Regex (and maybe also the eventual parser-combinator library), since monadic computations downstream of the binding cannot be evaluated without binding the value first, which means you'd never have a static regex / parser grammar. You really want the binding to be rewritten in a much less compositional way so that the dominant pattern where bindings are just used to form a result:

do lhs <- expr
   op <- tok_operator
   rhs <- expr
   return $ binary_operator(lhs, op, rhs)

gets turned into something with a static grammar that accumulates the captures and then transforms them in one step.

4 Likes

Review manager note:

The proposal authors have made an update to the proposal. This is mostly clarification following this discussion, but it also moves recursive subpatterns into the future directions section, as it is not currently implemented.

The diff can be found here.

2 Likes