SE-0351: Regex Builder DSL

How about adding something like func buildRegexString() -> String to RegexComponent, allowing (and enforcing) every Regex component to have its own String representation? A CustomMatchingRegexComponent provider can choose its preferred spelling (or placeholder) for the component (or maybe leave it up to end-user through the initializer).

Conversion to textual regex is a possible future direction, but I think the functionality is purely additive and therefore is out of scope for this proposal.

I added a future direction section about this in apple/swift-evolution#1612.

Future directions

Conversion to textual regex

Sometimes it may be useful to convert a regex created using regex builder to textual regex. This may be achieved in the future by extending RegexComponent with a computed property.

extension RegexComponent {
  public func makeTextualRegex() -> String?
}

It is worth noting that the internal representation of a Regex is not textual regex, but an efficient pattern matching bytecode compiled from an abstract syntax tree. Moreover, not every Regex can be converted to textual regex. Regex builder supports arbitrary types that conform to the RegexComponent protocol, including CustomMatchingRegexComponent (pitched in [String Processing Algorithms]) which can be implemented with arbitrary code. If a Regex contains a CustomMatchingRegexComponent, it cannot be converted to textual regex.

2 Likes

Clarified the precondition:

extension Regex.Match {
  /// Returns the capture referenced by the given reference.
  ///
  /// - Precondition: The reference must have been captured in the regex that produced this match.
  public subscript<Capture>(_ reference: Reference<Capture>) -> Capture { get }
}

Good point. The recursive subpattern API could use some more exploration. I moved it to an alternative considered section in apple/swift-evolution#1612.

Future directions

...

Recursive subpatterns

Sometimes, a textual regex may also use (?R) or (?0) to recusively evaluate the entire regex. For example, the following textual regex matches "I say you say I say you say hello".

(you|I) say (goodbye|hello|(?R))

For this, Regex offers a special initializer that allows its pattern to recursively reference itself. This is somewhat akin to a fixed-point combinator.

extension Regex {
  public init<R: RegexComponent>(
    @RegexComponentBuilder _ content: (Regex<Substring>) -> R
  ) where R.Output == Match
}

With this initializer, the above regex can be expressed as the following using regex builder.

Regex { wholeSentence in
  ChoiceOf {
   "I"
   "you"
  }
  "say"
  ChoiceOf {
    "goodbye"
    "hello"
    wholeSentence
  }
}

There are some concerns with this design which we need to consider:

  • Due to the lack of labeling, the argument to the builder closure can be arbitrarily named and cause confusion.
  • When there is an initializer that accepts a result builder closure, overloading that initializer with the same argument labels could lead to bad error messages upon interor type errors.

Builder-time conditions can still be supported with buildOptional and buildEither in the future. I think we need some future investigation to make sure it supports heterogeneous types in branches and won't conflict with regex conditionals.

2 Likes

Agreed. Richard and I discussed this in some detail, and concluded that not having support initially was the best option to leave the door open for implementing the most useful behavior in the future, rather than tying us to something that we'll regret later.

Since this is also my dream, I am afraid that allowing statements within the builder will preclude this eventuality. It's also unfortunate to have regular expression literals having labeled captures (which, in Swift, I guess will happen almost 100% of the times) to break code when converted into a combination of Regex DSLs and References. Ideally, you shouldn't do anything after converting a regex literal into its DSL equivalent.

Also, is the current design capable of placing labeled regexes in Regex DSLs or are labeled tuples not handled by the 10-nple overloads?


Should the transform closure on Try/Captures become a .transform method on instances of RegexComponent? It could be useful to be able to try a transformation without actually capturing the value.


Regarding the Structured rather than flat captures final section in the considered alternatives: I'm not sure it's a future-proof decision to split Regex and an hypothetical Pattern based on their capture structure and default semantics. If a more powerful, "Swifty" and structured Pattern is available, would users still use this hybrid Regex DSL?

Regex literals with named captures (labeled tuples) currently do not work as a builder component because builder methods on unlabeled tuples won't accept labeled tuples as arguments. To support this as a future direction I think we need a new type system feature.

In today's regex builder DSL you can use Reference instead to achieve similar functionality.

Example
let a = Reference(Substring.self)
let b = Reference(Substring.self)
let regex = Regex {
  Capture("abc", as: a)
  Capture("def", as: b)
  a
  Capture(b)
}

if let result = input.firstMatch(of: regex) {
  print(result[a]) // => "abc"
  print(result[b]) // => "def"
}

This has been discussed in the pitch thread. A method named transform(_:) or map(_:) on RegexComponent would be totally useful as a future direction, but it should have very different semantics from the transform: parameter on Capture/TryCapture initializer. A map(_:), similar to other map methods, would transform the entire Output type.

The transform: parameter transforms the most recent capture, so if we were to make it a method, it would be called something like mapFirstCapture(_:) which I'm not sure reads as clearly as Capture(..., transform: ...). Moreover, mapFirstCapture(_:) needs 10 overloads to support each capture arity.

1 Like

Probably an unpopular opinion and not really constructive to the proposal itself: much of the appeal of regex is in its concise form and reusability, oftentimes between different languages and platforms. With external tools able to decompose complex regex and help you create, test and understand it, I can't say I'd use a harder to read, less-concise, non-portable DSL (that I also have to learn first).
Getting strong typed results from a regular regex (or fail) + compile time valid regex checks would be ideal (and enough in terms of safety) IMO.

1 Like

That’s a very reasonable request, which is covered by a separate proposal (discussed some in SE-350, and more in a forthcoming proposal for the literal syntax).

1 Like

Could you elaborate? I don't see why this would preclude that at all. Also note that this is in the RegexBuilder module.

Could you explain what you mean by future-proofing?

If Pattern existed powered by an improved Swift DSL / result builder story alongside better type operations, then clearly Regex would use that. Regex would be able to be used within a pattern and patterns could be regex components. This is all a little hand-wavy, because Pattern hasn't been fully designed and depends on language improvements.

As it is, this Regex DSL is helping to clarify and motivate those language improvements, such as variadic generics and issues around tuple labels.

Assuming that let bindings can be used to specify labeled captures,

Regex {
  let a = "abc"
  let b = "def"
  a
  let _ = b
}

would change its behavior (from being equivalent to /abc/ to /(?x) (?<a>abc) (?<b>def) \k<a> (\k<b>)/) and be of type Regex<(Substring, a: Substring, b: Substring, Substring)> instead of the current Regex<Substring>.

I'm sorry for derailing the conversation. I was thinking of Pattern as being capable of containing regexes but not viceversa. It would be nice to have a Future directions section in this proposal stating how the authors expect structured captures and/or collected captures (i.e. having OneOrMore(Capture(...)) to return an array of all the captured values instead of the last one only) to be handled.

Just pointing out that this is a huge assumption and not workable with result builders as they exist for the foreseeable future. The Swift parser does not know whether something is a result builder or not, that's resolved at type checking and name lookup time. We really don't want parsing to depend on type checking. So this would have to either be a dramatic overhaul of result builders or some other Swift DSL solution, or else other significant changes to Swift.

That would not be result builders at that point, which allow you to use lets inside to refactor common subexpressions.

Finally, the issues surrounding labeled tuples is a separate issue than result builders. IIUC, we need type-level operations that operate over tuples with labels, and this work is actually helping to motivate some of these tricky details surrounding variadic generics with tuple labels. @rxwei and @codafi would know the details much better than me.

No worries, but we might want to take this off-thread if we're debating speculative language features. Those are not scheduled on any rigid timeline, and even if they were it's not clear to me that we'd change anything we're proposing.

History preservation would be explicitly opt-in. We're going with a model that encourages interleaving code, either from closures or CustomMatchingRegexComponent conformers in with the regex algorithm itself. History preservation by default would either end up wasting a ton of space for uses that don't care about it, or would try to save engine state to replay upon request. The latter would re-run closures, which could produce spooky-action-at-a-distance, especially if invariants were maintained when the search happened that are no longer maintained now that the result has been passed off.

Even then, we'd be less likely to use Array, instead it'd be some lazy collection that provided O(1) access to the last element but would have to replay the matching algorithm to get any other elements. That's a pretty nuanced concept to introduce right now, so it's future work.

Recursively nested structured captures would be explicitly transformed to/from flattened captures when working with something like Pattern, details TBD. It's less clear that users would want to write tree-traversal code inside a find/replace API, but they certainly would for a parse result.

4 Likes

Yeah, I think making let be a monadic binding in result builders would be a pretty problematic reinterpretation. And there are a lot of good reasons to support both non-monadic and monadic binding; among other things, note that Haskell allows both, with let vs. <-. So if we want to add monadic binding, I think we should find a new syntax for it instead of repurposing let.

Also, traditional monadic binding would directly conflict with what I think the authors want to do with Regex (and maybe also the eventual parser-combinator library), since monadic computations downstream of the binding cannot be evaluated without binding the value first, which means you'd never have a static regex / parser grammar. You really want the binding to be rewritten in a much less compositional way so that the dominant pattern where bindings are just used to form a result:

do lhs <- expr
   op <- tok_operator
   rhs <- expr
   return $ binary_operator(lhs, op, rhs)

gets turned into something with a static grammar that accumulates the captures and then transforms them in one step.

4 Likes

Review manager note:

The proposal authors have made an update to the proposal. This is mostly clarification following this discussion, but it also moves recursive subpatterns into the future directions section, as it is not currently implemented.

The diff can be found here.

2 Likes

I'm a big fan of making regex easier to read and refactor. I recently ported some code from a JS codebase into Swift and while I could pretty easily understand and reason about the behavior of the JS code, the regexes were much more difficult to understand, especially when characters had to be escaped. For some of the larger regexes, I just copied them over verbatim without really exploring how they worked. I'm sure I've written lots of regexes myself that no one (including myself) would ever be able to reverse engineer.

Focusing on the solution presented specifically, I was hesitant at first glance. It almost feels like a result builder takes things too far into the other direction, but I think it's growing on me. I grabbed a regex from a project and converted it over and it's certainly easier to read and composable.

One thing I might have missed: how are quantitative captures handled? Something like this:

OneOrMore {
	Capture {
		OneOrMove(.word)
	}
}
1 Like

Applying a quantifier to a capturing regex will cause the capture to store the last occurrence upon matching, same behavior as textual regex.

let regex = OneOrMore {
	Capture {
		OneOrMore(.word) // .Output == Substring
	} // .Output == (Substring, Substring)
} // .Output == (Substring, Substring)
let input = "abcdef"
let match = input.firstMatch(of: regex)!
match.0 // "abcdef"
match.1 // "f"

One future direction could be to provide a quantifier variant which results in a collection capture type that either preserves the entire capture history or re-matches lazily. We decided that the current behavior (saving the last occurrence) is the best default, as preserving history is memory-intensive and lazy re-matching can lead to misuse.

ChoiceOf enums

SE-0350 describes Regex-backed enums as future work. However, could ChoiceOf support either CaseIterable enums or buildArray loops in this release?


Predefined regexes

The introduction has a simplified example of email addresses. Some servers and clients also support addresses containing ASCII symbols, comments, quoting, escaping, etc.

As a future direction, should the standard library (or a standard package) have predefined regexes, similar to NSDataDetector?

1 Like

I'm still really happy that the proposal authors have tackled a DSL in addition to traditional regex syntax. I feel now, as I did previously, that this is the right approach for offering users the right balance of readability and concision.

For that reason, I do think this problem is significant enough to warrant a change, and I think this proposal generally fits the language well.

I've not used another language that offers a similar DSL. I participated in earlier phases of the discussion in some detail, and I studied this proposal in depth.


I am sad to hear that the use of operators for quantification isn't feasible without variadic generics. Although many aspects of the DSL are very readable, I still find that ?, +, and * can be more easily scanned by eye than the written-out alternatives, and do wish that they were possible to use today without turning to traditional regex syntax. However, I understand the technical limitation. Hopefully the authors are amenable to revisiting the topic if/when variadic generics become available.

Some naming quibbles:

IIUC, the scoping type Local is intended to be an advanced feature; certainly I've never myself written a regex which defines a backtracking scope (although I don't count myself a regex aficionado). Since the DSL proposed here is supposed to be easy to read and offer good code completion, I think this type could do with a different, more obvious name. For example, BacktrackingScope seems like it'd do the trick? Local reads like jargon to me, and considering that these types may appear in autocomplete lists outside of regex builder blocks, when stripped of context it'd be a totally mystifying type for a user to find.

Perhaps I understood at one point, but reviewing again I really can't call to mind the distinction between Capture and TryCapture (and the text doesn't show an example that really compares and contrasts). I vaguely get that the latter is something to do with error handling, but these names really aren't helpful IMO: after all, aren't we trying to capture something in all cases?

It's plainly obvious why Optionally can't be Optional, but it seems a little unsatisfying that we've got OneOrMore, ZeroOrMore, and then an adverb instead of ZeroOrOne—which would be what I'd search for if I were looking for the missing option after having seen the first two quantification types in an autocomplete list.

Last nit: Why Repeat(..., count: n) with the label for n, but no label with the version that has a range parameter? Those ranges are just as much of counts, and I would expect that Repeat(..., count: n...m) wouldn't be a problem for type-checking performance.

17 Likes

Interesting; to my mind names involving Backtracking are much more jargon-y, in that they refer to the details of how such a feature is commonly implemented rather than what it does to the language accepted by the Regex. I don't quite love Local either, but it does feel less like jargon to me.

1 Like

Both can take transform closures that construct a value. But, a Capture's transform closure is a total function, it cannot locally-fail and thus the capture always succeeds. A TryCapture returns an optional value, meaning that whatever post-processed it could signal a local-failure (and thus backtrack). Thus TryCapture signals that its transform function actively participates in the pattern matching attempt.

Both can abort the whole matching operation in case it detects something malicious or some violated invariant (or it somehow knows that matching can never ever succeed) via throwing. And of course, inside any closure there can be fatalErrors or other traps to bring down the whole process as well.

I think If you have thoughts on better API unification, I'd love to explore that a little.

Any thoughts on a better argument label than transform? That's always irked me a tiny bit but I just can't seem to come up with something better.

If a Capture can also throw to abort, then TryCapture might be the wrong name.
Could TryCapture be merged into Capture, with different trailing closure labels?

I suggest renaming transform: to

  • flatMap: for backtracking.
  • map: for non-backtracking.
2 Likes