SE-0351: Regex Builder DSL

Assuming that let bindings can be used to specify labeled captures,

Regex {
  let a = "abc"
  let b = "def"
  a
  let _ = b
}

would change its behavior (from being equivalent to /abc/ to /(?x) (?<a>abc) (?<b>def) \k<a> (\k<b>)/) and be of type Regex<(Substring, a: Substring, b: Substring, Substring)> instead of the current Regex<Substring>.

I'm sorry for derailing the conversation. I was thinking of Pattern as being capable of containing regexes but not viceversa. It would be nice to have a Future directions section in this proposal stating how the authors expect structured captures and/or collected captures (i.e. having OneOrMore(Capture(...)) to return an array of all the captured values instead of the last one only) to be handled.

Just pointing out that this is a huge assumption and not workable with result builders as they exist for the foreseeable future. The Swift parser does not know whether something is a result builder or not, that's resolved at type checking and name lookup time. We really don't want parsing to depend on type checking. So this would have to either be a dramatic overhaul of result builders or some other Swift DSL solution, or else other significant changes to Swift.

That would not be result builders at that point, which allow you to use lets inside to refactor common subexpressions.

Finally, the issues surrounding labeled tuples is a separate issue than result builders. IIUC, we need type-level operations that operate over tuples with labels, and this work is actually helping to motivate some of these tricky details surrounding variadic generics with tuple labels. @rxwei and @codafi would know the details much better than me.

No worries, but we might want to take this off-thread if we're debating speculative language features. Those are not scheduled on any rigid timeline, and even if they were it's not clear to me that we'd change anything we're proposing.

History preservation would be explicitly opt-in. We're going with a model that encourages interleaving code, either from closures or CustomMatchingRegexComponent conformers in with the regex algorithm itself. History preservation by default would either end up wasting a ton of space for uses that don't care about it, or would try to save engine state to replay upon request. The latter would re-run closures, which could produce spooky-action-at-a-distance, especially if invariants were maintained when the search happened that are no longer maintained now that the result has been passed off.

Even then, we'd be less likely to use Array, instead it'd be some lazy collection that provided O(1) access to the last element but would have to replay the matching algorithm to get any other elements. That's a pretty nuanced concept to introduce right now, so it's future work.

Recursively nested structured captures would be explicitly transformed to/from flattened captures when working with something like Pattern, details TBD. It's less clear that users would want to write tree-traversal code inside a find/replace API, but they certainly would for a parse result.

4 Likes

Yeah, I think making let be a monadic binding in result builders would be a pretty problematic reinterpretation. And there are a lot of good reasons to support both non-monadic and monadic binding; among other things, note that Haskell allows both, with let vs. <-. So if we want to add monadic binding, I think we should find a new syntax for it instead of repurposing let.

Also, traditional monadic binding would directly conflict with what I think the authors want to do with Regex (and maybe also the eventual parser-combinator library), since monadic computations downstream of the binding cannot be evaluated without binding the value first, which means you'd never have a static regex / parser grammar. You really want the binding to be rewritten in a much less compositional way so that the dominant pattern where bindings are just used to form a result:

do lhs <- expr
   op <- tok_operator
   rhs <- expr
   return $ binary_operator(lhs, op, rhs)

gets turned into something with a static grammar that accumulates the captures and then transforms them in one step.

4 Likes

Review manager note:

The proposal authors have made an update to the proposal. This is mostly clarification following this discussion, but it also moves recursive subpatterns into the future directions section, as it is not currently implemented.

The diff can be found here.

2 Likes

I'm a big fan of making regex easier to read and refactor. I recently ported some code from a JS codebase into Swift and while I could pretty easily understand and reason about the behavior of the JS code, the regexes were much more difficult to understand, especially when characters had to be escaped. For some of the larger regexes, I just copied them over verbatim without really exploring how they worked. I'm sure I've written lots of regexes myself that no one (including myself) would ever be able to reverse engineer.

Focusing on the solution presented specifically, I was hesitant at first glance. It almost feels like a result builder takes things too far into the other direction, but I think it's growing on me. I grabbed a regex from a project and converted it over and it's certainly easier to read and composable.

One thing I might have missed: how are quantitative captures handled? Something like this:

OneOrMore {
	Capture {
		OneOrMove(.word)
	}
}
1 Like

Applying a quantifier to a capturing regex will cause the capture to store the last occurrence upon matching, same behavior as textual regex.

let regex = OneOrMore {
	Capture {
		OneOrMore(.word) // .Output == Substring
	} // .Output == (Substring, Substring)
} // .Output == (Substring, Substring)
let input = "abcdef"
let match = input.firstMatch(of: regex)!
match.0 // "abcdef"
match.1 // "f"

One future direction could be to provide a quantifier variant which results in a collection capture type that either preserves the entire capture history or re-matches lazily. We decided that the current behavior (saving the last occurrence) is the best default, as preserving history is memory-intensive and lazy re-matching can lead to misuse.

ChoiceOf enums

SE-0350 describes Regex-backed enums as future work. However, could ChoiceOf support either CaseIterable enums or buildArray loops in this release?


Predefined regexes

The introduction has a simplified example of email addresses. Some servers and clients also support addresses containing ASCII symbols, comments, quoting, escaping, etc.

As a future direction, should the standard library (or a standard package) have predefined regexes, similar to NSDataDetector?

1 Like

I'm still really happy that the proposal authors have tackled a DSL in addition to traditional regex syntax. I feel now, as I did previously, that this is the right approach for offering users the right balance of readability and concision.

For that reason, I do think this problem is significant enough to warrant a change, and I think this proposal generally fits the language well.

I've not used another language that offers a similar DSL. I participated in earlier phases of the discussion in some detail, and I studied this proposal in depth.


I am sad to hear that the use of operators for quantification isn't feasible without variadic generics. Although many aspects of the DSL are very readable, I still find that ?, +, and * can be more easily scanned by eye than the written-out alternatives, and do wish that they were possible to use today without turning to traditional regex syntax. However, I understand the technical limitation. Hopefully the authors are amenable to revisiting the topic if/when variadic generics become available.

Some naming quibbles:

IIUC, the scoping type Local is intended to be an advanced feature; certainly I've never myself written a regex which defines a backtracking scope (although I don't count myself a regex aficionado). Since the DSL proposed here is supposed to be easy to read and offer good code completion, I think this type could do with a different, more obvious name. For example, BacktrackingScope seems like it'd do the trick? Local reads like jargon to me, and considering that these types may appear in autocomplete lists outside of regex builder blocks, when stripped of context it'd be a totally mystifying type for a user to find.

Perhaps I understood at one point, but reviewing again I really can't call to mind the distinction between Capture and TryCapture (and the text doesn't show an example that really compares and contrasts). I vaguely get that the latter is something to do with error handling, but these names really aren't helpful IMO: after all, aren't we trying to capture something in all cases?

It's plainly obvious why Optionally can't be Optional, but it seems a little unsatisfying that we've got OneOrMore, ZeroOrMore, and then an adverb instead of ZeroOrOne—which would be what I'd search for if I were looking for the missing option after having seen the first two quantification types in an autocomplete list.

Last nit: Why Repeat(..., count: n) with the label for n, but no label with the version that has a range parameter? Those ranges are just as much of counts, and I would expect that Repeat(..., count: n...m) wouldn't be a problem for type-checking performance.

17 Likes

Interesting; to my mind names involving Backtracking are much more jargon-y, in that they refer to the details of how such a feature is commonly implemented rather than what it does to the language accepted by the Regex. I don't quite love Local either, but it does feel less like jargon to me.

1 Like

Both can take transform closures that construct a value. But, a Capture's transform closure is a total function, it cannot locally-fail and thus the capture always succeeds. A TryCapture returns an optional value, meaning that whatever post-processed it could signal a local-failure (and thus backtrack). Thus TryCapture signals that its transform function actively participates in the pattern matching attempt.

Both can abort the whole matching operation in case it detects something malicious or some violated invariant (or it somehow knows that matching can never ever succeed) via throwing. And of course, inside any closure there can be fatalErrors or other traps to bring down the whole process as well.

I think If you have thoughts on better API unification, I'd love to explore that a little.

Any thoughts on a better argument label than transform? That's always irked me a tiny bit but I just can't seem to come up with something better.

If a Capture can also throw to abort, then TryCapture might be the wrong name.
Could TryCapture be merged into Capture, with different trailing closure labels?

I suggest renaming transform: to

  • flatMap: for backtracking.
  • map: for non-backtracking.
2 Likes

I also couldn't see why Capture and TryCapture need to be distinct types any more (since the transform in Capture was changed to be throwing too) and agree that it would make more sense to have the distinction be done by the closure labels. However, I don't think that's possible with unlabelled trailing closures?

Or compactMap? This would fit better with the equivalent collection API. Scala's parser combinators call this 'into' (I think?)

1 Like

Regarding Capture:

As mentioned by @xwu, this sentence doesn't really capture what @Michael_Ilseman described above and I think needs updating:

The difference between Capture and TryCapture is that TryCapture works better with transform closures that can return nil or throw, whereas Capture relies on the user to handle errors within a transform closure. With TryCapture , when the closure returns nil or throws, the failure becomes a no-match.

From Michael's description, it isn't that TryCapture 'works better' with nil return values, it's that the match relies on both the regex and the transform succeeding. TryCapture therefore feels to me a lot more like Capture(_: CustomConsumingRegexComponent) in that the closure participates in the match (in the same way that the CustomConsumingRegexComponent parser does).

Given the above, what about Capture { ... } matching: { /* can return nil */ } for this case?


The Detailed Design for Capture/TryCapture doesn't seem to match the implementation. I think all the transform closures should all be throws and for TryCapture the return type of the transform closures should all be NewCapture?.

1 Like

There's a revision up here: [SE-0351] Revise regex builder proposal by rxwei · Pull Request #1634 · apple/swift-evolution · GitHub

Any type can be Captured, including optional types, and so there is a difference between a closure that returns nil to signal failure and a transform that returns nil as a strongly-typed optional capture value.

I like the idea of alternate closure labels to disambiguate, but it might not be enough. There's an overload that takes a RegexComponent as the first argument (rather than a builder resulting in a RegexComponent), which means the closure label wouldn't always be required.

Another name to matching: { ... } could be attempting { ... }.

1 Like

Thank you. Those changes to the proposal make it much clearer.

For distinguishing the non- vs backtracking cases, I had hoped that the possibility of nil or not would be enough for the RegexComponent overload. I hadn’t considered that the transform could succeed with a nil.

There were some discussions during the review of SE-0279 Multiple Trailing Closures about giving a way for an API to require a label for the first trailing closure. I think this is (another?) case where that facility might be useful for the design of an API.

6 Likes

Review Conclusion

The proposal has been returned for revision.

1 Like

Appending to this review thread (apologies, not sure where best to post this):

Working out this example in another review:

let kind = Reference(Substring.self)  // 🫤
let date = Reference(Substring.self)
let account = Reference(Substring.self)
let amount = Reference(Substring.self)

let regex = Regex {
  // Match a line of the format e.g "DEBIT  03/03/2022  Totally Legit Shell Corp  $2,000,000.00"
  let fieldBreak = /\s\s+/
  Capture(/\w+/,               as: kind);    fieldBreak
  Capture(/\S+/,               as: date);    fieldBreak
  Capture(/(?: (?!\s\s) . )+/, as: account); fieldBreak  // Note that account names may contain spaces.
  Capture(/.*/,                as: amount)
}

…made me realize that the builder DSL presents an ugly tradeoff:

  1. either tolerate verbose repetition of Reference(Substring.self), or
  2. do away with the legibility and safety of named captures.

Positional captures are something we’d want to discourage: they not only harm readability, but make code more brittle because of the danger of position errors (i.e. capture 2 became capture 3, code far below still assume it is 2 and is now broken, oops).


It would be nice if Substring were the default Capture for a no-args Reference initializer:

let kind = Reference()  // 😐 better
let date = Reference()
let account = Reference()
let amount = Reference()

Perhaps it would even be worth providing a method overloaded for various tuple sizes that returns a tuple of many references of the same type, and makes Substring the default type via overloading:

let (kind, date, account, amount) = Regex.references()   // 🙂 not so bad! (all Substring captures)

let (startDate, endDate) = Regex.references(Date.self)  // multiple references of a different type

Something along these lines could make named captures in builder DSL regexes much more tolerable.

Something to consider in the revision?

4 Likes