Full proposal: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexBuilderDSL.md
Regex builder DSL
- Status: Pitch
- Implementation: apple/swift-experimental-string-processing
Table of Contents
- Introduction
- Motivation
- Proposed solution
- Detailed design
- Source compatibility
- Effect on ABI stability
- Effect on API resilience
- Alternatives considered
Introduction
Declarative string processing aims to offer powerful pattern matching capabilities with expressivity, clarity, type safety, and ease of use. To achieve this, we propose to introduce a result-builder-based DSL, regex builder, for creating and composing regular expressions (regexes).
Regex builder is part of the Swift Standard Library but resides in a standalone module named RegexBuilder
. By importing RegexBuilder
, you get all necessary API for building a regex.
import RegexBuilder
let emailPattern = Regex {
let word = OneOrMore(.word)
Capture {
ZeroOrMore {
word
"."
}
word
}
"@"
Capture {
word
OneOrMore {
"."
word
}
}
} // => Regex<(Substring, Substring, Substring)>
let email = "My email is my.name@mail.swift.org."
if let match = email.firstMatch(of: emailPattern) {
let (wholeMatch, name, domain) = match.output
// wholeMatch: "My email is my.name@mail.swift.org."
// name: "my.name"
// domain: "mail.swift.org"
}
This proposal introduces all core API for creating and composing regexes that echos the textual regex syntax and strongly typed regex captures, but does not formally specify the matching semantics or define character classes.
Motivation
Regex is a fundemental and powerful tool for textual pattern matching. It is a domain-specific language often expressed as text. For example, given the following bank statement:
CREDIT 04062020 PayPal transfer $4.99
CREDIT 04032020 Payroll $69.73
DEBIT 04022020 ACH transfer $38.25
DEBIT 03242020 IRS tax payment $52249.98
One can write the follow textual regex to match each line:
(CREDIT|DEBIT)\s+(\d{2}\d{2}\d{4})\s+([\w\s]+\w)\s+(\$\d+\.\d{2})
While a regex like this is very compact and expressive, it is very difficult read, write and use:
- Syntactic special characters, e.g.
\
,(
,[
,{
, are too dense to be readable. - It contains a hierarchy of subpatterns fit into a single line of text.
- No code completion when typing syntactic components.
- Capturing groups produce raw data (i.e. a range or a substring) and can only be converted to other data structures after matching.
- While comments
(?#...)
can be added inline, it only complicates readability.
Proposed solution
We introduce regex builder, a result-builder-based API for creating and composing regexes. This API resides in a new module named RegexBuilder
that is to be shipped as part of the Swift toolchain.
With regex builder, the regex for matching a bank statement can be written as the following:
import RegexBuilder
enum TransactionKind: String {
case credit = "CREDIT"
case debit = "DEBIT"
}
struct Date {
var month, day, year: Int
init?(mmddyyyy: String) { ... }
}
struct Amount {
var valueTimes100: Int
init?(twoDecimalPlaces text: Substring) { ... }
}
let statementPattern = Regex {
// Parse the transaction kind.
TryCapture {
ChoiceOf {
"CREDIT"
"DEBIT"
}
} transform: {
TransactionKind(rawValue: String($0))
}
OneOrMore(.whitespace)
// Parse the date, e.g. "01012021".
TryCapture {
Repeat(.digit, count: 2)
Repeat(.digit, count: 2)
Repeat(.digit, count: 4)
} transform: { Date(mmddyyyy: $0) }
OneOrMore(.whitespace)
// Parse the transaction description, e.g. "ACH transfer".
Capture {
OneOrMore(.custom([
.characterClass(.word),
.characterClass(.whitespace)
]))
CharacterClass.word
} transform: { String($0) }
OneOrMore(.whitespace)
"$"
// Parse the amount, e.g. `$100.00`.
TryCapture {
OneOrMore(.digit)
"."
Repeat(.digit, count: 2)
} transform: { Amount(twoDecimalPlaces: $0) }
} // => Regex<(Substring, TransactionKind, Date, String, Amount)>
let statement = """
CREDIT 04062020 PayPal transfer $4.99
CREDIT 04032020 Payroll $69.73
DEBIT 04022020 ACH transfer $38.25
DEBIT 03242020 IRS tax payment $52249.98
"""
for match in statement.matches(of: statementPattern) {
let (line, kind, date, description, amount) = match.output
...
}
Regex builder addresses all of textual regexes' shortcomings presented in the Motivation section:
- Capture groups and quantifiers are expressed as API calls that are easy to read.
- Scoping and indentations clearly distinguish subpatterns in the hierarchy.
- Code completion is available when the developer types an API call.
- Capturing groups can be transformed into structured data at the regex declaration site.
- Normal code comments can be written within a regex declaration to further improve readability.
Detailed design
RegexComponent
protocol
One of the goals of the regex builder DSL is allowing the developers to easily compose regexes from common currency types and literals, or even define custom patterns to use for matching. We introduce RegexComponent
, a protocol that unifies all types that can represent a component of a regex.
public protocol RegexComponent {
associatedtype Output
@RegexComponentBuilder
var regex: Regex<Output> { get }
}
By conforming standard library types to RegexComponent
, we allow them to be used inside the regex builder DSL as a match target.
// A string represents a regex that matches the string.
extension String: RegexComponent {
public var regex: Regex<Substring> { get }
}
// A substring represents a regex that matches the substring.
extension Substring: RegexComponent {
public var regex: Regex<Substring> { get }
}
// A character represents a regex that matches the character.
extension Character: RegexComponent {
public var regex: Regex<Substring> { get }
}
// A unicode scalar represents a regex that matches the scalar.
extension UnicodeScalar: RegexComponent {
public var regex: Regex<Substring> { get }
}
// To be introduced in a future pitch.
extension CharacterClass: RegexComponent {
public var regex: Regex<Substring> { get }
}
Since regexes are composable, the Regex
type itself also conforms to RegexComponent
.
extension Regex: RegexComponent {
public var regex: Self { self }
}
All of the regex builder DSL in the rest of this pitch will accept generic components that conform to RegexComponent
.
Concatenation
A regex can be viewed as a concatenation of smaller regexes. In the regex builder DSL, RegexComponentBuilder
is the basic facility to allow developers to compose regexes by concatenation.
@resultBuilder
public enum RegexComponentBuilder { ... }
A closure marked with @RegexComponentBuilder
will be transformed to produce a Regex
by concatenating all of its components, where the result type's Output
type will be a Substring
followed by concatenated captures (tuple when plural).
Recap: Regex capturing basics
Regex
is a generic type with generic parameterOutput
.struct Regex<Output> { ... }
When a regex does not contain any capturing groups, its
Output
type isSubstring
, which represents the whole matched portion of the input.let noCaptures = #/a/# // => Regex<Substring>
When a regex contains capturing groups, i.e.
(...)
, theOutput
type is extended as a tuple to also contain capture types. Capture types are tuple elements after the first element.// ________________________________ // .0 | .0 | // ____________________ _________ let yesCaptures = #/a(?:(b+)c(d+))+e(f)?/# // => Regex<(Substring, Substring, Substring, Substring?)> // ---- ---- --- --------- --------- ---------- // .1 | .2 | .3 | .1 | .2 | .3 | // | | | | | | // | | |_______________________________ | ______ | ________| // | | | | // | |______________________________________ | ______ | // | | // |_____________________________________________| // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ // Capture types
We introduce a new initializer Regex.init(_:)
which accepts a @RegexComponentBuilder
closure. This initializer is the entry point for creating a regex using the regex builder DSL.
extension Regex {
public init<R: RegexComponent>(
@RegexComponentBuilder _ content: () -> R
) where R.Output == Output
}
Example:
Regex {
regex0 // Regex<Substring>
regex1 // Regex<(Substring, Int)>
if condition {
regex2 // Regex<(Substring, Float)>
} else {
regex3 // Regex<(Substring, Float)>
}
} // Regex<(Substring, Int, Float)>
This regex will be transformed to:
Regex {
let e0 = RegexComponentBuilder.buildExpression(regex0) // Component<Regex<Substring>>
let e1 = RegexComponentBuilder.buildExpression(regex1) // Component<Regex<(Substring, Int)>>
let e2: Regex<(Substring, Float)>
if condition {
let comp = RegexComponentBuilder.buildExpression(regex2) // Component<Regex<(Substring, Float)>>
e2 = RegexComponentBuilder.buildEither(first: comp) // Regex<(Substring, Float)>
} else {
let comp = RegexComponentBuilder.buildExpression(regex3) // Component<Regex<(Substring, Float)>>
e2 = RegexComponentBuilder.buildEither(first: comp) // Regex<(Substring, Float)>
}
let r0 = RegexComponentBuilder.buildPartialBlock(first: e0)
let r1 = RegexComponentBuilder.buildPartialBlock(accumulated: r0, next: e1)
let r2 = RegexComponentBuilder.buildPartialBlock(accumulated: r1, next: e2)
return r2
} // Regex<(Substring, Int, Float)>
Basic methods in RegexComponentBuilder
, e.g. buildBlock()
, provides support for creating the most fundamental blocks. The buildExpression
method wraps a user-provided component in a RegexComponentBuilder.Component
structure, before passing the component to other builder methods. This is used for saving the source location of the component so that runtime errors can be reported with an accurate location.
@resultBuilder
public enum RegexComponentBuilder {
/// Returns an empty regex.
public static func buildBlock() -> Regex<Substring>
/// A builder component that stores a regex component and its source location
/// for debugging purposes.
public struct Component<Value: RegexComponent> {
public var value: Value
public var file: String
public var function: String
public var line: Int
public var column: Int
}
/// Returns a component by wrapping the component regex in `Component` and
/// recording its source location.
public static func buildExpression<R: RegexComponent>(
_ regex: R,
file: String = #file,
function: String = #function,
line: Int = #line,
column: Int = #column
) -> Component<R>
}
When it comes to concatenation, RegexComponentBuilder
utilizes the recently proposed buildPartialBlock
feature to be able to concatenate all components' capture types to a single result tuple. buildPartialBlock(first:)
provides support for creating a regex from a single component, and buildPartialBlock(accumulated:next:)
support for creating a regex from multiple results.
Before Swift supports variadic generics, buildPartialBlock(first:)
and buildPartialBlock(accumulated:next:)
must be overloaded to support concatenating regexes of supported capture quantities (arities).
buildPartialBlock(first:)
is overloadedarity
times such that a unary block with a component of any supported capture arity will produce a regex with capture typeSubstring
followed by the component's capture types. The base overload,buildPartialBlock<R>(first:) -> Regex<Substring>
, must be marked with@_disfavoredOverload
to prevent it from shadowing other overloads.buildPartialBlock(accumulated:next:)
is overloaded up toarity^2
times to account for all possible pairs of regexes that make up 10 captures.
In the initial version of the DSL, we plan to support regexes with up to 10 captures, as 10 captures are sufficient for most use cases. These overloads can be superceded by variadic versions of buildPartialBlock(first:)
and buildPartialBlock(accumulated:next:)
in a future release.
extension RegexComponentBuilder {
@_disfavoredOverload
public static func buildPartialBlock<R: RegexComponent>(
first r: Component<R>
) -> Regex<Substring>
public static func buildPartialBlock<W, C0, R: RegexComponent>(
first r: Component<R>
) -> Regex<(Substring, C0)> where R.Output == (W, C0)
// ... `O(arity)` overloads of `buildPartialBlock(first:)`
public static func buildPartialBlock<W0, W1, C0, R0: RegexComponent, R1: RegexComponent>(
accumulated: R0, next: Component<R1>
) -> Regex<(Substring, C0)> where R0.Output == W0, R1.Output == (W1, C0)
public static func buildPartialBlock<W0, W1, C0, C1, R0: RegexComponent, R1: RegexComponent>(
accumulated: R0, next: Component<R1>
) -> Regex<(Substring, C0, C1)> where R0.Output == W0, R1.Output == (W1, C0, C1)
// ... `O(arity^2)` overloads of `buildPartialBlock(accumulated:next:)`
}
To support if
statements, buildEither(first:)
, buildEither(second:)
and buildOptional(_:)
are defined with overloads to support up to 10 captures because each capture type needs to be transformed to an optional. The overload for non-capturing regexes, due to the lack of generic constraints, must be annotated with @_disfavoredOverload
in order not shadow other overloads. We expect that a variadic-generic version of this method will eventually superseded all of these overloads.
Due to word limit in forum posts, the definition of
buildOptional(_:)
has been omitted. See the Concatenation section in the full proposal.
To support if #available(...)
statements, buildLimitedAvailability(_:)
is defined with overloads to support up to 10 captures. Similar to buildOptional
, the overload for non-capturing regexes must be annotated with @_disfavoredOverload
.
Due to word limit in forum posts, the definition of
buildLimitedAvailability(_:)
has been omitted. See the Concatenation section in the full proposal.
Alternation
Alternations are used to match one of multiple patterns. An alternation wraps its underlying patterns' capture types in an Optional
and concatenates them together, first to last.
let choice = ChoiceOf {
regex1 // Regex<(Substring, Int)>
regex2 // Regex<(Substring, Float)>
regex3 // Regex<(Substring, Substring)>
regex0 // Regex<Substring>
} // => Regex<(Substring, Int?, Float?, Substring?)>
AlternationBuilder
is a result builder type for creating alternations from components of a block.
@resultBuilder
public struct AlternationBuilder { ... }
To the developer, the top-level API is a type named ChoiceOf
. This type has an initializer that accepts an @AlternationBuilder
closure.
public struct ChoiceOf<Output>: RegexComponent {
public var regex: Regex<Output> { get }
public init<R: RegexComponent>(
@AlternationBuilder builder: () -> R
) where R.Output == Output
}
AlternationBuilder
is mostly similar to RegexComponent
with the following distinctions:
- Empty blocks are not supported.
- Capture types are wrapped in a layer of
Optional
before being concatenated in the resultingOutput
type. buildEither(first:)
andbuildEither(second:)
are overloaded for each supported capture arity because they need to wrap capture types inOptional
.
@resultBuilder
public enum AlternationBuilder {
public typealias Component<Value> = RegexComponentBuilder.Component<Value>
/// Returns a component by wrapping the component regex in `Component` and
/// recording its source location.
public static func buildExpression<R: RegexComponent>(
_ regex: R,
file: String = #file,
function: String = #function,
line: Int = #line,
column: Int = #column
) -> Component<R>
@_disfavoredOverload
public static func buildPartialBlock<R: RegexComponent>(
first r: Component<R>
) -> Regex<Substring>
public static func buildPartialBlock<W, C0, R: RegexComponent>(
first r: Component<R>
) -> Regex<(Substring, C0?)> where R.Output == (W, C0)
// ... `O(arity)` overloads of `buildPartialBlock(first:)`
public static func buildPartialBlock<W0, W1, C0, R0: RegexComponent, R1: RegexComponent>(
accumulated: R0, next: Component<R1>
) -> Regex<(Substring, C0?)> where R0.Output == W0, R1.Output == (W1, C0)
public static func buildPartialBlock<W0, W1, C0, C1, R0: RegexComponent, R1: RegexComponent>(
accumulated: R0, next: Component<R1>
) -> Regex<(Substring, C0?, C1?)> where R0.Output == W0, R1.Output == (W1, C0, C1)
// ... `O(arity^2)` overloads of `buildPartialBlock(accumulated:next:)`
}
Due to word limit in forum posts, the definitions of
buildEither(first:)
,buildEither(second:)
,buildOptional(_:)
andbuildLimitedAvailability(_:)
have been omitted. See the Alternation section in the full proposal.
Quantification
Quantifiers are free functions that take a regex or a @RegexComponentBuilder
closure that produces a regex. The result is a regex whose Output
type is the same as the argument's, when the lower bound of quantification is greater than 0
; otherwise, it is an Optional
thereof.
Quantifiers are generic types that can be created from a regex component. Their Output
type is inferred from initializers. Each of these types corresponds to a quantifier in the textual regex.
Quantifier in regex builder | Quantifier in textual regex |
---|---|
OneOrMore(...) |
...+ |
ZeroOrMore(...) |
...* |
Optionally(...) |
...? |
Repeat(..., count: n) |
...{n} |
Repeat(..., n...) |
...{n,} |
Repeat(..., n...m) |
...{n,m} |
public struct OneOrMore<Output>: RegexComponent {
public var regex: Regex<Output> { get }
}
public struct ZeroOrMore<Output>: RegexComponent {
public var regex: Regex<Output> { get }
}
public struct Optionally<Output>: RegexComponent {
public var regex: Regex<Output> { get }
}
public struct Repeat<Output>: RegexComponent {
public var regex: Regex<Output> { get }
}
Like quantifiers in textual regexes, the developer can specify how eager the pattern should be matched against using QuantificationBehavior
. Static properties in QuantificationBehavior
are named like adverbs for fluency at a quantifier call site.
/// Specifies how much to attempt to match when using a quantifier.
public struct QuantificationBehavior {
/// Match as much of the input string as possible, backtracking when
/// necessary.
public static var eagerly: QuantificationBehavior { get }
/// Match as little of the input string as possible, expanding the matched
/// region as necessary to complete a match.
public static var reluctantly: QuantificationBehavior { get }
/// Match as much of the input string as possible, performing no backtracking.
public static var possessively: QuantificationBehavior { get }
}
Each quantification behavior corresponds to a quantification behavior in the textual regex.
Quantifier behavior in regex builder | Quantifier behavior in textual regex |
---|---|
.eagerly |
no suffix |
.reluctantly |
suffix ? |
.possessively |
suffix + |
OneOrMore
and count-based Repeat
are quantifiers that produce a new regex with the original capture types. Their Output
type is Substring
followed by the component's capture types. ZeroOrMore
, Optionally
, and range-based Repeat
are quantifiers that produce a new regex with optional capture types. Their Output
type is Substring
followed by the component's capture types wrapped in Optional
.
Quantifier | Component Output |
Result Output |
---|---|---|
OneOrMore Repeat(..., count: ...) |
(WholeMatch, Capture...) |
(Substring, Capture...) |
OneOrMore Repeat(..., count: ...) |
WholeMatch (non-tuple) |
Substring |
ZeroOrMore Optionally Repeat(..., n...m) |
(WholeMatch, Capture...) |
(Substring, Capture?...) |
ZeroOrMore Optionally Repeat(..., n...m) |
WholeMatch (non-tuple) |
Substring |
Due to the lack of variadic generics, these functions must be overloaded for every supported capture arity.
extension OneOrMore {
@_disfavoredOverload
public init<Component: RegexComponent>(
_ component: Component,
_ behavior: QuantificationBehavior = .eagerly
) where Output == Substring
@_disfavoredOverload
public init<Component: RegexComponent>(
_ behavior: QuantificationBehavior = .eagerly,
@RegexComponentBuilder _ component: () -> Component
) where Output == Substring
public init<W, C0, Component: RegexComponent>(
_ component: Component,
_ behavior: QuantificationBehavior = .eagerly
) where Output == (Substring, C0), Component.Output == (W, C0)
public init<W, C0, Component: RegexComponent>(
_ behavior: QuantificationBehavior = .eagerly,
@RegexComponentBuilder _ component: () -> Component
) where Output == (Substring, C0), Component.Output == (W, C0)
// ... `O(arity)` overloads
}
Due to word limit in forum posts, the definitions of
ZeroOrMore
,Optionally
, andRepeat
have been omitted. See the Quantification section in the full proposal.
Capture and reference
Capture
and TryCapture
produce a new Regex
by inserting the captured pattern's whole match (.0
) to the .1
position of Output
. When a transform closure is provided, the whole match of the captured content will be transformed to using the closure.
public struct Capture<Output>: RegexComponent {
public var regex: Regex<Output> { get }
}
public struct TryCapture<Output>: RegexComponent {
public var regex: Regex<Output> { get }
}
The difference between Capture
and TryCapture
is that TryCapture
works better with transform closures that can return nil
or throw, whereas Capture
relies on the user to handle errors within a transform closure. With TryCapture
, when the closure returns nil
or throws, the failure becomes a no-match.
// Below are `Capture` and `TryCapture` initializer variants on capture arity 0.
// Higher capture arities are omitted for simplicity.
extension Capture {
public init<R: RegexComponent, W>(
_ component: R
) where Output == (Substring, W), R.Output == W
public init<R: RegexComponent, W>(
_ component: R, as reference: Reference<W>
) where Output == (Substring, W), R.Output == W
public init<R: RegexComponent, W, NewCapture>(
_ component: R,
transform: @escaping (Substring) -> NewCapture
) where Output == (Substring, NewCapture), R.Output == W
public init<R: RegexComponent, W, NewCapture>(
_ component: R,
as reference: Reference<NewCapture>,
transform: @escaping (Substring) -> NewCapture
) where Output == (Substring, NewCapture), R.Output == W
public init<R: RegexComponent, W>(
@RegexComponentBuilder _ component: () -> R
) where Output == (Substring, W), R.Output == W
public init<R: RegexComponent, W>(
as reference: Reference<W>,
@RegexComponentBuilder _ component: () -> R
) where Output == (Substring, W), R.Output == W
}
Due to word limit in forum posts, the definition of
TryCapture
has been omitted. See the Capture and reference section in the full proposal.
Example:
let regex = Regex {
OneOrMore("a")
Capture {
TryCapture("b") { Int($0) }
ZeroOrMore {
TryCapture("c") { Double($0) }
}
Optionally("e")
}
}
Variants of Capture
and TryCapture
accept a Reference
argument. References can be used to achieve named captures and named backreferences from textual regexes.
public struct Reference<Capture>: RegexComponent {
public init(_ captureType: Capture.Type = Capture.self)
public var regex: Regex<Capture>
}
When capturing some regex with a reference specified, the reference will refer to the most recently captured content. The reference itself can be used as a regex to match the most recently captured content. This API is also designed to work with match result lookup, which will be introduced in a different pitch.
let a = Reference()
let b = Reference()
let regex = Regex {
Capture("abc", as: a)
Capture("def", as: b)
a
Capture(b)
}
// To be introduced in a different pitch:
if let result = input.firstMatch(of: regex) {
print(result[a]) // => "abc"
print(result[b]) // => "def"
}
A regex is considered invalid when it contains a use of reference without it ever being captured in the regex. When this occurs in the regex builder DSL, an runtime error will be reported.
Subpattern
In textual regex, one can refer to a subpattern to avoid duplicating the subpattern, for example:
(you|I) say (goodbye|hello); (?1) say (?2)
The above regex is equivalent to
(you|I) say (goodbye|hello); (you|I) say (goodbye|hello)
With regex builder, there is no special API required to reuse existing subpatterns, as a subpattern can be defined modularly using a let
binding inside or outside a regex builder closure.
Regex {
let subject = ChoiceOf {
"I"
"you"
}
let object = ChoiceOf {
"goodbye"
"hello"
}
subject
"say"
object
";"
subject
"say"
object
}
Sometimes, a textual regex may also use (?R)
or (?0)
to recusively evaluate the entire regex. For example, the following textual regex matches "I say you say I say you say hello".
(you|I) say (goodbye|hello|(?R))
For this, Regex
offers a special initializer that allows its pattern to recursively reference itself. This is somewhat akin to a fixed-point combinator.
extension Regex {
public init<R: RegexComponent>(
@RegexComponentBuilder _ content: (Regex<Substring>) -> R
) where R.Output == Match
}
With this initializer, the above regex can be expressed as the following using regex builder.
Regex { wholeSentence in
ChoiceOf {
"I"
"you"
}
"say"
ChoiceOf {
"goodbye"
"hello"
wholeSentence
}
}
Source compatibility
Regex builder will be shipped in a new module named RegexBuilder
, and thus will not affect the source compatibility of the existing code.
Effect on ABI stability
The proposed feature does not change the ABI of existing features.
Effect on API resilience
Due to word limit in forum posts, please click the link to visit this section in the full proposal.
Alternatives considered
Due to word limit in forum posts, please click the link to visit this section in the full proposal.
Full proposal: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexBuilderDSL.md