[Pitch] Regex builder DSL

rxwei · March 17, 2022, 12:49am

Full proposal: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexBuilderDSL.md

Regex builder DSL

Status: Pitch
Implementation: apple/swift-experimental-string-processing

Table of Contents

Introduction
Motivation
Proposed solution
Detailed design
Source compatibility
Effect on ABI stability
Effect on API resilience
Alternatives considered

Introduction

Declarative string processing aims to offer powerful pattern matching capabilities with expressivity, clarity, type safety, and ease of use. To achieve this, we propose to introduce a result-builder-based DSL, regex builder, for creating and composing regular expressions (regexes).

Regex builder is part of the Swift Standard Library but resides in a standalone module named RegexBuilder. By importing RegexBuilder, you get all necessary API for building a regex.

import RegexBuilder

let emailPattern = Regex {
  let word = OneOrMore(.word)
  Capture {
    ZeroOrMore {
      word
      "."
    }
    word
  }
  "@"
  Capture {
    word
    OneOrMore {
      "."
      word
    }
  }
} // => Regex<(Substring, Substring, Substring)>

let email = "My email is my.name@mail.swift.org."
if let match = email.firstMatch(of: emailPattern) {
  let (wholeMatch, name, domain) = match.output
  // wholeMatch: "My email is my.name@mail.swift.org."
  //       name: "my.name"
  //     domain: "mail.swift.org"
}

This proposal introduces all core API for creating and composing regexes that echos the textual regex syntax and strongly typed regex captures, but does not formally specify the matching semantics or define character classes.

Motivation

Regex is a fundemental and powerful tool for textual pattern matching. It is a domain-specific language often expressed as text. For example, given the following bank statement:

CREDIT    04062020    PayPal transfer    $4.99
CREDIT    04032020    Payroll            $69.73
DEBIT     04022020    ACH transfer       $38.25
DEBIT     03242020    IRS tax payment    $52249.98

One can write the follow textual regex to match each line:

(CREDIT|DEBIT)\s+(\d{2}\d{2}\d{4})\s+([\w\s]+\w)\s+(\$\d+\.\d{2})

While a regex like this is very compact and expressive, it is very difficult read, write and use:

Syntactic special characters, e.g. \, (, [, {, are too dense to be readable.
It contains a hierarchy of subpatterns fit into a single line of text.
No code completion when typing syntactic components.
Capturing groups produce raw data (i.e. a range or a substring) and can only be converted to other data structures after matching.
While comments (?#...) can be added inline, it only complicates readability.

Proposed solution

We introduce regex builder, a result-builder-based API for creating and composing regexes. This API resides in a new module named RegexBuilder that is to be shipped as part of the Swift toolchain.

With regex builder, the regex for matching a bank statement can be written as the following:

import RegexBuilder

enum TransactionKind: String {
   case credit = "CREDIT"
   case debit = "DEBIT"
}

struct Date {
  var month, day, year: Int
  init?(mmddyyyy: String) { ... }
}

struct Amount {
  var valueTimes100: Int
  init?(twoDecimalPlaces text: Substring) { ... }
}

let statementPattern = Regex {
  // Parse the transaction kind.
  TryCapture {
    ChoiceOf {
      "CREDIT"
      "DEBIT"
    }
  } transform: {
    TransactionKind(rawValue: String($0))
  }
  OneOrMore(.whitespace)
  // Parse the date, e.g. "01012021".
  TryCapture {
    Repeat(.digit, count: 2)
    Repeat(.digit, count: 2)
    Repeat(.digit, count: 4)
  } transform: { Date(mmddyyyy: $0) }
  OneOrMore(.whitespace)
  // Parse the transaction description, e.g. "ACH transfer".
  Capture {
    OneOrMore(.custom([
      .characterClass(.word),
      .characterClass(.whitespace)
    ]))
    CharacterClass.word
  } transform: { String($0) }
  OneOrMore(.whitespace)
  "$"
  // Parse the amount, e.g. `$100.00`.
  TryCapture {
    OneOrMore(.digit)
    "."
    Repeat(.digit, count: 2)
  } transform: { Amount(twoDecimalPlaces: $0) }
} // => Regex<(Substring, TransactionKind, Date, String, Amount)>


let statement = """
  CREDIT    04062020    PayPal transfer    $4.99
  CREDIT    04032020    Payroll            $69.73
  DEBIT     04022020    ACH transfer       $38.25
  DEBIT     03242020    IRS tax payment    $52249.98
  """
for match in statement.matches(of: statementPattern) {
  let (line, kind, date, description, amount) = match.output
  ...
}

Regex builder addresses all of textual regexes' shortcomings presented in the Motivation section:

Capture groups and quantifiers are expressed as API calls that are easy to read.
Scoping and indentations clearly distinguish subpatterns in the hierarchy.
Code completion is available when the developer types an API call.
Capturing groups can be transformed into structured data at the regex declaration site.
Normal code comments can be written within a regex declaration to further improve readability.

Detailed design

`RegexComponent` protocol

One of the goals of the regex builder DSL is allowing the developers to easily compose regexes from common currency types and literals, or even define custom patterns to use for matching. We introduce RegexComponent, a protocol that unifies all types that can represent a component of a regex.

public protocol RegexComponent {
  associatedtype Output
  @RegexComponentBuilder
  var regex: Regex<Output> { get }
}

By conforming standard library types to RegexComponent, we allow them to be used inside the regex builder DSL as a match target.

// A string represents a regex that matches the string.
extension String: RegexComponent {
  public var regex: Regex<Substring> { get }
}

// A substring represents a regex that matches the substring.
extension Substring: RegexComponent {
  public var regex: Regex<Substring> { get }
}

// A character represents a regex that matches the character.
extension Character: RegexComponent {
  public var regex: Regex<Substring> { get }
}

// A unicode scalar represents a regex that matches the scalar.
extension UnicodeScalar: RegexComponent {
  public var regex: Regex<Substring> { get }
}

// To be introduced in a future pitch.
extension CharacterClass: RegexComponent {
  public var regex: Regex<Substring> { get }
}

Since regexes are composable, the Regex type itself also conforms to RegexComponent.

extension Regex: RegexComponent {
  public var regex: Self { self }
}

All of the regex builder DSL in the rest of this pitch will accept generic components that conform to RegexComponent.

Concatenation

A regex can be viewed as a concatenation of smaller regexes. In the regex builder DSL, RegexComponentBuilder is the basic facility to allow developers to compose regexes by concatenation.

@resultBuilder
public enum RegexComponentBuilder { ... }

A closure marked with @RegexComponentBuilder will be transformed to produce a Regex by concatenating all of its components, where the result type's Output type will be a Substring followed by concatenated captures (tuple when plural).

Recap: Regex capturing basics

Regex is a generic type with generic parameter Output.

struct Regex<Output> { ... }

When a regex does not contain any capturing groups, its Output type is Substring, which represents the whole matched portion of the input.

let noCaptures = #/a/# // => Regex<Substring>

When a regex contains capturing groups, i.e. (...), the Output type is extended as a tuple to also contain capture types. Capture types are tuple elements after the first element.

//                           ________________________________
//                        .0 |                           .0 |
//                  ____________________                _________
let yesCaptures = #/a(?:(b+)c(d+))+e(f)?/# // => Regex<(Substring, Substring, Substring, Substring?)>
//                      ---- ----   ---                            ---------  ---------  ----------
//                    .1 | .2 |   .3 |                              .1 |       .2 |       .3 |
//                       |    |      |                                 |          |          |
//                       |    |      |_______________________________  |  ______  |  ________|
//                       |    |                                        |          |
//                       |    |______________________________________  |  ______  |
//                       |                                             |
//                       |_____________________________________________|
//                                                                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
//                                                                          Capture types

We introduce a new initializer Regex.init(_:) which accepts a @RegexComponentBuilder closure. This initializer is the entry point for creating a regex using the regex builder DSL.

extension Regex {
  public init<R: RegexComponent>(
    @RegexComponentBuilder _ content: () -> R
  ) where R.Output == Output
}

Example:

Regex {
   regex0 // Regex<Substring>
   regex1 // Regex<(Substring, Int)>
   if condition {
     regex2 // Regex<(Substring, Float)>
   } else {
     regex3 // Regex<(Substring, Float)>
   }
} // Regex<(Substring, Int, Float)>

This regex will be transformed to:

Regex {
  let e0 = RegexComponentBuilder.buildExpression(regex0) // Component<Regex<Substring>>
  let e1 = RegexComponentBuilder.buildExpression(regex1) // Component<Regex<(Substring, Int)>>
  let e2: Regex<(Substring, Float)>
  if condition {
    let comp = RegexComponentBuilder.buildExpression(regex2) // Component<Regex<(Substring, Float)>>
    e2 = RegexComponentBuilder.buildEither(first: comp) // Regex<(Substring, Float)>
  } else {
    let comp = RegexComponentBuilder.buildExpression(regex3) // Component<Regex<(Substring, Float)>>
    e2 = RegexComponentBuilder.buildEither(first: comp) // Regex<(Substring, Float)>
  }
  let r0 = RegexComponentBuilder.buildPartialBlock(first: e0)
  let r1 = RegexComponentBuilder.buildPartialBlock(accumulated: r0, next: e1)
  let r2 = RegexComponentBuilder.buildPartialBlock(accumulated: r1, next: e2)
  return r2
} // Regex<(Substring, Int, Float)>

Basic methods in RegexComponentBuilder, e.g. buildBlock(), provides support for creating the most fundamental blocks. The buildExpression method wraps a user-provided component in a RegexComponentBuilder.Component structure, before passing the component to other builder methods. This is used for saving the source location of the component so that runtime errors can be reported with an accurate location.

@resultBuilder
public enum RegexComponentBuilder {
  /// Returns an empty regex.
  public static func buildBlock() -> Regex<Substring>

  /// A builder component that stores a regex component and its source location
  /// for debugging purposes.
  public struct Component<Value: RegexComponent> {
    public var value: Value
    public var file: String
    public var function: String
    public var line: Int
    public var column: Int
  }

  /// Returns a component by wrapping the component regex in `Component` and
  /// recording its source location.
  public static func buildExpression<R: RegexComponent>(
    _ regex: R, 
    file: String = #file, 
    function: String = #function, 
    line: Int = #line,
    column: Int = #column
  ) -> Component<R>
}

When it comes to concatenation, RegexComponentBuilder utilizes the recently proposed buildPartialBlock feature to be able to concatenate all components' capture types to a single result tuple. buildPartialBlock(first:) provides support for creating a regex from a single component, and buildPartialBlock(accumulated:next:) support for creating a regex from multiple results.

Before Swift supports variadic generics, buildPartialBlock(first:) and buildPartialBlock(accumulated:next:) must be overloaded to support concatenating regexes of supported capture quantities (arities).

buildPartialBlock(first:) is overloaded arity times such that a unary block with a component of any supported capture arity will produce a regex with capture type Substring followed by the component's capture types. The base overload, buildPartialBlock<R>(first:) -> Regex<Substring>, must be marked with @_disfavoredOverload to prevent it from shadowing other overloads.
buildPartialBlock(accumulated:next:) is overloaded up to arity^2 times to account for all possible pairs of regexes that make up 10 captures.

In the initial version of the DSL, we plan to support regexes with up to 10 captures, as 10 captures are sufficient for most use cases. These overloads can be superceded by variadic versions of buildPartialBlock(first:) and buildPartialBlock(accumulated:next:) in a future release.

extension RegexComponentBuilder {
  @_disfavoredOverload
  public static func buildPartialBlock<R: RegexComponent>(
    first r: Component<R>
  ) -> Regex<Substring>

  public static func buildPartialBlock<W, C0, R: RegexComponent>(
    first r: Component<R>
  ) -> Regex<(Substring, C0)> where R.Output == (W, C0)

  // ... `O(arity)` overloads of `buildPartialBlock(first:)`
  
  public static func buildPartialBlock<W0, W1, C0, R0: RegexComponent, R1: RegexComponent>(
    accumulated: R0, next: Component<R1>
  ) -> Regex<(Substring, C0)>  where R0.Output == W0, R1.Output == (W1, C0)
  
  public static func buildPartialBlock<W0, W1, C0, C1, R0: RegexComponent, R1: RegexComponent>(
    accumulated: R0, next: Component<R1>
  ) -> Regex<(Substring, C0, C1)>  where R0.Output == W0, R1.Output == (W1, C0, C1)

  // ... `O(arity^2)` overloads of `buildPartialBlock(accumulated:next:)`
}

To support if statements, buildEither(first:), buildEither(second:) and buildOptional(_:) are defined with overloads to support up to 10 captures because each capture type needs to be transformed to an optional. The overload for non-capturing regexes, due to the lack of generic constraints, must be annotated with @_disfavoredOverload in order not shadow other overloads. We expect that a variadic-generic version of this method will eventually superseded all of these overloads.

Due to word limit in forum posts, the definition of buildOptional(_:) has been omitted. See the Concatenation section in the full proposal.

To support if #available(...) statements, buildLimitedAvailability(_:) is defined with overloads to support up to 10 captures. Similar to buildOptional, the overload for non-capturing regexes must be annotated with @_disfavoredOverload.

Due to word limit in forum posts, the definition of buildLimitedAvailability(_:) has been omitted. See the Concatenation section in the full proposal.

Alternation

Alternations are used to match one of multiple patterns. An alternation wraps its underlying patterns' capture types in an Optional and concatenates them together, first to last.

let choice = ChoiceOf {
  regex1 // Regex<(Substring, Int)>
  regex2 // Regex<(Substring, Float)>
  regex3 // Regex<(Substring, Substring)>
  regex0 // Regex<Substring>
} // => Regex<(Substring, Int?, Float?, Substring?)>

AlternationBuilder is a result builder type for creating alternations from components of a block.

@resultBuilder
public struct AlternationBuilder { ... }

To the developer, the top-level API is a type named ChoiceOf. This type has an initializer that accepts an @AlternationBuilder closure.

public struct ChoiceOf<Output>: RegexComponent {
  public var regex: Regex<Output> { get }
  public init<R: RegexComponent>(
    @AlternationBuilder builder: () -> R
  ) where R.Output == Output
}

AlternationBuilder is mostly similar to RegexComponent with the following distinctions:

Empty blocks are not supported.
Capture types are wrapped in a layer of Optional before being concatenated in the resulting Output type.
buildEither(first:) and buildEither(second:) are overloaded for each supported capture arity because they need to wrap capture types in Optional.

@resultBuilder
public enum AlternationBuilder {
  public typealias Component<Value> = RegexComponentBuilder.Component<Value>

  /// Returns a component by wrapping the component regex in `Component` and
  /// recording its source location.
  public static func buildExpression<R: RegexComponent>(
    _ regex: R, 
    file: String = #file, 
    function: String = #function, 
    line: Int = #line,
    column: Int = #column
  ) -> Component<R>

  @_disfavoredOverload
  public static func buildPartialBlock<R: RegexComponent>(
    first r: Component<R>
  ) -> Regex<Substring>

  public static func buildPartialBlock<W, C0, R: RegexComponent>(
    first r: Component<R>
  ) -> Regex<(Substring, C0?)> where R.Output == (W, C0)

  // ... `O(arity)` overloads of `buildPartialBlock(first:)`
  
  public static func buildPartialBlock<W0, W1, C0, R0: RegexComponent, R1: RegexComponent>(
    accumulated: R0, next: Component<R1>
  ) -> Regex<(Substring, C0?)>  where R0.Output == W0, R1.Output == (W1, C0)
  
  public static func buildPartialBlock<W0, W1, C0, C1, R0: RegexComponent, R1: RegexComponent>(
    accumulated: R0, next: Component<R1>
  ) -> Regex<(Substring, C0?, C1?)>  where R0.Output == W0, R1.Output == (W1, C0, C1)

  // ... `O(arity^2)` overloads of `buildPartialBlock(accumulated:next:)`
}

Due to word limit in forum posts, the definitions of buildEither(first:), buildEither(second:), buildOptional(_:) and buildLimitedAvailability(_:) have been omitted. See the Alternation section in the full proposal.

Quantification

Quantifiers are free functions that take a regex or a @RegexComponentBuilder closure that produces a regex. The result is a regex whose Output type is the same as the argument's, when the lower bound of quantification is greater than 0; otherwise, it is an Optional thereof.

Quantifiers are generic types that can be created from a regex component. Their Output type is inferred from initializers. Each of these types corresponds to a quantifier in the textual regex.

Quantifier in regex builder	Quantifier in textual regex
`OneOrMore(...)`	`...+`
`ZeroOrMore(...)`	`...*`
`Optionally(...)`	`...?`
`Repeat(..., count: n)`	`...{n}`
`Repeat(..., n...)`	`...{n,}`
`Repeat(..., n...m)`	`...{n,m}`

public struct OneOrMore<Output>: RegexComponent {
  public var regex: Regex<Output> { get }
}

public struct ZeroOrMore<Output>: RegexComponent {
  public var regex: Regex<Output> { get }
}

public struct Optionally<Output>: RegexComponent {
  public var regex: Regex<Output> { get }
}

public struct Repeat<Output>: RegexComponent {
  public var regex: Regex<Output> { get }
}

Like quantifiers in textual regexes, the developer can specify how eager the pattern should be matched against using QuantificationBehavior. Static properties in QuantificationBehavior are named like adverbs for fluency at a quantifier call site.

/// Specifies how much to attempt to match when using a quantifier.
public struct QuantificationBehavior {
  /// Match as much of the input string as possible, backtracking when
  /// necessary.
  public static var eagerly: QuantificationBehavior { get }
  
  /// Match as little of the input string as possible, expanding the matched
  /// region as necessary to complete a match.
  public static var reluctantly: QuantificationBehavior { get }
  
  /// Match as much of the input string as possible, performing no backtracking.
  public static var possessively: QuantificationBehavior { get }
}

Each quantification behavior corresponds to a quantification behavior in the textual regex.

Quantifier behavior in regex builder	Quantifier behavior in textual regex
`.eagerly`	no suffix
`.reluctantly`	suffix `?`
`.possessively`	suffix `+`

OneOrMore and count-based Repeat are quantifiers that produce a new regex with the original capture types. Their Output type is Substring followed by the component's capture types. ZeroOrMore, Optionally, and range-based Repeat are quantifiers that produce a new regex with optional capture types. Their Output type is Substring followed by the component's capture types wrapped in Optional.

Quantifier	Component `Output`	Result `Output`
`OneOrMore` `Repeat(..., count: ...)`	`(WholeMatch, Capture...)`	`(Substring, Capture...)`
`OneOrMore` `Repeat(..., count: ...)`	`WholeMatch` (non-tuple)	`Substring`
`ZeroOrMore` `Optionally` `Repeat(..., n...m)`	`(WholeMatch, Capture...)`	`(Substring, Capture?...)`
`ZeroOrMore` `Optionally` `Repeat(..., n...m)`	`WholeMatch` (non-tuple)	`Substring`

Due to the lack of variadic generics, these functions must be overloaded for every supported capture arity.

extension OneOrMore {
  @_disfavoredOverload
  public init<Component: RegexComponent>(
    _ component: Component,
    _ behavior: QuantificationBehavior = .eagerly
  ) where Output == Substring
  
  @_disfavoredOverload
  public init<Component: RegexComponent>(
    _ behavior: QuantificationBehavior = .eagerly,
    @RegexComponentBuilder _ component: () -> Component
  ) where Output == Substring
  
  public init<W, C0, Component: RegexComponent>(
    _ component: Component,
    _ behavior: QuantificationBehavior = .eagerly
  ) where Output == (Substring, C0), Component.Output == (W, C0)
  
  public init<W, C0, Component: RegexComponent>(
    _ behavior: QuantificationBehavior = .eagerly,
    @RegexComponentBuilder _ component: () -> Component
  ) where Output == (Substring, C0), Component.Output == (W, C0)
  
  // ... `O(arity)` overloads
}

Due to word limit in forum posts, the definitions of ZeroOrMore, Optionally, and Repeat have been omitted. See the Quantification section in the full proposal.

Capture and reference

Capture and TryCapture produce a new Regex by inserting the captured pattern's whole match (.0) to the .1 position of Output. When a transform closure is provided, the whole match of the captured content will be transformed to using the closure.

public struct Capture<Output>: RegexComponent {
  public var regex: Regex<Output> { get }
}

public struct TryCapture<Output>: RegexComponent {
  public var regex: Regex<Output> { get }
}

The difference between Capture and TryCapture is that TryCapture works better with transform closures that can return nil or throw, whereas Capture relies on the user to handle errors within a transform closure. With TryCapture, when the closure returns nil or throws, the failure becomes a no-match.

// Below are `Capture` and `TryCapture` initializer variants on capture arity 0.
// Higher capture arities are omitted for simplicity.
  
extension Capture {
  public init<R: RegexComponent, W>(
    _ component: R
  ) where Output == (Substring, W), R.Output == W
  
  public init<R: RegexComponent, W>(
    _ component: R, as reference: Reference<W>
  ) where Output == (Substring, W), R.Output == W
  
  public init<R: RegexComponent, W, NewCapture>(
    _ component: R,
    transform: @escaping (Substring) -> NewCapture
  ) where Output == (Substring, NewCapture), R.Output == W
  
  public init<R: RegexComponent, W, NewCapture>(
    _ component: R,
    as reference: Reference<NewCapture>,
    transform: @escaping (Substring) -> NewCapture
  ) where Output == (Substring, NewCapture), R.Output == W
  
  public init<R: RegexComponent, W>(
    @RegexComponentBuilder _ component: () -> R
  ) where Output == (Substring, W), R.Output == W
  
  public init<R: RegexComponent, W>(
    as reference: Reference<W>,
    @RegexComponentBuilder _ component: () -> R
  ) where Output == (Substring, W), R.Output == W
}

Due to word limit in forum posts, the definition of TryCapture has been omitted. See the Capture and reference section in the full proposal.

Example:

let regex = Regex {
  OneOrMore("a")
  Capture {
    TryCapture("b") { Int($0) }
    ZeroOrMore {
      TryCapture("c") { Double($0) }
    }
    Optionally("e")
  }
}

Variants of Capture and TryCapture accept a Reference argument. References can be used to achieve named captures and named backreferences from textual regexes.

public struct Reference<Capture>: RegexComponent {
  public init(_ captureType: Capture.Type = Capture.self)
  public var regex: Regex<Capture>
}

When capturing some regex with a reference specified, the reference will refer to the most recently captured content. The reference itself can be used as a regex to match the most recently captured content. This API is also designed to work with match result lookup, which will be introduced in a different pitch.

let a = Reference()
let b = Reference()
let regex = Regex {
  Capture("abc", as: a)
  Capture("def", as: b)
  a
  Capture(b)
}

// To be introduced in a different pitch:
if let result = input.firstMatch(of: regex) {
  print(result[a]) // => "abc"
  print(result[b]) // => "def"
}

A regex is considered invalid when it contains a use of reference without it ever being captured in the regex. When this occurs in the regex builder DSL, an runtime error will be reported.

Subpattern

In textual regex, one can refer to a subpattern to avoid duplicating the subpattern, for example:

(you|I) say (goodbye|hello); (?1) say (?2)

The above regex is equivalent to

(you|I) say (goodbye|hello); (you|I) say (goodbye|hello)

With regex builder, there is no special API required to reuse existing subpatterns, as a subpattern can be defined modularly using a let binding inside or outside a regex builder closure.

Regex {
   let subject = ChoiceOf {
     "I"
     "you"
   }
   let object = ChoiceOf {
     "goodbye"
     "hello"
   }
   subject
   "say"
   object
   ";"
   subject
   "say"
   object
}

Sometimes, a textual regex may also use (?R) or (?0) to recusively evaluate the entire regex. For example, the following textual regex matches "I say you say I say you say hello".

(you|I) say (goodbye|hello|(?R))

For this, Regex offers a special initializer that allows its pattern to recursively reference itself. This is somewhat akin to a fixed-point combinator.

extension Regex {
  public init<R: RegexComponent>(
    @RegexComponentBuilder _ content: (Regex<Substring>) -> R
  ) where R.Output == Match
}

With this initializer, the above regex can be expressed as the following using regex builder.

Regex { wholeSentence in
  ChoiceOf {
   "I"
   "you"
  }
  "say"
  ChoiceOf {
    "goodbye"
    "hello"
    wholeSentence
  }
}

Source compatibility

Regex builder will be shipped in a new module named RegexBuilder, and thus will not affect the source compatibility of the existing code.

Effect on ABI stability

The proposed feature does not change the ABI of existing features.

Effect on API resilience

Due to word limit in forum posts, please click the link to visit this section in the full proposal.

Alternatives considered

Due to word limit in forum posts, please click the link to visit this section in the full proposal.

Full proposal: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexBuilderDSL.md

Jeehut · March 17, 2022, 2:50am

Thank you for the proposal draft. I love the idea of making regular expressions more readable. And while I didn't have the time to review your proposal in more detail, I would like to mention a project where – in my opinion – some interesting groundwork has been already laid to explore this area right away – it's swift-parsing from Point-Free.

In particular, see the thread I started on the GitHub Discussions site here:

Playing around with the library and its limitations could help inform the shape of this proposal or the naming of some types in the implementation. I will take a deeper look another time, but looks good on first sight!

jayton · March 17, 2022, 12:07pm

My first thought on reading this was, “why can’t I just use try here?”. On further inspection, both cases in the example will return nil on failure (although TryCapture handles throwing matchers as well).

Still, one could imagine a result builder method buildThrowing(_ component: (Component) throws -> Void) -> Component, which could be used to handle builders that turn failures into error states, retries etc. Using try in such cases would be ergonomic, but perhaps too magical if it’s used to treat nils as failures.

This is surprisingly clunky; one would like to write:

OneOrMore {
    CharacterClass.word  // Ideally, result builders would be enhanced so we don’t need CharacterClass here
    CharacterClass.whitespace
}

rxwei:

Regex {
  let e0 = RegexComponentBuilder.buildExpression(regex0) // Regex<Substring>
  let e1 = RegexComponentBuilder.buildExpression(regex1) // Regex<(Substring, Int)>
  let e2: Regex<Substring>
  if condition {
    e2 = RegexComponentBuilder.buildEither(first: regex2) // Regex<(Substring, Float)>
  } else {
    e2 = RegexComponentBuilder.buildEither(second: regex3) // Regex<(Substring, Substring)>
  }
  let r0 = RegexComponentBuilder.buildPartialBlock(first: e0)
  let r1 = RegexComponentBuilder.buildPartialBlock(accumulated: r0, next: r1)
  let r2 = RegexComponentBuilder.buildPartialBlock(accumulated: r1, next: r2)
  return r2
} // Regex<(Substring, Int, Float, Substring)>

I don’t understand this example. First, I assume that towards thee end next: r1 and next: r2 should be referring to e1 and e2 (with buildExpression elided). But e2 is of type Regex<Substring>, and then has both Regex<(Substring, Float)> and Regex<(Substring, Substring)> assigned to it, and then this Float and second Substring are somehow magically extracted from it?

The non-optionality of the Float and Substring types is also troubling, is this correct? Does that mean that all matches are optional? I would have expected to see something like Regex<(Substring, Int, Float?, Substring?)> (or, better, Regex<(Substring, Int, Either<Float, Substring>)>).

hisekaldma · March 17, 2022, 12:13pm

This jumped out at me when skimming through as well. If the Float and Substring matches are indeed non-optional, it would be great to explain in the text why it’s necessary and how it works in practice.

Michael_Ilseman · March 17, 2022, 1:02pm

More specifically, returning nil signals a local match failure and the engine backtracks to try an alternative. throwing aborts pattern matching.

What to do with custom abort errors and how to surface them throughout API is an open question. CC @itingliu. I opened API design: Custom abort errors and API threading.

jayton:

This is surprisingly clunky; one would like to write:

OneOrMore {
    CharacterClass.word  // Ideally, result builders would be enhanced so we don’t need CharacterClass here
    CharacterClass.whitespace
}

That would be the equivalent of (?:\w\s)+, that is a concatenation of the two classes. Instead, you'd probably want to say:

OneOrMore {
  CharacterClass {
    CharacterClass.word
    CharacterClass.whitespace
  }
}

And we have to use the explicit CharacterClass before the .word etc. because of limitations of result builder syntax. I believe @rxwei was showing a formulation that doesn't have that because it's using the initializers directly. (It's possible that we could omit the .characterClass part for this formulation, though).

That listing is not building a ChoiceOf (i.e. an alternation or ordered-choice of alternatives), i.e. the if statement gets resolved at result builder construction-time. I'm not sure how that type signature follows, though. @rxwei?

sky · March 17, 2022, 1:06pm

Thanks for this! It's been great to see the string processing repo grow over the last year-ish. My main question is, while I'd love to have this in the standard library, what are the motivations to having this in the stdlib as opposed to a package? There's as much precedent in other language ecosystems for pattern-matching capabilities being a separate package as there is built-in to the standard library, so I'd love to hear what your thoughts are.

Dante-Broggi · March 17, 2022, 1:24pm

While I like the API in general, I was wondering about the possibility of extending this syntax to non-String Collections and custom Element types.

Max_Desiatov · March 17, 2022, 3:33pm

Strong +1 to this, I'm hoping this would be available on [UInt8], which is a natural way to parse binary formats.

mtsrodrigues · March 17, 2022, 4:18pm

@rxwei Will it be possible to declare a regex pattern like this?

struct Pattern: RegexComponent {
    
    var regex: some RegexComponent {
        Regex {
            Capture {
                OneOrMore(.word)
            }
            "."
            Capture {
                OneOrMore(.word)
            }
        }
    }
    
}

Paul_Cantrell · March 17, 2022, 6:32pm

While I hope to have a chance to review the proposal in detail, I just want to say quickly while I have the time: nice work! This is a really cool proposal, and I’m delighted to see people exploring further into creating eDSLs like this in Swift.

John_McCall · March 17, 2022, 6:47pm

Hmm, this is an interesting choice. So, given a block like this:

  if <1> {
   // captures Int
  } else if <2> {
   // captures Float
  } else if <3> {
   // captures nothing
  } else {
   // captures (String, Double)
  }

I assume the overall capture is one of these:

(Int?, Float?, (String, Double)?)
(Int?, Float?, Void?, (String, Double)?)
(Int?, Float?, String?, Double?)
(Int?, Float?, Void?, String?, Double?)

Abstractly, it seems like it would be better if we could produce a OneOf<Int, Float, Void, (String, Double)>, but of course that would require both variadic generics and some unclear way of testing and projecting out the different components.

Morten_Bek_Ditlevsen · March 17, 2022, 7:15pm

I also think that a tuple with optional values is not ideal in Swift.

Has the idea of anonymous sum types - as an analog to tuples ever been explored?

Like (Int | String) which could perhaps be tested like:

let a: (Int | String) = 1
switch a {
  case .0(let value):
    () // value is an Int
  case .1(let value):
    () // value is a String
}

EDITED: Changed $0 to .0 and $1 to .1 in strawman syntax.

rxwei · March 18, 2022, 12:57am

I agree it looks clunky and could use some sugar. Note that this proposal does not introduce CharacterClass, so this is an out-of-scope example, and the clunky example uses what works in the existing implementation. CharacterClass will be introduced soon by @nnnnnnnn in a separate pitch. I imagine a simplified version could be something like:

OneOrMore {
    CharacterClass(.word, .whitespace)
}

I made an error in the example at the top of the detailed design section. buildEither(first:) and buildEither(second:) currently return non-optionals, and therefore require all branches to have the same return type. Here's the fixed example:

Regex {
   regex0 // Regex<Substring>
   regex1 // Regex<(Substring, Int)>
   if condition {
     regex2 // Regex<(Substring, Float)>
   } else {
     regex3 // Regex<(Substring, Float)>
   }
} // Regex<(Substring, Int, Float)>

Regex {
  let e0 = RegexComponentBuilder.buildExpression(regex0) // Component<Regex<Substring>>
  let e1 = RegexComponentBuilder.buildExpression(regex1) // Component<Regex<(Substring, Int)>>
  let e2: Regex<(Substring, Float)>
  if condition {
    let comp = RegexComponentBuilder.buildExpression(regex2) // Component<Regex<(Substring, Float)>>
    e2 = RegexComponentBuilder.buildEither(first: comp) // Regex<(Substring, Float)>
  } else {
    let comp = RegexComponentBuilder.buildExpression(regex3) // Component<Regex<(Substring, Float)>>
    e2 = RegexComponentBuilder.buildEither(first: comp) // Regex<(Substring, Float)>
  }
  let r0 = RegexComponentBuilder.buildPartialBlock(first: e0)
  let r1 = RegexComponentBuilder.buildPartialBlock(accumulated: r0, next: e1)
  let r2 = RegexComponentBuilder.buildPartialBlock(accumulated: r1, next: e2)
  return r2
} // Regex<(Substring, Int, Float)>

I'm not quite sure if builder support is relevant for this. The way Capture(..., transform: ...) and TryCapture(..., transform: ...) work is by storing the failable/throwing closure inside the underlying regex program, and the closure will not be called until matching. The failable/throwing closure is not a component of the builder.

As for the naming of Capture and TryCapture, we could unify the both capture operations under Capture with different argument labels, e.g. transform: and tryTransform:. Would love to hear some opinions on this.

Yeah, in @John_McCall's example, we could have a tuple of optionals, but it may be confused with the result of ChoiceOf. An anonymous sum type or a sum type with variadic cases would be nice. Currently all branches are expected to have the same type.

Yes, and the protocol requirement var regex has already been marked with @RegexComponentBuilder. However the type of var regex would not be some RegexComponent, because the protocol requires a Regex<Output> type. Plus, a big difference between a regex and a SwiftUI view is that a user actually wants to access the regex output (Output) with non-opaque types. A working example would be:

struct Pattern: RegexComponent {
    var regex: Regex<(Substring, Substring, Substring)> {
         Capture {
             OneOrMore(.word)
         }
         "."
         Capture {
             OneOrMore(.word)
         }
    }
}

I did wonder if it would be possible to write var regex: Regex<_> and have the Output inferred, but @hborla reminded me that it would mean return type inference and complicate type checking.

rxwei · March 18, 2022, 1:15am

In the fullness of time, we want to have a generic Pattern type with a builder DSL like this as shown in Declarative String Processing Overview to support general-purpose data processing. Regex is a scoped-down version of that, but aims to offer better familiarity and adopt regex conventions, e.g. to have .0 refer to the whole match followed by captures. Regex could conform to a future PatternComponent protocol and be used in a pattern builder block.

Michael_Ilseman · March 18, 2022, 1:20am

I would add that regexes make for a very poor data processing solution, e.g. due to their use of unrestricted backtracking. Not to mention that a huge swath of concerns are inherently string-specific, such as character classes, literals where elements are separated by adjacency, etc.

Far better would be a data pipeline system comprised of combinations of linear automata that run over a moving window (even if dynamically resizable) over asynchronous sources. This is clearly future work .

John_McCall · March 18, 2022, 1:36am

I know this is a broader question than this pitch, but isn’t unrestricted backtracking a major problem even for string processing? What controls does the programmer have to avoid it under this model?

Michael_Ilseman · March 18, 2022, 1:40am

We support the typical: possessive quantification and atomic grouping. There's options to switch default to possessive, etc., which would bring regex in line with vanilla PEGs.

Future work include supporting more backtracking engine directives (ala PCRE).

I think adding engine limiters (future work) is particularly interesting, given that pathological behavior is often a function of what the input string ends up being. Given the matching engine is built for something more general purpose (originally PEGs), coming up with a good limiter story would help inform other domains.

John_McCall · March 18, 2022, 2:51am

Okay. I can understand why we might not want to change the default for regexp literals, since people will expect to be able to copy things like .*, from other sources, but should we consider that for these alternative forms and require backtracking to be more opt-in?

Morten_Bek_Ditlevsen · March 18, 2022, 7:16am

Regarding the naming - the argument of the map() operator is called transform, but at the call site we rarely see it. So when I read the named argument transform I immediately thought that I would have an easier time reading it if it was actually a map instead.

Would it be possible to make map a modifier to Capture?
Like:

Regex {
  Capture {
     OneOrMore(.word)
  }
  .map(User.init)
}

That would be the non-throwing scenario.

I don't know if such a spelling would be possible, but I think that I would prefer reading that over a transform argument.

Morten_Bek_Ditlevsen · March 18, 2022, 8:09am

Regarding throwing or non-throwing transforms:
Couldn’t this be implicit?
If the transform throws the parsing fails, if it can’t throw then the transform simply can’t be the source for failing?
I don’t think it would be surprising to the user even if the throwing and non-throwing APIs were the same.