[Pitch] Strongly Typed Regex Captures

rxwei · November 10, 2021, 12:02am

Strongly Typed Regex Captures

See latest revision below

Original version:

Introduction

Capturing groups are a commonly used component of regular expressions as they allow the programmer to extract information from matched input. A capturing group collects multiple characters together as a single unit that can be backreferenced within the regular expression and accessed in the result of a successful match. For example, the following regular expression contains the capturing groups (cd*) and (ef).

// Literal version.
let regex = /ab(cd*)(ef)gh/
// => `Regex<(Substring, Substring)>`

// Equivalent result builder syntax:
//     let regex = Pattern {
//         "ab"
//         Group {
//             "c"
//             Repeat("d")
//         }.capture()
//         "ef".capture()
//         "gh"
//     }

if let match = "abcddddefgh".firstMatch(of: regex) {
    print(match) // => ("abcddddefgh", "cdddd", "ef")
}

We introduce a generic type Regex<Captures>, which treats the type of captures as part of a regular expression's type information for type safety and ease of use. As we explore a fundamental design aspect of the regular expression feature, this pitch discusses the following topics.

A type definition of the generic type Regex<Captures> and firstMatch method.
Capture type inference and composition in regular expression literals and the forthcoming result builder syntax.
New language features which this design may require.

This focus of this pitch is the structural properties of capture types and how regular expression patterns compose to form new capture types. The semantics of string matching, its effect on the capture types (i.e. UnicodeScalarView or Substring), the result builder syntax, or the literal syntax will be discussed in future pitches.

For background on Declarative String Processing, see related topics:

Motivation

Across a variety of programming languages, many established regular expression libraries present captures as a collection of captured content to the caller upon a successful match [1][2]. However, to know the structure of captured contents, programmers often need to carefully read the regular expression or run the regular expression on some input to find out. Because regular expressions are oftentimes statically available in the source code, there is a missed opportunity to use generics to present captures as part of type information to the programmer, and to leverage the compiler to infer the type of captures based on a regular expression literal. As we propose to introduce declarative string processing capabilities to the language and the Standard Library, we would like to explore a type-safe approach to regular expression captures.

Proposed solution

We introduce a generic structure Regex<Captures> whose generic parameter Captures denotes the type of the captured content of such a regular expression. With a single generic parameter Captures, we make use of tuples to represent multiple and nested captures.

let regex = /ab(cd*)(ef)gh/
// => Regex<(Substring, Substring)>
if let match = "abcddddefgh".firstMatch(of: regex) {
  print(match) // => ("abcddddefgh", "cdddd", "ef")
}

During type inference for regular expression literals, the compiler infers the capture type based on the regular expression's content. Same for the result builder syntax, except that the type inference rules are expressed as method declarations in the result builder type.

Detailed design

`Regex` type

Regex is a structure that represents a regular expression. Regex is generic over an unconstrained generic parameter Captures. Upon a regex match, the captured values are available as part of the result.

public struct Regex<Captures>: RegexProtocol, ExpressibleByRegexLiteral {
    ...
}

`firstMatch` method

The firstMatch method returns a Substring of the first match of the provided regex in the string, or nil if there are no matches. If the provided regex contains captures, the result is a tuple of the match and the flattened capture type (described more below).

extension String {
    public func firstMatch<R: RegexProtocol, C...>(of regex: R)
        -> (Substring, C...)? where R.Captures == (C...)
}

// Expands to:
//     extension String {
//         func firstMatch<R: RegexProtocol>(of regex: R)
//             -> Substring? where R.Captures == ()
//         func firstMatch<R: RegexProtocol, C1>(of regex: R)
//             -> (Substring, C1)? where R.Captures == (C1)
//         func firstMatch<R: RegexProtocol, C1, C2>(of regex: R)
//             -> (Substring, C1, C2)? where R.Captures == (C1, C2)
//         ...
//     }

This signature is consistent with traditional regex backreference numbering. The numbering of backreferences to captures starts at \1 (with \0 refering to the entire match). Flattening the match and captures into the same tuple aligns the tuple index numbering of the result with the regex backreference numbering:

let scalarRangePattern = /([0-9A-F]+)(?:\.\.([0-9A-F]+))?/
// Result tuple index:  0 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
//                      1 ^~~~~~~~~~~     2 ^~~~~~~~~~~
if let match = line.firstMatch(of: scalarRangePattern) {
    print(match.0, match.1, match.2) // => 007F..009F, 007F, 009F
}

Capture type

In this section, we dive into capture types for regular expression patterns and how they compose.

By default, a regular expression literal has type Regex. Its generic argument Captures is its capture type.

Basics

Regular expressions without any capturing groups have type Regex<Void>, for example:

let identifier = /[_a-zA-Z]+[_a-zA-Z0-9]*/  // => `Regex<Void>`

// Equivalent result builder syntax:
//     let identifier = Pattern {
//         OneOrMore(/[_a-zA-Z]/)
//         Repeat(/[_a-zA-Z0-9]/)
//     }

Capturing group: `(...)`

In regular expression literals, a capturing group saves the portion of the input matched by its contained pattern. A capturing group's capture type is Substring.

let graphemeBreakLowerBound = /([0-9a-fA-F]+)/ // => `Regex<Substring>`

// Equivalent result builder syntax:
//     let graphemeBreakLowerBound = OneOrMore(.hexDigit).capture()

Concatenation: `abc`

Concatenating a sequence of patterns, r0, r1, r2, ..., will cause the resulting capture type to reflect the concatenated capture type, represented as a tuple of capture types or a single capture type depending on the overall quantity of captures in r0, r1, r2, ... If the overall capture quantity is 1, the resulting capture type is the capture type of the single pattern that has a capture; otherwise, the resulting capture type is a tuple of capture types of all patterns that have a capture.

let graphemeBreakLowerBound = /([0-9a-fA-F]+)\.\.[0-9a-fA-F]+/
// => `Regex<Substring>`

// Equivalent result builder syntax:
//     let graphemeBreakLowerBound = Pattern {
//         OneOrMore(.hexDigit).capture()
//         ".."
//         OneOrMore(.hexDigit)
//     }

let graphemeBreakRange = /([0-9a-fA-F]+)\.\.([0-9a-fA-F]+)/
// => `Regex<(Substring, Substring)>`

// Equivalent result builder syntax:
//     let graphemeBreakRange = Pattern {
//         OneOrMore(.hexDigit).capture()
//         ".."
//         OneOrMore(.hexDigit).capture()
//     }

Named capturing group: `(?<name>...)`

A named capturing group in a pattern with multiple captures causes the resulting tuple to have a tuple element label at the corresponding capture type position. When the pattern has only one capture, there will be no tuple element label because there are no 1-element tuples.

let graphemeBreakLowerBound = /(?<lower>[0-9A-F]+)\.\.[0-9A-F]+/
// => `Regex<Substring>`

let graphemeBreakRange = /(?<lower>[0-9A-F]+)\.\.(?<upper>[0-9A-F]+)/
// => `Regex<(lower: Substring, upper: Substring)>`

Non-capturing group: `(?:...)`

A non-capturing group's capture type is the capture type of its underlying pattern. That is, it does not capture anything by itself, but transparently propagates its underlying pattern's captures.

let graphemeBreakLowerBound = /([0-9A-F]+)(?:\.\.([0-9A-F]+))?/
// => `Regex<(Substring, Substring?)>`

// Equivalent result builder syntax:
//     let graphemeBreakLowerBound = Pattern {
//         OneOrMore(.hexDigit).capture()
//         Optionally {
//             ".."
//             OneOrMore(.hexDigit).capture()
//         }
//     }

Nested capturing group: `(...(...)...)`

When capturing group is nested within another capturing group, they count as two distinct captures in the order their left parenthesis first appears in the regular expression literal. This is consistent with PCRE and allows us to use backreferences (e.g. \2) with linear indices.

let graphemeBreakPropertyData = /(([0-9A-F]+)(\.\.([0-9A-F]+)))\s*;\s(\w+).*/
// Positions in result tuple:  1 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~    5 ^~~~~
//                                         3 ^~~~~~~~~~~~~~~~~
//                              2 ^~~~~~~~~~~   4 ^~~~~~~~~~~
// => `Regex<(Substring, Substring, Substring, Substring, Substring)>`

// Equivalent result builder syntax:
//     let graphemeBreakPropertyData = Pattern {
//         Group {
//             OneOrMore(.hexDigit).capture() // (2)
//             Group {
//                 ".."
//                 OneOrMore(.hexDigit).capture() // (4)
//             }.capture() // (3)
//         }.capture() // (1)
//         Repeat(.whitespace)
//         ";"
//         CharacterClass.whitespace
//         OneOrMore(.word).capture() // (5)
//         Repeat(.any)
//     }

let input = "007F..009F   ; Control"
// Match result for `input`:
// ("007F..009F   ; Control", "007F..009F", "007F", "..009F", "009F", "Control")

Quantification: `*`, `+`, `?`, `{n}`, `{n,}`, `{n,m}`

A quantifier wraps its underlying pattern's capture type in either an Optional or Array. Zero-or-one quantification (?) produces an Optional and all others produce an Array. The kind of quantification, i.e. greedy vs reluctant vs possessive, is irrelevant to determining the capture type.

Syntax	Description	Capture type
`*`	0 or more	`Array` of the sub-pattern capture type
`+`	1 or more	`Array` of the sub-pattern capture type
`?`	0 or 1	`Optional` of the sub-pattern capture type
`{n}`	Exactly n	`Array` of the sub-pattern capture type
`{n,m}`	Between n and m	`Array` of the sub-pattern capture type
`{n,}`	n or more	`Array` of the sub-pattern capture type

/([0-9a-fA-F]+)+/
// => `Regex<[Substring]>`

// Equivalent result builder syntax:
//     OneOrMore {
//         OneOrMore(.hexDigit).capture()
//     }

/([0-9a-fA-F]+)*/
// => `Regex<[Substring]>`

// Equivalent result builder syntax:
//     Repeat {
//         OneOrMore(.hexDigit).capture()
//     }

/([0-9a-fA-F]+)?/
// => `Regex<Substring?>`

// Equivalent result builder syntax:
//     Optionally {
//         OneOrMore(.hexDigit).capture()
//     }

/([0-9a-fA-F]+){3}/
// => `Regex<[Substring]>

// Equivalent result builder syntax:
//     Repeat(3) {
//         OneOrMore(.hexDigit).capture()
//     )

/([0-9a-fA-F]+){3,5}/
// => `Regex<[Substring]>`

// Equivalent result builder syntax:
//     Repeat(3...5) {
//         OneOrMore(.hexDigit).capture()
//     )

/([0-9a-fA-F]+){3,}/
// => `Regex<[Substring]>`

// Equivalent result builder syntax:
//     Repeat(3...) {
//         OneOrMore(.hexDigit).capture()
//     )

Note that capturing collections of repeated captures like this is a departure from most regular expression implementations, which only provide access to the last match of a repeated capture group. For example, Python only captures the last group in this dash-separated string:

rep = re.compile('(?:([0-9a-f]+)-?)+')
match = rep.match("1234-5678-9abc-def0")
print(match.group(1))
# Prints "def0"

By contrast, the proposed Swift version captures all four sub-matches:

let pattern = /(?:([0-9a-f]+)-?)+/
if let match = "1234-5678-9abc-def0".firstMatch(of: pattern) {
    print(match.1)
}
// Prints ["1234", "5678", "9abc", "def0"]

Despite the deviation from prior art, we believe that the proposed capture behavior leads to better consistency with the meaning of these quantifiers.

Note, the alternative behavior does have the advantage of a smaller memory footprint because the matching algorithm would not need to allocate storage for capturing anything but the last match. As a future direction, we could introduce a variant of quantifiers for this behavior that the programmer would opt in for memory-critical use cases.

Alternation: `a|b`

Alternations are used to match one of multiple patterns. If there are one or more capturing groups within an alternation, the resulting capture type is an Alternation that's generic over each option's underlying pattern.

/([01]+)|[0-9]+|([0-9A-F]+)/
// => `Regex<Alternation<(Substring, Void, Substring)>>`

If there are no capturing groups within an alternation the resulting capture type is Void.

/[01]+|[0-9]+|[0-9A-F]+/
// => `Regex<Void>`

Nested captures follow the algebra previously described.

/([01]+|[0-9]+|[0-9A-F]+)/
// => `Regex<Substring>`
/(([01]+)|([0-9]+)|([0-9A-F]+))/
// => `Regex<(Substring, Alternation<(Substring, Substring, Substring))>>`
/(?<overall>(?<binary>[01]+)|(?<decimal>[0-9]+)|(?<hex>[0-9A-F]+))/
// => `Regex<(overall: Substring, Alternation<(binary: Substring, decimal: Substring, hex: Substring))>>`

At the use site, ideally Alternation would behave like an enum allowing you to exhaustively switch over all the captures.

let number = line
    .firstMatch(of: /([01]+)|([0-9]+)|([0-9A-F]+)/)
    .flatMap { (_, n) in
        switch n {
        case let .0(binary):
            return Int(binary, radix: 2)
        case let .1(decimal):
            return Int(decimal, radix: 10)
        case let .2(hex):
            return Int(hex, radix: 16)
        }
    }

// Or with named captures:
let number = line
    .firstMatch(of: /(?<binary>[01]+)|(?<decimal>[0-9]+)|(?<hex>[0-9A-F]+)/)
    .flatMap { (_, n) in
        switch n {
        case let .binary(str):
            return Int(str, radix: 2)
        case let .decimal(str):
            return Int(str, radix: 10)
        case let .hex(str):
            return Int(str, radix: 16)
        }
    }

In the fullness of time, we'd like to design a language feature to support this. In the meantime, we would like to do the best we can and leave the door open for a source-compatible migration.

With variadic generics we think we can define the following Alternation type.

@dynamicMemberLookup
public struct Alternation<Captures> { ... }

extension<Option...> Alternation where Captures == (Option...) {
    public var options: (Option?...) { get }
    
    public subscript<T>(dynamicMember keyPath: KeyPath<(Option?...), T>) -> T {
      options[keyPath: keyPath]
    }
}

An optional projection property, options, presents all options as a tuple of optionals. The Standard Library will provide the runtime guarantee that only one element of the options tuple will be non-nil. The programmer can use a switch statement to pattern-match the tuple and handle each case, or directly access the tuple's properties via key path dynamic member lookup.

let number = line
    .firstMatch(of: /([01]+)|([0-9]+)|([0-9A-F]+)/)
    .flatMap { (_, n) in
        switch n.options {
        case let (binary?, nil, nil):
            return Int(binary, radix: 2)
        case let (nil, decimal?, nil):
            return Int(decimal, radix: 10)
        case let (nil, nil, hex?):
            return Int(hex, radix: 16)
        default:
            fatalError("unreachable")
        }
    }

// Or with named captures:
let number = line
    .firstMatch(of: /(?<binary>[01]+)|(?<decimal>[0-9]+)|(?<hex>[0-9A-F]+)/)
    .flatMap { (_, n) in
        if let binary = n.binary {
            return Int(binary, radix: 2)
        } else if let decimal = n.decimal {
            return Int(decimal, radix: 10)
        } else if let hex = n.hex {
            return Int(hex, radix: 16)
        } else {
            fatalError("unreachable")
        }
    }

We acknowledge that it is unforunate that the programmer has to write an unreachable fatalError(...) even when all possible cases are handled. We believe that, in the fullness of time, this will motivate the design of a language-level solution to variadic enumerations.

Effect on ABI stability

None. This is a purely additive change to the Standard Library.

Effect on API resilience

None. This is a purely additive change to the Standard Library.

Alternatives considered

Lazy collections instead of arrays of substrings

For quantifiers that produce an array, it is arguable that a lazy collection based on matched ranges could minimize reference counting operations on Substring and reduce allocations.

let regex = /([a-z])+/
// => `Regex<CaptureCollection<Substring>>`

// `CaptureCollection` implemented as... 
public struct CaptureCollection<Captures>: BidirectionalCollection {
    private var ranges: [ClosedRange<String.Index>]
    ...
}

However, we believe the use of arrays in capture types would make a much cleaner type signature.

Homogeneous tuples for exact-count quantification

For exact-count quantifications, e.g. [a-z]{5}, it would slightly improve type safety to make its capture type be a homogeneous tuple instead of an array, e.g. (5 x Substring) as pitched in Improved Compiler Support for Large Homogenous Tuples.

/[a-z]{5}/     // => Regex<(5 x Substring)> (exact count)
/[a-z]{5, 8}/  // => Regex<[Substring]>     (bounded count) 
/[a-z]{5,}/    // => Regex<[Substring]>     (lower-bounded count)

However, this would cause an inconsistency between exact-count quantification and bounded quantification. We believe that the proposed design will result in fewer surprises as we associate the {...} quantifier syntax with Array.

`Never` as empty capture instead of `Void`

Past swift evolution proposals (SE-0215, SE-0319) have added conformances for Never in order to support its use a bottom type. Never may seem like a natural fit for the empty capture type instead of Void, such that a regex of type Regex<Never> means it never captures.

However, a Never value never exists. Functions with return type Never will never return. As a result, calling an API like captures on a regex with no captures would cause the program to abort or hang.

let identifier = /[_a-zA-Z]+[_a-zA-Z0-9]*/  // => `Regex<Never>`
print(str.firstMatch(of: identifier)?.captures)
// ❗️ Program aborts or hangs.

In contrast, using Void as the empty capture type would allow captures to be accessed safely at anytime. When a regex has no captures, the match result's capture is simply ().

let identifier = /[_a-zA-Z]+[_a-zA-Z0-9]*/  // => `Regex<Void>`
print(str.firstMatch(of: identifier)?.captures)
// Prints `()`.

() is also just a more consistent, continuous representation of a type with 0 captures:

Number	Capture type
0	`()`
1	`(Substring)`
2	`(Substring, Substring)`

`Regex<(EntireMatch, Capture...)>` instead of `Regex<Captures>`

There are a few downsides when Regex's generic parameter represents captures:

When there is only one capture and it is a named one, e.g. \d{4}-(?<month>\d{2})-\d{2}, the name will be lost in the type Regex<Substring> because Swift does not support single-element tuples.
The indices in the Captures tuple do not align with the conventional backreference indices, where the latter uses \0 to refer to the entire match and \1 for the start of captures.

One alternative design to address these downsides is to include the whole match in the generic parameter. When a regex has captures, the generic argument is a tuple that can be seen as (EntireMatch, Capture...). Capturing can then be seen as appending substrings on to an existing Match (as additional tuple elements).

public struct Regex<Match>: RegexProtocol, ExpressibleByRegexLiteral {
    ...
}

extension String {
    public func firstMatch<R: RegexProtocol>(of regex: R) -> R.Match?
}

With this design, regexes with a single named capture will preserve the name as a tuple element label.

\d{4}-(?<month>\d{2})-\d{2} // (Substring, month: Substring)

However, one downside to this approach is that the typing rules of concatenation, alternation, etc would require discarding the first element of Match when forming the new type. It would also lead to a harder requirement on variadic generics for the result builder syntax down the road, where the concatenation pattern needs to drop the first element from Match to be able to concatenate each component's capture type.

extension<EntireMatch, Capture...> Regex where Match == (EntireMatch, Capture...) {
    public typealias Captures = (Capture...) // Dropping first from `Match`
}

The other downside is that the role of the first tuple element can be unclear at call sites, even though Match is now consistent with backreference numbering in most other regex implementations.

Future directions

Dynamic captures

So far, we have explored offering static capture types for using a regular expression that is available in source code. Meanwhile, we would like to apply Swift's string processing capabilities to fully dynamic use cases, such as matching a string using a regular expression obtained at runtime.

To support dynamism, we could introduce a new type, DynamicCaptures that represents a tree of captures, and add a Regex initializer that accepts a string and produces Regex<DynamicCaptures>.

public struct DynamicCaptures: Equatable, RandomAccessCollection {
  var range: Range<String.Index> { get }
  var substring: Substring? { get }
  subscript(name: String) -> DynamicCaptures { get }
  subscript(position: Int) -> DynamicCaptures { get }
}

extension Regex where Captures == DynamicCaptures {
  public init(_ string: String) throws
}

Example usage:

let regex = readLine()! // (\w*)(\d)+z(\w*)?
let input = readLine()! // abcd1234xyz
print(input.firstMatch(of: regex)?.1)
// [
//     "abcd",
//     [
//         "1",
//         "2",
//         "3",
//         "4",
//     ],
//     .some("xyz")
// ]

Single-element labeled tuples

Swift doesn't currently support single-element labeled tuples, which leads to a discontinuity at arity 1:

let noCaptures = /[0-9A-F]+\.\.[0-9A-F]+/
// => `Regex<()>`

let oneCapture = /(?<lower>[0-9A-F]+)\.\.[0-9A-F]+/
// => `Regex<Substring>`

let twoCaptures = /(?<lower>[0-9A-F]+)\.\.(?<upper>[0-9A-F]+)/
// => `Regex<(lower: Substring, upper: Substring)>`

Dropping the argument label is particularly undesirable because firstMatch concatenates the match and the captures, make the argument label more significant:

let str = "007F..009F    ; Control # Cc  [33] <control-007F>..<control-009F>"

if let m = str.firstMatch(of: /(?<lower>[0-9A-F]+)\.\.(?<upper>[0-9A-F]+)/) {
    print(type(of: m)) // Prints (Substring, lower: Substring, upper: Substring)
    print(m.0) // Prints "007F..009F"
    print(m.lower) // Prints "007F"
    print(m.upper) // Prints "009F"
}

if let m = str.firstMatch(of: /(?<lower>[0-9A-F]+)\.\.[0-9A-F]+/) {
    print(type(of: m)) // Prints (Substring, Substring)
    print(m.0) // Prints "007F..009F"
    print(m.lower) // error
}

Forum discussion suggests there isn't a technical reason why support for single-element labeled tuples can't be added in the future. In particular, the examples here would be source compatible if as suggested (T), which is equivalent to T, is made a supertype of (label: T).

sveinhal · November 10, 2021, 11:16am

rxwei:

// Literal version.
let regex = /ab(cd*)(ef)gh/
// => `Regex<(Substring, Substring)>`
if let match = "abcddddefgh".firstMatch(of: regex) {
    print(match) // => ("abcddddefgh", "cdddd", "ef")
}

Wait? What? Does this include support for variadic generics?

Or did you mean:

if let match = "abcddddefgh".firstMatch(of: regex) {
    print(match) // => ("abcddddefgh", ("cdddd", "ef"))
}

jjrscott · November 10, 2021, 11:47am

Great work, but I've come to disagree with text encoded regular expressions (and that's coming from a Perl Groupie).

Long story short, I think we should replace string based regular expressions with idiomatic Swift, whether that be as suggested below or Decodable, result builders, or Scanner (but for arbitrary element types).

xAlien95 · November 10, 2021, 11:51am

All the examples in the pitch have firstMatch(of:) returning (Substring, Captures...), thus resulting in .0 being the whole match and the rest of the tuple containing the 1-indexed captures. Did you read an old version of the pitch?

It's amazing to see progress in this direction, especially given the fact that this will foster the development of variadic generics.

The Alternation section is intriguing, but, as proposed, it steps away from the PCRE/ICU/etc. capture group linear indexing. With that in mind, this example

/(?<overall>(?<binary>[01]+)|(?<decimal>[0-9]+)|(?<hex>[0-9A-F]+))/
// => `Regex<(overall: Substring, Alternation<(binary: Substring, decimal: Substring, hex: Substring))>>`

should give

(Substring, overall: Substring, binary: Substring?, decimal: Substring?, hex: Substring?)

as the result of applying firstMatch(of:). The downside of sticking to PCRE indexing is that checking which branch you got can be tedious: not only do you need to inexhaustibly check for non-nilness among .binary, .decimal and .hex, but you also need to account (and count) for 5 tuple members instead of just three when pattern matching. Furthermore, the proposed future direction of having "structural enums" (.0(binary), etc.) is more Swifty, if we may say.

In my opinion, the most Swifty way to store captures, would be to store them structurally in deeply nested tuples. The example above would result in

(
    Substring,
    overall: (
        Substring,
        Alternation<(
            binary: Substring,
            decimal: Substring,
            hex: Substring
        )>
    )
)

and I wonder if having multiple "views" on the captures may be beneficial. Something like

string.firstMatch(of: regex)?.match // => (Substring, overall: ( ... ))?
string.firstMatch(of: regex)?.captures // => (Substring, overall: Substring, binary: ...)?

with .match being structured and deeply nested, while .captures containing the flatten out linearly indexed captures?

If Patterns will share the same strongly typed regex captures layout of Regex, in the case of complex Patterns, we would end up with unmanageable results having dozens of captures flatten out.

Food for thoughts

Given the fact that Swift's regex syntax could use a superset of PCRE, we may even extend backreferences to allow for nested references. As an example, the token \k<name>, which matches the capture named name, doesn't allow non-alphabetic names in PCRE et al., so we could eventually extend it to support \k<.1> (equivalent to \1), \k<.overall> (equivalent to \k<overall>), \k<.overall.2> (refers to the second capture in the capture named overall).

kylemacomber · November 10, 2021, 4:58pm

This is a good observation and I think was an oversight on my part. I think we'll want to update the pitch so indexing alternations in the result tuple are consistent with the backreference numbering.

xAlien95:

I wonder if having multiple "views" on the captures may be beneficial. Something like
string.firstMatch(of: regex)?.match // => (Substring, overall: ( ... ))?
string.firstMatch(of: regex)?.captures // => (Substring, overall: Substring, binary: ...)?
with .match being structured and deeply nested, while .captures containing the flatten out linearly indexed captures?

This is definitely an intriguing idea. One downside I see is that it'd probably require Regex to be generic over both the nested and flattened types, which would be a little much.

One idea is that Regex is generic over the nested type and we use an integer subscript to lookup the value at the corresponding flattened index. Though the result of the subscript would have to be type erased, which isn't as convenient.

Another idea is that Regex supports the traditional flattened style, since the motivation behind regex is knowledge and code reuse. But maybe Pattern supports both (e.g. capture() which is nested and flatCapture() which is flattened)?

nnnnnnnn · November 10, 2021, 4:59pm

Just want to add a quick note for anyone tripped up about the numbering of the captures in a match result: the generic parameter for Regex does not include the whole match, but the generic parameter for the match result does.

That is, for this example:

let graphemeBreakRange = /([0-9a-fA-F]+)\.\.([0-9a-fA-F]+)/
// => `Regex<(Substring, Substring)>`

...the match result would give you access to a (Substring, Substring, Substring) tuple:

if let match = "009A..00FE".firstMatch(of: graphemeBreakRange) {
    print("Full match: '\(match.0)'")
    print("Groups: '\(match.1)' & '\(match.2)'")
}
// Prints "Full match: '009A..00FE'"
// Prints "Groups: '009A' & '00FE'"

davedelong · November 10, 2021, 5:10pm

Yes, thank you. My (now deleted) comment was a result of me not quite making this connection. Thanks for clarifying!

Michael_Ilseman · November 10, 2021, 5:34pm

Semantic-level switching is out-of-scope for this pitch, but it's possible handling that could introduce constraints. IIUC, this pitch is pitching an unconstrained generic parameter for brevity and simplicity, but isn't trying to close the door on constraints (or even nominal types).

Again, a tuple for simplicity and brevity in this pitch. IIUC, we're not closing the door to it being a proper nominal type with .0, .1 etc overloads and dynamic member lookup for named cptures. It's just that for the content in this pitch, a tuple more than suffices. Either way, the developer experience is meant to be light-weight and tuple-y,

What would it take in the language to support writing this as Regex<>? Default generic parameter values, or something simpler? CC @beccadax

An interesting note is that the Substrings are created on-the-fly, the actual memory representation uses Range<String.Index>. So, we have firstMatch which slices the input many times (incurring ARC, etc.). We could also have a firstMatch that more directly provides (Range<String.Index>, ...). We could even provide a firstMatch which provides (String...), which is not as crazy as it sounds for scripting or when captures/matches fit within 15 UTF-8 code units for the small string form.

In that sense, the Captures generic type is just an encoding of the arity and kind of captured content. We could have instead written Regex_sos for Regex<Substring, Substring?, Substring> and conform to a common protocol, avoiding any generics at all. We could equivalently write Regex<Void, Void?, Void> or Regex<String, String?, String>, they're all fundamentally equivalent ways to represent the structure of the captures in a literal or regex DSL.

I'm interested in how we make and evaluate this decision. I do like Regex<Substring, Substring?, Substring> currently, which:

A) Gives people a nice slice by "default" while sharing memory
B) Hints that the most common usage is sharing memory (safely with COW)
C) Helps drives developer intuition that it's contained within the match

But I could totally see arguments for String or even a custom type with an apt name. Perhaps the final decision is pending more work on Pattern<...>, which is capable of expressing failable transformations from Substring -> T? (e.g. the Unicode.Scalar value produced in the overview example).

Noting that we're talking about the default or common presentation. Since the type is just encoding structural information, we could in theory (perhaps pending language features) present a heirarchical view as well in which the nesting structure is present in the type system.

I do agree that the flattened view is probably the most useful, especially for the literal. Perhaps we'll face this when we start to flesh out the DSL (currently, the nesting looks weird unless you mentally mapping back to a literal).

Noting that what we introduce is less likely to be encoded in literal syntax. It's more likely to be an API-level option on the callee or an option/property on Regex.

This means that aggregates of alternations will preserve exactly what was matched and how. E.g. ((ab)|(cd))* is Array<Alternation<Substring, Substring>> which preserves match information. That could be cool. Do you know how this compares to other engines?

I feel like this is conflating Captures with the in-memory representation. Since we are fundamentally operating with Range<String.Index>, a firstMatchRanges or such would produce [Range<String.Index>]. We don't know yet if the engine will store these as an allocated Array or some custom-allocated type instead.

I feel like a better alternative here is a firstMatchRanges equivalent since that lets us skip all ARC, not just ARC inside aggregates. Then, firstMatch is a convenience layer on top of this. Future work would be exposing access to the engine's underlying representation to avoid allocation for the Array.

I agree. There's sharply diminishing returns in being clever here and increasing hazards. It would be annoying to change a {5} to {5, 6} and have that result in a massive shift in the capture arity and position of everything coming after it.

rxwei:

In contrast, using Void as the empty capture type would allow captures to be accessed safely at anytime. When a regex has no captures, the match result's capture is simply () .
let identifier = /[_a-zA-Z]+[_a-zA-Z0-9]*/  // => `Regex<Void>`
print(str.firstMatch(of: identifier)?.captures)
// Prints `()`.
() is also just a more consistent, continuous representation of a type with 0 captures:

Why is it Regex<Void> instead of Regex<()>? Also, we conceivably could introduce a new None or NoCapture type (or something better named). It's conceivable to even have NonCapturingRegex without any generic parameters, though I don't like having multiple types just for this purpose.

Either way, having a special type in that position for non-capturing regex does make sense. Just as a single-capture concatenation degenerates into that capture itself, so too does a no-capture regex degenerate into Regex<Void>, we're just figuring out how best to spell that.

I feel like this is only an artifact of modeling everything as tuples rather than nominal types. There's a chance we'll derive additional value from a nominal type, in which case we can map numbers however we want and use argument labels to clarify.

Are we sure we'll even provide a .captures API if we have a perfectly fine nominal match result with both captures and the full match which aligns with backreference numbering and common regex conventions? It's possible that accessing just the captures is done via normal operations over variadic generics (hopefully). Another alternative is type slices, but I don't know the aims of variadic generics here.

beccadax · November 10, 2021, 5:57pm

Surely \0 is NUL (U+0000)?

(I think that would be fine because using \0 as a backreference at any point in a regular expression would be incoherent for the same reason /(foo\1)/ is—it doesn't make sense to use a backreference inside the capture group it references. Perl, at least, simply fails the match if you try to use a backreference to a capture group that’s still open, but in Swift we may want to diagnose it as a syntax error.)

kylemacomber · November 10, 2021, 6:14pm

Ya we should re-phrase that. All I was trying to say is that the entire match is conceptually backreference #0.

beccadax · November 10, 2021, 8:32pm

I've been thinking about this more than a little. We're trying to get capture numbers and tuple numbers to line up, but even if we use .0 to address the starting number, there are at least two other places where the numbering will deviate:

When a quantifier wraps a concatenation with multiple captures:

let quotedPalindromes: Regex<(Substring, [(Substring, Substring)], Substring)> =
    /(['"])(?:(\w)(\w)\3\2)+\1(u)?/

if let match = "'ababcdcd'".firstMatch(of: quotedPalindromes) {
    print(match.1, match.2[0].0, match.2[0].1, match.3)
    // Notice that \3 is `match.2[i].1`, and `match.3` corresponds to /\4/
}

When captures are inside an Alternation:

let necessaryAccessModifier: Regex<(Alternation<(Substring, Substring, Substring, Substring, Substring)>, Substring)> =
    /(?:(open)|(public)|(internal)|(fileprivate)|(private)) (?:enum|class|struct) \w+ \{\n    \1\2 (var|func|subscript|init)/

if let match = sourceCode.firstMatch(of: necessaryAccessModifier) {
    print(match.1.0, match.1.1, match.1.2, match.1.3, match.1.4, match.2)
    // Notice that /\2/ is `match.1.1`, and `match.2` corresponds to /\6/
}

So if we're really committed to making the numbering match, we probably need to flatten the nested structure of the captures. Instead of these types:

let quotedPalindromes: Regex<(Substring, [(Substring, Substring)], Substring)> = ...

let necessaryAccessModifier: Regex<(Alternation<(Substring, Substring, Substring, Substring, Substring)>, Substring)> = ...

We'd need these types:

let quotedPalindromes: Regex<(Substring, [Substring], [Substring], Substring)> = ...

let necessaryAccessModifier: Regex<(Substring?, Substring?, Substring?, Substring?, Substring?, Substring)> = ...

This loses some of the structure of the match—for instance, there is no longer anything to indicate that the two arrays in quotedPalindromes will always be of the same length, or that only one of the five Substring?s in necessaryAccessModifier will be non-nil—but it ensures that backreferences really will line up correctly with match values.

If we aren't willing to do that, I'm not sure how much value there really is in adding .0 to ensure captures line up as expected. At least the numbering is consistently misaligned without it, instead of being aligned as long as you don't use certain features in certain ways.

(On the bright side, I suspect such flattening may simplify the implementation of captures. We could store all information about captures in a simple [[Range<String.Index>]] array and then convert each sub-array to Substring, Substring?, or [Substring] as needed when we construct the capture tuple, without having to construct nested or alternating structures.)

benrimmington · November 11, 2021, 7:14pm

Most of the examples show the regex literal syntax, followed by the equivalent result builder syntax.

Could the default type for both of these syntaxes be the same?
- Either Regex would be the @resultBuilder root type;
- or Pattern would be ExpressibleByRegexLiteral.
Could the compiler implement regex literals by constructing the result builder syntax internally?
- This might make switching between syntaxes easier.
- ExpressibleByRegexLiteral might not be needed.
Will other kinds of result builders be able to implement similar functionality?
- For example, a database query DSL with strongly typed result columns.

rxwei · November 11, 2021, 9:26pm

We are aiming for Regex to have capture numbering that's consistent with prior art, whereas the Pattern's generic parameter could use more structure to reflect the scoping of the result builder syntax. Regex would conform to ExpressibleByRegexLiteral. Pattern is out of scope for this pitch, but I think we might end up having a PatternProtocol whose conforming types include Regex, Pattern, and various combinator types such as OneOrMore and Optionally. A generic version of firstMatch would be defined over PatternProtocol.

In the overview we noted that it would be desirable to have a refactor action that converts a regex literal to a Pattern. We could even provide an option to obtain the textual description of a regex literal as the Pattern result builder syntax for readability. But an ExpressibleByRegexLiteral would still be useful as it would allow libraries to define custom types that can be constructed with regex literals.

benrimmington · November 12, 2021, 8:20am

If you have different capture numbering for Regex and Pattern, then I imagine that would make it more difficult to refactor?

If the regex literal only defines the pattern, then existing libraries (e.g. ICU, JavaScriptCore, POSIX) may also need an OptionSet of flags; a JSContext object; and some way to return error codes for invalid patterns.

I'll take another look at the related threads, since my feedback is mostly out of scope for this pitch.

rxwei · November 12, 2021, 9:09am

Yeah, to mitigate this, we were thinking that Pattern could have a flatten() modifier which transforms structural captures to PCRE-style captures. Refactoring would generate .flatten() at the end of the pattern.

let r = /(a)((?:(b)(c)d)+)e(f)/
// Regex<(Substring, Substring, [Substring], [Substring], Substring)>

Refactored:

let r = Pattern {
    "a".capture() // Pattern<Substring>
    OneOrMore {
        "b".capture() // Pattern<Substring>
        "c".capture() // Pattern<Substring>
        "d"
    }.capture()  // Pattern<(Substring, [(Substring, Substring)])>
    "e"
    "f".capture() // Pattern<Substring>
}.flatten()
// Before flatten: `Pattern<(Substring, (Substring, [(Substring, Substring)]), Substring)>`
// After flatten: `Pattern<(Substring, Substring, [Substring], [Substring], Substring)>`

Karl · November 12, 2021, 12:40pm

I haven't yet fully digested this proposal, but on first reading it seems really good.

The way nested capture groups get flattened in the resulting Captures type certainly caught my attention because of how different it is to result builders.

WRT to the ARC stuff, I would imagine it should be possible to optimise. If we know we have 5 captures from the same String, then when forming the slices, surely it should be possible to just bump the retain count by 5 in a single op? (And also release them in a single op if all/many of them don't escape the function - release/deinit performance is just as important as retaining)

And with ownership, hopefully these will be recognised as borrows and just extend the String's lifetime rather than retaining storage in the common case when the slices don't escape the function. But we haven't heard about ownership for years, so

What's more interesting (IMO) is the ability to include regex capture groups in types conforming to ExpressibleByStringInterpolation, including the type information proposed here.

For custom patterns, you often don't want an enormous regex matching against all of the data - what you want is 90+% string literal, with a few small embedded patterns (capture groups), where those patterns may be expressed using a regex.

So something like: let p: URLPathPattern = "/user/\(?<id>\d+)/profile"

I don't see there being many types (other than regexes themselves) which are constructible from a regex literal and nothing else. So I'm sceptical that let p: MyPattern = /(?<lower>[0-9A-F]+)\.\.[0-9A-F]+/ will actually be that useful.

kylemacomber · November 13, 2021, 1:48am

Below is an updated pitch. You can also find the full document here.

Strongly Typed Regex Captures

Authors: Richard Wei, Kyle Macomber

Revision history

v1
- Initial pitch.
v2
- Includes entire match in Regex's generic parameter.
- Fixes Quantification and Alternation capture types to be consistent with traditional back reference numbering.

Introduction

Capturing groups are a commonly used component of regular expressions as they allow the programmer to extract information from matched input. A capturing group collects multiple characters together as a single unit that can be backreferenced within the regular expression and accessed in the result of a successful match. For example, the following regular expression contains the capturing groups (cd*) and (ef).

// Regex literal syntax:
let regex = /ab(cd*)(ef)gh/
// => `Regex<(Substring, Substring, Substring)>`

// Equivalent result builder syntax:
//     let regex = Pattern {
//         "ab"
//         Group {
//             "c"
//             Repeat("d")
//         }.capture()
//         "ef".capture()
//         "gh"
//     }

if let match = "abcddddefgh".firstMatch(of: regex) {
    print(match) // => ("abcddddefgh", "cdddd", "ef")
}

Note: The Regex type includes, and firstMatch(of:) returns, the entire match as the "0th element".

We introduce a generic type Regex<Match>, which treats the type of captures as part of a regular expression's type information for clarity, type safety, and convenience. As we explore a fundamental design aspect of the regular expression feature, this pitch discusses the following topics:

A type definition of the generic type Regex<Match> and firstMatch(of:) method.
Capture type inference and composition in regular expression literals and the forthcoming result builder syntax.
New language features which this design may require.

The focus of this pitch is the structural properties of capture types and how regular expression patterns compose to form new capture types. The semantics of string matching, its effect on the capture types (i.e. UnicodeScalarView.SubSequence or Substring), and the result builder syntax will be discussed in future pitches.

For background on Declarative String Processing, see related topics:

Motivation

Across a variety of programming languages, many established regular expression libraries present captures as a collection of captured content to the caller upon a successful match [1][2]. However, to know the structure of captured contents, programmers often need to carefully read the regular expression or run the regular expression on some input to find out. Because regular expressions are oftentimes statically available in the source code, there is a missed opportunity to use generics to present captures as part of type information to the programmer, and to leverage the compiler to infer the type of captures based on a regular expression literal. As we propose to introduce declarative string processing capabilities to the language and the Standard Library, we would like to explore a type-safe approach to regular expression captures.

Proposed solution

We introduce a generic structure Regex<Match> whose generic parameter Match includes the match and any captures, using tuples to represent multiple and nested captures.

let regex = /ab(cd*)(ef)gh/
// => Regex<(Substring, Substring, Substring)>
if let match = "abcddddefgh".firstMatch(of: regex) {
  print(match) // => ("abcddddefgh", "cdddd", "ef")
}

During type inference for regular expression literals, the compiler infers the type of Match from the content of the regular expression. The same will be true for the result builder syntax, except that the type inference rules are expressed as method declarations in the result builder type.

Because much of the motivation behind providing regex literals in Swift is their familiarity, a top priority of this design is for the result of calling firstMatch(of:) with a regex to align with the traditional numbering of backreferences to capture groups, which start at \1.

let regex = /ab(cd*)(ef)gh/
if let match = "abcddddefgh".firstMatch(of: regex) {
  print((match.1, match.2)) // => ("cdddd", "ef")
}

Detailed design

`Regex` type

Regex is a structure that represents a regular expression. Regex is generic over an unconstrained generic parameter Match. Upon a regex match, the entire match and any captured values are available as part of the result.

public struct Regex<Match>: RegexProtocol, ExpressibleByRegexLiteral {
    ...
}

Note: Semantic-level switching (i.e. matching grapheme clusters with canonical equivalence vs Unicode scalar values) is out-of-scope for this pitch, but handling that will likely introduce constraints on Match. We use an unconstrained generic parameter in this pitch for brevity and simplicity. The Substrings we use for illustration throughout this pitch are created on-the-fly; the actual memory representation uses Range<String.Index>. In this sense, the Match generic type is just an encoding of the arity and kind of captured content.

`firstMatch(of:)` method

The firstMatch(of:) method returns a Substring of the first match of the provided regex in the string, or nil if there are no matches. If the provided regex contains captures, the result is a tuple of the matching string and any captures (described more below).

extension String {
    public func firstMatch<R: RegexProtocol>(of regex: R) -> R.Match?
}

This signature is consistent with the traditional numbering of backreferences to capture groups starting at \1. Many regex libraries make the entire match available at position 0. We propose to do the same in order to align the tuple index numbering with the regex backreference numbering:

let scalarRangePattern = /([0-9a-fA-F]+)(?:\.\.([0-9a-fA-F]+))?/
// Positions in result: 0 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
//                      1 ^~~~~~~~~~~~~~     2 ^~~~~~~~~~~~~~
if let match = line.firstMatch(of: scalarRangePattern) {
    print((match.0, match.1, match.2)) // => ("007F..009F", "007F", "009F")
}

Note: Additional features like efficient access to the matched ranges are out-of-scope for this pitch, but will likely mean returning a nominal type from firstMatch(of:). In this pitch, the result type of firstMatch(of:) is a tuple of Substrings for simplicity and brevity. Either way, the developer experience is meant to be light-weight and tuple-y. Any nominal type would likely come with dynamic member lookup for accessing captures by index (i.e. .0, .1, etc.) and name.

Capture type

In this section, we describe the inferred capture types for regular expression patterns and how they compose.

By default, a regular expression literal has type Regex. Its generic argument Match can be viewed as a tuple of the entire matched substring and any captures.

(EntireMatch, Captures...)
              ^~~~~~~~~~~
              Capture types

When there are no captures, Match is just the entire matched substring.

Basics

Regular expressions without any capturing groups have type Regex<Substring>, for example:

let identifier = /[_a-zA-Z]+[_a-zA-Z0-9]*/  // => `Regex<Substring>`

// Equivalent result builder syntax:
//     let identifier = Pattern {
//         OneOrMore(/[_a-zA-Z]/)
//         Repeat(/[_a-zA-Z0-9]/)
//     }

Capturing group: `(...)`

A capturing group saves the portion of the input matched by its contained pattern. Its capture type is Substring.

let graphemeBreakLowerBound = /([0-9a-fA-F]+)/
// => `Regex<(Substring, Substring)>`

// Equivalent result builder syntax:
//     let graphemeBreakLowerBound = OneOrMore(.hexDigit).capture()

Concatenation: `abc`

A concatenation's Match is a tuple of Substrings followed by every pattern's capture type. When there are no capturing groups, the Match is just Substring.

let graphemeBreakLowerBound = /([0-9a-fA-F]+)\.\.[0-9a-fA-F]+/
// => `Regex<(Substring, Substring)>`

// Equivalent result builder syntax:
//     let graphemeBreakLowerBound = Pattern {
//         OneOrMore(.hexDigit).capture()
//         ".."
//         OneOrMore(.hexDigit)
//     }

let graphemeBreakRange = /([0-9a-fA-F]+)\.\.([0-9a-fA-F]+)/
// => `Regex<(Substring, Substring, Substring)>`

// Equivalent result builder syntax:
//     let graphemeBreakRange = Pattern {
//         OneOrMore(.hexDigit).capture()
//         ".."
//         OneOrMore(.hexDigit).capture()
//     }

Named capturing group: `(?<name>...)`

A named capturing group's capture type is Substring. In its Match type, the capture type has a tuple element label specified by the capture name.

let graphemeBreakLowerBound = /(?<lower>[0-9a-fA-F]+)\.\.[0-9a-fA-F]+/
// => `Regex<(Substring, lower: Substring)>`

let graphemeBreakRange = /(?<lower>[0-9a-fA-F]+)\.\.(?<upper>[0-9a-fA-F]+)/
// => `Regex<(Substring, lower: Substring, upper: Substring)>`

Non-capturing group: `(?:...)`

A non-capturing group's capture type is the same as its underlying pattern's. That is, it does not capture anything by itself, but transparently propagates its underlying pattern's captures.

let graphemeBreakLowerBound = /([0-9a-fA-F]+)(?:\.\.([0-9a-fA-F]+))?/
// => `Regex<(Substring, Substring, Substring?)>`

// Equivalent result builder syntax:
//     let graphemeBreakLowerBound = Pattern {
//         OneOrMore(.hexDigit).capture()
//         Optionally {
//             ".."
//             OneOrMore(.hexDigit).capture()
//         }
//     }

Nested capturing group: `(...(...))`

When capturing group is nested within another capturing group, they count as two distinct captures in the order their left parenthesis first appears in the regular expression literal. This is consistent with traditional regex backreference numbering.

let graphemeBreakPropertyData = /(([0-9a-fA-F]+)(\.\.([0-9a-fA-F]+)))\s*;\s(\w+).*/
// Positions in result:        0 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
//                             1 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~    5 ^~~~~
//                                            3 ^~~~~~~~~~~~~~~~~~~~
//                              2 ^~~~~~~~~~~~~~   4 ^~~~~~~~~~~~~~
// => `Regex<(Substring, Substring, Substring, Substring, Substring, Substring)>`

// Equivalent result builder syntax:
//     let graphemeBreakPropertyData = Pattern {
//         Group {
//             OneOrMore(.hexDigit).capture() // (2)
//             Group {
//                 ".."
//                 OneOrMore(.hexDigit).capture() // (4)
//             }.capture() // (3)
//         }.capture() // (1)
//         Repeat(.whitespace)
//         ";"
//         CharacterClass.whitespace
//         OneOrMore(.word).capture() // (5)
//         Repeat(.any)
//     }
//     .flattened()

let input = "007F..009F   ; Control"
// Match result for `input`:
// ("007F..009F   ; Control", "007F..009F", "007F", "..009F", "009F", "Control")

Quantification: `*`, `+`, `?`, `{n}`, `{n,}`, `{n,m}`

A quantifier wraps its underlying pattern's capture type in either an Optional or Array. Zero-or-one quantification (?) produces an Optional and all others produce an Array. The kind of quantification, i.e. greedy vs reluctant vs possessive, is irrelevant to determining the capture type.

Syntax	Description	Capture type
`*`	0 or more	`Array` of the sub-pattern capture type
`+`	1 or more	`Array` of the sub-pattern capture type
`?`	0 or 1	`Optional` of the sub-pattern capture type
`{n}`	Exactly n	`Array` of the sub-pattern capture type
`{n,m}`	Between n and m	`Array` of the sub-pattern capture type
`{n,}`	n or more	`Array` of the sub-pattern capture type

/([0-9a-fA-F]+)+/
// => `Regex<(Substring, [Substring])>`

// Equivalent result builder syntax:
//     OneOrMore {
//         OneOrMore(.hexDigit).capture()
//     }

/([0-9a-fA-F]+)*/
// => `Regex<(Substring, [Substring])>`

// Equivalent result builder syntax:
//     Repeat {
//         OneOrMore(.hexDigit).capture()
//     }

/([0-9a-fA-F]+)?/
// => `Regex<(Substring, Substring?)>`

// Equivalent result builder syntax:
//     Optionally {
//         OneOrMore(.hexDigit).capture()
//     }

/([0-9a-fA-F]+){3}/
// => `Regex<(Substring, [Substring])>`

// Equivalent result builder syntax:
//     Repeat(3) {
//         OneOrMore(.hexDigit).capture()
//     )

/([0-9a-fA-F]+){3,5}/
// => `Regex<(Substring, [Substring])>`

// Equivalent result builder syntax:
//     Repeat(3...5) {
//         OneOrMore(.hexDigit).capture()
//     )

/([0-9a-fA-F]+){3,}/
// => `Regex<(Substring, [Substring])>`

// Equivalent result builder syntax:
//     Repeat(3...) {
//         OneOrMore(.hexDigit).capture()
//     )

let multipleAndNestedOptional = /(([0-9a-fA-F]+)\.\.([0-9a-fA-F]+))?/
// Positions in result:        0 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
//                             1 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
//                              2 ^~~~~~~~~~~~~~  3 ^~~~~~~~~~~~~~
// => `Regex<(Substring, Substring?, Substring?, Substring?)>`

// Equivalent result builder syntax:
//     let multipleAndNestedOptional = Pattern {
//         Optionally {
//             OneOrMore(.hexDigit).capture()
//             ".."
//             OneOrMore(.hexDigit).capture()
//         }
//         .capture()
//     }
//     .flattened()

let multipleAndNestedQuantifier = /(([0-9a-fA-F]+)\.\.([0-9a-fA-F]+))+/
// Positions in result:          0 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
//                               1 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
//                                2 ^~~~~~~~~~~~~~  3 ^~~~~~~~~~~~~~
// => `Regex<(Substring, [Substring], [Substring], [Substring])>`

// Equivalent result builder syntax:
//     let multipleAndNestedQuantifier = Pattern {
//         OneOrMore {
//             OneOrMore(.hexDigit).capture()
//             ".."
//             OneOrMore(.hexDigit).capture()
//         }
//         .capture()
//     }
//     .flattened()

Note that capturing collections of repeated captures like this is a departure from most regular expression implementations, which only provide access to the last match of a repeated capture group. For example, Python only captures the last group in this dash-separated string:

rep = re.compile('(?:([0-9a-fA-F]+)-?)+')
match = rep.match("1234-5678-9abc-def0")
print(match.group(1))
# Prints "def0"

By contrast, the proposed Swift version captures all four sub-matches:

let pattern = /(?:([0-9a-fA-F]+)-?)+/
if let match = "1234-5678-9abc-def0".firstMatch(of: pattern) {
    print(match.1)
}
// Prints ["1234", "5678", "9abc", "def0"]

We believe that the proposed capture behavior leads to better consistency with the meaning of these quantifiers. However, the alternative behavior does have the advantage of a smaller memory footprint because the matching algorithm would not need to allocate storage for capturing anything but the last match. As a future direction, we could introduce some way of opting into this behavior.

Alternation: `a|b`

Alternations are used to match one of multiple patterns. An alternation wraps
its underlying pattern's capture type in an Optional.

let numberAlternationRegex = /([01]+)|[0-9]+|([0-9a-fA-F]+)/
// Positions in result:     0 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
//                          1 ^~~~~~~      2 ^~~~~~~~~~~~~~
// => `Regex<(Substring, Substring?, Substring?)>`

// Equivalent result builder syntax:
//     let numberAlternationRegex = Pattern {
//         OneOf {
//             OneOrMore(.binaryDigit).capture()
//             OneOrMore(.decimalDigit)
//             OneOrMore(.hexDigit).capture()
//         }
//     }
//     .flattened()

let scalarRangeAlternation = /([0-9a-fA-F]+)\.\.([0-9a-fA-F]+)|([0-9a-fA-F]+)/
// Positions in result:     0 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
//                          1 ^~~~~~~~~~~~~~  2 ^~~~~~~~~~~~~~
//                                                           3 ^~~~~~~~~~~~~~
// => `Regex<(Substring, Substring?, Substring?, Substring?)>

// Equivalent result builder syntax:
//     let scalarRangeAlternation = Pattern {
//         OneOf {
//             Group {
//                 OneOrMore(.hexDigit).capture()
//                 ".."
//                 OneOrMore(.hexDigit).capture()
//             }
//             OneOrMore(.hexDigit).capture()
//         }
//     }
//     .flattened()

let nestedScalarRangeAlternation = /(([0-9a-fA-F]+)\.\.([0-9a-fA-F]+))|([0-9a-fA-F]+)/
// Positions in result:           0 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
//                                1 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~   
//                                 2 ^~~~~~~~~~~~~~  3 ^~~~~~~~~~~~~~
//                                                                   4 ^~~~~~~~~~~~~~
// => `Regex<(Substring, Substring?, Substring?, Substring?, Substring?)>

// Equivalent result builder syntax:
//     let scalarRangeAlternation = Pattern {
//         OneOf {
//             Group {
//                 OneOrMore(.hexDigit).capture()
//                 ".."
//                 OneOrMore(.hexDigit).capture()
//             }
//             .capture()
//
//             OneOrMore(.hexDigit).capture()
//         }
//     }
//     .flattened()

Effect on ABI stability

None. This is a purely additive change to the Standard Library.

Effect on API resilience

None. This is a purely additive change to the Standard Library.

Alternatives considered

Lazy collections instead of arrays of substrings

For quantifiers that produce an array, it is arguable that a lazy collection based on matched ranges could minimize reference counting operations on Substring and reduce allocations.

let regex = /([a-z])+/
// => `Regex<(Substring, CaptureCollection<Substring>)>`

// `CaptureCollection` implemented as... 
public struct CaptureCollection<Captures>: BidirectionalCollection {
    private var ranges: [ClosedRange<String.Index>]
    ...
}

However, we believe the use of arrays in capture types would make a much cleaner type signature.

Homogeneous tuples for exact-count quantification

For exact-count quantifications, e.g. [a-z]{5}, it would slightly improve type safety to make its capture type be a homogeneous tuple instead of an array, e.g. (5 x Substring) as pitched in Improved Compiler Support for Large Homogenous Tuples.

/[a-z]{5}/     // => Regex<(Substring, (5 x Substring))> (exact count)
/[a-z]{5, 8}/  // => Regex<(Substring, [Substring])>     (bounded count)
/[a-z]{5,}/    // => Regex<(Substring, [Substring])>     (lower-bounded count)

However, this would cause an inconsistency between exact-count quantification and bounded quantification. We believe that the proposed design will result in fewer surprises as we associate the {...} quantifier syntax with Array.

`Regex<Captures>` instead of `Regex<Match>`

In the initial version of this pitch, Regex was only generic over its captures and firstMatch(of:) was responsible for flattening together the match and captures into a tuple.

extension String {
    public func firstMatch<R: RegexProtocol, C...>(of regex: R)
        -> (Substring, C...)? where R.Captures == (C...)
}

// Expands to:
//     extension String {
//         func firstMatch<R: RegexProtocol>(of regex: R)
//             -> Substring? where R.Captures == ()
//         func firstMatch<R: RegexProtocol, C1>(of regex: R)
//             -> (Substring, C1)? where R.Captures == (C1)
//         func firstMatch<R: RegexProtocol, C1, C2>(of regex: R)
//             -> (Substring, C1, C2)? where R.Captures == (C1, C2)
//         ...
//     }

For simple regular expressions this had the benefit of aligning the generic signature more obviously with the captures in the regex.

let regex = /ab(cd*)(ef)gh/
// => `Regex<(Substring, Substring)>`

However, it came with a number of (not necessarily insurmountable) open questions:

Will variadic generic tuple splatting preserve element labels?
Will variadic generic tuple splatting eliminate Voids? (We don't want firstMatch(of:) to return (Substring, Void) for a regex with no captures).
Will we be able to add single-element labeled tuples? (This would be needed to preserve the name of a capture in a regex with a single named capturing group.)
What should be the type of Captures for a regex with no captures (e.g. Void or Never or something else)?

Given all of this, it seems simpler and more pragmatic to make Regex generic over both the match and the captures.

Structured rather than flat captures

This pitch proposes inferring capture types in such a way as to align with the traditional numbering of backreferences. This is because much of the motivation behind providing regex literals in Swift is their familiarity.

If we decided to deprioritize this motivation, there are opportunities to infer safer, more ergonomic, and arguably more intuitive types for captures.

For example, to be consistent with traditional regex backreferences quantifications of multiple or nested captures had to produce parallel arrays rather than an array of tuples.

/(?:(?<lower>[0-9a-fA-F]+)\.\.(?<upper>[0-9a-fA-F]+))+/
// Flat capture type:
// => `Regex<(Substring, lower: [Substring], upper: [Substring])>`

// Structured capture type:
// => `Regex<(Substring, [(lower: Substring, upper: Substring)])>`

The structured capture type is safer because the type system encodes that there are an equal number of lower and upper hex numbers. It's also more convenient because you're likely to be processing lower and upper in parallel (e.g. to create ranges).

Similarly, alternations of multiple or nested captures produces flat optionals rather than a structured alternation type.

/([0-9a-fA-F]+)\.\.([0-9a-fA-F]+)|([0-9a-fA-F]+)/
// Flat capture type:
// => `Regex<(Substring, Substring?, Substring?, Substring?)>`

// Structured capture type:
// => `Regex<(Substring, Alternation<((Substring, Substring), Substring)>)>`

The structured capture type is safer because the type system encodes which options in the alternation of mutually exclusive. It'd also be much more convenient if, in the future, Alternation could behave like an enum, allowing exhaustive switching over all the options.

It's possible to derive the flat type from the structured type (but not vice versa), so Regex could be generic over the structured type and firstMatch(of:) could return a result type that vends both.

extension String {
    struct MatchResult<R: RegexProtocol> {
        var flat: R.Match.Flat { get }
        var structured: R.Match { get }
    }
    func firstMatch<R>(of regex: R) -> MatchResult<R>?
}

This is cool, but it adds extra complexity to Regex and it isn't as clear because the generic type no longer aligns with the traditional regex backreference numbering. Because the primary motivation for providing regex literals in Swift is their familiarity, we think the consistency of the flat capture type trumps the added safety and ergonomics of the structured captures type.

We think the calculus probably flips in favor of a structured capture type for the result builder syntax, for which familiarity is not as high a priority.

Future directions

Dynamic captures

So far, we have explored offering static capture types for using a regular expression that is available in source code. Meanwhile, we would like to apply Swift's string processing capabilities to fully dynamic use cases, such as matching a string using a regular expression obtained at runtime.

To support dynamism, we could introduce a new type, DynamicCaptures that represents a tree of captures, and add a Regex initializer that accepts a string and produces Regex<(Substring, DynamicCaptures)>.

public struct DynamicCaptures: Equatable, RandomAccessCollection {
  var range: Range<String.Index> { get }
  var substring: Substring? { get }
  subscript(name: String) -> DynamicCaptures { get }
  subscript(position: Int) -> DynamicCaptures { get }
}

extension Regex where Match == DynamicCaptures {
  public init(_ string: String) throws
}

Example usage:

let regex = readLine()! // (\w*)(\d)+z(\w*)?
let input = readLine()! // abcd1234xyz
print(input.firstMatch(of: regex)?.1)
// [
//     "abcd",
//     [
//         "1",
//         "2",
//         "3",
//         "4",
//     ],
//     .some("xyz")
// ]

beccadax · November 13, 2021, 3:28am

I have no functional changes to suggest, but I do see a few places where we could improve the proposal's description of capture behavior.

Drafting suggestions

I personally find it useful if the "proposed solution" section calls out most of the major design decisions implied by the "Detailed design", but I don't think we're really doing that here with captures. We may want to briefly outline a few facts about captures and illustrate these points with an example:

Quantifiers (*, +, and ?) and alternations (|) wrap each capture inside them in Array or Optional. These structures can be nested, so a capture which is inside multiple levels of quantifiers or alternations will end up with a type like [Substring?]?. To ensure that backreference numbering and tuple element numbering match, each capture is separately wrapped in the structure implied by the quantifiers and alternations around it, rather than wrapping tuples of adjacent captures in the structure:
let regex = /ab(?:c(d)*(ef))?gh/
if let match = "abcddddefgh".firstMatch(of: regex) {
  print((match.1, match.2)) // => (Optional(["d","d","d","d"]), Optional("ef"))
}

(The point about multiple levels of quantifiers/alternations creating multiple wrapping types is particularly useful to mention here, because it's implied by the rules in the "detailed design" section but never stated outright; a casual reader might assume that we're envisioning flattening these structures as well as tuples.)

For clarity, we should also reword a few parts of the "capture type" section to better describe the flattening behavior:

Because concatenations are allowed at deeper levels in the pattern, I would not describe the leading Substring as being introduced by a concatenation, and I would not talk about forming a tuple here (which would imply non-flat behavior). Instead, I would say something like:

A concatenation's capture types are a concatenation of the capture types of its underlying patterns, ignoring any underlying patterns with no captures.

If we feel that we need to restate the existence of the initial Substring more clearly, we should reiterate it in the "Basics" section. We might even rename that to "top-level" or "root". We could then (accurately!) position the fact that a captureless regex's type is Regex<Substring> as a degenerate case falling out of Swift's normal type system rules, which treat a 1-tuple as synonymous with the element itself.

The other changes are just minor tweaks:

"…capture types are the same as its underlying pattern's."

"…underlying pattern's capture types in either Optionals or Arrays."

Syntax	Description	Capture type
		...
`+`	1 or more	`Array`s of the sub-pattern capture types
`?`	0 or 1	`Optional`s of the sub-pattern capture types
		...

(Apply that to the other rows too, of course.)

"…wraps its underlying patterns' capture types in Optionals and concatenates them together, first to last."

There's a subtle shift in the apostrophe here and a little extra verbiage to reflect the fact that, unlike the other constructs, alternations have several sub-patterns and we are combining their capture types together into a single capture type.

[Pitch] Strongly Typed Regex Captures

Strongly Typed Regex Captures

See latest revision below

Original version:

Introduction

Motivation

Proposed solution

Detailed design

Regex type

firstMatch method

Capture type

Basics

Capturing group: (...)

Concatenation: abc

Named capturing group: (?<name>...)

Non-capturing group: (?:...)

Nested capturing group: (...(...)...)

Quantification: *, +, ?, {n}, {n,}, {n,m}

Alternation: a|b

Effect on ABI stability

Effect on API resilience

Alternatives considered

Lazy collections instead of arrays of substrings

Homogeneous tuples for exact-count quantification

Never as empty capture instead of Void

Regex<(EntireMatch, Capture...)> instead of Regex<Captures>

Future directions

Dynamic captures

Single-element labeled tuples

Strongly Typed Regex Captures

Revision history

Introduction

Motivation

Proposed solution

Detailed design

Regex type

firstMatch(of:) method

Capture type

Basics

Capturing group: (...)

Concatenation: abc

Named capturing group: (?<name>...)

Non-capturing group: (?:...)

Nested capturing group: (...(...))

Quantification: *, +, ?, {n}, {n,}, {n,m}

Alternation: a|b

Effect on ABI stability

Effect on API resilience

Alternatives considered

Lazy collections instead of arrays of substrings

Homogeneous tuples for exact-count quantification

Regex<Captures> instead of Regex<Match>

Structured rather than flat captures

Future directions

Dynamic captures

`Regex` type

`firstMatch` method

Capturing group: `(...)`

Concatenation: `abc`

Named capturing group: `(?<name>...)`

Non-capturing group: `(?:...)`

Nested capturing group: `(...(...)...)`

Quantification: `*`, `+`, `?`, `{n}`, `{n,}`, `{n,m}`

Alternation: `a|b`

`Never` as empty capture instead of `Void`

`Regex<(EntireMatch, Capture...)>` instead of `Regex<Captures>`

`Regex` type

`firstMatch(of:)` method

Capturing group: `(...)`

Concatenation: `abc`

Named capturing group: `(?<name>...)`

Non-capturing group: `(?:...)`

Nested capturing group: `(...(...))`

Quantification: `*`, `+`, `?`, `{n}`, `{n,}`, `{n,m}`

Alternation: `a|b`

`Regex<Captures>` instead of `Regex<Match>`