[Pitch] Regex Type and Overview

Michael_Ilseman · March 18, 2022, 1:43am

Regex Type and Overview

Authors: Michael Ilseman and the Standard Library Team

Introduction

Swift strings provide an obsessively Unicode-forward model of programming with strings. String processing with Collection's algorithms is woefully inadequate for many day-to-day tasks compared to other popular programming and scripting languages.

We propose addressing this basic shortcoming through an effort we are calling regex. What we propose is more powerful, extensible, and maintainable than what is traditionally thought of as regular expressions from other programming languages. This effort is presented as 6 interrelated proposals:

Regex<Output> and Regex<Output>.Match types with support for typed captures, both static and dynamic.
A best-in-class treatment of traditional, familiar regular expression syntax for run-time construction of regex.
A literal for compile-time construction of a regex with statically-typed captures, enabling powerful source tools.
An expressive and composable result-builder DSL, with support for capturing strongly-typed values.
A modern treatment of Unicode semantics and string processing.
A treasure trove of string processing algorithms, along with library-extensible protocols enabling industrial-strength parsers to be used seamlessly as regex components.

This proposal provides details on #1, the Regex type and captures, and gives an overview of how each of the other proposals fit into regex in Swift.

At the time of writing, these related proposals are in various states of being drafted, pitched, or proposed. For the current status, see Pitch and Proposal Status.

Obligatory differentiation from formal regular expressions

Regular expressions originated in formal language theory as a way to answer yes-or-no whether a string is in a given regular language. They are more powerful (and less composable) than star-free languages and less powerful than context-free languages. Because they just answer a yes-or-no question, how that answer is determined is irrelevant; i.e. their execution model is ambiguous.

Regular expressions were brought into practical applications for text processing and compiler lexers. For searching within text, where the result (including captures) is a portion of the searched text, how a match happened affects the result. Over time, more and more power was needed and "regular expressions" diverged from their formal roots.

For compiler lexers, especially when implemented as a discrete compilation phase, regular expressions were often ingested by a separate tool from the rest of the compiler. Understanding formal regular expressions can help clarify the separation of concerns between lexical analysis and parsing. Beyond that, they are less relevant for structuring modern parsers, which interweave error handling and recovery, debuggability, and fine-grained source location tracking across this traditional separation-of-tools.

The closest formal analogue to what we are proposing are Parsing Expression Grammars ("PEGs"), which describe a recursive descent parser. Our alternation is ordered choice and we support possessive quantification, recursive subpattern calls, and lookahead. However, we are first and foremost providing a regexy presentation: quantification, by default, is non-possessive.

Motivation

Imagine processing a bank statement in order to extract transaction details for further scrutiny. Fields are separated by 2-or-more spaces:

struct Transaction {
  enum Kind { case credit; case debit }

  let kind: Kind
  let date: Date
  let accountName: String
  let amount: Decimal
}

let statement = """
  CREDIT    03/02/2022    Payroll                   $200.23
  CREDIT    03/03/2022    Sanctioned Individual A   $2,000,000.00
  DEBIT     03/03/2022    Totally Legit Shell Corp  $2,000,000.00
  DEBIT     03/05/2022    Beanie Babies Forever     $57.33
  """

One option is to split() around whitespace, hard-coding field offsets for everything except the account name, and join()ing the account name fields together to restore their spaces. This carries a lot of downsides such as hard-coded offsets, many unnecessary allocations, and this pattern would not easily expand to supporting other representations.

Another option is to process an entry in a single pass from left-to-right, but this can get unwieldy:

// Parse dates using a simple (localized) numeric strategy
let dateParser = Date.FormatStyle(date: .numeric).parseStrategy

// Parse currencies as US dollars
let decimalParser = Decimal.FormatStyle.Currency(code: "USD")

func processEntry(_ s: String) -> Transaction? {
  var slice = s[...]
  guard let kindEndIdx = slice.firstIndex(of: " "),
        let kind = Transaction.Kind(slice[..<kindEndIdx])
  else {
    return nil
  }

  slice = slice[kindEndIdx...].drop(while: \.isWhitespace)
  guard let dateEndIdx = slice.firstIndex(of: " "),
        let date = try? Date(
          String(slice[..<dateEndIdx]), strategy: dateParser)
  else {
    return nil
  }
  slice = slice[dateEndIdx...].drop(while: \.isWhitespace)

  // Account can have spaces, look for 2-or-more for end-of-field
  // ...
  // You know what, let's just bail and call it a day
  return nil
}

Foundation provides NSRegularExpression, an API around ICU's regular expression engine. An NSRegularExpression is constructed at run time from a string containing regex syntax. Run-time construction is very useful for tools, such as SwiftPM's swift test --filter and search fields inside text editors.

let pattern = #"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"#
let nsRegEx = try! NSRegularExpression(pattern: pattern)

func processEntry(_ line: String) -> Transaction? {
  let range = NSRange(line.startIndex..<line.endIndex, in: line)
  guard let result = nsRegEx.firstMatch(in: line, range: range),
        let kindRange = Range(result.range(at: 1), in: line),
        let kind = Transaction.Kind(line[kindRange]),
        let dateRange = Range(result.range(at: 2), in: line),
        let date = try? Date(String(line[dateRange]), strategy: dateParser),
        let accountRange = Range(result.range(at: 3), in: line),
        let amountRange = Range(result.range(at: 4), in: line),
        let amount = try? Decimal(
          String(line[amountRange]), format: decimalParser)
  else {
    return nil
  }

  return Transaction(
    kind: kind, date: date, account: String(line[accountRange]), amount: amount)
}

The pattern is represented as a string literal, missing out on better source tooling. For example, the pattern describes an algorithm and would benefit from syntax highlighting more akin to code than uniform data. We also miss the opportunity to represent the number and kinds of captures in the type system: the programmer must remember to check for NSNotFound or make sure that whatever the capture is passed to does.

Traditional regular expression engines and tooling present an all-in or all-out world, making them impervious to refactoring or sharing common sub-components. This also encourages programmers to use regular expressions to parse things they shouldn't, such as dates, times, numbers, and currencies. In the code above, an overly-permissive parser is used and validation and interpretation is left as a post-processing phase, increasing complexity and maintenance burden.

Fundamentally, ICU's regular expression engine operates over a different model of string than Swift's model. The results may split grapheme clusters apart (potentially introducing degenerate grapheme clusters), ICU does not support comparing under canonical equivalence, etc. This means that using NSRegularExpression will often produce different results than the equivalent algorithm ran over String.

Finally, NSRegularExpression, due to both compatibility reasons and needing to use ICU, incurs bridging overhead and is unable to take advantage of Swift's native string representations.

Proposed solution

A Regex<Output> describes a string processing algorithm. Captures surface the portions of the input that were matched by subpatterns. By convention, capture 0 is the entire match.

Creating Regex

Regexes can be created at run time from a string containing familiar regex syntax. If no output type signature is specified, the regex has type Regex<AnyRegexOutput>, in which captures are existentials and the number of captures is queryable at run time. Alternatively, providing an output type signature produces strongly-typed outputs, where captures are concrete types embedded in a tuple, providing safety and enabling source tools such as code completion.

let pattern = #"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"#
let regex = try! Regex(compiling: pattern)
// regex: Regex<AnyRegexOutput>

let regex: Regex<(Substring, Substring, Substring, Substring, Substring)> =
  try! Regex(compiling: pattern)

Note: The syntax accepted and further details on run-time compilation, including AnyRegexOutput and extended syntaxes, are discussed in Run-time Regex Construction.

Type mismatches and invalid regex syntax are diagnosed at construction time by throwing errors.

When the pattern is known at compile time, regexes can be created from a literal containing the same regex syntax, allowing the compiler to infer the output type. Regex literals enable source tools, e.g. syntax highlighting and actions to refactor into a result builder equivalent.

let regex = re'(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)'
// regex: Regex<(Substring, Substring, Substring, Substring, Substring)>

Note: Regex literals, most notably the choice of delimiter, are discussed in Regex Literals. For this example, I used the less technically-problematic option of re'...'.

This same regex can be created from a result builder, a refactoring-friendly representation:

let fieldSeparator = Regex {
  CharacterClass.whitespace
  OneOrMore(.whitespace)
}

let regex = Regex {
  Capture(OneOrMore(.word))
  fieldSeparator

  Capture(OneOrMore(.whitespace.inverted))
  fieldSeparator

  Capture {
    OneOrMore {
      NegativeLookahead(fieldSeparator)
      CharacterClass.any
    }
  }
  fieldSeparator

  Capture { OneOrMore(.any) }
}
// regex: Regex<(Substring, Substring, Substring, Substring, Substring)>

Note: The result builder API is discussed in Regex Builders. Character classes and other Unicode concerns are discussed in Unicode for String Processing.

Regex itself is a valid component for use inside a result builder, meaning that embedded literals can be used for concision.

Using Regex

A Regex<Output>.Match contains the result of a match, surfacing captures by number, name, and reference.

func processEntry(_ line: String) -> Transaction? {
  let regex = re'''
    (?x) # Ignore whitespace and comments
    (?<kind>    \w+)                \s\s+
    (?<date>    \S+)                \s\s+
    (?<account> (?: (?!\s\s) . )+)  \s\s+
    (?<amount>  .*)
    '''
  //  regex: Regex<(
  //    Substring,
  //    kind: Substring,
  //    date: Substring,
  //    account: Substring,
  //    amount: Substring
  //  )>

  guard let match = regex.matchWhole(line),
        let kind = Transaction.Kind(match.kind),
        let date = try? Date(String(match.date), strategy: dateParser),
        let amount = try? Decimal(String(match.amount), format: decimalParser)
  else {
    return nil
  }

  return Transaction(
    kind: kind, date: date, account: String(match.account), amount: amount)
}

Note: Details on typed captures using tuple labels are covered in Regex Literals.

The result builder allows for inline failable value construction, which participates in the overall string processing algorithm: returning nil signals a local failure and the engine backtracks to try an alternative. This not only relieves the use site from post-processing, it enables new kinds of processing algorithms, allows for search-space pruning, and enhances debuggability.

Swift regexes describe an unambiguous algorithm, were choice is ordered and effects can be reliably observed. For example, a print() statement inside the TryCapture's transform function will run whenever the overall algorithm naturally dictates an attempt should be made. Optimizations can only elide such calls if they can prove it is behavior-preserving (e.g. "pure").

CustomMatchingRegexComponent, discussed in String Processing Algorithms, allows industrial-strength parsers to be used a regex components. This allows us to drop the overly-permissive pre-parsing step:

func processEntry(_ line: String) -> Transaction? {
  let fieldSeparator = Regex {
    CharacterClass.whitespace
    OneOrMore(.whitespace)
  }

  // Declare strongly-typed references to store captured values into
  let kind = Reference<Transaction.Kind>()
  let date = Reference<Date>()
  let account = Reference<Substring>()
  let amount = Reference<Decimal>()

  let regex = Regex {
    TryCapture(as: kind) {
      OneOrMore(.word)
    } transform: {
      Transaction.Kind($0)
    }
    fieldSeparator

    TryCapture(as: date) { dateParser }
    fieldSeparator

    Capture(as: account) {
      OneOrMore {
        NegativeLookahead(fieldSeparator)
        CharacterClass.any
      }
    }
    fieldSeparator

    TryCapture(as: amount) { decimalParser }
  }
  // regex: Regex<(Substring, Transaction.Kind, Date, Substring, Decimal)>

  guard let match = regex.matchWhole(line) else { return nil }

  return Transaction(
    kind: match[kind],
    date: match[date],
    account: String(match[account]),
    amount: match[amount])
}

Note: Details on how references work is discussed in Regex Builders. Regex.Match supports referring to all captures by position (match.1, etc.) whether named or referenced or neither. Due to compiler limitations, result builders do not support forming labeled tuples for named captures.

Algorithms, algorithms everywhere

Regexes can be used right out of the box with a variety of powerful and convenient algorithms, including trimming, splitting, and finding/replacing all matches within a string.

These algorithms are discussed in String Processing Algorithms.

Onward Unicode

A regex describes an algorithm to be ran over some model of string, and Swift's String has a rather unique Unicode-forward model. Character is an extended grapheme cluster and equality is determined under canonical equivalence.

Calling dropFirst() will not drop a leading byte or Unicode.Scalar, but rather a full Character. Similarly, a . in a regex will match any extended grapheme cluster. A regex will match canonical equivalents by default, strengthening the connection between regex and the equivalent String operations.

Additionally, word boundaries (\b) follow UTS#29 Word Boundaries, meaning contractions ("don't") and script changes are detected and separated, without incurring significant binary size costs associated with language dictionaries.

Regex targets UTS#18 Level 2 by default, but provides options to switch to scalar-level processing as well as compatibility character classes. Detailed rules on how we infer necessary grapheme cluster breaks inside regexes, as well as options and other concerns, are discussed in Unicode for String Processing.

Detailed design

/// A regex represents a string processing algorithm.
public struct Regex<Output> {
  /// Match a string in its entirety.
  ///
  /// Returns `nil` if no match and throws on abort
  public func matchWhole(_: String) throws -> Match?

  /// Match at the front of a string
  ///
  /// Returns `nil` if no match and throws on abort
  public func matchFront(_: String) throws -> Match?

  /// The result of matching a regex against a string.
  @dynamicMemberLookup
  public struct Match {
    /// The range of the overall match
    public let range: Range<String.Index>
  
    /// The produced output from the match operation
    public var output: Output
  
    /// Lookup a capture by number
    public subscript<T>(dynamicMember keyPath: KeyPath<Output, T>) -> T
  
    /// Lookup a capture by number
    @_disfavoredOverload
    public subscript(
      dynamicMember keyPath: KeyPath<(Output, _doNotUse: ()), Output>
    ) -> Output
    // Note: this allows `.0` when `Match` is not a tuple.
  
  }
}

Note: The below are covered by other proposals, but listed here to help round out intuition.


// Result builder interfaces
extension Regex: RegexComponent {
  public var regex: Regex<Output> { self }

  /// Create a regex out of a single component
  public init<Content: RegexComponent>(
    _ content: Content
  ) where Content.Output == Output

  /// Result builder interface
  public init<Content: RegexComponent>(
    @RegexComponentBuilder _ content: () -> Content
  ) where Content.Output == Output

}
extension Regex.Match {
  /// Lookup a capture by reference
  public subscript<Capture>(_ reference: Reference<Capture>) -> Capture
}

// Run-time compilation interfaces
extension Regex {
  /// Parse and compile `pattern`.
  public init(compiling pattern: String, as: Output.Type = Output.self) throws
}
extension Regex where Output == AnyRegexOutput {
  /// Parse and compile `pattern`.
  public init(compiling pattern: String) throws
}

On severability and related proposals

The proposal split presented is meant to aid focused discussion, while acknowledging that each is interconnected. The boundaries between them are not completely cut-and-dry and could be refined as they enter proposal phase.

Accepting this proposal in no way implies that all related proposals must be accepted. They are severable and each should stand on their own merit.

Source compatibility

Everything in this proposal is additive. Regex delimiters may have their own source compatibility impact, which is discussed in that proposal.

Effect on ABI stability

Everything in this proposal is additive. Run-time strings containing regex syntax are represented in the ABI as strings. For this initial release, literals are strings in the ABI as well (they get re-parsed at run time), which avoids baking an intermediate representation into Swift's ABI as we await better static compilation support (see future work).

Effect on API resilience

N/A

Alternatives considered

Regular expressions are a blight upon computing!

"I had one problem so I wrote a regular expression, now I have two problems!"

Regular expressions have a deservedly mixed reputation, owing to their historical baggage and treatment as a completely separate tool or subsystem. Despite this, they still occupy an important place in string processing. We are proposing the "regexiest regex", allowing them to shine at what they're good at and providing mitigations and off-ramps for their downsides.

"Regular expressions are bad because you should use a real parser"
- In other systems, you're either in or you're out, leading to a gravitational pull to stay in when... you should get out
- Our remedy is interoperability with real parsers via CustomMatchingRegexComponent
- Literals with refactoring actions provide an incremental off-ramp from regex syntax to result builders and real parsers
"Regular expressions are bad because ugly unmaintainable syntax"
- We propose literals with source tools support, allowing for better syntax highlighting and analysis
- We propose result builders and refactoring actions from literals into result builders
"Regular expressions are bad because Unicode"
- We propose a modern Unicode take on regexes
- We treat regexes as algorithms to be ran over some model of String, like's Swift's default Character-based view.
"Regular expressions are bad because they're not powerful enough"
- Engine is general-purpose enough to support recursive descent parsers with captures, back-references, and lookahead
- We're proposing a regexy presentation on top of more powerful functionality
"Regular expressions are bad because they're too powerful"
- We provide possessive quantifications, atomic groups, etc., all the normal ways to prune backtracking
- We provide clear semantics of how alternation works as ordered-choice, allowing for understandable execution
- Pathological behavior is ultimately a run-time concern, better handled by engine limiters (future work)
- Optimization is better done as a compiler problem, e.g. static compilation to DFAs (future work)
- Formal treatment of power is better done by other presentations, like PEGs and linear automata (future work)

Alternative names

The generic parameter to Regex is Output and the erased version is AnyRegexOutput. This is... fairly generic sounding.

An alternative could be Captures, doubling down on the idea that the entire match is implicitly capture 0, but that can make describing and understanding how captures combine in the result builder harder to reason through (i.e. a formal distinction between explicit and implicit captures).

An earlier prototype used the name Match for the generic parameter, but that quickly got confusing with all the match methods and was confusing with the result of a match operation (which produces the output, but isn't itself the generic parameter). We think Match works better as the result of a match operation.

What's with all the `String(...)` initializer calls at use sites?

We're working on how to eliminate these, likely by having API to access ranges, slices, or copies of the captured text.

We're also looking for more community discussion on what the default type system and API presentation should be. As pitched, Substring emphasizes that we're referring to slices of the original input, with strong sharing connotations.

The actual Match struct just stores ranges: the Substrings are lazily created on demand. This avoids unnecessary ARC traffic and memory usage.

Future work: static optimization and compilation

Swift's support for static compilation is still developing, and future work here is leveraging that to compile regex when profitable. Many regex describe simple DFAs and can be statically compiled into very efficient programs. Full static compilation needs to be balanced with code size concerns, as a matching-specific bytecode is typically far smaller than a corresponding program (especially since the bytecode interpreter is shared).

Regex are compiled into an intermediary representation and fairly simple analysis and optimizations are high-value. This compilation currently happens at run time (as such the IR is not ABI), but more of this could happen at compile time to save load/compilation time of the regex itself. Ideally, this representation would be shared along the fully-static compilation path and can be encoded in the ABI as a compact bytecode.

Future work: parser combinators

What we propose here is an incremental step towards better parsing support in Swift using parser-combinator style libraries. The underlying execution engine supports recursive function calls and mechanisms for library extensibility. CustomMatchingRegexComponent's protocol requirement is effectively a monadic parser, meaning Regex provides a regex-flavored combinator-like system.

An issues with traditional parser combinator libraries are the compilation barriers between call-site and definition, resulting in excessive and overly-cautious backtracking traffic. These can be eliminated through better compilation techniques. As mentioned above, Swift's support for custom static compilation is still under development.

Future work is a parser combinator system which leverages tiered static compilation and presents a parser-flavored approach, such as limited backtracking by default and more heavily interwoven recursive calls.

Future work: `Regex`-backed enums

Regexes are often used for tokenization and tokens can be represented with Swift enums. Future language integration could include Regex backing somewhat analogous to RawRepresentable enums. A Regex-backed enum could conform to RegexComponent producing itself upon a match by forming an ordered choice of its cases.

xwu · March 18, 2022, 3:24am

What sorts of errors are planned to be thrown, and is the distinction among them going to be meaningful for users? What I’m driving at is whether making these failable instead of throwing could be considered.

Splitting “don’t” into two words is a bit counterintuitive. I recognize that if the only generally satisfactory definition of a Unicode word boundary requires it then it’s hard to argue against, but if there is another workable definition that doesn’t have this behavior it may be worth exploring.

In other scenarios (not only in Swift but also including, for example, in CSS terminology), the term for the beginning of a string (agnostic to writing direction) is “start” (with the other end being “end”) rather than “front” and “back”… Was that considered and rejected here?

Agree on the generic sounding Output. How about…CapturedOutput and AnyRegexCapturedOutput.

Alejandro · March 18, 2022, 3:27am

With UTS#29, we don't split "don't" into 2 words, its a singular word.

xwu · March 18, 2022, 3:30am

That is great—and how I recalled it—which is why the text was surprising:

Alejandro · March 18, 2022, 3:33am

Yeah, I think what he means by this is that the following: Chicago에. We notice the script change and can separate these two words out to Chicago and 에.

Michael_Ilseman · March 18, 2022, 2:01pm

This is still being fleshed out, but the most likely case is that the errors will be the same ones given to the Swift compiler to present as compile-time diagnostics. So there'd be a reason, a range over the input which triggered the reason, and potentially auxiliary information like a suggested fix. This can also drive pretty-printing, where the problematic range is underlined. That being said, there is a lot being proposed in this initial release so it's likely that some of these features will appear over time.

I think having some interface to access that information would be very valuable for tools taking a user-provided regex. At the very least, even if the tool is just going to fatal-error, they can present the user with a reason why their command-line invocation failed.

It certainly can be considered, but we'd likely want a throwing interface anyways. In what ways were you thinking failable would be better than try? / try!?

Argh, language. I meant to imply:

Contractions ("don't") are correctly detected and script changes are separated

"start" works for me, although we also use the term "prefix". These are the interfaces invoked by the higher-level algorithms and it would be great to align the naming scheme. There, we have trimPrefix() and starts(with:). Thoughts?

CapturedOutput seems it would have both the issues of Captures and the rather generic-sounding Output.

This distinction mostly comes up when explaining the semantic model of how concatenation of regexes work (CC @rxwei).

westacular · March 18, 2022, 4:27pm

Regarding the need for String(...) initializer calls:

One possible way to reduce those would be to support using Reference<String> with Capture(as:), and have the resulting Match lazily create a String instead of Substring. That would at least remove that wart from the final Transaction example here.

benlings · March 18, 2022, 5:07pm

My thought when first reading the pitch was that matchPrefix would probably be clearer than matchFront. This would also fit with String’s hasPrefix.

xwu · March 18, 2022, 5:36pm

If there isn’t a way in code for the end user’s code to behave differently depending on the error, then just making it failable would comport best with Swift’s error handling design, I think. Definitely agree that it’d be great to surface all relevant diagnostics somehow, and if there’s different kinds of error and users can recover from it in meaningfully distinct ways, then throwing is the way to go!

To my mind “prefix” strongly suggests that the result is anchored to the starting element, but I don’t know that this is anything but an idiosyncratic intuition. Either seems fine honestly.

rxwei · March 19, 2022, 12:25am

Regex<(A, B, C, D)>
       ^ Whole match
          ^~~~~~~ Captures

"Output" is defined as "whole match" followed by "captures". Concatenation's output is defined as Substring (its whole match) followed by each component's captures. Other types Outputs are defined similarly based on the fact that "captures" refers to "output except the first element".

If we put "captures" in the generic parameter, we would be re-defining this concept and including the whole match as the 0-th capture. There are two issues:

The 0-th capture would be implicit and does not correspond to any Capture(...) component in the builder. It's possible set up this mental model, but I couldn't find any existing regex documentation that refers to the entire match as "the first capture"/"the first group".
I'm not sure what's the best word to use to accurately refer to explicit captures especially when it comes to describing the behavior of concatenation, alternation, and quantifiers. "Captures except the first element" seems too verbose and arbitrary to be used for that.

hisekaldma · March 19, 2022, 9:16am

What would happen if you separated the whole match and the captures into two separate type parameters, i.e. have Regex<Match, Captures>? (Regexes that don’t have any captures would be Regex<Match, Never>.)

xwu · March 19, 2022, 11:23am

By convention, the zeroth capture is the whole match—as we wouldn’t (couldn’t) ever change this for regex syntax itself, it would be unwise to make $1 in regex syntax equivalent to .0 in Swift syntax.

hisekaldma · March 19, 2022, 12:15pm

That’s a good point! Swift bucks convention in lots of ways though, so perhaps that discrepancy might be a price worth paying? From the POV of someone who uses the result builder instead of the traditional syntax, the second of these is definitely clearer (albeit more verbose):

// Regex<Substring, TransactionKind, Date, Decimal>
let whole = match.0
let kind = match.1
let date = match.2
let amount = match.3

// Regex<Substring, (TransactionKind, Date, Decimal)>
let whole = match.whole
let kind = match.captures.0
let date = match.captures.1
let amount = match.captures.2

It seems to me a design like that might at least be worth exploring, if only to reject it as a considered alternative. (It obviously has other downsides too, e.g. needing to include Never in the signature when there aren’t any captures, making capture extraction more verbose, etc.)

stackotter · March 19, 2022, 12:38pm

To me it wouldn’t make sense for that to be Never, I think it would be Void instead. Because it’s not that accessing the empty capture list should be invalid, it’s just an empty value. And as a bonus, () is interpreted as Void in swift and also resembles an empty tuple, which is what is being represented.

stackotter · March 19, 2022, 12:46pm

I agree that the second option is much nicer.

In regards to the first option: It’s definitely not clear to me (as someone that doesn’t use regex much) that the first capture would implicitly be the entire matched region. Swift is never about making code a tiny bit more compact at the cost of readability.

xwu · March 19, 2022, 3:44pm

Whatever we decide for Swift, it won’t change the fact that in order to use regex captures correctly it is essential to know that the zeroth capture is the entire matched region and that the rest start at index 1. Is it desirable for users to have to know additionally that Swift indexes captures differently from regex syntax, a Swift idiosyncrasy layered on top of the existing rules? I don’t think that solves more problems than it introduces.

We could say that there isn’t an implicit capture of the whole match when using Swift’s result builder syntax, but that would make changing from one to the other syntax have the pitfall of introducing off-by-one errors.

Michael_Ilseman · March 19, 2022, 7:16pm

matchPrefix() sounds ideal in that case. CustomMatchingRegexComponent's requirement should have "prefix" in the name as well because it's meant to nibble off the front of the range. Since it doesn't create a Regex.Match, but instead the deconstructed monadic-parser-style equivalent, I do wonder if we should have a different base name than "match" there.

Michael_Ilseman · March 19, 2022, 7:21pm

Right, internally the match result is just ranges (or ranges and Anys for value captures like Date). The Substring is produced lazily in order to avoid unnecessary ARC traffic.

I'm ok with the idea of adding a special case to produce the String when the reference has that type, or when the Regex<...> generic parameter has that type. Type inference will still prefer Substring as proposed. @rxwei what do you think?

An interesting discussion is whether the default should be flipped, that is whether they should all be Strings by default in the type system representation and default capture API. In that world, we'd have the following story for non-valued captures:

String in type system and default API (e.g. match.2)
Explicit API to access the range
Explicit/convenience API to access the slice

The alternative would be to go as proposed and have an explicit API to produce the String. I've long wanted a var copy: String on Substring and I add those kinds of utilities to my own projects for expression chains, but that kind of one-off can be a hard sell on swift-evolution.

stackotter · March 19, 2022, 8:46pm

Why does that need to be the case in swift? If the entire matched region is separate, then it is completely clear that the tuple containing other captures only contains your explicit captures.

I think that it would be quite intuitive that the captures list starts from 0 given that the implicit capture of the matched region is separate, it’s not going to be duplicated.

I don’t agree that Swift should do it a non-Swift way just because of the way other languages have done it. Separating the implicit capture from the explicit captures doesn’t seem like a bad decision, and if someone expects the captures to start at 1, it should still be pretty clear that the implicit capture is separate.

xwu · March 19, 2022, 10:39pm

Because we are not forking regex syntax; it would defeat a tentpole rationale for implementing regex syntax if an existing formula that works unambiguously with any other regex engine means a different thing when parsed by Swift’s regex engine. To be concrete:

The regex (a|b)(a|b)\1 matches aba and not abb, because \1 refers to the first capture group and not the second. It would not be acceptable for the same regex to match abb rather than aba in Swift. This much is non-negotiable.

The question at hand is whether .1 in Swift should correspond to \1 in regex syntax, or whether it should be .0 that corresponds to \1 instead. That alternative doesn’t seem like an improvement to me.