[Pitch] Regex-powered string processing algorithms

This proposal is part of a larger regex-powered string processing initiative. Throughout the document, we will reference the still-in-progress RegexProtocol, Regex, and result builder DSL, but these are in flux and not formally part of this proposal. Further discussion of regex specifics is out of scope of this proposal and better discussed in another thread (see Pitch and Proposal Status for links to relevant threads).

Introduction

The Swift standard library's string processing algorithms are underpowered compared to other popular programming and scripting languages. Some of these omissions can be found in NSString, but these fundamental algorithms should have a place in the standard library.

We propose:

  1. New regex-powered algorithms over strings, bringing the standard library up to parity with scripting languages
  2. Generic Collection equivalents of these algorithms in terms of subsequences
  3. protocol CustomMatchingRegexComponent, which allows 3rd party libraries to provide their industrial-strength parsers as intermixable components of regexes

Motivation

A number of common string processing APIs are missing from the Swift standard library. While most of the desired functionalities can be accomplished through a series of API calls, every gap adds a burden to developers doing frequent or complex string processing. For example, here's one approach to find the number of occurrences a substring ("banana") within a string:

let str = "A banana a day keeps the doctor away. I love bananas; banana are my favorite fruit."

var idx = str.startIndex
var ranges = [Range<String.Index>]()
while let r = str.range(of: "banana", options: [], range: idx..<str.endIndex) {
    if idx != str.endIndex {
        idx = str.index(after: r.lowerBound)
    }
    ranges.append(r)
}

print(ranges.count)

While in Python this is as simple as

str = "A banana a day keeps the doctor away. I love bananas; banana are my favorite fruit."
print(str.count("banana"))

We propose adding string processing algorithms so common tasks as such can be achieved as straightforwardly.

Comparison of how Swift's APIs stack up with Python's.

Note: Only a subset of Python's string processing API are included in this table for the following reasons:

  • Functions to query if all characters in the string are of a specified category, such as isalnum() and isalpha(), are omitted. These are achievable in Swift by passing in the corresponding character set to allSatisfy(_:), so they're omitted in this table for simplicity.
  • String formatting functions such as center(length, character) and ljust(width, fillchar) are also excluded here as this proposal focuses on matching and searching functionalities.
Search and replace
Python Swift
count(sub, start, end)
find(sub, start, end), index(sub, start, end) firstIndex(where:)
rfind(sub, start, end), rindex(sub, start, end) lastIndex(where:)
expandtabs(tabsize), replace(old, new, count) Foundation.replacingOccurrences(of:with:)
maketrans(x, y, z) + translate(table)
Prefix and suffix matching
Python Swift
startswith(prefix, start, end) starts(with:) or hasPrefix(:)
endswith(suffix, start, end) hasSuffix(:)
removeprefix(prefix) Test if string has prefix with hasPrefix(:), then drop the prefix with dropFirst(:)
removesuffix(suffix) Test if string has suffix with hasSuffix(:), then drop the suffix with dropLast(:)
Strip / trim
Python Swift
strip([chars]) Foundation.trimmingCharacters(in:)
lstrip([chars]) drop(while:)
rstrip([chars]) Test character equality, then dropLast() iteratively
Split
Python Swift
partition(sep) Foundation.components(separatedBy:)
rpartition(sep)
split(sep, maxsplit) split(separator:maxSplits:...)
splitlines(keepends) split(separator:maxSplits:...)
rsplit(sep, maxsplit)

Complex string processing

Even with the API additions, more complex string processing quickly becomes unwieldy. Up-coming support for authoring regexes in Swift help alleviate this somewhat, but string processing in the modern world involves dealing with localization, standards-conforming validation, and other concerns for which a dedicated parser is required.

Consider parsing the date field "Date: Wed, 16 Feb 2022 23:53:19 GMT" in an HTTP header as a Date type. The naive approach is to search for a substring that looks like a date string (16 Feb 2022), and attempt to post-process it as a Date with a date parser:

let regex = Regex {
    capture {
        oneOrMore(.digit)
        " "
        oneOrMore(.word)
        " "
        oneOrMore(.digit)
    }
}

let dateParser = Date.ParseStrategy(format: "\(day: .twoDigits) \(month: .abbreviated) \(year: .padded(4))"
if let dateMatch = header.firstMatch(of: regex)?.0 {
    let date = try? Date(dateMatch, strategy: dateParser)
}

This requires writing a simplistic pre-parser before invoking the real parser. The pre-parser will suffer from being out-of-sync and less featureful than what the real parser can do.

Or consider parsing a bank statement to record all the monetary values in the last column:

let statement = """
CREDIT    04/06/2020    Paypal transfer    $4.99
CREDIT    04/03/2020    Payroll            $69.73
DEBIT     04/02/2020    ACH transfer       ($38.25)
DEBIT     03/24/2020    IRX tax payment    ($52,249.98)
"""

Parsing a currency string such as $3,020.85 with regex is also tricky, as it can contain localized and currency symbols in addition to accounting conventions. This is why Foundation provides industrial-strength parsers for localized strings.

Proposed solution

Complex string processing

We propose a CustomMatchingRegexComponent protocol which allows types from outside the standard library participate in regex builders and RegexComponent algorithms. This allows types, such as Date.ParseStrategy and FloatingPointFormatStyle.Currency, to be used directly within a regex:

let dateRegex = Regex {
    capture(dateParser)
}

let date: Date = header.firstMatch(of: dateRegex).map(\.result.1) 

let currencyRegex = Regex {
    capture(.localizedCurrency(code: "USD").sign(strategy: .accounting))
}

let amount: [Decimal] = statement.matches(of: currencyRegex).map(\.result.1)

String algorithm additions

We also propose the following regex-powered algorithms as well as their generic Collection equivalents. See the Detailed design section for a complete list of variation and overloads .

Function Description
contains(_:) -> Bool Returns whether the collection contains the given sequence or RegexComponent
starts(with:) -> Bool Returns whether the collection contains the same prefix as the specified RegexComponent
trimPrefix(_:) Removes the prefix if it matches the given RegexComponent or collection
firstRange(of:) -> Range? Finds the range of the first occurrence of a given sequence or RegexComponent
ranges(of:) -> some Collection<Range> Finds the ranges of the all occurrences of a given sequence or RegexComponent within the collection
replace(:with:subrange:maxReplacements) Replaces all occurrences of the sequence matching the given RegexComponent or sequence with a given collection
split(by:) Returns the longest possible subsequences of the collection around elements equal to the given separator
firstMatch(of:) Returns the first match of the specified RegexComponent within the collection
matches(of:) Returns a collection containing all matches of the specified RegexComponent

Detailed design

CustomMatchingRegexComponent

CustomMatchingRegexComponent inherits from RegexComponent and satisfies its sole requirement; Conformers can be used with all of the string algorithms generic over RegexComponent.

/// A protocol for custom match functionality.
public protocol CustomMatchingRegexComponent : RegexComponent {
    /// Match the input string within the specified bounds, beginning at the given index, and return
    /// the end position (upper bound) of the match and the matched instance.
    /// - Parameters:
    ///   - input: The string in which the match is performed.
    ///   - index: An index of `input` at which to begin matching.
    ///   - bounds: The bounds in `input` in which the match is performed.
    /// - Returns: The upper bound where the match terminates and a matched instance, or `nil` if
    ///   there isn't a match.
    func match(
        _ input: String,
        startingAt index: String.Index,
        in bounds: Range<String.Index>
    ) -> (upperBound: String.Index, match: Match)?
}
Example for protocol conformance

We use Foundation FloatingPointFormatStyle<Decimal>.Currency as an example for protocol conformance. It would implement the match function with Match being a Decimal. It could also add a static function .localizedCurrency(code:) as a member of RegexComponent, so it can be referred as .localizedCurrency(code:) in the Regex result builder:

extension FloatingPointFormatStyle<Decimal>.Currency : CustomMatchingRegexComponent { 
    public func match(
        _ input: String,
        startingAt index: String.Index,
        in bounds: Range<String.Index>
    ) -> (upperBound: String.Index, match: Decimal)?
}

extension RegexComponent where Self == FloatingPointFormatStyle<Decimal>.Currency {
    public static func localizedCurrency(code: Locale.Currency) -> Self
}

Matching and extracting a localized currency amount, such as "$3,020.85", can be done directly within a regex:

let regex = Regex {
    capture(.localizedCurreny(code: "USD"))
}

String algorithm additions

Contains

extension Collection where Element: Equatable {
    /// Returns a Boolean value indicating whether the collection contains the
    /// given sequence.
    /// - Parameter other: A sequence to search for within this collection.
    /// - Returns: `true` if the collection contains the specified sequence,
    /// otherwise `false`.
    public func contains<S: Sequence>(_ other: S) -> Bool
        where S.Element == Element
}

extension BidirectionalCollection where SubSequence == Substring {
    /// Returns a Boolean value indicating whether the collection contains the
    /// given regex.
    /// - Parameter regex: A regex to search for within this collection.
    /// - Returns: `true` if the regex was found in the collection, otherwise
    /// `false`.
    public func contains<R: RegexComponent>(_ regex: R) -> Bool
}

Starts with

extension BidirectionalCollection where SubSequence == Substring {
    /// Returns a Boolean value indicating whether the initial elements of the
    /// sequence are the same as the elements in the specified regex.
    /// - Parameter regex: A regex to compare to this sequence.
    /// - Returns: `true` if the initial elements of the sequence matches the
    /// beginning of `regex`; otherwise, `false`.
    public func starts<R: RegexComponent>(with regex: R) -> Bool
}

Trim prefix

extension Collection {
    /// Returns a new collection of the same type by removing initial elements
    /// that satisfy the given predicate from the start.
    /// - Parameter predicate: A closure that takes an element of the sequence
    /// as its argument and returns a Boolean value indicating whether the
    /// element should be removed from the collection.
    /// - Returns: A collection containing the elements of the collection that are
    ///  not removed by `predicate`.
    public func trimmingPrefix(while predicate: (Element) throws -> Bool) rethrows -> SubSequence
}

extension Collection where SubSequence == Self {
    /// Removes the initial elements that satisfy the given predicate from the
    /// start of the sequence.
    /// - Parameter predicate: A closure that takes an element of the sequence
    /// as its argument and returns a Boolean value indicating whether the
    /// element should be removed from the collection.
    public mutating func trimPrefix(while predicate: (Element) throws -> Bool)
}

extension RangeReplaceableCollection {
    /// Removes the initial elements that satisfy the given predicate from the
    /// start of the sequence.
    /// - Parameter predicate: A closure that takes an element of the sequence
    /// as its argument and returns a Boolean value indicating whether the
    /// element should be removed from the collection.
    public mutating func trimPrefix(while predicate: (Element) throws -> Bool)
}

extension Collection where Element: Equatable {
    /// Returns a new collection of the same type by removing `prefix` from the
    /// start.
    /// - Parameter prefix: The collection to remove from this collection.
    /// - Returns: A collection containing the elements that does not match
    /// `prefix` from the start.
    public func trimmingPrefix<Prefix: Collection>(_ prefix: Prefix) -> SubSequence
        where Prefix.Element == Element
}

extension Collection where SubSequence == Self, Element: Equatable {
    /// Removes the initial elements that matches `prefix` from the start.
    /// - Parameter prefix: The collection to remove from this collection.
    public mutating func trimPrefix<Prefix: Collection>(_ prefix: Prefix)
        where Prefix.Element == Element
}

extension RangeReplaceableCollection where Element: Equatable {
    /// Removes the initial elements that matches `prefix` from the start.
    /// - Parameter prefix: The collection to remove from this collection.
    public mutating func trimPrefix<Prefix: Collection>(_ prefix: Prefix)
        where Prefix.Element == Element
}

extension BidirectionalCollection where SubSequence == Substring {
    /// Returns a new subsequence by removing the initial elements that matches
    /// the given regex.
    /// - Parameter regex: The regex to remove from this collection.
    /// - Returns: A new subsequence containing the elements of the collection
    /// that does not match `prefix` from the start.
    public func trimmingPrefix<R: RegexComponent>(_ regex: R) -> SubSequence
}

extension RangeReplaceableCollection
  where Self: BidirectionalCollection, SubSequence == Substring
{
    /// Removes the initial elements that matches the given regex.
    /// - Parameter regex: The regex to remove from this collection.
    public mutating func trimPrefix<R: RegexComponent>(_ regex: R)
}

First range

extension Collection where Element: Equatable {
    /// Finds and returns the range of the first occurrence of a given sequence
    /// within the collection.
    /// - Parameter sequence: The sequence to search for.
    /// - Returns: A range in the collection of the first occurrence of `sequence`.
    /// Returns nil if `sequence` is not found.
    public func firstRange<S: Sequence>(of sequence: S) -> Range<Index>? 
        where S.Element == Element
}

extension BidirectionalCollection where Element: Comparable {
    /// Finds and returns the range of the first occurrence of a given sequence
    /// within the collection.
    /// - Parameter other: The sequence to search for.
    /// - Returns: A range in the collection of the first occurrence of `sequence`.
    /// Returns `nil` if `sequence` is not found.
    public func firstRange<S: Sequence>(of other: S) -> Range<Index>?
        where S.Element == Element
}

extension BidirectionalCollection where SubSequence == Substring {
    /// Finds and returns the range of the first occurrence of a given regex
    /// within the collection.
    /// - Parameter regex: The regex to search for.
    /// - Returns: A range in the collection of the first occurrence of `regex`.
    /// Returns `nil` if `regex` is not found.
    public func firstRange<R: RegexComponent>(of regex: R) -> Range<Index>?
}

Ranges

extension Collection where Element: Equatable {
    /// Finds and returns the ranges of the all occurrences of a given sequence
    /// within the collection.
    /// - Parameter other: The sequence to search for.
    /// - Returns: A collection of ranges of all occurrences of `other`. Returns
    ///  an empty collection if `other` is not found.
    public func ranges<S: Sequence>(of other: S) -> some Collection<Range<Index>>
        where S.Element == Element
}

extension BidirectionalCollection where SubSequence == Substring {
    /// Finds and returns the ranges of the all occurrences of a given sequence
    /// within the collection.
    /// - Parameter regex: The regex to search for.
    /// - Returns: A collection or ranges in the receiver of all occurrences of
    /// `regex`. Returns an empty collection if `regex` is not found.
    public func ranges<R: RegexComponent>(of regex: R) -> some Collection<Range<Index>>
}

First match

extension BidirectionalCollection where SubSequence == Substring {
    /// Returns the first match of the specified regex within the collection.
    /// - Parameter regex: The regex to search for.
    /// - Returns: The first match of `regex` in the collection, or `nil` if
    /// there isn't a match.
    public func firstMatch<R: RegexComponent>(of regex: R) -> RegexMatch<R.Match>?
}

Matches

extension BidirectionalCollection where SubSequence == Substring {
    /// Returns a collection containing all matches of the specified regex.
    /// - Parameter regex: The regex to search for.
    /// - Returns: A collection of matches of `regex`.
    public func matches<R: RegexComponent>(of regex: R) -> some Collection<RegexMatch<R.Match>>
}

Replace

extension RangeReplaceableCollection where Element: Equatable {
    /// Returns a new collection in which all occurrences of a target sequence
    /// are replaced by another collection.
    /// - Parameters:
    ///   - other: The sequence to replace.
    ///   - replacement: The new elements to add to the collection.
    ///   - subrange: The range in the collection in which to search for `other`.
    ///   - maxReplacements: A number specifying how many occurrences of `other`
    ///   to replace. Default is `Int.max`.
    /// - Returns: A new collection in which all occurrences of `other` in
    /// `subrange` of the collection are replaced by `replacement`.    
    public func replacing<S: Sequence, Replacement: Collection>(
        _ other: S,
        with replacement: Replacement,
        subrange: Range<Index>,
        maxReplacements: Int = .max
    ) -> Self where S.Element == Element, Replacement.Element == Element
  
    /// Returns a new collection in which all occurrences of a target sequence
    /// are replaced by another collection.
    /// - Parameters:
    ///   - other: The sequence to replace.
    ///   - replacement: The new elements to add to the collection.
    ///   - maxReplacements: A number specifying how many occurrences of `other`
    ///   to replace. Default is `Int.max`.
    /// - Returns: A new collection in which all occurrences of `other` in
    /// `subrange` of the collection are replaced by `replacement`.
    public func replacing<S: Sequence, Replacement: Collection>(
        _ other: S,
        with replacement: Replacement,
        maxReplacements: Int = .max
    ) -> Self where S.Element == Element, Replacement.Element == Element
  
    /// Replaces all occurrences of a target sequence with a given collection
    /// - Parameters:
    ///   - other: The sequence to replace.
    ///   - replacement: The new elements to add to the collection.
    ///   - maxReplacements: A number specifying how many occurrences of `other`
    ///   to replace. Default is `Int.max`.
    public mutating func replace<S: Sequence, Replacement: Collection>(
        _ other: S,
        with replacement: Replacement,
        maxReplacements: Int = .max
    ) where S.Element == Element, Replacement.Element == Element
}

extension RangeReplaceableCollection where SubSequence == Substring {
    /// Returns a new collection in which all occurrences of a sequence matching
    /// the given regex are replaced by another collection.
    /// - Parameters:
    ///   - regex: A regex describing the sequence to replace.
    ///   - replacement: The new elements to add to the collection.
    ///   - subrange: The range in the collection in which to search for `regex`.
    ///   - maxReplacements: A number specifying how many occurrences of the
    ///   sequence matching `regex` to replace. Default is `Int.max`.
    /// - Returns: A new collection in which all occurrences of subsequence
    /// matching `regex` in `subrange` are replaced by `replacement`.
    public func replacing<R: RegexComponent, Replacement: Collection>(
        _ regex: R,
        with replacement: Replacement,
        subrange: Range<Index>,
        maxReplacements: Int = .max
    ) -> Self where Replacement.Element == Element
  
    /// Returns a new collection in which all occurrences of a sequence matching
    /// the given regex are replaced by another collection.
    /// - Parameters:
    ///   - regex: A regex describing the sequence to replace.
    ///   - replacement: The new elements to add to the collection.
    ///   - maxReplacements: A number specifying how many occurrences of the
    ///   sequence matching `regex` to replace. Default is `Int.max`.
    /// - Returns: A new collection in which all occurrences of subsequence
    /// matching `regex` are replaced by `replacement`.
    public func replacing<R: RegexComponent, Replacement: Collection>(
        _ regex: R,
        with replacement: Replacement,
        maxReplacements: Int = .max
    ) -> Self where Replacement.Element == Element
  
    /// Replaces all occurrences of the sequence matching the given regex with
    /// a given collection.
    /// - Parameters:
    ///   - regex: A regex describing the sequence to replace.
    ///   - replacement: The new elements to add to the collection.
    ///   - maxReplacements: A number specifying how many occurrences of the
    ///   sequence matching `regex` to replace. Default is `Int.max`.
    public mutating func replace<R: RegexComponent, Replacement: Collection>(
        _ regex: R,
        with replacement: Replacement,
        maxReplacements: Int = .max
    ) where Replacement.Element == Element
  
    /// Returns a new collection in which all occurrences of a sequence matching
    /// the given regex are replaced by another regex match.
    /// - Parameters:
    ///   - regex: A regex describing the sequence to replace.
    ///   - replacement: A closure that receives the full match information,
    ///   including captures, and returns a replacement collection.
    ///   - subrange: The range in the collection in which to search for `regex`.
    ///   - maxReplacements: A number specifying how many occurrences of the
    ///   sequence matching `regex` to replace. Default is `Int.max`.
    /// - Returns: A new collection in which all occurrences of subsequence
    /// matching `regex` are replaced by `replacement`.
    public func replacing<R: RegexComponent, Replacement: Collection>(
        _ regex: R,
        with replacement: (RegexMatch<R.Match>) throws -> Replacement,
        subrange: Range<Index>,
        maxReplacements: Int = .max
    ) rethrows -> Self where Replacement.Element == Element
  
    /// Returns a new collection in which all occurrences of a sequence matching
    /// the given regex are replaced by another collection.
    /// - Parameters:
    ///   - regex: A regex describing the sequence to replace.
    ///   - replacement: A closure that receives the full match information,
    ///   including captures, and returns a replacement collection.
    ///   - maxReplacements: A number specifying how many occurrences of the
    ///   sequence matching `regex` to replace. Default is `Int.max`.
    /// - Returns: A new collection in which all occurrences of subsequence
    /// matching `regex` are replaced by `replacement`.
    public func replacing<R: RegexComponent, Replacement: Collection>(
        _ regex: R,
        with replacement: (RegexMatch<R.Match>) throws -> Replacement,
        maxReplacements: Int = .max
    ) rethrows -> Self where Replacement.Element == Element
  
    /// Replaces all occurrences of the sequence matching the given regex with
    /// a given collection.
    /// - Parameters:
    ///   - regex: A regex describing the sequence to replace.
    ///   - replacement: A closure that receives the full match information,
    ///   including captures, and returns a replacement collection.
    ///   - maxReplacements: A number specifying how many occurrences of the
    ///   sequence matching `regex` to replace. Default is `Int.max`.
    public mutating func replace<R: RegexComponent, Replacement: Collection>(
        _ regex: R,
        with replacement: (RegexMatch<R.Match>) throws -> Replacement,
        maxReplacements: Int = .max
    ) rethrows where Replacement.Element == Element
}

Split

extension Collection where Element: Equatable {
    /// Returns the longest possible subsequences of the collection, in order,
    /// around elements equal to the given separator.
    /// - Parameter separator: The element to be split upon.
    /// - Returns: A collection of subsequences, split from this collection's
    /// elements.
    public func split<S: Sequence>(by separator: S) -> some Collection<SubSequence>
        where S.Element == Element
}

extension BidirectionalCollection where SubSequence == Substring {
    /// Returns the longest possible subsequences of the collection, in order,
    /// around elements equal to the given separator.
    /// - Parameter separator: A regex describing elements to be split upon.
    /// - Returns: A collection of substrings, split from this collection's
    /// elements.
    public func split<R: RegexComponent>(by separator: R) -> some Collection<Substring>
}

Alternatives considered

Extend Sequence instead of Collection

Most of the proposed algorithms are necessarily on Collection due to the use of indices or mutation. Sequence does not support multi-pass iteration, so even trimPrefix would problematic on Sequence because it needs to look 1 Element ahead to know when to stop trimming.

Future directions

Backward algorithms

It would be useful to have algorithms that operate from the back of a collection, including ability to find the last non-overlapping range of a pattern in a string, and/or that to find the first range of a pattern when searching from the back, and trimming a string from both sides. They are deferred from this proposal as the API that could clarify the nuances of backward algorithms are still being explored.

Nuances of backward algorithms

There is a subtle difference between finding the last non-overlapping range of a pattern in a string, and finding the first range of this pattern when searching from the back.

The currently proposed algorithm that finds a pattern from the front, e.g. "aaaaa".ranges(of: "aa"), produces two non-overlapping ranges, splitting the string in the chunks aa|aa|a. It would not be completely unreasonable to expect to introduce a counterpart, such as "aaaaa".lastRange(of: "aa"), to return the range that contains the third and fourth characters of the string. This would be a shorthand for "aaaaa".ranges(of: "aa").last. Yet, it would also be reasonable to expect the function to return the first range of "aa" when searching from the back of the string, i.e. the range that contains the fourth and fifth characters.

Trimming a string from both sides shares a similar story. For example, "ababa".trimming("aba") can return either "ba" or "ab", depending on whether the prefix or the suffix was trimmed first.

Future API

Some Python functions are not currently included in this proposal, such as trimming the suffix from a string/collection. This pitch aims to establish a pattern for using RegexComponent with string processing algorithms, so that further enhancement can to be introduced to the standard library easily in the future, and eventually close the gap between Swift and other popular scripting languages.

48 Likes

Complete side-note: I’m looking forward to the new light-weight syntax for genetics. :sunglasses:

extension Collection<any Equatable> {
    public func contains(_ other: any Sequence<Element>) -> Bool
}

To be clear, those are existentials. The new lightweight syntax for generics under review would be some Equatable and some Sequence<Element>.

11 Likes

What's the difference between index and bounds.lowerBound here? I'm also not 100% sure why this method doesn't simply take a Substring, but I assume it's because the compiler wouldn't be able to avoid ARC traffic from creating it?

1 Like

These look like very welcome additions.

Should this:

be named func hasPrefix<R: RegexComponent>(_ regex: R) -> Bool? This would be more consistent with the names of the existing String.hasPrefix and proposed Regex.matchPrefix() and trimPrefix() methods.

6 Likes

The proposed API aligns with the more universal starts(with:).

Why hasPrefix even exists is a little beyond me and predates my involvement in the project. Here's the implementation, FWIW:

  @inlinable
  public func hasPrefix<Prefix: StringProtocol>(_ prefix: Prefix) -> Bool {
    return self.starts(with: prefix)
  }
8 Likes

Regexes include lookaround assertions, which might look before or after the current search point. Additionally, some anchors will actually try to look at the surrounding context outside of the searched range, though there are modes and options to tweak this.

The main reason is that it's strictly more general to take a collection and range than a slice, especially if there's ever a need to look outside that range (e.g. some anchors mentioned above). Substring does expose a .base method to get the overall string, but it's a little cleaner and clearer for libraries to just have the range. E.g., Substring can be mutated and that will change the value of base, so trafficking in ranges for the lowest-level conformer interfaces is clearer.

Avoiding ARC is always a plus, obviously, but it's not as clear to me that it would make a difference here.

1 Like

In my experience, localised parsers like this can be a bit of a trap. It sounds like a good idea, but in reality it means you can no longer predict how the parser will behave for a given input. There are all kinds of settings and cultural conventions which can interfere with your delimiters and understanding of the text, and of course no developers should be expected to know them all.

IMO, regexes are most useful for parsing well-formatted "machine text", not localised prose. What happens when the statement fails to parse because a user with a different locale setting is viewing it?

It's good to allow custom text formats to be used as regex components, but I'm not sure specifically that making localised parsing so prominent is a great idea.

For example, if you are parsing comma-separated values - if you write your code in a locale which uses periods for decimal points (e.g. the US, UK), you might not consider that other locales (e.g. Germany) sometimes use commas. It can becomes ambiguous where each value starts and ends. Incorporating localised data in to machine-readable formats can be deceptively tricky, and I'm not sure we can do it justice with this.

"Localized" includes "explicit control of localization," not just "implicitly according to a dynamic locale".

The ability to easily switch between localized formats via API operations, rather than having to rewrite the parser from scratch.

1 Like

Even then, I'd argue that it's better to normalize and match against the specific format you're expecting, if you're going to be explicitly setting an expected locale anyway. If you want to be explicit, why add an opaque dynamic framework between you and your pattern?

Yes, building patterns from reusable components is good, and as I said, I support that in principle.

This pitch has methods for both finding matches and checking if there is a match. However, there is a gap:

Has Match Return Match Anchoring
contains(_:) -> Bool firstMatch(of:) Unanchored
starts(with:) -> Bool prefixMatch(of:) Anchored at start ^
??? wholeMatch(of:) Anchored at start and end ^...$

Is this intentional? If not, could it be added? Maybe matchesWhole(of:)?

Related:

Should the ~= operator be for anchored or unanchored matching (ie. wholeMatch/matchesWhole or firstMatch/contains)?

The two languages I’ve used that allow using Regex in a switch have gone for different options:

case "abcde"
when /bcd/
  print "matched"
end
  • Scala does anchored matching (unless unanchored is used)
val pattern = "bcd".r

"abcde" match {
  case pattern(_*) => "Matches"
  case _ => "No match"
}

The same behaviour can be achieved whether it's anchored or not:

Anchored:

switch "abcde" {
  case /bcd/: 
     // no match
  case /.*bcd.*/:
    // matches
}

Unanchored

switch "abcde" {
  case /^bcd$/: 
     // no match
  case /bcd/:
    // matches
}

I think the unanchored option is possibly less surprising?

4 Likes

I think the anchored option is less surprising in a switch context. It means /bcd/ and the literal "bcd" either both match or both do not; the regex would behave like a literal, and you'd need to add any non-literal elements in (e.g. by explicitly adding that "there may be some leading/trailing text", as you've done here with the .* elements).

IMO, there is a difference in expectations when asking if a string matches a pattern, and asking whether a pattern can be matched at all within a string; like the difference between somebody who is a baker (by profession) and somebody who, amongst other things, also bakes (as a hobby).

It would be strange if:

switch input {
case /bcd/:
  ...
}

Would also match the strings:

  • "The first four letters are abcd" (ignoring the fact that it has leading text unspecified in the pattern)
  • "bcd-encoding is underrated" (ignoring the fact that it has trailing text unspecified in the pattern)

FWIW, I've encountered a couple of similar issues in the wild, where developers parsed/matched some input but forget to check if the end of the match is the end of the input. The results can be pretty poor, so in general I think it's wise to do anchored matching and ensure you understand which inputs you're accepting/rejecting.

For example, JavaScript's URL class has this mistake baked in to its specification:

// JavaScript
var url      = new URL("https://example.com/");
url.href;
// "https://example.com/"
//  ^^^^^

url.protocol = "ftp://will.be.ignored.org/";
url.href;
// "ftp://example.com/"
//  ^^^

The way JS defines its 'protocol' setter, it checks to see if the input starts with a URL scheme and delimiter, and allows any amount of trailing garbage to follow it (it can even be an entirely different URL). It doesn't even throw an error as it ignores the vast majority of the input. It's quite surprising and almost certainly an oversight, and I think developers should have to opt-in to that kind of thing, not opt-out.

3 Likes

I agree in general. I think that first-match semantics can be very useful and we should have a nice way to explicitly request that. But, first-match being the default lands the programmer in a perilous situation when the input is dynamic and unpredictable: new inputs might just happen to trigger an earlier case statement in non-obvious ways. One of the main advantages of Swift over C++ at the memory safety level, and over scripting languages at the type safety with explicit error handling level, is that it reduces the degree of programmer vigilance required to write and understand code. This of course always needs to be balanced with brevity and ergonomics. I think whole-match by default reduces programmer vigilance here.

I think an interesting direction to explore are regex modifiers for this purpose. If the case isn't a regex literal, it's a harder and more obnoxious to try to stick anchors in, or to try to insert a .* at beginning or end.

@nnnnnnnn, I believe [Pitch] Unicode for String Processing has most of the modifiers / options. What do you think about modifiers to change how a regex matches? E.g.

switch myString {
  case aRegex: ...
  case anotherRegex.matchingAnywhere: ...
  default: ...
}
4 Likes

I was hoping if it would be possible to use the RegexBuilder DSL to wrap a regex with the extra .* at the beginning and end.

Attempt 1

import _StringProcessing
import RegexBuilder

extension RegexComponent {
  public var matchingAnywhere: Regex<RegexOutput> {
    return Regex {
      OneOrMore(.any)
      regex
      OneOrMore(.any)
    }
  }
}

DSL's whole match output is Substring, so this doesn't work, and doesn't handle captures

Attempt 2

extension RegexComponent {
  public func matchingAnywhere() -> Regex<Substring> where RegexOutput == Substring {
    return Regex {
      ZeroOrMore(.any)
      regex
      ZeroOrMore(.any)
    }
  }
  public func matchingAnywhere<C0>() -> Regex<(Substring, C0)> where RegexOutput == (Substring, C0) {
    return Regex {
      ZeroOrMore(.any)
      regex
      ZeroOrMore(.any)
    }
  }
  // plus overloads for each capture arity
}

let r = try! Regex(compiling: #"bcd"#, as: Substring.self)
let c = "abcde".wholeMatch(of: r.matchingAnywhere())
print("contains? \(c?.0 ?? "<none>")")

Handles captures, but not AnyRegexOutput.

Attempt 3

public struct Unanchored<Output>: RegexComponent {
  public var regex: Regex<Output>

  @_disfavoredOverload
  public init<Component: RegexComponent>(
    @RegexComponentBuilder _ component: () -> Component
  ) where RegexOutput == Substring {
    regex = Regex {
      ZeroOrMore(.any)
      component()
      ZeroOrMore(.any)
    }
  }

  public init<W, C0, Component: RegexComponent>(
    @RegexComponentBuilder _ component: () -> Component
  ) where RegexOutput == (Substring, C0), Component.RegexOutput == (W, C0)
  {
    regex = Regex {
      ZeroOrMore(.any)
      component()
      ZeroOrMore(.any)
    }
  }
  
  // ... `O(arity)` overloads
}

let r = try! Regex(compiling: #"bcd"#)
let c = "abcde".wholeMatch(of: Unanchored { r })
print("contains? \(c?.0 ?? "<none>")")

Is there some trick that would allow Unanchored to be made into a modifier?

What about something like this?

public struct Unanchored<Pattern>: RegexComponent where Pattern : RegexComponent {
    public let regex: Regex<Pattern.Output>
    
    public init(_ pattern: Pattern) {
        self.regex = Regex {
            ZeroOrMore(.any)
            pattern
            ZeroOrMore(.any)
        }
    }
}
public enum RegexComponentAnchoring {
    case start
    case end
    case unanchored
}

extension RegexComponent {
    @RegexComponentBuilder
    func anchored(_ anchoring: RegexComponentAnchoring) -> Regex<Self.Output> {
        switch anchoring {
        case .start:
            StartAnchored(self)
        case .end:
            EndAnchored(self)
        case .unanchored:
            Unanchored(self)
        }
    }
}

Edit: Or since RegexComponentBuilder doesn't seem to work that way (I haven't been able to test the DSL):

extension RegexComponent {
    func anchored(_ anchoring: RegexComponentAnchoring) -> Regex<Self.Output> {
        switch anchoring {
        case .start:
            return StartAnchored(self).regex
        case .end:
            return EndAnchored(self).regex
        case .unanchored:
            return Unanchored(self).regex
        }
    }
}
Extra: StartAnchored & EndAnchored Regex Component Definitions
public struct StartAnchored<Pattern>: RegexComponent where Pattern : RegexComponent {
    public let regex: Regex<Pattern.Output>
    
    public init(_ pattern: Pattern) {
        self.regex = Regex {
            pattern
            ZeroOrMore(.any)
        }
    }
}
public struct EndAnchored<Pattern>: RegexComponent where Pattern : RegexComponent {
    public let regex: Regex<Pattern.Output>
    
    public init(_ pattern: Pattern) {
        self.regex = Regex {
            ZeroOrMore(.any)
            pattern
        }
    }
}

contains and starts(with:) have existing analogues and it makes sense to extend them for regex. The ??? entry is the ~= overload as the analogue for == (or elementsEqual)

As you point out, for direct in-line literals either can be explicitly chosen. The string being switched over is often unknown and inputs could have any shape. I think whole-match is the most useful and intuitive semantics here (and consistent with e.g. switching over an array of characters). A match-anywhere is surprising as it's easy for a regex to accidentally have some match somewhere inside an unknown or untested input.

Match-anywhere should probably be opt-in and I am interested in making opting-in more ergonomic.

2 Likes

I think you'd need to work around the lack of variadic generics by generating a ton of overloads like we're doing. We definitely want a better way, and that's future work at this point.

I believe @nnnnnnnn was looking into something similar, any guidance?

1 Like

I'm kind of wary of whether it's relevant to discuss it here, but I'm eyeing this addition to Swift as a small cleanup to Swift-Colab. I currently use the Python re library for parsing magic commands, but I strive to minimize dependencies on Python libraries and stringently enforce code cleanliness. I don't have much experience with regular expressions, but I did modify one expression from the original swift-jupyter using knowledge from a Wikipedia article. Could anybody show what these commands would translate to in the proposed Swift DSL?

Example 1 (source file: PreprocessAndExecute.swift):

let systemRegularExpression = ###"""
  ^\s*%system (.*)$
  """###
let systemMatch = re.match(systemRegularExpression, line)
guard systemMatch == Python.None else {
  let restOfLine = String(systemMatch.group(1))!
  try executeSystemCommand(restOfLine: restOfLine)
  return ""
}

Example 2 - more complex in that it calls processInstallDirective (code not shown, but present in ProcessInstalls.swift) to scan various magic commands that start with %install-:

let installRegularExpression = ###"""
  ^\s*%install
  """###
let installMatch = re.match(installRegularExpression, line)
  if installMatch != Python.None {
    var isValidDirective = false
    try processInstallDirective(
      line: line, lineIndex: lineIndex, isValidDirective: &isValidDirective)
    if isValidDirective {
      return ""
    } else {
      // This was not a valid %install-XXX command. Continue through regular 
      // processing and let the Swift parser throw an error.
    }
  }
This may be even more off-topic (maybe move it to an issue on Swift-Colab)

Would the Regex DSL help in any way with the monstrosity below? I have to do a ton of workarounds just to locate a $clear flag that resets the SwiftPM flags, leaving only what appears to the right of it. Perhaps, someone could at least suggest a near-term solution using the Python re library.

Actual source code location: ProcessInstalls.swift

// %install-swiftpm-flags

fileprivate var swiftPMFlags: [String] = []

fileprivate func processSwiftPMFlags(
  restOfLine: String, lineIndex: Int
) throws {
  // Nobody is going to type this literal into their Colab notebook.
  let id = "$SWIFT_COLAB_sHcpmxAcqC7eHlgD"
  var processedLine: String
  do {
    processedLine = String(try string.Template(restOfLine).substitute.throwing
      .dynamicallyCall(withArguments: [
        "clear": id
      ])
    )!
  } catch {
    throw handleTemplateError(error, lineIndex: lineIndex)
  }
  
  // Ensure that only everything after the last "$clear" flag passes into shlex.
  let reversedID = String(id.reversed())
  let reversedLine = String(processedLine.reversed())
  if let idRange = reversedLine.range(of: reversedID) {
    let endRange = reversedLine.startIndex..<idRange.lowerBound
    processedLine = String(reversedLine[endRange].reversed())
    swiftPMFlags = []
  }
  
  let flags = shlex[dynamicMember: "split"](processedLine)
  swiftPMFlags += [String](flags)!
  
  sendStdout("\(swiftPMFlags)")
}

I think it would be unfortunate if the only way to spell this operation would be with the ~= operator. The (only?) other override of ~= that I could find in the stdlib is for Range, where it is effectively an alias for .contains(...). Would it make sense to add a Regex analogue of elementsEqual?

extension BidirectionalCollection where SubSequence == Substring {
  public func elementsEqual(_ other: some RegexComponent) -> Bool 
}

I'll admit that the naming is much less clear than for contains(_:) and starts(with:), but it is at least consistent. ~= could then be an alias for this.

Yours and @Karl 's arguments have convinced me that this is the right choice :slight_smile: .

:+1:

It would definitely be nice to make abstracting over composing regexes easier. With the current proposals, it's possible to compose regexes (which is already a huge improvement over what's available elsewhere). As I discovered above, putting this composition into a function is much harder! Your idea for regex literal interpolation looks like it might help with this?

2 Likes

If you can come up with a good name I'm all :ear:s. matches(_:) comes to mind, but that's already used to get the collection of all non-overlapping matches. Another could be isMatched(by:).

The ~= is a little hard to discover, but it does work fairly well after becoming familiar with it:

let x = myRegex ~= string ? foo() : bar()

And this is equivalent to

let x = string.wholeMatch(of: myRegex) != nil ? foo() : bar()

You'd be able to interpolate any RegexComponent, which would bring it to parity with the builder DSL in that respect. Beyond the convenience of not having to split the literal, it doesn't add too much more composition capabilities. It might allow you interpolate in between a capture and a backreference, but I suspect that isn't a super-common need (there's an argument to be made that the parser should reject certain forms of that anyways if an interpolation changes e.g. relative capture numbering).