Is CustomConsumingRegexComponent incapable of declaring the start of a match?

I believe I have found a lacking capability in CustomConsumingRegexComponent and I would like to confirm that I'm not simply using it wrong.

To implement CustomConsumingRegexComponent you must implement consuming(_:startingAt:in:) which returns (upperBound: String.Index, output: Self.RegexOutput)?. Notice how the tuple only contains an upperBound. There is no lowerBound and there is no Range. As far as I know there is no way for a CustomConsumingRegexComponent to declare where the start of the match is.

Consider the following test data:

let dateStringHaystacks = [
  "2023-04-15 some other text",
  "some other text 2023-04-15",
  "some 2023-04-15 other text"
]

All three of these strings pass the following test:

@Test(arguments: dateStringHaystacks)
func dateRegex(string: String) throws {
  // Create a Regex using RegexBuilder for YYYY/MM/DD
  let dateRegex = Regex {

    Repeat(OneOrMore(.digit), count: 4)  // Matches exactly 4 digits for the year (YYYY)
    "-"
    Repeat(OneOrMore(.digit), count: 2)   // Matches exactly 2 digits for the month (MM)
    "-"
    Repeat(OneOrMore(.digit), count: 2)   // Matches exactly 2 digits for the day (DD)
  }
  
  let ranges = string.ranges(of: dateRegex)
  let range = try #require(ranges.first)
  let foundString = String(string[range])
  #expect(foundString == "2023-04-15")
}

which yields the following results:

// input: "2023-04-15 some other text" ✅ test passes
// input: "some other text 2023-04-15" ✅ test passes
// input: "some 2023-04-15 other text" ✅ test passes

I then implemented a CustomConsumingRegexComponent wrapper around NSDataDetector:

public struct DateDataDetector: CustomConsumingRegexComponent {
  public typealias RegexOutput = Date
  
  public init() {}
  
  public func consuming(
    _ input: String,
    startingAt index: String.Index,
    in bounds: Range<String.Index>
  ) throws -> (upperBound: String.Index, output: Date)? {
    let detector = try NSDataDetector(types: NSTextCheckingResult.CheckingType.date.rawValue)
    let range = NSRange(index..<bounds.upperBound, in: input)
    guard let match = detector.firstMatch(in: input, options: [], range: range),
          let date = match.date else {
      return nil
    }
    let upperBound = input.index(
      input.startIndex,
      offsetBy: match.range.upperBound
    )
    return (upperBound: upperBound, output: date)
  }
}

This custom regex component correctly identifies dates however, the ranges are wrong. The start of the range is always the beginning of the string, even if that is not where the beginning of the match was. Furthermore, because there is no upperBound in the tuple, I can see no way to implement declaring the beginning of the range when a match is found.

The following tests fail, but only fail when the date is not found at the beginning of the string. (They pass even if there is text after the match.)

  @Test(arguments: dateStringHaystacks)
  func rangesOf_valid(string: String) throws {
    let regex = Regex {
      DateDataDetector()
      
    }
    let ranges = string.ranges(of: regex)
    #expect(ranges.count == 1)
    
    let range = try #require(ranges.first)
    let foundString = String(string[range])
    #expect(foundString.starts(with: "2023-04-15"))
    #expect(foundString == "2023-04-15")
  }
  
  @Test(arguments: dateStringHaystacks)
  func firstRangeOf_valid(string: String) throws {
    let regex = Regex {
      DateDataDetector()
      
    }
    let range = try #require(string.firstRange(of: regex))
    
    let foundString = String(string[range])
    #expect(foundString.starts(with: "2023-04-15"))
    #expect(foundString == "2023-04-15")
  }
  
  @Test(arguments: dateStringHaystacks)
  func matches_valid(string: String) throws {
    let regex = Regex {
      OneOrMore(DateDataDetector())
      
    }
    let matches = string.matches(of: regex)
    #expect(matches.count == 1)
    let match = try #require(matches.first)
    #expect(match.output == "2023-04-15")
  }

which yields the following results:

// input: "2023-04-15 some other text", ✅ test passes
// input: "some other text 2023-04-15" ❌ match.output = "some other text 2023-04-15"
// input:  "some 2023-04-15 other text" ❌ match.output = "some 2023-04-15"

As you can see the match behavior is different for RegexBuilder than it is for CustomConsumingRegexComponent. The RegexBuilder begins the match at the beginning of the match, but the CustomConsumingRegexComponent matches at the beginning of the string, no matter what. It's highly likely that I'm "using it wrong". However, if that is the case, I simply do not know how to tell the Regex system where the beginning of the match is. I can reliably tell it where the end of the match is (the upperBound) but there is no way to return the beginning of the match (a lowerBound).

1 Like

It might be a prefixMatch by design. The API doc for the protocol in the proposal says

allowing custom types to [provide] 
the raw functionality backing `prefixMatch`.

But that was only in the docc comment for the protocol.

https://github.com/swiftlang/swift-evolution/blob/main/proposals/0357-regex-string-processing-algorithms.md#customconsumingregexcomponent

The protocol-level API comment is not included in the documentation I see in Xcode via the documentation viewer or in the source I see when navigating to the protocol source/interface.

Frustrating/confusing!

1 Like

Great find. I guess I would like to pitch to implement this missing functionality in a Swift Evolution proposal, but I don't even have official confirmation that it is actually missing.