I believe I have found a lacking capability in CustomConsumingRegexComponent and I would like to confirm that I'm not simply using it wrong.
To implement CustomConsumingRegexComponent
you must implement consuming(_:startingAt:in:) which returns (upperBound: String.Index, output: Self.RegexOutput)?
. Notice how the tuple only contains an upperBound
. There is no lowerBound
and there is no Range
. As far as I know there is no way for a CustomConsumingRegexComponent
to declare where the start of the match is.
Consider the following test data:
let dateStringHaystacks = [
"2023-04-15 some other text",
"some other text 2023-04-15",
"some 2023-04-15 other text"
]
All three of these strings pass the following test:
@Test(arguments: dateStringHaystacks)
func dateRegex(string: String) throws {
// Create a Regex using RegexBuilder for YYYY/MM/DD
let dateRegex = Regex {
Repeat(OneOrMore(.digit), count: 4) // Matches exactly 4 digits for the year (YYYY)
"-"
Repeat(OneOrMore(.digit), count: 2) // Matches exactly 2 digits for the month (MM)
"-"
Repeat(OneOrMore(.digit), count: 2) // Matches exactly 2 digits for the day (DD)
}
let ranges = string.ranges(of: dateRegex)
let range = try #require(ranges.first)
let foundString = String(string[range])
#expect(foundString == "2023-04-15")
}
which yields the following results:
// input: "2023-04-15 some other text" ✅ test passes
// input: "some other text 2023-04-15" ✅ test passes
// input: "some 2023-04-15 other text" ✅ test passes
I then implemented a CustomConsumingRegexComponent
wrapper around NSDataDetector:
public struct DateDataDetector: CustomConsumingRegexComponent {
public typealias RegexOutput = Date
public init() {}
public func consuming(
_ input: String,
startingAt index: String.Index,
in bounds: Range<String.Index>
) throws -> (upperBound: String.Index, output: Date)? {
let detector = try NSDataDetector(types: NSTextCheckingResult.CheckingType.date.rawValue)
let range = NSRange(index..<bounds.upperBound, in: input)
guard let match = detector.firstMatch(in: input, options: [], range: range),
let date = match.date else {
return nil
}
let upperBound = input.index(
input.startIndex,
offsetBy: match.range.upperBound
)
return (upperBound: upperBound, output: date)
}
}
This custom regex component correctly identifies dates however, the ranges are wrong. The start of the range is always the beginning of the string, even if that is not where the beginning of the match was. Furthermore, because there is no upperBound
in the tuple, I can see no way to implement declaring the beginning of the range when a match is found.
The following tests fail, but only fail when the date is not found at the beginning of the string. (They pass even if there is text after the match.)
@Test(arguments: dateStringHaystacks)
func rangesOf_valid(string: String) throws {
let regex = Regex {
DateDataDetector()
}
let ranges = string.ranges(of: regex)
#expect(ranges.count == 1)
let range = try #require(ranges.first)
let foundString = String(string[range])
#expect(foundString.starts(with: "2023-04-15"))
#expect(foundString == "2023-04-15")
}
@Test(arguments: dateStringHaystacks)
func firstRangeOf_valid(string: String) throws {
let regex = Regex {
DateDataDetector()
}
let range = try #require(string.firstRange(of: regex))
let foundString = String(string[range])
#expect(foundString.starts(with: "2023-04-15"))
#expect(foundString == "2023-04-15")
}
@Test(arguments: dateStringHaystacks)
func matches_valid(string: String) throws {
let regex = Regex {
OneOrMore(DateDataDetector())
}
let matches = string.matches(of: regex)
#expect(matches.count == 1)
let match = try #require(matches.first)
#expect(match.output == "2023-04-15")
}
which yields the following results:
// input: "2023-04-15 some other text", ✅ test passes
// input: "some other text 2023-04-15" ❌ match.output = "some other text 2023-04-15"
// input: "some 2023-04-15 other text" ❌ match.output = "some 2023-04-15"
As you can see the match behavior is different for RegexBuilder
than it is for CustomConsumingRegexComponent
. The RegexBuilder
begins the match at the beginning of the match, but the CustomConsumingRegexComponent
matches at the beginning of the string, no matter what. It's highly likely that I'm "using it wrong". However, if that is the case, I simply do not know how to tell the Regex system where the beginning of the match is. I can reliably tell it where the end of the match is (the upperBound
) but there is no way to return the beginning of the match (a lowerBound
).