I've been having a go at adding CustomConsumingRegexComponent
support to WebURL, but I'm having trouble figuring out how much of the input string to consume.
If I try the regex <(.+)>
against the string "<hello>"
, it matches, and the value of the capture group is "hello". Similarly, I can use (.+),(.+)
to match "hello,world"
, and the resulting captures are "hello" and "world". Bearing in mind that I'm not a regex expert, it appears that, even though the .+
groups could match my delimiters (>
and ,
), the regex engine knows to stop consuming there, because it sees what comes next. My understanding is that this is due to backtracking - after making the first capture, it looks at the rest, sees it doesn't match, and reduces the first capture until it matches as many elements of the pattern as possible (or something like that).
But custom components are different - the API gives them a location, and tells them to eat as much as they like. And so, if I try the analog of my previous regex (.+),(.+)
, but with a custom component:
"http://foo/abc,http://bar/def".firstMatch(of: Regex {
Capture(.webURL)
","
Capture(.webURL)
})
// ❌ nil
What is happening here is that commas are allowed in URLs (just like they may be captured by a (.+)
group), so the first capture consumes the entire string. Once the engine sees other patterns after that first capture, it advances the start location rather than backtracking by reducing the amount consumed.
I'm not sure why it does that. Is it intentional?
Log of regions the component is asked to match in
http://foo/abc,http://bar/def
ttp://foo/abc,http://bar/def
tp://foo/abc,http://bar/def
p://foo/abc,http://bar/def
://foo/abc,http://bar/def
//foo/abc,http://bar/def
/foo/abc,http://bar/def
foo/abc,http://bar/def
oo/abc,http://bar/def
o/abc,http://bar/def
/abc,http://bar/def
abc,http://bar/def
bc,http://bar/def
c,http://bar/def
,http://bar/def
http://bar/def
ttp://bar/def
tp://bar/def
p://bar/def
://bar/def
//bar/def
/bar/def
bar/def
ar/def
r/def
/def
def
ef
f
We can get that backtracking behaviour back by using TryCapture
instead, because it doesn't have the same consuming style of API:
"http://foo/abc,http://bar/def".firstMatch(of: Regex {
TryCapture {
OneOrMore(.any)
} transform: {
WebURL($0)
}
","
TryCapture {
OneOrMore(.any)
} transform: {
WebURL($0)
}
})
// ✅ [http://foo/abc, http://bar/def]
This makes many, many more attempts to capture, so it is important that the transform
builds in fast-paths to quickly reject invalid input. It's a bit different to most parsers, because generally you assume input is valid and prioritise that over the failure case, but for this use-case you actually expect most inputs to be invalid.
Given all of that, I have a couple of questions:
-
Is the lack of backtracking likely to be a surprise to Regex users? Is it intentional?
-
Are there any recommendations about how to surface this? Currently I'm thinking the component will need to accept some kind of parameter to find the end of the URL (e.g. another regex):
Regex { Capture(.webURL(until: ",")) "," Capture(.webURL) }
Clearly, this seems less than ideal. I can build in some reasonable defaults (e.g. URLs are unlikely to include spaces or newlines), but regexes are often used to parse poorly-formatted input data, so customisation is important and this way seems to be a bit repetitive.
-
Is it possible that we could (one day) introduce a variant regex component which acts more like
TryCapture
? Perhaps the component could guide backtracking in some way for better performance?
That said, I really like that we've got some kind of formal type to represent parsers. It can make a lot of processing much easier:
// Using default delimiters (spaces, quotes, angle brackets, etc)
#"hello http://🦆.de world! <ftp://x.com:22> and then "ssh://karl@localhost""#
.matches(of: .webURL).map { $0.output }
// [http://xn--5s9h.de/, ftp://x.com:22/, ssh://karl@localhost]
// Find all URLs on a webpage
let webpage = try! String(contentsOf: WebURL("https://en.wikipedia.org/wiki/Swift_(programming_language)")!)
for match in webpage.matches(of: .webURL) {
print("-", match.output)
}
- https://creativecommons.org/licenses/by-sa/3.0/
- https://en.wikipedia.org/wiki/Swift_(programming_language)
- https://www.wikidata.org/wiki/Q17118377?uselang=en#P348
- https://www.swift.org/
- https://developer.apple.com/swift/
- https://swift.org/
- https://github.com/apple/swift
...etc (179 URLs)