Best practices for CustomConsumingRegexComponent?

I've been having a go at adding CustomConsumingRegexComponent support to WebURL, but I'm having trouble figuring out how much of the input string to consume.

If I try the regex <(.+)> against the string "<hello>", it matches, and the value of the capture group is "hello". Similarly, I can use (.+),(.+) to match "hello,world", and the resulting captures are "hello" and "world". Bearing in mind that I'm not a regex expert, it appears that, even though the .+ groups could match my delimiters (> and ,), the regex engine knows to stop consuming there, because it sees what comes next. My understanding is that this is due to backtracking - after making the first capture, it looks at the rest, sees it doesn't match, and reduces the first capture until it matches as many elements of the pattern as possible (or something like that).

But custom components are different - the API gives them a location, and tells them to eat as much as they like. And so, if I try the analog of my previous regex (.+),(.+), but with a custom component:

"http://foo/abc,http://bar/def".firstMatch(of: Regex {
  Capture(.webURL)
  ","
  Capture(.webURL)
})
// ❌ nil

What is happening here is that commas are allowed in URLs (just like they may be captured by a (.+) group), so the first capture consumes the entire string. Once the engine sees other patterns after that first capture, it advances the start location rather than backtracking by reducing the amount consumed.

I'm not sure why it does that. Is it intentional?

Log of regions the component is asked to match in
http://foo/abc,http://bar/def
ttp://foo/abc,http://bar/def
tp://foo/abc,http://bar/def
p://foo/abc,http://bar/def
://foo/abc,http://bar/def
//foo/abc,http://bar/def
/foo/abc,http://bar/def
foo/abc,http://bar/def
oo/abc,http://bar/def
o/abc,http://bar/def
/abc,http://bar/def
abc,http://bar/def
bc,http://bar/def
c,http://bar/def
,http://bar/def
http://bar/def
ttp://bar/def
tp://bar/def
p://bar/def
://bar/def
//bar/def
/bar/def
bar/def
ar/def
r/def
/def
def
ef
f

We can get that backtracking behaviour back by using TryCapture instead, because it doesn't have the same consuming style of API:

"http://foo/abc,http://bar/def".firstMatch(of: Regex {
  TryCapture {
    OneOrMore(.any)
  } transform: {
    WebURL($0)
  }
  ","
  TryCapture {
    OneOrMore(.any)
  } transform: {
    WebURL($0)
  }
})
//  ✅ [http://foo/abc, http://bar/def]

This makes many, many more attempts to capture, so it is important that the transform builds in fast-paths to quickly reject invalid input. It's a bit different to most parsers, because generally you assume input is valid and prioritise that over the failure case, but for this use-case you actually expect most inputs to be invalid.

Given all of that, I have a couple of questions:

  1. Is the lack of backtracking likely to be a surprise to Regex users? Is it intentional?

  2. Are there any recommendations about how to surface this? Currently I'm thinking the component will need to accept some kind of parameter to find the end of the URL (e.g. another regex):

    Regex {
      Capture(.webURL(until: ","))
      ","
      Capture(.webURL)
    }
    

    Clearly, this seems less than ideal. I can build in some reasonable defaults (e.g. URLs are unlikely to include spaces or newlines), but regexes are often used to parse poorly-formatted input data, so customisation is important and this way seems to be a bit repetitive.

  3. Is it possible that we could (one day) introduce a variant regex component which acts more like TryCapture? Perhaps the component could guide backtracking in some way for better performance?

That said, I really like that we've got some kind of formal type to represent parsers. It can make a lot of processing much easier:

// Using default delimiters (spaces, quotes, angle brackets, etc)

#"hello http://🦆.de world! <ftp://x.com:22> and then "ssh://karl@localhost""#
  .matches(of: .webURL).map { $0.output }
// [http://xn--5s9h.de/, ftp://x.com:22/, ssh://karl@localhost]
// Find all URLs on a webpage

let webpage = try! String(contentsOf: WebURL("https://en.wikipedia.org/wiki/Swift_(programming_language)")!)
for match in webpage.matches(of: .webURL) {
  print("-", match.output)
}

- https://creativecommons.org/licenses/by-sa/3.0/
- https://en.wikipedia.org/wiki/Swift_(programming_language)
- https://www.wikidata.org/wiki/Q17118377?uselang=en#P348
- https://www.swift.org/
- https://developer.apple.com/swift/
- https://swift.org/
- https://github.com/apple/swift
...etc (179 URLs)
2 Likes

what is your expected output for

"http://foo/abc,http://bar/def,http://baz/def"

?

should it be

("http://foo/abc", "http://bar/def,http://baz/def")

or

("http://foo/abc,http://bar/def", "http://baz/def")

?

One more thing:

WebURL is a normalised type - so when you construct a value from a String, the parser has a couple of stages:

  1. Scan the input string, marking where the URL components are in a ScannedRangesAndFlags structure.

  2. Validate the few components which need it (hostname must be valid, port must be a number, etc). Once validated, the ranges and other data are stored in a ProcessedMapping value.

  3. The mapping can be used to write the normalised URL. This is by far the most expensive step, as more of the source contents need to be interpreted and we need to correct character escaping, etc.

    We actually perform this step twice - once as a "dry run" to calculate the amount of storage to allocate, and then again to actually write the contents. This is similar to things like transcoding APIs in C, where you can specify a null pointer as the output just to calculate how much storage you need, before calling it again with a correctly-sized buffer. It avoids having to repeatedly resize the output, which can lead to quadratic complexity.

Here's some figures. This shows steps 3, 1, and 2. Step 3 accounts for about 70% of the execution time.

But for integration with the Regex engine, we really only want to perform steps 1 and 2. That's where we determine if we can understand the String. Step 3, producing the cleaned-up result, is only necessary once the Regex engine has a complete solution to the String and we know the range that we've matched is final.

Unfortunately, if I want to do that, my CustomConsumingRegexComponent needs to return an intermediate type, and the user would need to manually call something like match.output.createURL() to get a value they can actually use.

I wonder if something like this could be incorporated in CCRC directly for a better end-user experience? We could have the consuming function return an intermediate result type which gets processed later, with a default no-op implementation. I believe this kind of protocol evolution would be ABI-compatible (everything can be defaulted), so it can happen post-5.7.

protocol CustomConsumingRegexComponent {
  associatedtype RegexOutput
  associatedtype IntermediateOutput = RegexOutput

  func consuming(
    _ input: String, startingAt index: String.Index, in bounds: Range<String.Index>
  ) throws -> (upperBound: String.Index, output: IntermediateOutput)?

  func finalizeResult(_ output: IntermediateOutput) -> RegexOutput
}

extension CustomConsumingRegexComponent where RegexOutput == IntermediateOutput {

  func finalizeResult(_ output: IntermediateOutput) -> RegexOutput {
    return output  // No-op
  }
}

My understanding is that regex components try to match as much of the string as possible, so given the input:

"http://foo/abc,http://bar/def,http://baz/def"

If you want to capture this as one gigantic URL, you can. It's a valid URL.

Regex {
  Capture(.webURL)
}
=> "http://foo/abc,http://bar/def,http://baz/def"

If you have a pattern after it, the engine starts at the previous result, then starts chopping bits off the end until it can also satisfy those later patterns:

Regex {
  Capture(.webURL)
  ","
  Capture(.webURL)
}
=> "http://foo/abc,http://bar/def"
=> "http://baz/def"
Regex {
  Capture(.webURL)
  ","
  Capture(.webURL)
  ","
  Capture(.webURL)
}
=> "http://foo/abc"
=> "http://bar/def"
=> "http://baz/def"

I think that's how backtracking is supposed to work, but I'm not 100% sure.

For example, matching "foo,bar,baz" against (.+),(.+) results in:
=> "foo,bar"
=? "baz"

So the first component matches as much as it can.

this grammar isn’t well-defined, but it sounds like you already knew that. what you have written is sugar for

URLPair ::= 
(
    WebURL.ProcessedMapping<Commaless>, 
    Comma, 
    WebURL.ProcessedMapping<Commaless>, 
    [(Comma, WebURL.ProcessedMapping<Commaless>)]
)

and you have to do post-processing to left-coalesce the fragments. i don’t think there’s a way around that so long as your URL grammar goes left-to-right and your higher-level grammar goes right-to-left. as for the overhead of initializing WebURLs that just get thrown away anyways, i think that is just a limitation of the Regex API right now and hopefully we will get a “buildFinalResult” like we have for result builders.

maybe there are some other features in your input text that you can leverage to make this faster?

otherwise if you don’t mind me plugging my library consider porting your URL grammar to swift-grammar which is designed for libraries and allows you to order your post-processing logic in a way that doesn’t initialize objects that just get discarded by backtracking. then your API can return “cooked” WebURLs by Construction :)

won’t this produce

("http://foo/abc", "http://bar/def,http://baz/def")

?

I'm not really interested in defining grammars for URLs or strings which can include URLs; I'm interested in supporting Swift Regex specifically, where things like the extent of captures follow standard Swift Regex practice. If you have a string which includes a URL, I'd prefer as much of that context as possible be specified in the Regex.

What you're suggesting sounds more like inventing my own pattern-matching system - where I get a bunch of possible matches and other content (such as literal commas) and post-process/coalesce them. I don't think that is the same thing as integration with Swift Regex.

It seems to me that Regexes have clear rules about how large captures should be, based on the patterns which follow them. For example (regex101):

(\d+)42

5423142
└─┬─┘ 
capture

That's what I mean about the context being specified in the regex. The capture group just captures the longest string of digits it can - it has no idea that it is supposed to end at a particular pattern. The regex engine is then responsible for shortening the capture and re-trying it; the capture group itself doesn't need to care about limiting itself.

--

It would, which is another reason why it may not be a great workaround.

--

I just can't find any guidance about how to create an excellent CustomConsumingRegexComponent. The documentation is rather bare, and SE-0357 doesn't elaborate very much. When I search across GitHub, the only conformances I can find are the handful of examples in the _StringProcessing test suite, which are generally very simple.

I realise that it is new and these docs are probably still being written, but SE-0357 also emphasises that components should be composable, and without clearer guidance about how to conform to the protocol, I find it hard to evaluate that composability.

I wonder if these things even can be composable without backtracking. I can't even compose a URL capture with a literal string, whereas the (\d+)42 example above composes properly because it backtracks.

3 Likes