Unintuitive behavior of `RegexComponent`s in string interpolation

String interpolation is used everywhere in Swift — of course we would want to use it in the Regex world. Let's consider the following snippet:

@RegexComponentBuilder
func makeSimpleXMLTagRegex(name: String) -> some RegexComponent<Substring> {
  "<\(name)>"
}

It works as expected: for any tag name passed in, the function will return a new string with the correct format. And since the return type is some RegexComponent<Substring>, the compiler can ensure it's only used by Regex, even though the underlying type is String.

Now we want to make it more generic by allowing to match the tag name with Regex. With one simple change, we get the following:

@RegexComponentBuilder
func makeSimpleXMLTagRegex(name: some RegexComponent<Substring>) -> some RegexComponent<Substring> {
  "<\(name)>"
}

It seems to work. The snippet compiles without breaking the API surface, because String is a RegexComponent<Substring> itself.

"<td>".wholeMatch(of: makeSimpleXMLTagRegex(name: #/td|tr/#)) // nil

So why? What happened?

If we break down the function, we find that it returns a String for every output. If name is String or Substring, it's interpolated like elsewhere. But if we pass in a Regex or other RegexComponent, we will get something like:

"<Regex<Substring>(program: _StringProcessing.Regex<Swift.Substring>.Program)>"

which basically cannot match anything, far from our expectation.


The fix is easy. We can break it down into pieces (much like how AttributedString works under the hood), and let name be freestanding instead of interpolated.

@RegexComponentBuilder
func makeSimpleXMLTagRegex(name: some RegexComponent<Substring>) -> some RegexComponent<Substring> {
  "<"
  name
  ">"
}

The fact is worrying, however, that we can easily get a wrong version which is simpler, looks natural, and, most importantly, doesn't trigger any warnings or errors. It even works for some cases, which means such error-prone code can possibly survive unit testing.

What can we do? I feels like interpolating RegexComponent into a String is a bad pattern that the compiler should warn about, but this may require going through Evolution.

2 Likes

I don't know, I feel like it should be obvious that interpolating a Regex into a string just creates a regular string. For it to work properly it would have to create a regex. But string interpolation is supposed to work for any type, and is basically the equivalent of:

"<" + String(describing: name) + ">"

and when you think about it that way of course it doesn't work! If something like this would work it would have to be by allowing escaping in regex literals.

/<\(name)>/ as Regex

One of the key things about string interpolation is that it can't change the underlying type of the literal, so it wouldn't be possible to have the type of a string literal transform into a Regex when you interpolate a regex.

Instances of any type can be used in string interpolation. What you're seeing is that Regex doesn't implement any custom string conversions (e.g. CustomStringConvertible, CustomDebugStringConvertible) so it's getting the default which is String(describing: …) run against it, which seemingly does some kind of introspection to print a minimal description of the type itself.

In fact there's no way to retrieve a regex string from Regex. So it's impossible to compose a regex string from Regex subpieces.

I'm not sure why string interpolation accepts things which aren't string convertible. Perhaps because it's used so often in debugging and it was felt that explicitly wrapping interpolation references in String(describing: …) is too onerous? It seems like something that'd be very hard to change now since it'd be source-breaking (although conceivably it could happen in a major release, like Swift 6, where source-breaking changes are permitted).

As to why the alternative form works, I can't say, because I can't get it to compile:

error: Couldn't lookup symbols:
  _StringProcessing._RegexFactory.ignoreCapturesInTypedOutput<τ_0_0 where τ_0_0: _StringProcessing.RegexComponent>(τ_0_0) -> _StringProcessing.Regex<Swift.Substring>
  _StringProcessing._RegexFactory.ignoreCapturesInTypedOutput<τ_0_0 where τ_0_0: _StringProcessing.RegexComponent>(τ_0_0) -> _StringProcessing.Regex<Swift.Substring>
  _StringProcessing._RegexFactory.ignoreCapturesInTypedOutput<τ_0_0 where τ_0_0: _StringProcessing.RegexComponent>(τ_0_0) -> _StringProcessing.Regex<Swift.Substring>
  _StringProcessing._RegexFactory.ignoreCapturesInTypedOutput<τ_0_0 where τ_0_0: _StringProcessing.RegexComponent>(τ_0_0) -> _StringProcessing.Regex<Swift.Substring>

That's true even if I make the return type concrete, Regex<Substring>, among other variations I tried. I've used RegexBuilder before so I know this should work; no idea why it's now broken. Xcode 15 beta 5.