[Pitch] Regex Reverse Matching

JacobHearst · July 26, 2024, 1:42pm

Regex reverse matching

Proposal: SE-NNNN
Authors: Jacob Hearst Michael Ilseman
Review Manager: TBD
Status: Awaiting implementation
Bug: rdar://132158055
Implementation: Prototype
Upcoming Feature Flag: (pending)

Introduction

Regex supports matching strings forwards, including lookahead assertions, but does not currently support matching in reverse or lookbehind assertions. We propose adding these.

Motivation

Modern regular expression engines support lookbehind assertions, whether fixed length (Perl, PCRE2, Python, Java) or arbitrary length (.NET, Javascript).

Proposed solution

We propose supporting arbitrary-length lookbehind regexes which can be achieved by performing matching in reverse. We also propose API to run a regex in reverse from the end of a string.

A regex that matches a string going forwards will also match going in reverse, but may produce a different match. For example, in a regex that has multiple eager quantifications:

"abcdefg".wholeMatch(of: /(.+)(.+)/)
// Produces ("abcdefg", "abcdef", "g")

"abcdefg".wholeReverseMatch(of: /(.+)(.+)/)
// Produces ("abcdefg", "a", "bcdefg")

Detailed design

Syntax

Lookbehind assertion syntax is already supported in the existing Regex syntax.

The engine is currently incapable of running them, so a compilation error is thrown:

let regex = /(?<=a)b/
// error: Cannot parse regular expression: lookbehind is not currently supported

With this proposal, this restriction is lifted and the following syntactic forms will be accepted:

// Positive lookbehind
/a(?<=b)c/
/a(*plb:b)c/
/a(*positive_lookbehind:b)c/

// Negative lookbehind
/a(?<!b)c/
/a(*nlb:b)c/
/a(*negative_lookbehind:b)c/

Regex builders

This proposal adds support for both positive and negative lookbehind assertions when using the Regex builder, for example:

// Positive Lookbehind
Regex {
  "a"
  Lookbehind { "b" }
  "c"
}

// Negative lookbehind
Regex {
  "a"
  NegativeLookbehind { "b" }
  "c"
}

API

extension Regex {
  /// Returns the first match for this regex found in the given string when matching in reverse.
  public func firstReverseMatch(in string: String) throws -> Regex<Output>.Match?
  /// Returns the first match for this regex found in the given string when matching in reverse.
  public func firstReverseMatch(in string: Substring) throws -> Regex<Output>.Match?

  /// Returns a match if this regex matches the given string in its entirety when matching in reverse.
  public func wholeReverseMatch(in string: String) throws -> Regex<Output>.Match?
  /// Returns a match if this regex matches the given string in its entirety when matching in reverse.
  public func wholeReverseMatch(in string: Substring) throws -> Regex<Output>.Match?

  /// Returns a match if this string is matched by the given regex (matching in reverse) at its end.
  public func suffixMatch(in string: String) throws -> Regex<Output>.Match?
  /// Returns a match if this string is matched by the given regex (matching in reverse) at its end.
  public func suffixMatch(in string: Substring) throws -> Regex<Output>.Match?
}

extension BidirectionalCollection where SubSequence == Substring {
  /// Returns the first match of the specified regex within the collection when matching in reverse.
  public func firstReverseMatch<Output>(of r: some RegexComponent<Output>) -> Regex<Output>.Match?
  /// Returns the first match of the specified regex within the collection when matching in reverse.
  public func firstReverseMatch<Output>(@RegexComponentBuilder of content: () -> some RegexComponent<Output>) -> Regex<Output>.Match?

  /// Returns a match if this string is matched by the given regex in its entirety when matching in reverse.
  public func wholeReverseMatch<R: RegexComponent>(of regex: R) -> Regex<R.RegexOutput>.Match?
  /// Returns a match if this string is matched by the given regex in its entirety when matching in reverse.
  public func wholeReverseMatch<Output>(@RegexComponentBuilder of content: () -> some RegexComponent<Output>) -> Regex<Output>.Match?

  /// Matches part of the regex, starting at the end.
  public func suffixMatch<R: RegexComponent>(of regex: R) -> Regex<R.RegexOutput>.Match?
  /// Matches part of the regex, starting at the end.
  public func suffixMatch<Output>(@RegexComponentBuilder of content: () -> some RegexComponent<Output>) -> Regex<Output>.Match?

  /// Returns a new collection of the same type by removing the final elements that match the given regex when matching in reverse.
  public func trimmingSuffix(_ regex: some RegexComponent) -> SubSequence
  /// Returns a new collection of the same type by removing the final elements that match the given regex when matching in reverse.
  public func trimmingSuffix(@RegexComponentBuilder _ content: () -> some RegexComponent) -> SubSequence

  /// Returns a Boolean value indicating whether the final elements of the sequence are the same as the elements in the specified regex when matching in reverse.
  public func ends(with regex: some RegexComponent) -> Bool
  /// Returns a Boolean value indicating whether the final elements of the sequence are the same as the elements in the specified regex when matching in reverse.
  public func ends(@RegexComponentBuilder with content: () -> some RegexComponent) -> Bool
}

Source compatibility

This proposal is additive and source-compatible with existing code.

ABI compatibility

This proposal is additive and ABI-compatible with existing code.

Implications on adoption

The additions described in this proposal require a new version of the standard library and runtime.

Future directions

Support PCRE's `\K`

Future work includes supporting PCRE's \K, which resets the current produced match.

Alternatives considered

Fixed length lookbehind assertions only

Fixed-length lookbehind assertions are easier to implement and retrofit onto existing engines. Python only supports a single fixed-width concatenation sequence, PCRE2 additionally supports alternations of fixed-width concatenations, and Java additionally supports bounded quantifications within.

However, this would limit Swift's expressivity compared to Javascript and .NET, as well as be insufficient for reverse matching API.

Using the word "last" in API names

Our proposed reverse matching APIs use the word "reverse" to denote the regex is running in reverse from the end of the string. An alternative name to firstReverseMatch could be lastMatch. We rejected lastMatch because reverse matching doesn't necessarily produce the same match as str.matches(of: regex).last.

Acknowledgments

cherrycoke, bjhomer, Simulacroton, and rnantes provided use cases and rationale for lookbehind assertions.

Loubear · July 26, 2024, 4:46pm

This is an excellent addition. While lookbehind is not as common as lookahead, when it IS needed it is usually quite helpful.

bjhomer · July 26, 2024, 7:13pm

From what I can see, the development of reverse matching was necessary in order to implement lookbehinds, and the public API for is it being added because… since we have it already implemented, why not?

Are there any specific use cases for something like wholeReverseMatch, though? I can imagine use cases for suffixMatch, but I'm not sure when I would choose to use wholeReverseMatch over wholeMatch.

JacobHearst · July 26, 2024, 11:15pm

Honestly I can't think of a single practical use case. I was going for completeness in terms of parity where I thought it made sense and in hindsight, this one doesn't make much sense .

xwu · July 27, 2024, 1:53am

I'll admit that I'm not up on the scenarios in which lastMatch doesn't produce the same result as matches(...).last, but would note same issue apply to suffixMatch and matches(...).suffix and so on?

nikolai.ruhe · July 27, 2024, 11:51am

+1 for the added functionality.

Just a minor thing I would add to documentation: Stating that look behind (and look ahead) assertions are zero-width matches makes their semantics much clearer in my opinion.

Most patterns match some part of the string and matching characters are part of the .0 output. This is different for zero-width assertions.

ksluder · July 27, 2024, 4:24pm

What does "abc".wholeMatch(of: /a(?<=b)c/) return? I can imagine the following plausible results:

("abc", "a", "", "c")
("ac", "a", "", "c")
("abc", "a", "c")
("ac", "a", "c")

Michael_Ilseman · July 28, 2024, 3:39pm

"abc".firstMatch(of: /a(?<=b)c/)   // no match
"abc".firstMatch(of: /a(?<=a)bc/)  // matches "abc"
"abc".firstMatch(of: /a(?<=.)./)   // matches "ab"
"abc".firstMatch(of: /ab(?<=a)c/)  // no match
"abc".firstMatch(of: /ab(?<=.a)c/) // no match
"abc".firstMatch(of: /ab(?<=a.)c/) // matches "abc"

pyrtsa · July 28, 2024, 7:50pm

As Michael points out above, the particular regex /a(?<=b)c/ can never match any input string, because requiring a letter b to exist right before the position after a matched letter a is nonsensical. But let me try and respond to Kyle's question with one answer and another question (or two):

First, lookaheads and lookbehinds don't generally form capture groups in regex libraries, and AFAICT that's the case in Swift as well (right?). So with that in mind, a regex like /a(?<=.)./ returns no other capture groups than the whole match itself.

But now, my understanding is also that if you wrapped the lookbehind (or lookahead) in another pair of parens, say like /a((?<=.))./, then the match would return one capture group .1 besides the whole match, but that capture group would always be the zero-width substring positioned right after the initially matched letter a. (Correct me if I'm wrong!)

Finally, the question: If all of the above is true, then is it valid to form capture groups inside lookbehinds as in /a(?<=(.))./ and do they then become nonempty substrings?

Michael_Ilseman · July 29, 2024, 2:50pm

These are great questions!

"abc".firstMatch(of: /a((?<=(.)))b/)
// ("ab", "", "a")

rnantes · July 29, 2024, 2:52pm

Big +1 for this, functionality is sorely missed.

pyrtsa · July 29, 2024, 3:57pm

Thanks, this is precisely as I expected it to be! One interesting consequence of it is that the matched substring .0 doesn't necessarily contain all of the substrings of the capture groups, as in:

"abc".firstMatch(of: /((?<=(.)))bc/)
// ("bc", "", "a")

Surprising as it may be, I think this is good.

JacobHearst · July 30, 2024, 11:59am

That's a great question unfortunately I'm also not up on the scenarios in which firstReverseMatch doesn't produce the same result as matches(...).last. @Michael_Ilseman wrote this portion of the pitch, perhaps he can shed some light on this for the both of us?

Michael_Ilseman · July 30, 2024, 4:24pm

I believe @xwu was stating that the naming rationale for having "reverse" in the API name may also apply to "suffix". Naming is hard.

Someone may only really care about matching from the end of the string, as in they want something that approximates matches(of: regex).last but without paying the price of finding all the matches from the start. There, the fact that the regex is run in reverse is necessary to how the algorithm functions, but beyond that its particulars might be irrelevant. But, they may easily get a different result and it would be nice of the API could imply this fact.

For suffix, the match is anchored at the end of the input (as prefix is anchored at the start), so there's not as direct an analogue to matches(of: regex).suffix(from: i).

We can try to find more examples where the precise match differs and look at how API in other languages make this distinction.

JacobHearst · August 3, 2024, 6:41pm

100% this, especially with the nuance between forwards and reverse matching. I think suffixMatch makes sense here as the reverse variant of prefixMatch because (at least to me) there's an implication with the name suffixMatch that you're starting at the end of the string. That could however just be because I've been so wrapped up in this code...

I think reversePrefixMatch could also be a good name and it would be closer to firstReverseMatch.

Having said that, I'm now questioning whether the reverse variant of firstMatch should be firstReverseMatch or reverseFirstMatch

Naming is hard

JacobHearst · August 3, 2024, 7:02pm

Something that was brought up in the thread that started all this (for me at least) was how reverse matching could open up the possibility of APIs like String.ends(with: RegexComponent). I forgot to include that in this draft but it seems like a good inclusion. Thoughts from others?

I think in practice it would actually be implemented on BidirectionalCollection where SubSequence == Substring

Michael_Ilseman · August 3, 2024, 9:08pm

Yes, I think that would be a great addition. For naming, the open question is rather the reverse semantics, which will be mentioned in the documentation, is important enough to also surface in the names. It's also looking like with suffix and ends(with:) would be awkward to include reverse in them.

woolsweater · August 4, 2024, 1:13am

The examples in the posts immediately following this recommendation highlight its importance. The fact that the lookbehind does not change the match position and that it is zero-width is crucial to understanding why a(?<=b) cannot match anything.

Michael_Ilseman · August 9, 2024, 3:55pm

I'm going to recommend that we strip the reverse matching API and focus only on lookbehind. I surveyed other engines and languages and reverse matching is extremely obscure. I was mistakenly under the impression that NSString.range(of:options:) supported it, but it turns out that passing both .backwards and .regularExpression isn't supported.

Because we support arbitrary regexes inside lookbehinds, we have the nuts and bolts in place to add reverse matching API in the future, if we decide to do so.

Michael_Ilseman · August 23, 2024, 10:44pm

Pitch has been updated to only propose lookbehind assertions.