Swift Regex: lookbehind

Notably missing from Swift's new regex features is a way to perform a lookbehind. Is this an intentional omission or something that will eventually arrive?

6 Likes

It is intended to be eventually supported. Can you share your use case? There's a lot of different kinds of lookbehind, and many engines do not support the fully general case as it can add extra algorithmic factors that aren't present with lookahead.

For example, there's lookbehind of fixed-length verbatim content, look-behind of an alternation of fixed-length verbatim content, and lookbehind of simple and reversible regexes. These can be supported with different levels of efficiency without really changing the regex's algorithmic complexity.

But, in the most general form, a look-behind regex component would need to attempt the regex from every starting position in the input prior to the current position. This is very different than a lookahead regex which only tries from the current position.

7 Likes

A place where I just wanted to use a lookbehind was in implementing a hashtag detector that follows the Unicode specification for Hashtag Identifiers. (Who knew such a thing existed?!)

The spec includes the following:

<Hashtag-Identifier> := <Start> <Continue>* (<Medial> <Continue>+)*

When parsing hashtags in flowing text, it is recommended that an extended Hashtag only be recognized when there is no Continue character before a Start character. For example, in “abc#def” there would be no hashtag, while there would be in “abc #def” or “abc.#def”.

One natural way to implement this would be with a negative lookbehind, verifying that there is not a <Continue> character before the <Start> character (which is generally '#').

So, this would be a single character lookbehind.

7 Likes

I have regex using strings and NSRegularExpression that I'm looking to convert to Regex Builder that requires LookBehinds and NegativeLookBehinds. As a simple example I'm trying to extract valid comparison operators. Without NegativeLookBehinds I can't filter out repeating occurrences e.g. ===, << or >>= should not match, but ==, =, <= etc should.

This is always evaluated on short strings, so for me the performance argument doesn't have much weight. For any api there is potential for bad performance to creep in if you use it incorrectly. NSRegularExpression supports this, so it seems a bit arbitrary to exclude it.

2 Likes

+1 for look-behind support

RegExp lookbehind assertions would be useful to me. Especially with Safari support in iOS 16.4.

My use case is wanting to match on a full word (not just part of a word), so the proceeding character must be the start of the string, a whitespace or newline, or a period or comma. However, preferably I would not capture what proceeds the first word as this is not important to me.

// EX 1: Colder with a low of 5.
// EX 2: a low of 5 below zero.
// EX 3: high of 5. low of 6.
let lowRegex = /(?<=^|\s|\.|\,)low of (\d+|zero)( below zero)?/

Without the lookbehind assertion I would capture low zero in below zero from example 2 which was not what I was intending. Alternatively if I just decide to capture the first character then my matches would often be prefixed by a whitespace/newline or period/comma which I'd rather ignore, especially if I intend on replacing the match in the original string.

Perhaps a more general version of this would be something like PCRE's \K:

The escape sequence \K causes any previously matched characters not to be included in the final matched sequence. For example, the pattern:

foo\Kbar

matches "foobar", but reports that it has matched "bar".

The documentation says that \K can be used to work around the fact that lookbehinds must be fixed-length. I wonder if \K and variable-length lookbehinds have different performance implications.

Any update on this? Would there need to be a pitch to get this added to the stdlib?

Yes, it would need a pitch and then formal proposal.

I'd be interested in taking a look at this although after reading through the _StringProcessing target of swift-experimental-string-processing I have some serious doubts as to my ability to actually write an implementation.

Before reading through that target, my understanding was that Swift was using ICU under the hood as a regex engine which led me to believe that the stdlib Regex object was simply a facade in front of that library. My interpretation of the contents of _StringProcessing.Processor though is that this is not the case for the more modern version of Regex that was introduced in Swift 5.7. Instead, it seems like an entirely new regex engine was created and so adding support for lookbehinds will require writing new algorithms as opposed to simply calling some functions from ICU. Is that accurate?

Spent some more time reading through the Processor and trying to understand the pipeline from the Regex DSL -> bytecode -> execution and I think I actually have enough of an understanding that I could at least attempt an implementation. I think it would be better to have look-behinds be more flexible rather than less in the long term but maybe for an initial implementation, fixed-length expressions would suffice? I suppose this is all discussion that could/should happen in the pitch but for the sake of discussion in this thread I did a quick survey of what other popular implementations support and here's what I found:

Fixed-length (ex: abc, a|b|c is allowed but a+ and a{1,2} are not)

  • PCRE
    • Allows top-level alternatives to have different lengths but not branches. ex: (?<=bullock|donkey) is allowed but (?<=a(c|de)) is not
  • Python
  • Ruby
  • PHP

Unbounded variable length (ex: [abc]+ or [abc]{1,} is allowed)

Bounded variable length (ex: [abc]{1,3} is allowed but [abc]+ or [abc]{1,} are not)

  • Java
  • Perl
    • Perl allows bounded variable length expressions between 1 and 255 characters as an experimental feature as of Perl 5.30.
    • In Perl 5.35.10 the scope of the experimental nature of this construct has been reduced, and experimental warnings will only be produced when the construct contains capturing parenthesis.

3 Likes
1 Like

@Michael_Ilseman I was reading through the Implementation Musings document and found myself uncertain about the state of swift-experimental-string-processing. Based on the repo's README and the fact that it's checked out as a part of setup for development of the swift toolchain itself I felt fairly confident that this library was shipping as part of the stdlib. However, after reading this paragraph from "Frontends", wanted to double check that conclusion.

For string processing, this repo contains an example regex parser, compiler, and a few execution strategies. The regex parser itself serves as an example of writing parsers in Swift for simple languages. It hasn't (yet) been ported to run on the MatchingEngine, and doing so requires picking an execution strategy or strategies. It's AST can serve as the start of a simple result-builder API (tbd).

The result-builder frontend shipped in Swift 5.7 so can I assume that this document was simply written before then?

As a sidenote, I'm sure it's a non-trivial amount of work to do this but would it make sense to change the repo's name to remove the word experimental if this is in fact shipping with Swift?

That's right; the _StringProcessing module that's built by that package is part of the standard library (up to handwaving about exactly what exactly "standard library" means), and contains the execution engine for Regex and other string processing operations.

1 Like

EDIT: This has turned more into a log of my thoughts and learning process as I answer the questions in "Unanswered Questions". I plan to continue updating this post with the hope that others who are interested in this topic will find my musings useful.

Bytecode is pretty foreign to me so I wanted to write up my initial thoughts/implementation plans with two goals:

  • Force myself to think more thoroughly about my design before actually starting the implementation. I often find that the simple act of writing things out accomplishes this.
  • Get feedback/guidance from some of the original authors (if they have time)

Trying to understand emitPositiveLookahead

I think there are some core similarities in execution pattern between lookaheads and lookbehinds. At a high level, they both need to:

  1. Make a save state for successful matches
  2. Make a save state to intercept failed matches
  3. Start at some position
    a. For lookahead, this is just the current position
    b. For lookbehind, this is the start position of the input
  4. Perform a non-consuming match
  5. Did the match succeed?
    Yes: Go back to the original position and continue execution
    No: Go back to the original position and fail (lookbehinds will actually just go to the position after the last one it checked and repeat the above process until it reaches the current position)

Due to these similarities I thought it would be a good idea to try and get a working knowledge of how the bytecode for positive lookaheads is generated. I've copied emitPositiveLookahead below and annotated each line with my interpretation of what it does. Lines with three slashes (///) are my comments, all others were left by the authors.

mutating func emitPositiveLookahead(_ child: DSLTree.Node) throws {
  /*
    save(restoringAt: success)
    save(restoringAt: intercept)
    <sub-pattern>    // failure restores at intercept
    clearThrough(intercept)       // remove intercept and any leftovers from <sub-pattern>
   fail(preservingCaptures: true) // ->success
  intercept:
    clearSavePoint   // remove success
    fail             // propagate failure
  success:
    ...
  */
  /// If the lookahead match fails, it'll restore cpu state to the save point created with this address.
  let intercept = builder.makeAddress() 
  /// If the lookahead match succeeds, this will cause the first `fail` instruction created below to restore to the save point created with this address.
  let success = builder.makeAddress()

   /// Build a save point with the success address. Do this first because save points are treated as a stack and we want the intercept to be popped first
  builder.buildSave(success)
  /// Build a save point with the intercept address
  builder.buildSave(intercept) 
  /// Build the instructions for the expression in the lookahead
  try emitNode(child)
  /// Get rid of the intercept save point. If the lookahead failed, the cpu was restored to the intercept save point but now that the pattern has been evaluated, we can get rid of this save point.
  builder.buildClearThrough(intercept)
  /// Restore to the `success` save point but don't throw away any captures that happened
  builder.buildFail(preservingCaptures: true) // Lookahead succeeds here

  /// Store this index (after the `fail` instruction that's built above) as the position the `intercept` save point should restore to
  builder.label(intercept)
  /// Clear the success save point
  builder.buildClear()
  /// Fail the match
  builder.buildFail()

  /// Store this index (after the `fail` instruction that's built above) as the position the `success` save point should restore to
  builder.label(success)
}

A simple design

I decided it would be best to start with the simplest implementation I can think of so I can build my understanding of the bytecode generation process and then work from there. With that in mind, here's the test case I want to be able to execute with this implementation:

  • Given:
    • Pattern: (?<=abc)def
    • Input: abcdef
  • Expect that:
    • abc is not consumed
    • def is consumed

Notably, the lookbehind pattern matches exactly the characters between the start of the string and the pattern we're actually matching on (def). This means the lookbehind will succeed on the first step and looping through more indices is not necessary.

// This snippet is just pseudocode following the pattern from `emitPositiveLookahead`

save(restoringAt: success)
save(restoringAt: intercept)
goToStart // This won't work if the lookbehind pattern's length is less than the distance from the start index to the current position.
<sub-pattern>    // failure restores at intercept
clearThrough(intercept)       // remove intercept and any leftovers from <sub-pattern>
fail(preservingCaptures: true) // ->success
intercept:
clearSavePoint   // remove success
fail             // propagate failure
success:
...

Unanswered questions

Questions that are rolling around my head, potential rambling ahead.

Q: Why is this function called label? To me, labeling something means assigning an object some identifier for later use but here it seems like it just tells the processor to continue to the next instruction.

// Resolves the address token to the next instruction (one past the most
// recently added one), updating prior and future address references.
mutating func label(_ t: AddressToken) {
  addressTokens[t.rawValue] =
    InstructionAddress(instructions.count)
}

A:
label isn't telling the processor to go to the next instruction, it's a GOTO that tells the cpu to just go to the instruction at the address you pass to label. builder.makeAddress essentially reserves space in the MEProgram.Builder.addressTokens array by appending nil and then the label function replaces the nil value with the end index of the MEProgram.Builder.instructions array. Because the save instructions are emitted with an addressToken, when the cpu builds a save state, it's able to restore to the instruction index stored in the addressTokens array at the addressToken index. This is a bit complicated to explain as text so I also put together a diagram mapping out this process here.

Q: Why is a failure emitted after clearing the success save point? signalFailure sets the processor's state to fail if there aren't any other save points so wouldn't this fail the entire match? Even if it doesn't fail the entire match, the captures aren't being preserved so how does this not throw away all the captures that have happened up to this point?

mutating func signalFailure(preservingCaptures: Bool = false) {
  guard !savePoints.isEmpty else {
    state = .fail
    return
  }
  //...
  if !preservingCaptures {
    // Reset all capture information
    storedCaptures = capEnds
  }
}

A:

  • The failure is emitted so that if the intercept save point is the one restored (caused by the lookahead match failing), it restores to that failure and ends the capture. When the fail that triggers the success save point to be restored is hit, it restores the instruction position to endCapture.
  • signalFailures doesn't throw away the already stored captures, it sets storedCaptures to the stored captures contained in the save point that's being restored.

Q: How can I represent iterating through each index from the start to the current position in bytecode without a ton of duplicate instructions? At time of bytecode generation I don't know the length of the input string so I guess it has to be relative? Maybe something like this?

start = input.startIndex
goto start
<execute the match>
if success:
  exit
else:
  start += 1
  goto start

Not entirely sure how that translates into opcodes though.

A:
TODO

I think this design actually doesn't make sense. Lookaheads want to continue execution at the current position because by the time a lookahead is being processed, we've already consumed the characters that are dependent on the lookahead matching.

Lookbehinds haven't consumed those characters though so creating a save point at the position of the lookbehind to restore to on success doesn't make sense. I think what actually needs to happen is that the lookbehind should restore at the end of the pattern that's immediately after it.

// Lookahead
abc(?=def)
   |-> create success save point here. If `def` are the next
   | characters, we want to continue matching the characters
   | that come after `abc`

// Lookbehind
(?<=abc)def
           |-> Create success save point here, if `abc` are the
           | preceding characters, we want to continue matching
           | the characters that come after `def`

Of course I'm not quite sure how to create a save point at a position that hasn't been reached yet...

Looking behind is fundamentally different than looking ahead, so you'll need to add new instructions to the engine.

For fixed-length lookbehinds, you'd want a move-back instruction to move the current position back. E.g. (?<=abc)def would move current position back 3 characters and then try matching. (?<=abc|d|efghi)jkl would emit an alternation, and in that alternation would be a move-back of 3, 1, and 5, respectively.

A straight-forward implementation of bounded variable length lookbehinds could be to go back the max amount, then subtract one and try again, etc. More efficient approaches are a little more involved, and might be better handled as compiler optimizations. E.g., (?<=\d{1,3}-.{1,3}-\d{1,3})suffix could split the lookbehind into different segments and match each one in reverse order. But, this can be nuanced when the bounded variable length segment's match can overlap with the segment before or after it (e.g. the .{1,3} in -.{1,3}- can also match -).

If all we are able design in a reasonable time frame is bounded variable length lookbehinds using this straight-forward implementation, that's still a massive improvement.

For fully variable length lookbehind, I'm not certain what a good implementation strategy would be. A naive approach could be to spin up a fully separate matching sub-process on the portion of the string already matched. That will require more additions to the engine. Those additions would be what would also support a String.ends(with: RegexComponent) API.

1 Like

Reading up a little more on Javascript:

/(?<=([ab]+)([bc]+))$/.exec("abc"); // ['', 'a', 'bc']
// Not ['', 'ab', 'c']

This means that they do a reverse match from the end of the prefix. This is far simpler semantically (and more efficient) than the alternative. We could probably support this by adding match-back, etc., variant instructions.

3 Likes