EDIT: This has turned more into a log of my thoughts and learning process as I answer the questions in "Unanswered Questions". I plan to continue updating this post with the hope that others who are interested in this topic will find my musings useful.
Bytecode is pretty foreign to me so I wanted to write up my initial thoughts/implementation plans with two goals:
- Force myself to think more thoroughly about my design before actually starting the implementation. I often find that the simple act of writing things out accomplishes this.
- Get feedback/guidance from some of the original authors (if they have time)
Trying to understand emitPositiveLookahead
I think there are some core similarities in execution pattern between lookaheads and lookbehinds. At a high level, they both need to:
- Make a save state for successful matches
- Make a save state to intercept failed matches
- Start at some position
a. For lookahead, this is just the current position
b. For lookbehind, this is the start position of the input
- Perform a non-consuming match
- Did the match succeed?
Yes: Go back to the original position and continue execution
No: Go back to the original position and fail (lookbehinds will actually just go to the position after the last one it checked and repeat the above process until it reaches the current position)
Due to these similarities I thought it would be a good idea to try and get a working knowledge of how the bytecode for positive lookaheads is generated. I've copied emitPositiveLookahead below and annotated each line with my interpretation of what it does. Lines with three slashes (///) are my comments, all others were left by the authors.
mutating func emitPositiveLookahead(_ child: DSLTree.Node) throws {
/*
save(restoringAt: success)
save(restoringAt: intercept)
<sub-pattern> // failure restores at intercept
clearThrough(intercept) // remove intercept and any leftovers from <sub-pattern>
fail(preservingCaptures: true) // ->success
intercept:
clearSavePoint // remove success
fail // propagate failure
success:
...
*/
/// If the lookahead match fails, it'll restore cpu state to the save point created with this address.
let intercept = builder.makeAddress()
/// If the lookahead match succeeds, this will cause the first `fail` instruction created below to restore to the save point created with this address.
let success = builder.makeAddress()
/// Build a save point with the success address. Do this first because save points are treated as a stack and we want the intercept to be popped first
builder.buildSave(success)
/// Build a save point with the intercept address
builder.buildSave(intercept)
/// Build the instructions for the expression in the lookahead
try emitNode(child)
/// Get rid of the intercept save point. If the lookahead failed, the cpu was restored to the intercept save point but now that the pattern has been evaluated, we can get rid of this save point.
builder.buildClearThrough(intercept)
/// Restore to the `success` save point but don't throw away any captures that happened
builder.buildFail(preservingCaptures: true) // Lookahead succeeds here
/// Store this index (after the `fail` instruction that's built above) as the position the `intercept` save point should restore to
builder.label(intercept)
/// Clear the success save point
builder.buildClear()
/// Fail the match
builder.buildFail()
/// Store this index (after the `fail` instruction that's built above) as the position the `success` save point should restore to
builder.label(success)
}
A simple design
I decided it would be best to start with the simplest implementation I can think of so I can build my understanding of the bytecode generation process and then work from there. With that in mind, here's the test case I want to be able to execute with this implementation:
- Given:
- Pattern:
(?<=abc)def
- Input:
abcdef
- Expect that:
abc is not consumed
def is consumed
Notably, the lookbehind pattern matches exactly the characters between the start of the string and the pattern we're actually matching on (def). This means the lookbehind will succeed on the first step and looping through more indices is not necessary.
// This snippet is just pseudocode following the pattern from `emitPositiveLookahead`
save(restoringAt: success)
save(restoringAt: intercept)
goToStart // This won't work if the lookbehind pattern's length is less than the distance from the start index to the current position.
<sub-pattern> // failure restores at intercept
clearThrough(intercept) // remove intercept and any leftovers from <sub-pattern>
fail(preservingCaptures: true) // ->success
intercept:
clearSavePoint // remove success
fail // propagate failure
success:
...
Unanswered questions
Questions that are rolling around my head, potential rambling ahead.
Q: Why is this function called label? To me, labeling something means assigning an object some identifier for later use but here it seems like it just tells the processor to continue to the next instruction.
// Resolves the address token to the next instruction (one past the most
// recently added one), updating prior and future address references.
mutating func label(_ t: AddressToken) {
addressTokens[t.rawValue] =
InstructionAddress(instructions.count)
}
A:
label isn't telling the processor to go to the next instruction, it's a GOTO that tells the cpu to just go to the instruction at the address you pass to label. builder.makeAddress essentially reserves space in the MEProgram.Builder.addressTokens array by appending nil and then the label function replaces the nil value with the end index of the MEProgram.Builder.instructions array. Because the save instructions are emitted with an addressToken, when the cpu builds a save state, it's able to restore to the instruction index stored in the addressTokens array at the addressToken index. This is a bit complicated to explain as text so I also put together a diagram mapping out this process here.
Q: Why is a failure emitted after clearing the success save point? signalFailure sets the processor's state to fail if there aren't any other save points so wouldn't this fail the entire match? Even if it doesn't fail the entire match, the captures aren't being preserved so how does this not throw away all the captures that have happened up to this point?
mutating func signalFailure(preservingCaptures: Bool = false) {
guard !savePoints.isEmpty else {
state = .fail
return
}
//...
if !preservingCaptures {
// Reset all capture information
storedCaptures = capEnds
}
}
A:
- The failure is emitted so that if the intercept save point is the one restored (caused by the lookahead match failing), it restores to that failure and ends the capture. When the
fail that triggers the success save point to be restored is hit, it restores the instruction position to endCapture.
signalFailures doesn't throw away the already stored captures, it sets storedCaptures to the stored captures contained in the save point that's being restored.
Q: How can I represent iterating through each index from the start to the current position in bytecode without a ton of duplicate instructions? At time of bytecode generation I don't know the length of the input string so I guess it has to be relative? Maybe something like this?
start = input.startIndex
goto start
<execute the match>
if success:
exit
else:
start += 1
goto start
Not entirely sure how that translates into opcodes though.
A:
TODO