Incremental syntax parsing

I have recently been thinking about how to add incremental syntax parsing to the Swift compiler. You can find an up-to-date version of the proposal here and inline below. Feedback is highly welcomed.

Incremental Parsing Proposal

Swift incremental syntax parsing

This proposal aims to add an incremental syntax parsing mode to the Swift compiler that takes a syntax tree of a pre-edit file, a post-edit file, and a diff between the pre-edit and the post-edit file as its input. It then generates the libSyntax tree of the modified file based on this information.

Motivation

In order to be able to use libSyntax for syntax colouring in IDEs, libSyntax needs to be able to rapidly regenerate the syntax tree based on minor edits. A complete reparse of the file is not acceptable. This proposal adds an incremental syntax parsing mode to the compiler that is able to rearrange on old syntax tree based on minor edits that have been performed on the source file.

The goal is that this incremental compilation is (nearly) in O(1) in terms of the size of the original source file so that the delay for regenerating syntax colouring after a single edit is independent of the file's size.

Proposed solution

Overview

The basic idea for incremental compilation is that instead of reparsing the entire file, nodes in the tree that have not changed will be reused from the old syntax tree. In doing so the lexer will fast forward over this reused region.

We will consider the old syntax tree as a cache in which nodes can be looked up by source location. Nodes can be reused for the incremental syntax parse if:

  • They occur at exactly the same position in the pre-edit and post-edit source file after taking the characters removed and inserted are taken into account
  • The node to be reused has the same SyntaxKind as the node that is expected from the parser in the incremental parse
  • The range this node takes up in the source code does not contain any edits
  • No edits are inside the next node's leading trivia
    • This is because new text can attach to a previous node, e.g. () attaches to foo to make it a call expression

Note that because the nodes in the syntax tree are ordered by source location, for any given source location a tree search can be performed to find the node starting at any given source location. This search can be even further optimised when taking into account that the parser will usually request cache entries that have a similar source location.

When the parser finds a reusable node in the cache, it inserts it and all of its children into the current syntax tree. Furthermore, it advances the buffer pointer of the lexer by the number of bytes that node takes up when spelt out in the source code. This text length of the node will either be stored in the serialised syntax tree or calculated on-the-fly when deserialising it.

Implementation

Add a new class SytnaxParsingCache that owns the old syntax tree and knows about all the edits that have been performed since that syntax tree was generated. For any given source location and syntax kind, the parser can ask the cache if it has a node of this kind at that (adjusted to the edits) source location which is reusable as described above.

This SyntaxParsingCache will be added as a member to SyntaxParsingContext and a reference to it gets passed down to all child contexts.

The SyntaxParsingContext can after its creation be asked to look up if a reusable cache entry exists for it. If this is the case, it will insert the cached node into the storage and return the number of bytes the reused node takes up spelt out in source. On the call-side, parsing must take an early successful exit and advance the lexer buffer by the returned number of bytes.

Initially, the cache lookup and early exit behaviour will be implemented for the dominating syntax collections CodeBlockList and MemberDeclList. Addition of further nodes of kind SyntaxCollection should be simple and straightforward.

Other changes required

There currently exist places in the parser where the generated syntax tree depends on the environment in which the syntax is used. For example in

struct Foo {
  init() {
    init() {
    }
  }
}

the outer initialiser parses as an initialiser declaration while the inner init parses as an unknown node because syntactically self. is missing to make it an initialiser call with a trailing closure. We do not currently know of any ambiguities in which a spelling is correct in two different environments but needs to be parsed differently in both environments. We thus propose to make the incorrect case parse into the same syntax tree as the correct case.

Further improvements

Implementing the cache lookup for all syntax collections should be simple and allow for more fine-graded reuse than just on the declaration or expression level by e.g. reusing expressions used as function arguments or parts of a complex sequence expression.

It may also be desirable to investigate reusing nodes that are not part of a syntax collection, e.g. the attributes and parameter list of a function declaration in which only the name was changed. The model does not have any limitations that would disallow such a reuse.

If an item of a syntax collection is requested and the cache notices that it is followed by more than one reusable item, it may add all of these items to the storage and tell the parser to fast-forward the lexer's buffer by their combined length. This should significantly speed up the parsing of long unnested stretches of code in which only a single item is modified.

For actual use, the serialisation and deserialisation of the syntax tree into its JSON representation will have an unacceptably high cost. Thus incremental parsing should be incorporated into the SourceKit service that keeps the latest syntax tree in memory.

Instead of taking the post-edit file and a diff, the parser could also accept the pre-edit file and generate the post-edit file using the changes on-the-fly, depending on the client's needs.

Testing

Case specific tests

I propose writing a test utility that supports special markup to generate a pre-edited and a post-edited source file, e.g. by being able to insert the edit markup <<<pre|||post>>> into the test file where the entire markup gets replaced by pre in the pre-edit file and post for the post-edit file.

Tests can then be written that take the pre-edit file, parse it, generate the syntax tree of the post-edit file using incremental parsing and compare that syntax tree to the one generated if the post-edit file is parsed from scratch. Any deviations will be test failures.

Multiple test cases can be written in a single test file by specifying the test case in the edit markup, e.g. as <<test_case<pre|||post>>>. All edit markups that are not currently run get replaced by their respective pre text.

To test that certain regions of the test file are reused during incremental parsing, I am proposing to add markup like <REUSE test_case> and </REUSE test_case>. When running the test, all of these markers will be removed and the test will fail if any character within these markers is not reused from the old syntax tree.

Performance test

To measure performance consistently, we want to gather metrics like the number of tokens lexed during incremental compilation. The performance of these metrics can then be monitored independently of the workload of the testing machine.

The scaling behaviour of incremental parsing can also easily be measured by generating a series of test cases with increasing size using gyb.

Fuzzer tests

It should be fairly simple to write a fuzzer tool that takes a (not necessarily correct) Swift program (e.g. from the Swift test suite) performs a set of edits and undoes them (potentially in a different order). After every change incremental parsing is invoked based on the previous state and we can test that the final syntax tree is the same as the one we started with.

There is also a work-in-progress implementation that can be found here: https://github.com/apple/swift/pull/16340

9 Likes