[GSoC-2024] SwiftLexicalLookup - a new lexical name lookup library

MAJKFL · November 7, 2024, 5:56pm

Hi everyone,

I'm Jakub, and this summer during GSoC, I've worked with @Douglas_Gregor on a new swift-syntax library: SwiftLexicalLookup.

It consists of various lexical queries, answering questions like "Where is an error thrown here handled?" or "Which switch cases does this fallthrough keyword refer to?". Most importantly though, the library implements a new unqualified lookup API, which aims to replace the current ASTScope implementation in the long run.
In this brief post, I'd like to show you how to use the library, explain how it works, and mention some fun corner cases we've encountered along the way.

Unqualified lookup

Let's first briefly explain what unqualified lookup actually is. It's a query used by IDEs or the compiler to determine the availability of names at a given position in a source file. It traverses lexical scopes, gathers names, filters them by position, and returns the results to the client. Let's consider a simple example:

class Foo { // Class Decl Scope (Nominal Type Decl Scope)
  // Member Block Scope
  let someVar = 0

  func bar(a: Int, b: Int?) { // Function Decl Scope
    // Code Block Scope
    let a = 42

    if let b { // If Stmt Scope
      // Code Block Scope
      let c = 123
    
      print(b) // <-- 1. lookup here
    }

    print(d) // <-- 2. lookup here
    
    let d = 1
  }
}

For the first lookup, the result would consist of: let c, the conditional declaration b, let a, the parameters a and b, the implicit self introduced at the function boundary, and Self introduced by the nominal type declaration, in this exact order. For the second query, the result would largely stay the same, with the first two names being absent due to their nested introduction in the if statement.

It's important to note that the lookup only considers names that can be inferred from enclosing lexical scopes, so it's unable to reason about, for example, nominal type members. Handling those is part of the qualified lookup (one could say it's out of scope for this library ;)). However, this doesn't mean unqualified lookup ignores the existence of members. In fact, it can actually prompt clients where to perform such lookup. In the previous two examples, this would be indicated by a special result flag associated with class Foo.

SwiftLexicalLookup

Now let's dive into the SwiftLexicalLookup lookup API.

Usage

The API is stateless, traverses the syntax tree bottom-up, and collects names matching the lookup. It partitions results according to the scopes they were introduced in, which, for example, guides shadowing. Any syntax node can be an entry point for the query. Here's an example call to the API from an arbitrary syntax node:

someNode.lookup(myIdentifier, with: myConfig)

myIdentifier is an identifier that should be used to filter introduced (match names). It can be set to nil to skip name matching (as in the earlier example). myConfig is a LookupConfig object that allows clients to specify particular behavior during lookup (e.g., how to handle #if clauses). It can also be left out to provide default behavior.

The result is an array of LookupResult enums that represent different result kinds and flags (such as prompting where to perform member lookup). It associates names with the scopes where they were introduced. Speaking of names, they are also represented by an array of LookupName enums, which correspond to specific syntax nodes as queried names. These can be of specific kinds, such as: identifier, declaration, implicit, and more. LookupName also enables easy name extraction through LookupName.getNames(from:) which collects all syntax nodes that can be represented as a LookupName in the given syntax node.

This result data structure retains a good bit of information useful for clients and ensures flexibility for future maintenance and extensions.

Under the hood

Scopes are represented as extensions to specific syntax nodes with a new protocol, ScopeSyntax. Scopes reference each other through the parentScope property, which (by default) points to the closest scope ancestor in the syntax tree. The lookup method performs a lookup at a particular scope and passes the lookup to the parentScope. Inside the defaultIntroducedNames computed property, scopes can define names available within themselves regardless of lookup configurations or starting position (such names could include function parameters, variable declarations, and more). Here's the implementation of the for scope. Inside lookup, you can see how the API avoids names introduced at this scope based on the starting position.

@_spi(Experimental) extension ForStmtSyntax: ScopeSyntax {
  /// Names introduced in the `for` body.
  @_spi(Experimental) public var defaultIntroducedNames: [LookupName] {
    LookupName.getNames(from: pattern)
  }

  @_spi(Experimental) public var scopeDebugName: String {
    "ForStmtScope"
  }

  /// Returns results with names matching lookup.
  /// Doesn't include names introduced at this scope
  /// if lookup started inside it's `pattern` or `sequence`.
  @_spi(Experimental) public func lookup(
    _ identifier: Identifier?,
    at lookUpPosition: AbsolutePosition,
    with config: LookupConfig
  ) -> [LookupResult] {
    if pattern.range.contains(lookUpPosition) || sequence.range.contains(lookUpPosition) {
      return lookupInParent(identifier, at: lookUpPosition, with: config)
    } else {
      return defaultLookupImplementation(identifier, at: lookUpPosition, with: config)
    }
  }
}

Some scopes are more complex than others, providing shared functionality or closely collaborating with other scopes to offer more advanced behavior. Let's see an example:

func foo<A>(a: A) {
  let b = 42
  x // <-- lookup here
}

Functions and nominal types can have generic parameter clauses that themselves introduce names. To generalize this behavior and simplify lookup from inside of it, the generic parameter clause is a scope on its own, but this introduces a bit of complexity. In the syntax tree representation, it's not an ancestor of the function body, it's a sibling! That's why there are two specialized scope kinds that provide alternative lookup routing: GenericParameterScopeSyntax and WithGenericParameterScopeSyntax. They ensure that the lookup passes through the generic parameter clause and doesn't create an infinite cycle on return.

Testing

One of the biggest challanges of this project, is properly testing and ensuring quality of the results. Integration of changes in a system that's supposed to replace an important piece in the compiler should be a top priority. That's why we came up with two ways of testing SwiftLexicalLookup.

Cheap "unit" tests

Inside swift-syntax, we've defined a suite of tests that are meant to test individual scope kinds and behaviors in isolation. These are standard XCTests with a custom harness function that are easy to write and inexpensive to run. Each test takes a snippet of code marked with markers and defines entry points for lookup along with the respective expected results.

Result expectations are defined with a special data structure that closely resembles the results produced by an unqualified lookup query. As of now, it has to be hard-coded, so testing a newly added result or name kind requires some manual effort. I hope we will be able to rewrite it into a nice macro in the future.

func testNameLookupForNilParameter() {
  assertLexicalNameLookup(
    source: """
      🔟class foo {
        let 1️⃣a = 0
        let 2️⃣b = 0

        3️⃣func foo() {
          let 4️⃣a = 0
          let 5️⃣c = 0

          if let 6️⃣a = 7️⃣x {
            let (8️⃣a, 9️⃣b) = (0, 0)

            0️⃣x
          }
        }
      }
      """,
    references: [
      "7️⃣": [
        .fromScope(CodeBlockSyntax.self, expectedNames: ["5️⃣", "4️⃣"]),
        .fromScope(
          FunctionDeclSyntax.self,
          expectedNames: [NameExpectation.implicit(.self("3️⃣"))]
        ),
        .lookInMembers(ClassDeclSyntax.self),
      ],
      "0️⃣": [
        .fromScope(CodeBlockSyntax.self, expectedNames: ["8️⃣", "9️⃣"]),
        .fromScope(IfExprSyntax.self, expectedNames: ["6️⃣"]),
        .fromScope(CodeBlockSyntax.self, expectedNames: ["5️⃣", "4️⃣"]),
        .fromScope(
          FunctionDeclSyntax.self,
          expectedNames: [NameExpectation.implicit(.self("3️⃣"))]
        ),
        .lookInMembers(ClassDeclSyntax.self),
      ],
    ],
    expectedResultTypes: .all(
      IdentifierPatternSyntax.self,
      except: [
        "3️⃣": FunctionDeclSyntax.self,
        "🔟": ClassDeclSyntax.self,
      ]
    ),
    useNilAsTheParameter: true
  )
}

The assertLexicalNameLookup test harness provides a number of additional optional assertions that further help in defining the expected result. On top of that, the data structure used for representing expected results offers additional assertions as well, such as explicitly specifying the name kind.

Integration tests

So far, we've discussed SwiftLexicalLookup in isolation from the current compiler ASTScope implementation. This is also how we initially developed the query, based on our understanding, simple empirical tests, and a mental model of Swift lexical scope semantics.

Of course, by only following this approach, there are many corner cases and differences that go unnoticed. That's why we also created a special function that could automatically perform a comparison between the results from ASTScope and SwiftLexicalLookup. Here's a sample output of the validation before a fix for one of the files from the Swift standard library:

-----> Lookup started at: 1706:7 ("debugPrint") finishInSequentialScope: false
     |           ASTScope           |      SwiftLexicalLookup
> ✅ |          k 1699:10           |          k 1699:10
> ✅ |          v 1699:13           |          v 1699:13
> ❌ |         first 1698:9         |         type 1690:18
> ❌ |        result 1697:9         |         self 1689:17
> ❌ |         type 1690:18         |         ↕️ V 1689:49
> ❌ |         self 1689:17         |         ↕️ K 1689:46
> ℹ️ | Omitted SwiftLexicalLookup name: Self 1683:1
> ❌ |          V 1689:49           |      Look memb: 1684:11
> ❌ |          K 1689:46           |            -----
> ℹ️ | Omitted ASTScope name: K 1689:46
> ℹ️ | Omitted ASTScope name: V 1689:49
> ❌ | End here: Look memb: 1684:11 |            -----

indicate valid result matches, while indicate the exact opposite. There are also partial matches indicated by . Needless to say, after initially running validation for a couple of relatively small files, my console got absolutely filled with red color :).

You might also notice gray markers, which indicate that some results got reordered or omitted. Sometimes, when performing real-world tests, we've encountered rather... weird ASTScope behavior. In this case, you can see that K and V generic parameters are introduced twice in different orders, leading to filtering on the ASTScope side and reordering in the SwiftLexicalLookup result. Also, nominal types consistently do not introduce the Self keyword in ASTScope (which is then accounted for in the type checker) that's why here, it's filtered on the SwiftLexicalLookup side.

Real-world testing is also the current state of SwiftLexicalLookup development. Currently, validation can be successfully run for the entire source code of SwiftIfConfig, and I'm working to fix all the discrepancies in the files of the Swift standard library. At this point, it's mainly about refining the validation function and making tweaks in the name representation (such as the exact position the name is introduced at which is especially problematic with implicit names), with a few major discrepancies popping up every now and then.

The friends we've made along the way...

When working on SwiftLexicalLookup we've encountered and documented some really fun scope behaviour. Some of them gave us a bit of a headache when implementing and sometimes it prompted us to slightly rethink our approach.

`guard`

All of us know and love guard! It's really powerful when combined with unwrapping optionals. It's also really simple in the way it works: If the variables are not nil, unwrap and continue execution of this code block, otherwise perform some action and exit the scope this guard is part of.

Interestingly, this simplicity doesn't translate well to our scope representation. Here we have a scope that introduces names to it's parent scope, but completely ignores those names when looking up from the body. Also the introduced names are visible in subsequent introduced names in the same guard. Oh and also we need to somehow interleave the partitioned results with results from parent code block scope so clients have a way to figure out shadowing etc...

This was the first major obstacle we've faced when developing this library and it prompted us to create the first specialized scope kind pair: SequentialScope and IntroducingToSequentialParentScopeSyntax. First one provides an algorithm that "sequentially" introduces names from inside code block items and interleaves them with the results found inside the second one. So as an example:

func foo() {
  let a = 1
  let b = 2

  guard let b = a else { return }
  
  x // <-- lookup here [ guard(b), codeBlock(b, a) ]

  let c = 3
  let d = 4
  
  x // <-- lookup here [ codeBlock(d, c), guard(b), codeBlock(b, a) ]

  guard let a = b else { return }

  x // <-- lookup here [ guard(a), codeBlock(d, c), guard(b), codeBlock(b, a) ]
}

You can see how, in this pseudo-representation, results from guards are combined with the ones from the code block. Another added complexity to sequential scopes, are types declared inside of them which should be visible at any point of the scope. That's why they need to be hoisted to the top of a code block (so that they are available in the entire scope).

`subscript`

This one was a bit surprising to me. I've initially thought we could handle it similarly to a function declaration, but there's one big catch here, and it's the self keyword.

A simple read-only subscript can be written like this:

class Y {
  var arr = ["AB", "CD"]
  subscript(i: Int) -> String {
    return arr[i]
  }
}

And here's the output of the validating function:

-----> Lookup started at: 4:12 ("arr") finishInSequentialScope: false
     |           ASTScope           |      SwiftLexicalLookup      
> ✅ |            i 3:13            |            i 3:13            
> ✅ |          self 3:31           |          self 3:31           
> ✅ |   End here: Look memb: 1:7   |        Look memb: 1:7

You can notice that self is being introduced at the boundary of the accessor block i.e. the opening curly brace of the subscript. That's a similar behavior to function declarations which also introduce self at their boundaries.

Now let's consider a subscript with getters and setters:

class X {
  var arr = ["AB", "CD"]
  subscript(i: Int) -> String {
    get {
      return arr[i]
    }
    set(newValue) {
      arr[i] = newValue
    }
  }
}

And when we run the validating function...

-----> Lookup started at: 5:14 ("arr") finishInSequentialScope: false
     |           ASTScope           |      SwiftLexicalLookup      
> ✅ |            i 3:13            |            i 3:13            
> ⚠️ |           self 4:5           |           self 3:31           
> ✅ |        Look memb: 1:7        |        Look memb: 1:7

There's a discrepency! In this case, turns out the compiler is introducing self at the boundary of individual accessors, which are inside the accessor block, but does it after the parameters. That greatly complicates handling of this scope as we need to check what is the exact kind of a subscript, in which accessor declaration did the lookup start and also we need to associate the result with that accessor.

Invalid macro parameter ordering

In the process, we haven't just discovered bugs in SwiftLexicalLookup. Out of those, the most interesting one, which we also decided to report as an issue 77141, is about an incorrect parameter order in macro declarations within ASTScope.

We assume that most discrepancies between implementations, which we account for during validation, are likely filtered by clients. However, this particular issue could potentially lead to unexpected behavior. Let’s look at this macro declaration from the Swift standard library:

@freestanding(expression)
public macro externalMacro<T>(module: String, type: String) -> T = Builtin.ExternalMacro

SwiftLexicalLookup models macro declarations similarly to function, initializer, and nominal type declarations i.e. it first returns the regular parameters, followed by the generic parameters. This ordering makes sense when, for example, looking up from within the parameter clause. Let's run the validation now and...

-----> Lookup started at: 2:39 ("String") finishInSequentialScope: false
     |           ASTScope           |      SwiftLexicalLookup
> ❌ |            T 2:28            |         module 2:31
> ℹ️ | Omitted ASTScope name: T 2:28
> ❌ |         module 2:31          |          type 2:47
> ❌ |          type 2:47           |            T 2:28

... see that ASTScope introduces generic parameters first, followed by regular parameters! This is rather unexpected behavior, which could cause some issues. For this reason, instead of just accounting for it during validation, we’ve also reported the bug.

I think this example nicely illustrates the current process of bringing the two implementations closer together: running tests on lots of code, identifying issues, and determining how to address them. This may involve either updating SwiftLexicalLookup to explicitly model these cases or developing a new representation and adjusting validation accordingly. We could also assume erroneous ASTScope behavior, which could mean simply filtering certain cases during validation, or, in rare cases like this one, filing an issue.

Future of the library

There's still plenty of work to be done with SwiftLexicalLookup. The top priority is currently to verify the implementation with a large amount of tests. The validation function I've been discussing in this post is part of my recent PR 77140. Concurrently, I'm also distilling a strong suite of integration tests that could possibly be later incorporated in the CI pipeline.

There's also one thing that remains to be modeled by the new unqualified lookup, namely "almost visible" names. They are used by the compiler to produce diagnostics and those could be e.g. variable names daclared later in a code block scope.

func foo() {
  print(a) // Error: Use of local variable 'a' before its declaration
  let a = 123
}

After fixing remaining discrepencies and building a strong suite of unit and integration tests, I'd also like to try to refractor some parts of the unqualified lookup API. Especially inside particular scope lookup method implementations, I was thinking about incorporating some kind of result builder functions to make them more readable and maintainable, but that's definitely low priority right now.

Beyond unqualified lookup, SwiftLexicalLookup also houses some simple lexical queries (that I've briefly mentioned at the start of this post) which remain to be tested against the compiler. We could possibly make the current validation method more flexible and reuse it in this case. Also there's a potential for adding more lexical queries like: "Where does this return return from?". I think that could actually make a great "Good First Issue".

The ultimate goal of SwiftLexicalLookup is to be incorporated in the compiler itself and replace ASTScope. That goal would further advance introduction of Swift code in the compiler and bring the language closer to being self-hosted. SwiftLexicalLookup could be also incorporated with the RequestEvaluator central infrastructure (which should be easier due to it's stateless nature!) to properly cache the results.

Beyond the compiler, SwiftLexicalLookup also enables a whole universe of scope based queries that could be used by IDEs and could be implemented in SourceKit-LSP. Furthermore, such queries would be guaranteed to be correct, as the same unqualified lookup API would be used by the compiler!

How to currently use the library?

Right now, the easiest way to try out the library is to use the current implementation in swift-syntax. After adding the package as a dependency to your project, simply use @_spi(Experimental) import SwiftLexicalLookup to access the new APIs. There's also currently open PR 2883 that introduces handling of #if clauses (thanks to SwiftIfConfig).

If you'd like to try out the validation, you can download this toolchain and use -enable-experimental-feature UnqualifiedLookupValidation.

Wrapping up...

I hope you liked and had fun reading this brief post about SwiftLexicalLookup as did I this summer. So far I'm really proud and happy how the library turned out and, even though there's still quite a bit of work left to do, I can already think of some use cases the library could be used for.

From here, I’d also like to thank my mentor @Douglas_Gregor again for his extensive knowledge, guidance throughout the summer, and for continuing to work with me on SwiftLexicalLookup. It means a lot to me to have you as a mentor and I really appreciate your patience with my many questions. :)

I’d also like to specifically thank @ahoppen for his consistently thorough and helpful code reviews. Thank you for always ensuring the quality of my PRs and for your valuable input on the implementation.

Additionally, I’d like to thank everyone else from the Swift community who helped me along the way. It wouldn’t have been possible without your support!

Lastly, I’m really looking forward to continuing my work on this project, bringing the library to a production-ready state, and unlocking its full potential. I’d love to hear your thoughts and ideas for the library and hope I’ve convinced some of you to already give it a try.

Jakub

Douglas_Gregor · November 13, 2024, 6:35am

I didn't see you mention it below, but I really like the design we ended up with here: the "prompt" isn't a callback (which is how the compiler's implementation of this code works), but it's more like an instruction to the client to go use some other means to perform lookup (say, into the extended nominal type) and use the results there. That way, the lexical lookup library always produces the same results whether in the compiler or some other tool (say, the IDE), and that client can use an appropriate strategy for following the instructions.

This is excellent! The tests are clear and easy to write / maintain.

This is something we should probably fix in the compiler, even if we don't think it can manifest in a "real" bug users could notice. What the compiler is doing is here is... weird.

This sounds intriguing! What kind of IDE functionality do you think could be built on the SwiftLexicalLookup APIs that exist today?

Awesome! This library has made a ton of progress, and the path to replacing ASTScope feels achievable.

Doug

MAJKFL · November 14, 2024, 4:54pm

Thank you for your kind words @Douglas_Gregor!

I think it could enable some pretty elaborate renaming functionalities that could deal with some special Swift corner cases. For example, for the guard sugar syntax:

guard let a else { ... }

print(a)

When renaming a reference inside print, this example could be rewritten to:

guard let x = a else { ... }

print(x)

I actually made a (very) rough demo with this particular case.

Also, since we have a nice notion of implicit names and we can easily access the associated parent scope, we could „rename” them by introducing them explicitly so for example:

do {
  // …
catch {
  print(error)
}

Could become:

do {
  // …
catch(let myError) {
  print(myError)
}

bnbarham · November 15, 2024, 6:08am

Very nice :) Thanks for all the awesome work @MAJKFL, it's been great to see the progress you've been able to make here! Especially cool to see this work finding existing (or at least potential) bugs.

Interesting idea! We currently rename such cases by renaming them all, but I can definitely see use cases where that would be undesirable and the option would be great to have.

Even more generally it seems like this could be used for a fast path of various IDE functionality we have today, ie. for cases where full typechecking isn't actually needed (eg. jump to definition/rename for local variables).

One aspect that might be interesting to look into is the performance of SwiftLexicalLookup vs ASTScope. Have you noticed any differences between the two in your integration testing?

MAJKFL · November 15, 2024, 9:52pm

Thank you so much @bnbarham!

Yes, I also think it could be great to use SwiftLexicalLookup with some existing functionality as it’s just a simple, stateless API. As soon as I find some time, I will try to experiment with sourcekit-lsp. :)

It’s a bit tricky to compare the performance just yet, specifically because in order to use SwiftLexicalLookup in the compiler, we need to parse the source twice. Besides that, I also have a suspicion the performance of lookup itself could be a bit worse when compared to ASTScope, but I hope it won’t be a big issue especially since qualified lookup will still significantly outweigh the cost of performing unqualified lookup.

Additionally, once the old parser is fully replaced by the new implementation (which I hope SLL will help to accelerate), that would remove a significant overhead. I hope it would be also relatively easy to employ some clever caching and do some optimizations in the current implementation once we have a strong suite of tests to integrate those changes and not break the library in the process.