Sourcekit-lsp and the optimal index store size for efficient go-to-def latency?

Hello, a quick question since I was unable to really find an answer to this question.

Is there some kind of optimal, or the max supported index-store size for efficiency? I know one of the limitations in xcode is that it can't support index / go-to-defs for huge monorepos (hence the usage of focus targets). I am currently using vscode but it has a huge index size of the project if I were to build all of them. My underlying assumption / theory is that it can only be efficient up to x size index store. E.g. if my --index-store-path is set to /foo/bar/indexstore and that path contains about 10 gigabytes worth of data, I'm assuming general lsp requests like go-to-defs etc will face a decent amount of latency. My question is:

  1. Do we know what should be the "max size" for an index store in order to have the most efficient lookups for go-to-defs?
  2. By having a huge size index store, does some underlying mechanism in sourcekit-lsp to continue to make it use up extra cpu? I've found that it seems to continue to increase in cpu usage, and am wondering if its somehow retaining a lot of things due to the big index store size.

Thanks in advance!

Also for additional context, with a size of 11~17 GB index store sizes, I've discovered something with sourcekit-lsp seems to cause swap paging to increase tremendous amounts. Is this an expected behavior with large index store sizes?

It gets so bad to the point that my machine instance will begin to freeze up.

In general, there shouldn’t be a limit to the index store that should be supported. One thing that might be worth noting is that there are multiple levels to the index:

  • The data is stored in record files inside index/store/v5/records. Having many potentially old record files in this directory should not pose an issue.
  • These record files are referenced by unit files in index/store/v5/units. Only the record files referenced by unit files should be relevant for the index.
  • And then there’s indexstore-db which effectively forms an index on top of the index store (aka the record + unit files) and allows us to look up which unit and record files contain references to which USR etc.

The indexstore-db is built by scanning through the index store when you launch your editor and the database is memory-mapped, IIRC. This raises the following questions:

  • Are all the record files referenced by unit files or are there dead files in there?
  • How big is the indexstore-db. I could imagine that you get to surprising behavior if the indexstore-db approaches the available memory on your system.
  • Could you provide a sample of SourceKit-LSP while it’s using CPU (run sample sourcekit-lsp)?
  • Does SourceKit-LSP eventually calm down (I would expect some initial CPU usage as it builds the indexstore-db but for it to calm down once that is done and the database persisted to disk from that point onwards).
1 Like
  • Are all the record files referenced by unit files or are there dead files in there?

The files are quite fresh, so there should be close to no dead files in there

  • How big is the indexstore-db. I could imagine that you get to surprising behavior if the indexstore-db approaches the available memory on your system.

Is there a separate indexstore-db size? The 17 gigabytes is a result from summing up all the files in index/store so that includes v5/units and v5/records

  • Could you provide a sample of SourceKit-LSP while it’s using CPU (run sample sourcekit-lsp)?

Unfortunately this is not on macos, so I can't run sample. Is there a different one you would like to see?

  • Does SourceKit-LSP eventually calm down (I would expect some initial CPU usage as it builds the indexstore-db but for it to calm down once that is done and the database persisted to disk from that point onwards).

It runs for ~10 minutes after opening a swift file with a huge amount of IO, dies down, but restarts later for some reason. It registers almost ~200GB to the write-bytes field in /proc/${pid}/io but not really sure where thats going.

Is your project large? Probably millions of lines of source code?

yup, the project is large

How do you provide build settings to SourceKit-LSP. That defines where indexstore-db is saved. A BSP server would set the indexstore-db using indexDatabasePath in the initialize request, for compile commands, the index database is in IndexDatabase next to the index store and for SwiftPM, it’s in .build/debug/index/db. That size would be interesting to know as well.

There is [6.0] Use an `AtomicInt32` to count `pendingUnitCount` instead of using `AsyncQueue` by ahoppen · Pull Request #1744 · swiftlang/sourcekit-lsp · GitHub, which caused huge CPU usage and that might affect you. That issue is fixed in recent Swift 6.0 or main development snapshots but hasn’t made its way into a release yet. Could you try if you’re seeing the same CPU usage with a toolchain snapshot from Swift.org - Install Swift?

If you’re still seeing the issue in the development snapshots, do you know if there’s an equivalent to sample on Linux? Or, could you build sourcekit-lsp from source and attach a debugger? I assume that we’re busy building the indexstore-db from the index store but that would be good to confirm.

this is using the perf command in linux. We are actually using sourcekit-lsp with the main development snapshot I believe. I didn't include the entirety of the report for obvious reasons, and I only recorded it for about a min+, but some of these numbers do seem very high in terms of cpu usage (or would you say they are expected)?

also fwiw, sourcekit-lsp gets initialized upon opening a swift file.
I've monitored the db size, and it stops at ~3.1G while the index store has ~14G.
However, as I open more random swift files, the indexstore-db seems to increase in size ~0.1 to ~0.2G. Is this an expected behavior? I was under the impression that it scans through the indexstore to build out the db, and that technically the db shouldn't keep getting bigger if nothing changes in the index store. I've seen the index store-db size increase up to 4.5 (and continue to increase randomly) for example.
FWIW, me determining it has finished is if I don't see a change in size of the db for a bit, so this may not be the most accurate method of doing so. Please advise if there is a better way to do this.

I've attached another sample of the perf captured, seems to be pretty consistent in terms of cpu usage and the symbol matching.


Also fwiw, noticed something like this today

$ perf record -F 99 -p 1721559 -g
WARNING: Ignored open failure for pid 1774882
WARNING: Ignored open failure for pid 1774971
WARNING: Ignored open failure for pid 1775026
WARNING: Ignored open failure for pid 1775028
WARNING: Ignored open failure for pid 1775036
WARNING: Ignored open failure for pid 1775039
WARNING: Ignored open failure for pid 1775040
WARNING: Ignored open failure for pid 1775041
WARNING: Ignored open failure for pid 1775042
WARNING: Ignored open failure for pid 1775043
WARNING: Ignored open failure for pid 1775044
WARNING: Ignored open failure for pid 1775050
WARNING: Ignored open failure for pid 1775051
WARNING: Ignored open failure for pid 1775054
WARNING: Ignored open failure for pid 1775055
WARNING: Ignored open failure for pid 1775118
WARNING: Ignored open failure for pid 1775120

but I haven't been able to track down what those pids represent, as ps aux | grep <pid> doesn't seem to return what processes those are.

Edit: I've also noticed that the sourcekit-lsp PID had changed when I checked the machine later. I had done nothing with the machine up to this point, and index-db is now at 5.4 GB. I've checked to see if the sourcekit-lsp trace says anything, but nothing seems to indicate it had crashed (Notice the timestamps) so not sure why the pid would change?

[Trace - 2:10:54 PM] Received response 'textDocument/diagnostic - (116)' in 621ms.
Result: {
    "items": [],
    "kind": "full"
}


[Trace - 5:31:19 PM] Sending request 'textDocument/documentHighlight - (118)'.
Params: {
    "textDocument": {
        "uri": "file:some/file/here"
    },
    "position": {
        "line": 44,
        "character": 29
    }
}

This is the perf report at this point

Edit 2: This is the state at 6.6 gb of indexstore-db, but it took a very long time to get to this point (at least 3 hrs + it seems) I guess I was wrong initially about the size being ~3.1G, and that it was still in the process of building the db without even without me opening up the swift files.

Currently this is the stagnant state, the db doesnt seem to increase in size anymore. Is the sourcekit-lsp cpu usage normal at this point?

Extra updates, but yes, it does seem like sourcekit-lsp does crash. It gets reinitialized again, and beings to re-write the indexstore-db.

For example, I had a case where I was observing the indexstore-db to be at about 5.1 G. Soon after, my machine froze for a bit, and I noticed the mssg initializing sourcekit-lsp which most likely indicates that it had crashed. I check the size of the indexstore-db again, and it was at 384 M, so it had removed all the previously written indexstore-db and started writing a new one from scratch. Is this a configuration issue, or is writing the indexstore-db from the point it was stopped something that is not supported?

This also makes me want to ask, is there a way to tell sourcekit-lsp to not build the indexstore-db if there is already an existing indexstore-db? (For a use case e.g. we have already pre-processed the entire index store / indexstore db building for a machine and want to later import it)

Thanks for the perf snapshot. That does look like we’re indeeded spending the CPU resources while creating indexstore-db.

I hope that I captured all of your questions.

However, as I open more random swift files, the indexstore-db seems to increase in size ~0.1 to ~0.2G. Is this an expected behavior?

That seems odd to me. I would have expected indexstore-db to stay constant in size after it has been built. Navigating source files should not cause it to increase in size unless new unit or record files are added to the index store.

SourceKit-LSP PID changing

That does look like SourceKit-LSP is crashing, which is also what you concluded. Given that you also say that your machine hangs, I’m wondering whether SourceKit-LSP grows in memory and is eventually killed by the OS. Did you see how SourceKit-LSP’s memory usage develops while building the indexstore-db?

Currently this is the stagnant state, the db doesnt seem to increase in size anymore. Is the sourcekit-lsp cpu usage normal at this point?

If SourceKit-LSP is idle, it shouldn’t use any CPU. I am not familiar with how to read the perf output but assuming that the leftmost column is based on the process’s total CPU usage, it seems like it’s just doing idle kernel work every now and then, which doesn’t strike out to me as very odd.

This also makes me want to ask, is there a way to tell sourcekit-lsp to not build the indexstore-db if there is already an existing indexstore-db? (For a use case e.g. we have already pre-processed the entire index store / indexstore db building for a machine and want to later import it)

When SourceKit-LSP quits normall, it should move the indexstore-db to a saved folder and then re-use that when it’s restarted (indexstore-db/lib/Database/Database.cpp at 54212fce1aecb199070808bdb265e7f17e396015 · swiftlang/indexstore-db · GitHub is where this is implemented). There is no specific flag to control this.

And a couple questions from my side:

  • Do you see the same behavior on macOS or only Linux?
  • I assume you’re not able to share the index store so I can try reproducing it or can you?
  • Do you see any way in which you could create a project that reproduces this issue?

Sorry, I know its a long post and I've been continuously digging, so I will summarize few things here with the latest update

That seems odd to me. I would have expected indexstore-db to stay constant in size after it has been built. Navigating source files should not cause it to increase in size unless new unit or record files are added to the index store.

Yes I believe you are correct. It's hard to know when the index-store db has been fully built (Please advise if theres an easy way to see the current progress, which would be really nice to have if theres a way. My assumption is that for now the best I can do is create some logging when it's done?). Right now I can only conclude that it's done building when I dont see a change in the file size after some time. With that being said, with the same index store, I've had varying sizes of the "end result" (e.g. 6.5, 7.0, 8.1 GB) which could be something else, but I guess you can take that with a grain of salt until I find a way to know for sure that it's done writing.

Did you see how SourceKit-LSP’s memory usage develops while building the indexstore-db?

I'll update with this when I get a chance

When SourceKit-LSP quits normall, it should move the indexstore-db to a saved folder and then re-use that when it’s restarted (indexstore-db/lib/Database/Database.cpp at 54212fce1aecb199070808bdb265e7f17e396015 · swiftlang/indexstore-db · GitHub is where this is implemented). There is no specific flag to control this.

Are we opposed to this? E.g. creating a simple flag to tell the sourcekit-lsp to read the db from a given path instead of trying to look at the saved dir./ write a new one? I'll have to continue looking at the code, but initial observation of the code, it seems like the indexstore-db utilizes the pid of the sourcekit-lsp to name. I'd have to figure out why in my cases the sourcekit-lsp seems to want to write a new db from scratch when I have one in the path designated (assuming that has something to do with that probably not being the "saved" path?)

Do you see the same behavior on macOS or only Linux?

I'll attempt to replicate it on macOS as well

I assume you’re not able to share the index store so I can try reproducing it or can you? Do you see any way in which you could create a project that reproduces this issue?

Unfortunately I dont think so :/. I'll look into if this is a possibility.

You should be able to increase the log level to debug by modifying your configuration file and look at the SourceKit-LSP logs that are being written to ~/.sourcekit-lsp/logs. If you don’t see any IndexStoreDB changed messages anymore, then indexstore-db should be finished building.

I don’t think such a flag is the right direction. We should figure out why we are not properly saving the indexstore-db to the saved folder or why building indexstore-db takes so long.

The way that saving is supposed to work is that sourcekit-lsp creates an indexstore-db in a directory that has its pid in the name. That ensures that we don’t have two processes accessing the same database. Now, when sourcekit-lsp exits, we should move the indexstore-db from that folder with the pid into a folder named saved. And when sourcekit-lsp launches again, it should check for the presence of a saved folder and move that to the pid-folder. All of this happens in indexstore-db/lib/Database/Database.cpp at 54212fce1aecb199070808bdb265e7f17e396015 · swiftlang/indexstore-db · GitHub.

Worth noting that this won't happen if sourcekit-lsp crashed - which seems to be the case here?

We specifically delete any indexstore-db that no longer has a running process if it isn't in the saved folder - we can't re-use as it could potentially be corrupted.

I don’t think such a flag is the right direction. We should figure out why we are not properly saving the indexstore-db to the saved folder or why building indexstore-db takes so long.

Sorry, a bit of clarification from my side. So my use case is something like this:

  1. I have a big codebase, lots of index stores to be generated
  2. This also means that indexstore-db write seems to take up to at least 1.5 hrs + due to the high cpu usage (and also probably memory), the sourcekit_lsp is prone to crashing
  3. Due to these, I want to be able to pre-process the index store + db generation.
  4. If I have 5 machines, hypothetically all with the same file structure (hence the file paths shouldn't be an issue), I was thinking of having 1 machine pre-process, and then 4 other machines occasionally fetching this processed index store + db.

I guess with the workflow you provided, if I can just "transfer" the files into this saved folder, then I assume sourcekit_lsp will automatically pick it up as a saved folder?

Worth noting that this won't happen if sourcekit-lsp crashed - which seems to be the case here?

One of the problems I've been running into is that the crash occurs frequently, so I can't build out the indexstore-db reliably

I’m a little surprised that you take 1.5h for ~15GB. The ballpark I have in my head right now is that we need about 1 minute to build an indexstore-db for 1GB of index store data and you’re well above that.

I think that should/could work, but I never tried it.

We should figure out why we’re crashing and fix that crash. If you can reproduce the crashes on macOS, could you file an issue with a crash log. On Linux, maybe you can attach lldb to sourcekit-lsp. That should give you a stack trace when sourcekit-lsp is crashing.

I’m a little surprised that you take 1.5h for ~15GB. The ballpark I have in my head right now is that we need about 1 minute to build an indexstore-db for 1GB of index store data and you’re well above that.

It is able to build the first~1 GB of index store very rapidly. However, as the size of the db grows, the longer it starts taking for it to build (e.g. 0 -> 1 GB might take a minute, 3.5 GB -> 3.7 can take ~5 minutes)

I think that should/could work, but I never tried it.

Seems like it works, I was able to get it to use it by renaming it as saved folder

We should figure out why we’re crashing and fix that crash. If you can reproduce the crashes on macOS, could you file an issue with a crash log. On Linux, maybe you can attach lldb to sourcekit-lsp . That should give you a stack trace when sourcekit-lsp is crashing.

Definitely will try to get back to this, but youur initial theory seems to be on the right track where due to it ballooning up and hogging up a bunch of mem / cpu usage, it causes the sourcekit-lsp to crash. If I leave the machine alone, it will take a long time, but it will be able to build out the db. However, if you are doing additional things in the middle of it (such as opening files in vscode, etc) and its in the middle of building out the db, it seems to be pretty easy to crash the sourcekit-lsp.