[Pitch] Speed up protocol conformance checks on first launch using bloom filters

benpious · February 23, 2024, 1:01am

Hi all! I wanted to get folks' thoughts on a potential change that could improve the performance of swift_conformsToProtocol on the first launch of an application, at the cost of a relatively small amount of memory and code size increase.

I’ve not prototyped this at all, mainly because I think it would be pointless to start working without getting the thoughts of folks on the core team and in the community, to ensure that there is no better approach we haven’t considered, or something that prevents a data structure like Bloom Filters from being employed in this situation, like not being able to generate an appropriate hash for a type or something of that nature.

I'm also not sure if this makes more sense in the Development section, feel free to move it there if that is the case.

Motivation

For context on why this is an issue, this blog post gives a detailed description. The TL:DR is that Swift does a linear search whenever it needs to create type metadata for a type.

My employer’s applications have 100,000+ protocol conformance records in one of our binaries. While we’re aware that there is a dyld cache of the conformances, the cadence of our updates mean that the majority of users will have a poor experience once a week. We believe that many other applications which update on a weekly cadence may also be experiencing this issue.

Furthermore, improving the speed of the lookup can also have benefits for engineers who are profiling their applications in instruments. Many engineering teams may be wasting time trying to understand why their application is taking so much time in swift_conformsToProtocol, when in reality this cost is only paid on the first launch after updating. Conversely, engineers who do know may ignore real performance issues because their application running slowly during development is “normal.”

Improvements could also be beneficial on non-Apple platforms, such as running Swift on a Linux server. While there may not be many conformances in a microservice, the frequency it is updated and deployed likely makes relying solely on a dyld cache impractical, even if it was available on Linux.

High level approach

For every N protocol records (let’s say every 10,000, for now) the Swift compiler would emit a bloom filter in addition to the metadata. So an app with 100,000 protocols would have 10 filters emitted.

When the Swift runtime receives a request for type metadata, instead of naively walking all the records, it would check the bloom filter first. This would enable it to skip large chunks of the conformances entirely, turning a 100,000+ iteration loop into one as small as 10,000 iterations and 10 bloom filter checks in the best case.

This does assume that it is possible to generate a hash for the type; which I have yet to fully investigate, and may be the thing that dooms this proposal.

Memory/code size impact

Let’s say that we are targeting an error rate of 30%, and segmenting every 10,000 records. According to wikipedia, we need (10000 * 1.44 * log2(1/0.3)) / 8 = 3,126 bytes per segment. At 100,000 records, we need about 31k bytes in total.

While this may not be acceptable for some applications, if your application’s code size is already measured in the tens or hundreds of megabytes, it is a drop in the ocean for the resulting performance benefits.

Therefore, this could be an opt-in change. For ABI compatibility, it would probably have to be regardless. If the filter for a particular segment is not present, we could simply traverse it anyway.

It’s also not impossible to imagine that vendors like Apple could treat the filters like a dynamic resource instead of embedding them in the binary, and delete them from the system after the dyld cache was sufficiently populated.

ktoso · February 23, 2024, 1:06am

While not one to decide yay/nay for these parts of the compiler (I thought I should clarify since technically I am part of the Swift team so this could be over-interpreted), on a personal level I think that's an awesome idea.

Especially for server systems paying a bit in size really doesn't matter and we've definitely seen systems spend a lot of time in swift_conformsToProtocol sometimes, so I think that'd be a welcome addition.

Though as I said, someone closer to actual type-system like @Slava_Pestov maybe should chime in to give you a proper "worth the time" go ahead.

tera · February 23, 2024, 1:22am

Interesting idea. Have you considered other data structures like Set / Dictionary and why aren't they applicable in this case.

Joe_Groff · February 23, 2024, 1:38am

Do you know if the protocols that these conformances are being looked up are defined in your application, or do they come from the system frameworks? Of those 100,000 conformance records, do you have a sense of how many types are conforming to each protocol? While accelerating the global lookup table might be a good idea, it was my original hope that we would eventually be able to put per-protocol lookup tables and/or functions in front of it, for the set of conformances we know locally exist at the point of the type and/or protocol definition, and fall back to the global table only as a last resort for retroactive conformances that don't clearly belong to the same binary as the type or protocol. If you have a lot of protocols with a few interesting types conforming to each, having a smaller per-protocol lookup table seems like it could be a big win even with an accelerated global lookup table. (OTOH, if you have a few protocols with tens or hundreds of thousands of conformances each, then I can see that just moves the problem to the "local" lookup table.)

benpious · February 23, 2024, 2:48am

I/my coworkers can get you more precise info about the distribution of protocols in our applications if that would be helpful. But at a high level, I think there are several classes of protocols in our application:

Protocols that exist exclusively for mocking/unit tests, and have one conformant, maybe 2 (the mock) depending on which modules you're building/linking
Protocols that are similar to Codable/SwiftUI.View, which might have hundreds or thousands of conforming types. This is typically our own flavor of Codable which is used for network models
A long tail of protocols with 2-10 conforming types

So we would probably benefit from both the approach you are proposing and from the acceleration of the global lookup table.

it was my original hope that we would eventually be able to put per-protocol lookup tables and/or functions in front of it, for the set of conformances we know locally exist at the point of the type and/or protocol definition

Just curious, what kind of data structure would these live in? If this list is itself 50,000 protocols long, and the runtime is doing a linear search of it before it checks the global one, it seems like the problem could get worse in some cases.

coff · February 23, 2024, 2:50am

144k total conformances
41k classes
15k protocols

The number of conformers for each protocol is fairly top-heavy with the top 5 protocols amounting for ~55k conformances

benpious · February 23, 2024, 2:53am

I think they're not applicable for the same reason that whoever chose to do the linear search didn't pick them for the global lookup table in the first place.

I'm speculating somewhat about what the reasons for that might be, but my guess is that the person who made this choice was concerned about the size of the Dictionary, and the expense of producing it from the serialized representation, which would probably not be stored on disk ready to go. My understanding from a cursory reading of the runtime over six months ago is that the conformance records are read straight off of disk into memory (or mmapped?) fully formed and ready to be searched.

Joe_Groff · February 23, 2024, 3:20am

We were limited in the data structure we can build both by size concerns and by what we can reasonably expect a linker to be able to assemble from multiple object files at link time. We could potentially introduce an additional build step that builds a more specialized data structure, like the shared cache builder and dyld4 launch closure builder do on Apple platforms, though we'd have to integrate that tool into all the various build systems people use. You definitely might have some tricks in mind I'm not seeing, but it seems like a bloom filter would need a secondary tool to generate and incorporate into the binding, to the best of my knowledge. You're right that a second lookup table with tens of thousands of entries would not be very effective. From @coff's comment, I wonder if having per-type lookup tables would have fewer entries per table overall than per-protocol lookup tables would.

One thing we might be able to do with the linear lookup tables, that might be within reach of typical linker capabilities, is to use linkers' tendency to order subsections by ASCIIbetical order to sort the linear lookup table in a way that we could do a binary search instead of a brute linear search.

tera · February 23, 2024, 3:50am

A Dictionary like structure that could be saved to / read from disk without serialisation could work here. Further, if you mmap it - you won't nee reading/writing explicitly.

benpious · February 23, 2024, 4:17am

and by what we can reasonably expect a linker to be able to assemble from multiple object files at link time

You definitely might have some tricks in mind I'm not seeing, but it seems like a bloom filter would need a secondary tool to generate and incorporate into the binding, to the best of my knowledge

It did not occur to me that the ability of the linker to assemble this would be a limiting factor. I was imagining SwiftC would output them.

If memory serves, versions of the Swift compiler circa 2017 sometimes output all the generic specializations needed for a module into one .O file. So if that behavior still holds, maybe it can just output all of these into one .O file per module too, and the filters can be generated on a per-module or per-segment-of-module basis?

One thing we might be able to do with the linear lookup tables, that might be within reach of typical linker capabilities, is to use linkers' tendency to order subsections by ASCIIbetical order to sort the linear lookup table in a way that we could do a binary search instead of a brute linear search

If this could work, it seems far simpler and less error prone than the bloom filter idea. Is the difficulty from integrating it into the build system before the linker is run? If so, is there any reason why it can't be a "post processing" step after the linker is finished but before code signing, etc? Presumably this hypothetical tool just has to find the lookup tables in a binary, sort them, and then put them back where it found them, so the integration into the build system would be as simple as pointing it at the binary and maybe giving it a hint about where to look for the lookup tables.

tevelee · February 23, 2024, 10:13am

I'm wondering whether details about the runtime, such as these implementation specifics, fall under the purview of Swift Evolution, especially when it comes to performance enhancements that are not user-facing modifications of the language. Perhaps such changes don't require a formal proposal like this.

Joe_Groff · February 23, 2024, 4:17pm

For the flat protocol conformance table, we want to make sure we can build a runtime-discoverable structure that contains a record of all of the conformances in the binary, even if the module happens to be linked in with other modules in an executable, dylib, or object file. If we still have a build phase that can generate a .o file per-module, that has full knowledge of all the code that was built into that module including private and local types, that could help a lot. A bloom filter or other accelerator table doesn't necessarily have to be exhaustive, since we can fall back to linear search, so a best effort to generate the table could still be workable. We'd want to make sure if there are multiple such tables in a build product, that they all get put together somewhere the linker can find them, though.

Yeah, anything we do here shouldn't affect developer-observable behavior, so I would say this is a development topic and not a language evolution one.

tera · February 23, 2024, 4:23pm

When lookup ends with a false positive result will that be a problem or not?

George · February 23, 2024, 4:25pm

One additional question: Is my understanding correct that this is only an issue if you use dynamic casts? Does the compiler do any detection of “this protocol is never dynamically cast to, so we don’t need it in the hash table”?

Joe_Groff · February 23, 2024, 4:30pm

Not currently, though that is a code size optimization that we've long wanted to do. If a protocol conformance's protocol or type descriptor is never used dynamically, then the conformance itself should also be dead.

tapthaker · February 24, 2024, 6:18am

Could you validate that I understood this correctly ?

I love the idea to use binary-search to find protocol conformances but we would definitely need some implementation change in the linker. We would basically want to say that if the symbol is protocol conformance it should be placed in this section in a ASCII alphabetical order. This shouldn't be too challenging to achieve.

Do you think that the symbols exposed via tbd files for dylibs would have protocol conformance symbols for the linker ? If not we could try having similar protocol conformance table in every dylib.

David_Smith · February 24, 2024, 7:33am

We can actually do significantly better than alphabetical for binary searching static data: Eytzinger Binary Search - Algorithmica

jrose · February 24, 2024, 8:23am

…but it’s a lot trickier to convince a linker to do that for you without whole-image optimization.

dreampiggy · March 1, 2024, 7:42am

If memory serves, versions of the Swift compiler circa 2017 sometimes output all the generic specializations needed for a module into one .O file. So if that behavior still holds, maybe it can just output all of these into one .O file per module too, and the filters can be generated on a per-module or per-segment-of-module basis?

Do you means the -enable-single-module-llvm-emission ?
It also effect the codegen, so I guess this is still some optional feature

coff · March 6, 2024, 1:24am

Could someone advise what the path might be to make this a reality?

Not sure how this usually works, but would any of y'all insiders be willing to "sponsor" this and guide us through the process?