The Swift concurrency runtime uses thread-local storage to track a small amount of data (most importantly, pointers to the current task and executor). This is required in any case because of the designed ability to access certain kinds of task state from non-
async functions, but we've committed fairly hard to it in the ABI:
async functions do not pass around the current task and instead expect to be able to efficiently recover it from thread-local storage.
On (most) Darwin platforms, when the runtime is integrated into the OS, we can take advantage of that integration so that thread-local accesses are quite efficient. For example, on arm64, these accesses are just a move from a system register plus a load/store at a constant offset. Some people have expressed a concern that thread-local storage is less efficient on other platforms. The purpose of this thread is to explore that.
I believe this is how it works on ELF. There are basically two classes of thread-local storage: the static TLS block, and then library-specific blocks. The static TLS block is allocated as part of the thread object, and so its size must be the same for all threads and must be determined at load time. The static TLS block can in turn can be broken down into two parts: a portion that's automatically added by the thread library, which has negative offsets from the TLS base pointer, and a portion that's requested by the executable, which has non-negative offsets from the TLS base pointer. Only select system libraries like libc are supposed to use this first portion, and I believe they generally use static offsets; this is essentially the optimized path we use on Darwin. The second portion is laid out by the static linker based on the set of thread-local variables defined within the executable. Code inside the executable can use static offsets within this portion; this called the Local Exec access model. Code outside the executable that knows that a thread-local variable is defined by the executable can do the next best thing, which is to load the offset dynamically from a variable; this is the Initial Exec access model. If a variable is defined in a shared library, it is generally allocated in a library-specific block which is laid out by the static linker; an access first derives the library-specific TLS base pointer, then adds the appropriate static offset for that variable within the library's block. Unfortunately, deriving the library-specific TLS base pointer requires a function call, chiefly because the memory is often lazily allocated. This function call is optimized to preserve most registers, and it's relatively efficient, but still, it's a significant downgrade in performance.
If we can put the concurrency runtime's thread-local storage in the static TLS block, we should be able to use at worst the Initial Exec access pattern, which should be efficient enough to quiet any concerns as long as deriving the TLS base pointer itself isn't too slow. (I don't know what deriving the TLS base pointer looks like on different platforms; if that alone requires a function call, we're somewhat doomed.) I believe we can do this if we can include a small object file that defines this storage in every executable that will load Swift code. That's not a reasonable request for most dynamic libraries, but it might not be unreasonable for a language runtime. It would make it difficult to implement something in Swift like a plugin for an existing executable, though.