Concurrency's use of thread-local variables

John_McCall · May 20, 2021, 10:52pm

The Swift concurrency runtime uses thread-local storage to track a small amount of data (most importantly, pointers to the current task and executor). This is required in any case because of the designed ability to access certain kinds of task state from non-async functions, but we've committed fairly hard to it in the ABI: async functions do not pass around the current task and instead expect to be able to efficiently recover it from thread-local storage.

On (most) Darwin platforms, when the runtime is integrated into the OS, we can take advantage of that integration so that thread-local accesses are quite efficient. For example, on arm64, these accesses are just a move from a system register plus a load/store at a constant offset. Some people have expressed a concern that thread-local storage is less efficient on other platforms. The purpose of this thread is to explore that.

I believe this is how it works on ELF. There are basically two classes of thread-local storage: the static TLS block, and then library-specific blocks. The static TLS block is allocated as part of the thread object, and so its size must be the same for all threads and must be determined at load time. The static TLS block can in turn can be broken down into two parts: a portion that's automatically added by the thread library, which has negative offsets from the TLS base pointer, and a portion that's requested by the executable, which has non-negative offsets from the TLS base pointer. Only select system libraries like libc are supposed to use this first portion, and I believe they generally use static offsets; this is essentially the optimized path we use on Darwin. The second portion is laid out by the static linker based on the set of thread-local variables defined within the executable. Code inside the executable can use static offsets within this portion; this called the Local Exec access model. Code outside the executable that knows that a thread-local variable is defined by the executable can do the next best thing, which is to load the offset dynamically from a variable; this is the Initial Exec access model. If a variable is defined in a shared library, it is generally allocated in a library-specific block which is laid out by the static linker; an access first derives the library-specific TLS base pointer, then adds the appropriate static offset for that variable within the library's block. Unfortunately, deriving the library-specific TLS base pointer requires a function call, chiefly because the memory is often lazily allocated. This function call is optimized to preserve most registers, and it's relatively efficient, but still, it's a significant downgrade in performance.

If we can put the concurrency runtime's thread-local storage in the static TLS block, we should be able to use at worst the Initial Exec access pattern, which should be efficient enough to quiet any concerns as long as deriving the TLS base pointer itself isn't too slow. (I don't know what deriving the TLS base pointer looks like on different platforms; if that alone requires a function call, we're somewhat doomed.) I believe we can do this if we can include a small object file that defines this storage in every executable that will load Swift code. That's not a reasonable request for most dynamic libraries, but it might not be unreasonable for a language runtime. It would make it difficult to implement something in Swift like a plugin for an existing executable, though.

compnerd · May 21, 2021, 1:39am

In general, the access for the TLS base on Linux on ARM I believe is always a function call. Usually, the cost of that function call is negligible (the function call is part of the AEABI and is __aeabi_read_tp). On Android though, threading is emulated. The result is that it doesn't conform to AEABI and will instead go through the emulated TLS path (calling __emutls_get_address IIRC). This, unfortunately is much more expensive. (If you are interested, Windows does a pretty good job here, but that is not ELFish).

The story on ARM64 is better: Linux can avoid the function call as the value is in a register. (Windows again does well here too). Again, the problematic area is android, where again it is an expensive function call (due to the emulated TLS).

Thinking a bit more about this, it seems that on Android specifically, if we force static linking of the runtime always in order to get Concurrency support, we could use local exec. However, I recall that I had experimented with doing some unspeakable things (which I would need to go back to the sources to recall) which allowed me to steal a slot from the loader which is reserved by the system. In that case, the performance on android would actually be fairly comparable to the other platforms.

John_McCall · May 21, 2021, 7:20pm

By "emulated", you mean they're using userspace context-switching? I don't suppose there's actually a guarantee of non-concurrency.

compnerd · May 21, 2021, 8:47pm

Right, the emulated TLS is a user-space implementation which has a local (locked) array that it walks for the TLS data.

John_McCall · May 21, 2021, 9:03pm

Well. You know, at some point, it would be a better use of our time to submit a patch to give Android a better TLS implementation. If they want to force the use of a function call for ABI purposes, that's totally within their rights, but there's no way their userspace thread scheduler doesn't have efficient access to a userspace thread object that could spare a single pointer to make TLS access not involve a locked lookup table search.

John_McCall · May 21, 2021, 9:17pm

It looks like __get_tls() itself is quite efficient (and does indeed take advantage of integration with the scheduler), and pthread_getspecific and pthread_setspecific are dominated by that, so maybe Android just needs to degrade to that the same way we do on Darwin simulator platforms.

John_McCall · May 21, 2021, 10:20pm

That said, I'm not sure Android's emulated TLS implementation is quite as bad as you say. There's a load-acquire on the fast path that I think could be relaxed with a little bit of work, and it loses some performance from not being integrated into libc, and it's too bad that it has to dynamically check whether the environment is threaded (?); but it at least avoids locking on the fast path. Assuming this is the libgcc emutls implementation, that is.

ktoso · May 25, 2021, 12:34pm

Thank you for the thread John!

I was personally somewhat nervous about this part of the ABI to be honest.

We double checked this with @johannesweiss and @lukasa who had some great insights here. In general we think it's likely going to be fine, at least on on 64bit Linux platforms – it is no worse than Swift's existing assumptions about cheap thread-locals as precedented by swift_beginAccess.

A good resource has been https://www.akkadia.org/drepper/tls.pdf, pages 5/6 feature the description of the offsets for dynamically loaded modules.

That would be great but I don't know how to achieve that, I assume you do however

John_McCall · May 25, 2021, 6:23pm

I think we might already have a little .o that we force-link, we'd just need to add a thread-local definition to that.

ktoso · May 25, 2021, 9:16pm

I see, thanks -- that'd be excellent