How expensive is a task cancellation check?

I'm trying to build an intuition for how expensive a single task cancellation check is. My ultimate goal is to get (or write myself) some guidance on how often a computation-intensive function should perform manual cancellation checks to be a good concurrency citizen.

I started by writing a benchmark, but I'm not yet confident enough in the results (i.e. my benchmarking method) to trust or share them.

Next, I looked at the implementation of static var Task.isCancelled in the standard library to get an idea of the potentially expensive operations. From what I gathered, static var Task.isCancelled effectively performs these steps:

  1. Get a pointer to the current task object from thread-local storage
  2. Retain the task object (= an atomic write to the refcount field?)
  3. Perform an atomic read (ordering: std::memory_order_relaxed) on the task object's status field to get the cancellation status.

To people familiar with the implementation, is this sequence correct or did I miss something?

Even if I can't assign some fixed "cost" to the operations listed above, I find it helpful to know what happens under the hood because it allows me to compare the cost of the cancellation check to the cost of my actual code.

2 Likes

Related question: how expensive is it (relative to e.g. reading an atomic or reading from an uncontended Mutex) to read a pointer from thread-local storage?

There's two "modes" for TLS on Darwin (I haven't looked into how other platforms do it):

  • a set of hardcoded TLS keys the system uses, which are "read system register + load with offset". This is relevant for e.g. Swift Concurrency's internal use of TLS
  • dynamically allocated keys, which are that plus a table lookup of some sort

Locking a mutex actually needs to do very similar work to the former as part of its operation, since the thread identifier is used as the "locked" value. So an uncontended mutex is basically "hardcoded TLS read + compare and swap + compare and swap".

Also worth noting that for extremely fast operations like these, the overhead of the dyld stub to call the function at all may actually be higher than the cost of the function itself. It accounts for about 30% of the execution time in a tight retain-release loop benchmark.

5 Likes