[Concurrency] Performant approaches for global vars that are only set once

I’m currently looking at converting a number of old libraries to Swift 6. For various reasons both technical and non-technical, I am trying to minimize the blast radius of the changes, from both an API standpoint and a performance one; some of these libraries are accessed very often. I am aware that there are much better patterns than the one I am about to describe, but they all require changes to the implementations that aren’t really feasible under the current circumstances.

These libraries all have something in common;

  1. They’re synchronously, globally accessible (and should be synchronously, globally accessible, in the same way that say, the file system is synchronously, globally accessible)
  2. They’re responsible for returning values (think localized strings or UserDefaults), they don’t just listen to events
  3. Their implementations were intended to be thread safe, and after we migrate to Swift 6 this should be basically a guarantee
  4. They need dependencies that can’t be gotten globally
  5. …so they are implemented as a global var
  6. That global var is only set once, in a “startup sequence.” There’s no compile time safeguard for this, but in practice this is as close to guaranteed as you can get, let’s just take it as a given for the purposes of this discussion

Here’s an example:

// global scope

var myThing: MyThingProtocol? // don't @ me about the existential, it was 2016

protocol MyThingProtocol: Sendable { // they don't currently have this conformance, but this was the original intention for all of them

} 

// startup sequence:

func startup(someDependency: SomeDependency) {
    let theThing = MyThing(someDependency)
    myThing = theThing
}

Swift 6, naturally, complains that this is unsafe. And an obvious solution would be to simply swap this out for something like:

static let myThing: OSAllocatedUnfairLock<MyProtocol?> = .init(initialValue: nil)

You could also very reasonable argue that if the lock is pulled out to here, you might be able to remove locks from other places inside the library to “pay off” the performance deficit we’d be incurring with a second lock.

In practice, however, we’ve never seen any crashes that can be directly attributed to this pattern, and if I write a simple program like the following:

class C: P {
    
    var a: [Int] = [99]
    
}

protocol P: AnyObject {
    
    var a: [Int] { get set }
    
}

var p: P? = nil

// in main.swift

p = C()

await withTaskGroup { group in
    for i in 0..<1000 {
        group.addTask {
            print(p?.a.first as Any)
        }
    }
}

TSAN has no complaints.

Of course, my real use case might look more like

// main.swift

await withTaskGroup { group in
    for i in 0..<1000 {
        group.addTask {
            if i == 50 { // picking a random time to actually set this, don't worry about accesses that missed the opportunity to read a real value, they'll be fine 
                p = C()
            }
            print(p?.a.first as Any)
        }
    }
}

…since I have no control over when people try to access these. And of course, this does trigger TSAN, but it’s not clear to me that it will ever crash.

It occurred to me that perhaps I could use some of the machinery that’s used in global lets, since those are only set once and presumably are very fast to read. I’m also aware of some concurrency primitives like memory barriers that might help, though I’ve never used them in practice and based on what I’ve read of the implementation of the original dispatch_once this seems difficult for someone who hasn’t got experience with them to implement.

So in summary, I’m looking for a solution that lets me keep the fast reads, or some hard evidence that even with the access pattern I’m describing, this can crash that I can bring to the powers-that-be to convince them that they always needed a lock. I’m okay with slowing down setting the global var.

Do you have any measurable performance concerns about using a lock? If not, the benefits of just accepting to use one (even “for now”) far outweigh the theoretical costs. Since your classes themselves are already sendable (or will become such eventually), what you’ll be doing with those locks is reading and writing that one single pointer stored in them, not holding them for a prolonged time performing some critical section of a convoluted algorithm — I wouldn’t be surprized if this compiles to some mere 10 CPU instructions or even fewer.

This means that statistically it’s an incredibly low chance that two threads will ever contend on that lock to begin with, and unless you’re developing a real-time video renderer or something similar, you will never ever notice this.

If you’re concerned still, you should also be grabbing that instance once — i.e., not like this

await withTaskGroup { group in
    for i in 0..<1000 {
        group.addTask {
            print(p?.a.first as Any)
        }
    }
}

but like this:

let myP = p
await withTaskGroup { group in
    for i in 0..<1000 {
        group.addTask {
            print(myP?.a.first as Any)
        }
    }
}

.

The issue with global lets (aka dispatch_once) is that whatever first accesses the value actually needs to observe and confirm that the dependency which MyThing requires has been itself written somewhere (since you only get one chance to initialize that thing) — but now it’s a cascading problem.

2 Likes

Yeah I wouldn’t be at all surprised if the dyld stub overhead to merely call the lock() function takes as long as actually acquiring the lock IFF uncontended. The fast path for os_unfair_lock_lock is just

load TSD base + offset
CAS
check if CAS succeeded
return

plus one instruction for the branch to the stub, and… I think 3-4 for the stub, so yeah 8-9 instructions total probably.

2 Likes

Could you skip the lock and just put it on some type? From the Type properties section:

Stored type properties are lazily initialized on their first access. They’re guaranteed to be initialized only once, even when accessed by multiple threads simultaneously

1 Like

What are the risks of false sharing (implied by your IFF of course) if declaring two or more such static lets protected by a mutex after each other - how are these laid out in practice?

If the overhead of the lock is too, you could consider using AtomicLazyReference instead, which implements set-once behavior.

Definitely can be an issue! The underlying unfair lock is just 4 bytes, so it’s easy to accidentally pack a bunch of them onto a single cache line.

In my experience, though, most people are either mostly locking them all from the same thread, not locking them that often, or have real contention to deal with first. One obvious exception is if you’re using striped mutexes for more parallelism; you’ll want to put padding between them if so.

1 Like

Thanks David. It’s not clear if it’ll be uncontended: we know the current access pattern is 70% on MainActor, but it’s not guaranteed to remain that way, especially as we adopt Swift 6, and I’m unsure of the timing of the non main actor accesses.

Curious if you’d consider pthread_rwlock given that we have basically no writes, or the AtomicLazyReference suggested by @j-f1 , which seems like exactly what we need for this case. Though if I’m reading the pseudocode(?) that you posted, CAS = “compare atomic set,” so I’m unclear on what the difference would be between that and the unfair lock?

CAS is “compare and swap”, which is one of the most common “building block” atomic operations CPUs provide. It does this, but without the possibility of another thread changing variable in between the == and the =:

func nonatomicCAS<T: Equatable>(variable: inout T, expected: T, new: T) -> Bool {
  if variable == expected {
    variable = new
    return true
  } else {
    return false
  }
}

When used in a lock, variable will be the lock itself, not the value it’s protecting, which is what makes it different from an atomic value. If it returns false, the lock function will interpret that as “someone already has set the lock to the locked value” and make a kernel call to wait patiently until it’s unlocked.

If AtomicLazyReference is suitable for your needs it’s a great solution. I would expect it to be marginally faster than a lock when uncontended[1], and potentially quite a lot faster under heavy contention. One thing to be aware of is that while it will enforce that one and only one value eventually gets set, it will not enforce that one and only one value gets created in the first place, so if your value’s init has side effects, they may run more than once. This is hopefully obvious due to the fact that creating the value is entirely outside the AtomicLazyReference, so there’s nothing it could possibly do about it, but I’ve seen it catch people off guard before so figured it was worth mentioning.

I’m personally not a fan of pthread_rwlock, or rwlocks in general[2]. I think in your case you wouldn’t be tripping over any of the downsides of them though, so it’s not unreasonable.


  1. Reading a lock protected value will do: atomic CAS, nonatomic read, atomic CAS

    whereas reading from the lazy reference will do: atomic read ↩︎

  2. Outside of this “set once and only once” usage pattern, rwlocks have an unfortunate tendency to either degrade to being clunky regular locks under shockingly low %s of writes vs reads or allow writer starvation under heavy read loads, neither of which is an appealing trait. On top of that, they also break priority donation. ↩︎

4 Likes