[Concurrency] Performant approaches for global vars that are only set once

I’m currently looking at converting a number of old libraries to Swift 6. For various reasons both technical and non-technical, I am trying to minimize the blast radius of the changes, from both an API standpoint and a performance one; some of these libraries are accessed very often. I am aware that there are much better patterns than the one I am about to describe, but they all require changes to the implementations that aren’t really feasible under the current circumstances.

These libraries all have something in common;

  1. They’re synchronously, globally accessible (and should be synchronously, globally accessible, in the same way that say, the file system is synchronously, globally accessible)
  2. They’re responsible for returning values (think localized strings or UserDefaults), they don’t just listen to events
  3. Their implementations were intended to be thread safe, and after we migrate to Swift 6 this should be basically a guarantee
  4. They need dependencies that can’t be gotten globally
  5. …so they are implemented as a global var
  6. That global var is only set once, in a “startup sequence.” There’s no compile time safeguard for this, but in practice this is as close to guaranteed as you can get, let’s just take it as a given for the purposes of this discussion

Here’s an example:

// global scope

var myThing: MyThingProtocol? // don't @ me about the existential, it was 2016

protocol MyThingProtocol: Sendable { // they don't currently have this conformance, but this was the original intention for all of them

} 

// startup sequence:

func startup(someDependency: SomeDependency) {
    let theThing = MyThing(someDependency)
    myThing = theThing
}

Swift 6, naturally, complains that this is unsafe. And an obvious solution would be to simply swap this out for something like:

static let myThing: OSAllocatedUnfairLock<MyProtocol?> = .init(initialValue: nil)

You could also very reasonable argue that if the lock is pulled out to here, you might be able to remove locks from other places inside the library to “pay off” the performance deficit we’d be incurring with a second lock.

In practice, however, we’ve never seen any crashes that can be directly attributed to this pattern, and if I write a simple program like the following:

class C: P {
    
    var a: [Int] = [99]
    
}

protocol P: AnyObject {
    
    var a: [Int] { get set }
    
}

var p: P? = nil

// in main.swift

p = C()

await withTaskGroup { group in
    for i in 0..<1000 {
        group.addTask {
            print(p?.a.first as Any)
        }
    }
}

TSAN has no complaints.

Of course, my real use case might look more like

// main.swift

await withTaskGroup { group in
    for i in 0..<1000 {
        group.addTask {
            if i == 50 { // picking a random time to actually set this, don't worry about accesses that missed the opportunity to read a real value, they'll be fine 
                p = C()
            }
            print(p?.a.first as Any)
        }
    }
}

…since I have no control over when people try to access these. And of course, this does trigger TSAN, but it’s not clear to me that it will ever crash.

It occurred to me that perhaps I could use some of the machinery that’s used in global lets, since those are only set once and presumably are very fast to read. I’m also aware of some concurrency primitives like memory barriers that might help, though I’ve never used them in practice and based on what I’ve read of the implementation of the original dispatch_once this seems difficult for someone who hasn’t got experience with them to implement.

So in summary, I’m looking for a solution that lets me keep the fast reads, or some hard evidence that even with the access pattern I’m describing, this can crash that I can bring to the powers-that-be to convince them that they always needed a lock. I’m okay with slowing down setting the global var.

Do you have any measurable performance concerns about using a lock? If not, the benefits of just accepting to use one (even “for now”) far outweigh the theoretical costs. Since your classes themselves are already sendable (or will become such eventually), what you’ll be doing with those locks is reading and writing that one single pointer stored in them, not holding them for a prolonged time performing some critical section of a convoluted algorithm — I wouldn’t be surprized if this compiles to some mere 10 CPU instructions or even fewer.

This means that statistically it’s an incredibly low chance that two threads will ever contend on that lock to begin with, and unless you’re developing a real-time video renderer or something similar, you will never ever notice this.

If you’re concerned still, you should also be grabbing that instance once — i.e., not like this

await withTaskGroup { group in
    for i in 0..<1000 {
        group.addTask {
            print(p?.a.first as Any)
        }
    }
}

but like this:

let myP = p
await withTaskGroup { group in
    for i in 0..<1000 {
        group.addTask {
            print(myP?.a.first as Any)
        }
    }
}

.

The issue with global lets (aka dispatch_once) is that whatever first accesses the value actually needs to observe and confirm that the dependency which MyThing requires has been itself written somewhere (since you only get one chance to initialize that thing) — but now it’s a cascading problem.

1 Like

Yeah I wouldn’t be at all surprised if the dyld stub overhead to merely call the lock() function takes as long as actually acquiring the lock IFF uncontended. The fast path for os_unfair_lock_lock is just

load TSD base + offset
CAS
check if CAS succeeded
return

plus one instruction for the branch to the stub, and… I think 3-4 for the stub, so yeah 8-9 instructions total probably.

It seems what you’re describing is something that should be globally accessible in the main actor, rather than something that is globally synchronously accessible.

If the latter is necessary than I agree with the above that a simple lock is sufficient, unavoidable, and appropriate, though I believe from your”2016” comment that I know the project you’re referring to and it’s worth it to pay the price of the one-off churn of the @MainActor annotation.

Have you explored SE-0466 to minimize the blast radius of the API change?

Could you skip the lock and just put it on some type? From the Type properties section:

Stored type properties are lazily initialized on their first access. They’re guaranteed to be initialized only once, even when accessed by multiple threads simultaneously

What are the risks of false sharing (implied by your IFF of course) if declaring two or more such static lets protected by a mutex after each other - how are these laid out in practice?

If the overhead of the lock is too, you could consider using AtomicLazyReference instead, which implements set-once behavior.

Definitely can be an issue! The underlying unfair lock is just 4 bytes, so it’s easy to accidentally pack a bunch of them onto a single cache line.

In my experience, though, most people are either mostly locking them all from the same thread, not locking them that often, or have real contention to deal with first. One obvious exception is if you’re using striped mutexes for more parallelism; you’ll want to put padding between them if so.

1 Like