Low-Level Atomic Operations

drexin · March 24, 2020, 9:33pm

First of all thanks @lorentey for this incredibly detailed writeup. I'm really looking forward to finally having atomics in the stdlib.

I agree that having a default ordering would be easier to get started with, but I also think that atomics are a highly specialized feature and thus people should familiarize themselves with the functionality before using them. Making it explicit forces people to think about it, instead of just relying on a less than optimal default case.

As for the argument whether atomics should only be atomically accessible, I think it would be okay to start that way, but a lot of lock free algorithms benefit from writing non-atomically and then committing all the writes in a single atomic store at the end. So I think we should definitely offer this at some point.

drexin · March 24, 2020, 9:40pm

There are other use cases where this type is useful, for example single reader queues. If you think about it, UnsafeAtomicUnmanaged is pretty much the same as std::atomic for pointer types. Yes, you have to be careful when using it, but I think the name makes that very explicit.

ktoso · March 25, 2020, 1:19am

FYI in case you missed it, those are already included/covered
https://github.com/apple/swift/pull/27229/files#diff-c938716d1f6256da1ac82430457d71f2R14-R27

gribozavr · March 25, 2020, 2:57pm

I think these operations are provided for convenience, to avoid duplicating the code in the caller.

Executable code might be all inlined into callers, but just declaring a public type has ABI impact because a type has a name (which will appear in name mangling of other ABI symbols), a type exposes metadata, and since there are inlineable functions that we want to be fast the type will be @frozen so its memory layout will be exposed etc.

glessard · March 25, 2020, 2:58pm

I see. Thanks

lorentey · March 25, 2020, 11:21pm

I feel quite strongly that Pointer is not a good suffix for these. UnsafeAtomicInt is no more a pointer than Unmanaged is, and reusing an existing name clashes with atomic pointer types.

How about using the suffix Handle for these new animals? Calling something a FooHandle is a good indication that it is a reference-like thing capturing a Foo value.

"Handle" has some legacy baggage, but previous uses are roughly in the same ballpark and none of them apply to current Swift. (Here I'm following the de facto standard practice of studiously ignoring FileHandle. )

I'm also inclined to argue that the Handle suffix should imply memory unsafety, so that we can loose the Unsafe prefix:

struct AtomicHandle<Value: PrimitiveAtomic> {
  static func create(initialValue: Value) -> Self
  func destroy()

  init(at address: UnsafeMutablePointer<Value>)
}

struct UnfairLockHandle {
  static func create() -> Self
  func destroy()

  init(at address: UnsafeMutablePointer<os_unfair_lock>)

  func lock()
  func unlock()
}

Note that this Handle/create/destroy/init(at:) mess is really only needed because we cannot have the design we really want yet. However, I think we ought to have a better (low-level) representation for synchronization constructs than classes.

None of this business will apply to the eventual non-copiable atomic types. Those will be neither value nor reference types, which implies we have the freedom to name them without any such warts:

moveonly struct AtomicInt {
  init(_ value: Int)
  deinit
  func load(ordering: AtomicLoadOrdering = .sequentiallyConsistent) -> Int
  ...
}

moveonly struct UnfairLock {
  init()
  deinit
  func lock()
  func unlock()
}

If we had non-copiable types today, we would probably not be adding pointer-based atomics (or at least we would definitely not be adding them in the first round), because we wouldn't want to design the underlying layout of atomic types directly into the public API. (init(at:) is a critical part of the Handle concept; I find it really hard to convincingly argue against supporting "inline" storage (however ugly it is to implement with ManagedBuffer).)

Chris_Lattner3 · March 25, 2020, 11:49pm

To add one more +1 for a generic type like UnsafeAtomicPointer<Value> (however it gets named), it would be very nice for this to work with Implicitly Unwrapped Optionals as well.

lorentey · March 26, 2020, 1:38am

Good question! This is hitting directly at the issue of mixing atomic and nonatomic accesses, and how exactly UnsafeAtomic values get passed to other threads.

I think it's okay to require that any mechanism that communicates the atomic handle to other threads will involve appropriate acquire-release fences that order the handle's nonatomic initialization before any atomic load. (After all, std::atomic's constructor is also nonatomic.)

Yes! But I think it's fine if this only ever gets implemented through non-copiable types. (Incidentally, there is a rudimentary level of compiler support for extracting the stable memory location of the storage backing a global variable (within the same module at least).)

jrose:

UnsafeAtomicLazyReference.initialize(to:) feels like the wrong level of abstraction to me. What do you think about this instead? This is definitely the most common idiom I've seen for initializing a lazy singleton.
func getIfPresent() -> Instance? {
  return underlying.load()
}

// Filled out more in my UnsafeAtomicPointer prototype.
func getOrCreate(_ constructor: () -> Instance) -> Instance {
  if let value = getIfPresent() { return value }
  let newValue = constructor()
  let (exchanged, existingValue) = underlying.compareExchange(nil, newValue) // omitting refcounting logic for now
  return exchanged ? newValue : existingValue
}

I don't hate getOrCreate, but it does make it a lot less clear what atomicity means. It can also be surprising that the closure can get executed on multiple threads at once -- this isn't how e.g. dispatch_once works.

We could go with allocate/deallocate, but the pointer operations aren't a perfect match: create includes both allocation and initialization, and some cases of destroy also perform deinitialization. (UnsafeAtomicLazyReference in the initial batch)

jrose · March 26, 2020, 2:11am

I'd say UnsafeAtomicInt is more of a pointer than Unmanaged is. Unmanaged does not add an extra level of indirection on top of its base value; UnsafeAtomicInt does. (That is, Unmanaged<NSObject> has the same level of indirection as NSObject, but UnsafeAtomicInt does not have the same level of indirection as Int.)

That said, the name "Pointer" is more relevant if you have a generic wrapper anyway.

lorentey · March 26, 2020, 2:26am

I agree that defaulting to sequential consistency will be undesirable in most production use of atomics. Sequential consistency seems to be the exception, not the rule -- so it should stick out in code review.

On the other hand, as a student of these things, I do deeply understand that concurrent programming on this level of abstraction is extraordinarily difficult to learn, and sequential consistency does provide an easier-to-understand model.

Question is: are the benefits that a default memory ordering will bring for beginners outweigh the drawbacks for scarred veterans? I believe this may require guidance from the Core Team.

jrose · March 26, 2020, 2:28am

One weird but possible option would be to put the defaulted versions in a separate module. import SimplifiedAtomics

(I'm somewhat convinced by now that having to use a linter to find a misused default argument for atomics isn't great. I did make the same mistake with @_implementationOnly, where it was too easy to miss one and so we had to add a warning for inconsistency.)

lorentey · March 26, 2020, 3:02am

ManagedBuffer.create is the closest analogue I could find in the stdlib to these create functions -- it's not a perfect match, but in both cases, the factory combines allocation and initialization.

In ManagedBuffer's case, this needed to be done through a static factory rather than an initializer because of complications with tail allocation; for UnsafeAtomic*, the factory is a stand-in for an eventual non-copiable type's initializer.

Pierre_Habouzit · March 26, 2020, 3:36am

One thing missing from the pitch and implementation, which I think is required, is a primitive for yielding the processor when implementing busy-wait loops. I think the examples provided are all suboptimal and actually need a PAUSE instruction at the end of each iteration. I'm by no means an expert, but I know it's considered best-practice and that at least one CPU vendor [1] advices for it.

This is not something we should expose. it is something that some low level OS primitives might want to use to implement higher level constructs, but basically unless you are the kernel and can disable preemption, in the context of a heterogenous priority environment, building anything around an atomic busy loop (spinlocks being the most obvious example) is incorrect for both performance and power.

I'm strongly opposed to this making it the standard library "just because", instead the library should provide better higher level construct that the OS has a chance to be able to optimize for you, and provide the adaptive spinning in a controlled way (only the OS can do that safely, the library can't).

lorentey · March 26, 2020, 4:49am

Yes, this is exactly right -- the technical possibility of mixing atomic/nonatomic access is an unfortunate side effect of the requirements and limitations we need to work with.

This sounds good to me! We'll probably need to allow nonatomic deinitialization, too. (I think this is fine by the same reasoning nonatomic initialization is.)

I find that UnsafeMutablePointerToAtomic<UnsafeMutablePointer<Foo>> would be a terribly long-winded and confusing way to spell an atomic pointer type.

How about AtomicHandle<UnsafeMutablePointer<Foo>>?

Good point. I like the ifNil suffix, but I'd prefer to go with storeIfNil -- it would be nicely symmetric with load.

(What do people think of calling it storeOnce?)

Stay tuned! I'm working on integrating feedback into a practical implementation for this.

The primary problem is that any generic atomic handle type will want to constrain its type parameter to some protocol. For example, assuming that we're happy to say that primitive atomics have the same memory layout as their corresponding value type:

protocol PrimitiveAtomic {
  static func atomicLoad(
    at address: UnsafeMutablePointer<Self>, 
    ordering: AtomicLoadOrdering
  ) -> Self
  ...
}

struct AtomicHandle<Value: PrimitiveAtomic> { 
  static func create(initialValue: Value)
  func destroy()
  init(at address: UnsafeMutablePointer<Value>)

  func load(ordering: AtomicLoadOrdering) -> Value {
    Value.atomicLoad(at: _address, ordering: ordering)
  }
  ...
}

extension Int: PrimitiveAtomic { ... }
extension UInt: PrimitiveAtomic { ... }
...
extension UInt8: PrimitiveAtomic { ... }

extension UnsafeMutablePointer: PrimitiveAtomic { ... }
extension Unmanaged: PrimitiveAtomic { ... }

This is all well and good so far. But optional pointers and optional unmanaged references pose a problem:

extension<Pointee> Optional: PrimitiveAtomic 
where Wrapped == UnsafeMutablePointer<Pointee> {
  ...
}
extension<Instance> Optional: PrimitiveAtomic
where Wrapped == Unmanaged<Instance> {
  ... 
}

We don't have generalized extensions, and even if we had implemented them, Optional wouldn't be able to conform to PrimitiveAtomic twice.

We could work around this by marking these types as "atomicable" through Optional:

protocol OptionalPrimitiveAtomic: PrimitiveAtomic { 
  static func atomicLoadOptional(
    at address: UnsafeMutablePointer<Self>, 
    ordering: AtomicLoadOrdering
  ) -> Self?
}
extension Unmanaged: OptionalPrimitiveAtomic { ... }
extension UnsafeMutablePointer: OptionalPrimitiveAtomic { ... }
extension Optional: PrimitiveAtomic where Wrapped: OptionalPrimitiveAtomic {...}

However, I don't believe the additional complexity of OptionalPrimitiveAtomic is worth it, especially when I consider that the most practical atomic pointer/reference types will likely be one-off implementations around double-wide atomics.

This is fine! It's really the other way round -- the desire is to provide direct API for commonly available dedicated atomic instructions, not to preclude operations that aren't directly implemented in hardware. The requirements in the Motivation section merely state that if direct processor support is available, then it should be used, and that implementation artifacts (like switches over orderings) get eliminated.

Good point. The intent is that there is no requirement for wait-freedom. (If, say, loadThenBitwiseAnd isn't directly available for, say UInt64 values on some architecture, then the operation must still be implemented by a compare/exchange loop.) I'll clarify this.

(FWIW, I'm not aware of any type or operation that wouldn't be implementable on the architectures we currently support. The current implementation assumes that all fixed-width integer types are layout compatible with their atomic variant, which seems to hold true as well -- even for things like UInt64 on 32-bit platforms. We may need to restrict availability of double-wide atomics though, esp. if we support Linuxen running on older 64-bit CPUs without CMPXCHG16B.)

Yep, Atomics would be an excellent use case for submodules. (The stdlib will be fine though -- atomics aren't boundary types, so it can simply continue to maintain its own internal atomics implementation, with no measurable impact to its performance or (hopefully) code size. Stdlib engineers will suffer a little bit, but putting up with such things is practically part of the job description! )

lorentey · March 26, 2020, 4:59am

Note: since we have create/destroy, there is no need to ever extract the pointer to the underlying memory location from an UnsafeAtomic* value, so I'm removing the address property.

(Code that uses init(at:) will be able to regenerate the pointer later through the same method it used to get the pointer at the time of the init(at:) call. (This may be necessary to correctly deinitialize the storage.))

Chris_Lattner3 · March 26, 2020, 5:11am

I think there is a strong and reasonable argument to be made that nothing about this API is really suitable for beginners. Optimizing for predictability seems like a higher priority than progressive disclosure of complexity. I was one of the people that suggested a default argument, and I retract that suggestion!

-Chris

ktoso · March 26, 2020, 6:05am

This is not something we should expose. it is something that some low level OS primitives might want to use to implement higher level constructs, but basically unless you are the kernel and can disable preemption, in the context of a heterogenous priority environment, building anything around an atomic busy loop (spinlocks being the most obvious example) is incorrect for both performance and power.

Thanks for chiming in!

Yeah, I hear the argument against spin-loops in general and it's a solid one AFAICS. It's definitely not a thing that "has to" be included in an initial atomics pitch so seems like it'd be fine to skip those here until we have proven they'd really help. If we ever end up in a design / use case that'd want to use one, I'm happy to be then proven wrong and do something better

For reference: My "oh nice!" reaction on PAUSE was based on that pause was recently exposed in the JVM to much rejoice of people implementing queues and messaging systems ( JEP 285: Spin-Wait Hints ) – however that proposal and use case very Intel centric when one thinks about it.

I'd be (personally) happy to not have pause exposed in this initial pitch, and revisit it with proper discussion and use cases when the time comes.

Very glad the argument resonates, and thanks for reconsidering the suggestion Predictability / readability are indeed paramount in those APIs

drexin · March 26, 2020, 6:39am

I agree with @Pierre_Habouzit on PAUSE. Whether there is a more useful alternative available is heavily platform dependent. As @lorentey mentioned, ARM has WFE/SEV, where the thread at least gets paused until an event is signaled. On x86 there is MONITOR/MWAIT, that waits for a change on a specific memory location, but that is unfortunately only available at privilege level 0, so of no use here. Intel recently added UMONITOR/UMWAIT, but AFAIK that is only available on some Atom processors at the moment. With Excavator, AMD added MONITORX/MWAITX, which seems to be pretty much equivalent to UMONITOR/UMWAIT. Both take a timeout and wake up either when the watch triggered, or the timeout was exceeded.

Given that all of this is very platform dependent, I don't think it's feasible to expose this in the stdlib.

lorentey · March 26, 2020, 7:18am

I think I'm starting to convince myself that it's worth it.

It would also be desirable to have a RawRepresentable extension to allow a limited set of custom atomic types:

protocol AtomicProtocol {
  associatedtype AtomicStorage = Self
  static func atomicLoad(at address: UMP<AtomicStorage>) -> Self
}

extension Int: AtomicProtocol {...}
extension UInt: AtomicProtocol {...}
...
extension UInt8: AtomicProtocol {...}

// ⚛︎ ⚛︎ ⚛︎
protocol NullableAtomicProtocol: AtomicProtocol {
  static func atomicLoadOptional(at address: UMP<AtomicStorage>) -> Self?
}
extension Optional: AtomicProtocol where Wrapped: NullableAtomicProtocol {
  typealias AtomicStorage = Wrapped.AtomicStorage
  static func atomicLoad(at address: UMP<AtomicStorage>) -> Self {
    RawValue.atomicLoadOptional(at: address)
  }
}
extension UnsafeMutablePointer: NullableAtomicProtocol {...}
extension Unmanaged: NullableAtomicProtocol {...}

// ⚛︎ ⚛︎ ⚛︎
protocol AtomicRepresentable: AtomicProtocol, RawRepresentable
  where AtomicStorage == RawValue {}
extension AtomicRepresentable {
  static func atomicLoad(at address: UMP<RawValue>) -> Self {
    Self(rawValue: RawValue.atomicLoad(at: address))!
  }
}

// ⚛︎ ⚛︎ ⚛︎
struct AtomicHandle<Value: AtomicProtocol> {
  init(at: UMP<Value.AtomicStorage>)
  static func create(initialValue: Value)
  func destroy()

  func load() -> Value {
    Value.atomicLoad(at: _address)
  }
}

This way, AtomicHandle can support:

// Integer types:
let counter = AtomicHandle<Int>.create(initialValue: 42)
let cnt32 = AtomicHandle<UInt32>.create(initialValue: 23)

// Optional and non-optional pointers:
let ptr1 = AtomicHandle<UnsafeMutablePointer<Node>>.create(initialValue: ...)
let ptr2 = AtomicHandle<UnsafeMutablePointer<Node>?>.create(initialValue: nil)

// Optional and non-optional unmanaged references:
let ref1 = AtomicHandle<Unmanaged<Foo>>.create(initialValue: Unmanaged.passRetained(Foo()))
let ref2 = AtomicHandle<Unmanaged<Foo>?>.create(initialValue: nil)

// Custom atomicable types:
enum State: Int, AtomicRepresentable {
  case starting
  case running
  case stopped
}
let state = AtomicHandle<State>.create(initialValue: .starting)
...

It is my sad duty to report that supporting implicitly unwrapped optionals doesn't seem feasible without reintroducing dedicated type(s) for such things. (Which I'm not willing to do.)

This requires three more protocols than I originally planned on adding, but I really like where it's going.

(These protocols/generics would be like FixedWidthInteger in that unspecialized usages won't work very well at all.)

dfunckt · March 26, 2020, 9:57am

Hello Pierre, thanks for chiming in, I really appreciate your input.

This is not something we should expose. it is something that some low level OS primitives might want to use to implement higher level constructs, but basically unless you are the kernel and can disable preemption, in the context of a heterogenous priority environment, building anything around an atomic busy loop (spinlocks being the most obvious example) is incorrect for both performance and power.

I see your point and completely agree. What I have in mind however is not necessarily spinlocks (with os_unfair_lock, there really is no need for custom spinlocks) but any algorithm that includes a compareAndExchange operation which typically needs to be retried. Do you suggest retrying is bad in general and should be avoided? I don't think you do, and I also don't think it can be avoided altogether (otherwise we should just not even expose compareAndExchange).

I'm strongly opposed to this making it the standard library "just because", instead the library should provide better higher level construct that the OS has a chance to be able to optimize for you, and provide the adaptive spinning in a controlled way (only the OS can do that safely, the library can't).

Agreed. I personally don't care if we expose PAUSE specifically, my point is that it's important to expose something (anything really) that can hint to the OS that we're in a busy loop context (FWIW, Rust exposes a spin_loop_hint() top-level function) and, more importantly, that we shouldn't punt this for later. Otherwise, what we'll end up with is exactly what you want to prevent: badly performing and energy inefficient code that also adversely affects the rest of the system, because people will either have to write their own PAUSE/backoff/etc. strategy or just ignore it altogether and write naive busy-wait loops.