Memory footprint of UnsafeAtomic

y-ich · November 26, 2023, 1:21am

Hi. Thank you for nice package development!

This time, I need many, many atomic variables.

I look at the source code of swift-atomics.
UnsafeAtomic seems use an UnsafeMutablePointer as well as a raw Pointee.
So memory footprint is raw size + pointer size.
Is my understanding right?

And I wonder if it is possible to implement an atomic struct like C++ std::atomic, whose footprint is same as raw type.

Please let me know your thought. Thank you!

dnadoba · November 26, 2023, 1:41pm

I don't think so. UnsafeAtomic just has a single stored property which is a pointer to the atomic representation.

The newly proposed Atomics in the standard library also only have the memory footprint of the underlying atomic value representation and are even stored inline. Note that Atomics shouldn't be stored tightly packed because synchronisation usually happens at the granularity of a cache line. AFAIC there isn't currently a way to add padding automatically but you can do it manually.

y-ich · November 27, 2023, 1:51am

dnadoba san,

Thank you for your comments!

So you need both memory for pointer in UnsafeAtomic and memory for raw variable(or pointee) itself, which is allocated by static create method in the case of swift-atomics.
The memory for pointer is kind of large compared with targeted UInt8, for example.

“Atomic” proposal for standard library sounds nice for me.
I am looking forward to it.
Thank you!

dnadoba · November 27, 2023, 11:01am

If you can derive the pointer through other means you don't need to store UnsafeAtomic e.g. through an offset into a ManagedObject. You can then recreate UnsafeAtomic every time you want to use it. This allows UnsafeAtomic (the pointer) to just live on the stack instead of the heap. Therefore you only really requiring storage of the underlying atomic representation.

y-ich · November 27, 2023, 12:24pm

dnadoba san,

I think that I may have different context from yours.

I need many, many independent atomic variables since my application is multi-thread game tree search.
I don't care whether it is on stack or heap so much.

If you consume memory for both variables and pointers for atomic operations, you can search less than half of nodes on memory compared with consuming memory for variables only.

In such a case, std::atomic in C++ is attractive.

tera · November 27, 2023, 2:29pm

Can you show your tree "Node" data structure? Is it class or struct or enum or what?

Say, if it is class (or the field in question has stable address by other means, e.g. it is a struct residing in a field of a class) then you don't need any overhead to do atomicity (e.g. OSAtomicAdd32 or std::atomic<Int32> equivalent).

tera · November 27, 2023, 2:46pm

@dnadoba tells you that you don't have to store UnsafeAtomic values.... You just store your int32 (or whatever) storage variables and (re)create UnsafeAtomic only when needed, that way you don't use extra space for all the tree nodes, just one or a few nodes you are currently performing atomic operations on.

y-ich · November 28, 2023, 12:23am

Here is a data structure.

struct Node {
    var edges: [Edge]
}

struct Edge {
    var totalActionValue: Float16 = 0.0
    var visitCount: UInt32 = 0
    var virtualLoss: UInt8 = 0
}

You want nodes as many as possible and each node may have hundreds of edges.
All properties in Edge should be atomic for multi-thread search.

I was so surprised by your point.
I am sorry for my poor understanding.

Don't you need UnsafeAtomics for each target?
Then how do you access one raw value via UnsafeAtomic(s?) from multiple threads?
Please let me know.
If there is a sample code on Internet, it may be enough for me.

Thank you!

tera · November 28, 2023, 12:35am

I fear you have a bigger problem at hand (lack of stable addresses and potentially a need to protect array accesses)... Please show where the Node elements live, preferably till the very "root" (I guess that would be either a field in a class or a global variable or a local variable in one of the root functions line main that would have a long life-time.

If to speculate, is it something like this?

struct TreeNode {
    var node: Node
    var children: [TreeNode]
}

var rootTree: TreeNode

Also this: do you first construct the tree (from a single thread) and then modify individual nodes from different threads but without adding / removing edges? Or do you also add/remove edges from different threads?

y-ich · November 28, 2023, 1:08am

tera san,

I omitted tree construction and edge generation part because I do them via the control of dispatchqueue.

You can assume that the tree is static when accessing properties of edges.

I want to focus data access that should be atomic.

You have one variable.
Each thread creates an UnsafeAtomic instance which refers to the variable.
Accessing via those UnsafeAtomic instances guarantees atomic access to the variable.
This is my current understanding. Is it right?

tera · November 28, 2023, 1:26am

Ok, let's assume a simple case:

class C {
    var node: Node
    init() {
        // setup a node of several edges
    }
}
let c = C()

now you have two threads each doing something like this:

c.node.edges[0].visitCount += 1

and you want that increment to be atomic.

Note that as written this is pretty much equivalent to:

var edge = c.node.edges[0]
edge.visitCount += 1 // we want this atomic
c.node.edges[0] = edge

which is not atomic, as a sequence as obviously this could happen:

Thread1
var edge = c.node.edges[0]
edge.visitCount += 1 // visitCount changed from 0 to 1

Thread2
var edge = c.node.edges[0]
edge.visitCount += 1 // visitCount changed from 0 to 1
c.node.edges[0] = edge // visit count stored is 1

Thread1
c.node.edges[0] = edge // visit count stored is STILL 1

Now, if you do have a reference...

let edge = c.node.edges[0]
let referenceObject = edge.visitCount
referenceObject.value += 1 // atomically somehow

that indeed would save you from having the whole sequence atomic... But, indeed you'd have to use reference objects (like ManagedAtomic) for the edges' content. It would indeed be quite heavy memory wise for a large number of nodes / edges.

I am not sure this is the best approach though. Maybe a simple lock is better in this case:

lock.underLock {
    var edge = c.node.edges[0]
    edge.visitCount += 1 // visitCount changed from 0 to 1
    c.node.edges[0] = edge // visit count stored is 1
}

Edit: another option is to make Edge a class: then its fields would have stable memory addresses and you could do atomic operations on them a-la std::atomic style.

let edge = c.node.edges[0]
edge.visitCount += 1 // use atomic operation here instead
// job done

Bear in mind that class instances do have a space and time overhead: about 16-20 bytes or so of space and ARC retain count management of time.

I'd love to see what others think. cc @dnadoba

lorentey · November 28, 2023, 2:18am

Like @dnadoba and @tera said, the original intended use of UnsafeAtomic was that it would be materialized on demand, not directly stored anywhere. It is almost impossible to retrieve a stable pointer to stored class properties in Swift though, and this makes UnsafeAtomic largely unusable for anything other than manually allocated chunks of memory. (Which is why it lives in a package, rather than the stdlib -- it is a flawed construct.)

The upcoming struct Atomic will be very close in spirit (but not implementation) to std::atomic in C++, and hopefully you'll find it a suitable equivalent.

dnadoba · November 28, 2023, 3:00pm

Even if you could get hold of a stable pointer the layout of the data is not ideal. This also applies to the proposed non-copyable Atomic types e.g.:

struct Node: ~Copyable {
    var edges: [Edge]
}
struct Edge: ~Copyable {
    var totalActionValue: Atomic<Float16> = 0.0
    var visitCount: Atomic<UInt32> = 0
    var virtualLoss: Atomic<UInt8> = 0
}

This looks nice, doesn't use a lot of memory but likely doesn't do what you want.

The different atomic variables are tightly packed in memory and as Array stores its elements contiguously in memory you end up with lots of Edges being in the same cache line. Roughly speaking the atomic operations on all memory in a cache line will be "serial" which might slow you significantly down, depending on your access patterns. So you either want to add padding or other data that isn't accessed through atomic operations. If you really want independent atomic variables they will require at least one cache line worth of memory each. This would be 128 bytes on Apple Silicon.

@lorentey please correct me if Atomic has planned any special padding that I'm not aware of.

tera · November 28, 2023, 3:49pm

Might be a long shot:

Is there a limit to the number of threads in your case? For example the number will be limited to a number of CPU cores (which is a small number like 8 or 32) – which would make sense if the task at hand is CPU bound – there would be no win to use, say, 1000 threads or any number bigger than the total number of CPU cores.
Can you afford using N times as much memory for the counters in question (where N is the above small number)?
Subsequently, do you need to read the "final" values from a single thread after all threads are done with changing the tree? And can tolerate that individual threads will not have the "current" counter information?

In that case you could sidetrack the whole atomicity issue by collecting the individual counters per thread and aggregating them in a single value at the post processing stage. No locks, no atomics, but yes, N times as much memory for the counters.

A side consideration: if the edges array can only hold up to a small number of edges (like 10 or 100) I'd seriously consider using a less heavy data structure than the Array.

scanon · November 28, 2023, 3:56pm

No padding is applied to Atomic values by default (and you wouldn't want it to be done unconditionally, because some algorithms that use atomics actually benefit from locality), and Swift doesn't let you specify alignment higher than 16B for a type. You can construct higher alignments with manual allocation, of course, and it's possible to introduce padding into an object's layout manually, or define its layout precisely in a header file (bridging C/C++ atomic types to Swift atomics will be an important future enhancement for this reason).

wadetregaskis · November 28, 2023, 3:59pm

Keep in mind, though, that it's entirely contingent on access pattern. Sparse (and effectively random) use of the tree will probably be fine, for example. And especially if the workers effectively subdivide the tree so that each branch ends up following a specific core¹ rather than bouncing around caches.

There might also be some platform-specific and compiler-cleverness factors to consider, e.g. whether there's transactional semantics supported in the hardware that can be used to aggregate the accesses within a single node (so effectively one "big" compare-and-swap rather than per-field compare-and-swap; less nasty under high contention).

Maybe I missed the clarification or a key implication, but it's not actually apparent to me why atomics are necessary for this…? Concurrent tree walks can be implemented virtually lock-free (and atomic-free) through subtree partitioning (which can even be dynamic if necessary to avoid long tails).

e.g. you start at the root and traverse the tree in normal BFS style, but each time you encounter a branch and there's room for more threads, you fork into multiple threads. The only mutual exclusion necessary is around thread management.

(note: a real-world, tuned algorithm might be more complicated by heuristics as to when it's worth it to delegate a subtree to a new thread, to avoid excessive overheads for trivial subtrees)

¹ Notwithstanding any bouncing between cores as a result of the kernel thread scheduler, although there's not really anything you can do about that. Apple's Xnu kernel does love to move threads about excessively.

wadetregaskis · November 28, 2023, 4:01pm

Although, upon further reflection, I'm wondering if you meant graph, not tree? i.e. can your graph have cycles?

tera · November 29, 2023, 12:12am

How many Edge structures in total are we talking about?

tera · November 29, 2023, 2:34am

Doesn't c.property have a stable memory address (until c's deinit)?

class C {
    var property: Int
}
var c = C()

lorentey · November 30, 2023, 7:29pm

Enhancing Swift so that it provides better control over the memory layout of custom types is an orthogonal language issue that is not covered by the atomics proposal. It isn't a trivial thing to add, as it will require careful design and detailed compiler work. This can (and, in my opinion, should!) be done some time after we land atomics.

(SE-0410 is already far too unwieldy for comfort, and we already had to spin off at least one major swift-atomics API feature (atomic strong references) to keep it from becoming completely unmanageable.)

As always, if you think some followup work needs to get done soon, then it's important to talk about the problems it currently causes -- on this forum, in bug reports, and elsewhere.