Does Swift guarantee sequential memory write?

in C++, if I write this

    int a = 0;
    int b = 0;
    int c = 0;
    int d = 0;
    std::thread myThread([&a, &b]() {
        a = 10;
        b = 20;
    });

    std::thread myThread1([&b, &a, &c, &d]() {
        c = b;
        d = a;
    });

    myThread1.join();
    myThread.join();
    if (c == 20 && d == 0) {
        std::cout << "oi!!" << std::endl;
    }

it's actually possible to get c == 20, and d == 0 (b can be assigned value before a due to memory access reordering). I tested the same thing in swift using two DispatchQueue, but it doesn't happen, why is that? searching for "swift memory model" in google doesn't give good result

1 Like

Swift’s memory model is C’s memory model. DispatchQueues may issue a complete memory barrier depending on the architecture.

1 Like

Could you elaborate? I'm not sure how a barrier for each DispatchQueue block would prevent the CPU from reordering store of a and b as they would be both on the same side of the barrier (instead of across it).

Can you provide the dispatch code you used?

@Karl


let dp = DispatchQueue(label: "1")
let dp2 = DispatchQueue(label: "2")

for i in 0...1000000 {
    var a = 0
    var b = 0
    var c = 0
    var d = 0
    let sem = DispatchSemaphore(value: 0)
    dp.async {
        a = 10
        b = 20
    }

    dp2.async {
        c = b
        d = a
        sem.signal()
    }

    sem.wait()
    if c == 20, d == 0 {
       // never reaches here
        print("oi!!")
        break
    }
}

As in most languages on most CPU architectures, there's no guarantee of ordering between writes to a and b, nor c and d. If you're seeing ordering it's just happenstance of the CPU you're executing it on¹.

Your first closure compiles down to (on AArch64):

ldp x8, x9, [x20, #0x10]
mov w10, #0xa
str x10, [x8, #0x10]
mov w8, #0x14
str x8, [x9, #0x10]
ret

No barrier instructions or anything like that.

The second closure has a bit more boilerplate, for loading the closure's context and finding the address of the heap-allocated values, but it's essentially:

ldr x9, [x9, #0x10]
str x9, [x8, #0x10]
ldr x8, [x11, #0x10]
str x8, [x10, #0x10]

Again, no barriers of any kind (in these closures themselves - GCD likely emits full barriers between blocks).

AArch64 only guarantees a particular apparent² order of memory accesses when any of:

  1. There is a data dependency between the two accesses (e.g. the first writes to an address, the second reads from it).
  2. There is an address dependency between the two accesses (e.g. the first one reads a pointer value from memory, the second reads from the pointed-to address).
  3. There is a control dependency between the two accesses (e.g. the first one is checked for equality to zero and the second is conditionalised on the result).
  4. There is a relevant barrier instruction between them (e.g. dmb).

¹ CPU's don't have to reorder things, of course, and even modern CPUs might be somewhat if not entirely in-order if their priority is energy efficiency. Though the compiler is still free to reorder instructions. LLVM seems to avoid re-ordering unless there's a clear reason for it, although it'd be entirely within spec to literally reorder independent operations randomly for the hell of it. I don't know if swiftc / LLVM has a mode to do this, but it'd be useful if it did for fuzz-testing for missing barriers.

² Technically the CPU is free to do whatever it wants as long as it presents the appearance of obeying these rules. e.g. it might speculatively read something out of order, or guess what the read value will be, and only redo the read if it turns out it guessed wrong.

4 Likes

If you really want to compare one language to another, you should use the same threading primitives so the only difference is the language. You could use pthread or DispatchQueue or even std::thread (when using Swift C++ interop).

But in this case, they aren't even logically equivalent. You should at a minimum use two semaphores so you can wait for the completion of both tasks, like you join the two threads in the C++ version. Otherwise there's a chance the task dispatched to dp in the first iteration hasn't finished running when the second iteration starts and enqueues a new task to dp.

There's also a difference in how a thread's priority can be boosted while being joined with one of higher priority (like the main thread) that could be at play here. A semaphore will never boost a thread's priority because the kernel doesn't really know on which thread it's waiting for (doesn't know in advance which thread will call signal()). So here's another difference that could affect the probability of your expected result.

3 Likes

It’s not impossible that Dispatch is just not spinning up a second thread on your system for its own ineffable reasons. It’s certainly not guaranteed to occur with any ordering; the jobs sent to dq don’t even have to finish before your barrier.

1 Like

The answer is "no". Let's simplify the code and remove Dispatch dependency:

struct Worm {
    var head = 1
    var tail = 0
}

var worm = Worm()

Thread {
    while true {
        worm.head += 1
        worm.tail += 1
    }
}.start()

Here we are moving worm's head first, then its tail. With this setup worm length couldn't possibly go negative, right?!

Thread {
    var moveCount = 0
    while true {
        let tail = worm.tail
        let head = worm.head
        moveCount += 1
        let length = head - tail
        precondition(length >= 0, "worm length: \(length), moveCount: \(moveCount)")
    }
}.start()

RunLoop.current.run(until: .distantFuture)

Not so when you observe the length from a different thread and there are no barriers involved. This is an example what you could get:

main.swift:25: Precondition failed: worm length: -7, moveCount: 25291433

As you could see the assumption that length is non negative worked well for quite a while... until it broke.

1 Like

thank you for all thoughtful answers. I was able to reproduce it after moving the code from playground to simple project using code above. Didn't encounter any when using my DispatchQueue snippet (using 2 semaphore and ruling out that they run in only 1 thread) , but I guess that's just by chance.