An unexpected deadlock with Combine only on Release build

lukasa · May 31, 2020, 9:06am

You have a series of great questions here that deserve addressing. I don't know exactly where your knowledge starts and ends here so I'm going to try to cover a wide range of topics, so I apologise if some of this is stuff you already know. However, I think it's useful to clarify a bunch of things about the way optimisers work in modern programming languages.

I should stress that I'm not the real expert here: we have actual compiler authors in this community who absolutely know more than I do. Please, compiler authors, weigh in if I'm off-base on anything here. However, this is my rough understanding of why things work the way they do.

Let's start here. You say two things here, but the one I want to start with is "optimizers can change program behaviour". Yes, they absolutely can. Indeed, this is the entire point of an optimizer: to change the behaviour of the program to make it cheaper.

Consider this simple C function. (I'm using C because it's a simpler language and because the rules of the optimizer are much more clearly defined than in Swift.)

void clear(char *buf, int len) {
    for (int i = 0; i < len; i++) {
        buf[i] = 0;
    }
}

What this function does is fairly clear: it sets len bytes of buf to zero. Great.

In debug builds, we get:

clear:                                  # @clear
        push    rbp
        mov     rbp, rsp
        mov     qword ptr [rbp - 8], rdi
        mov     dword ptr [rbp - 12], esi
        mov     dword ptr [rbp - 16], 0
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        mov     eax, dword ptr [rbp - 16]
        cmp     eax, dword ptr [rbp - 12]
        jge     .LBB0_4
        mov     rax, qword ptr [rbp - 8]
        movsxd  rcx, dword ptr [rbp - 16]
        mov     byte ptr [rax + rcx], 0
        mov     eax, dword ptr [rbp - 16]
        add     eax, 1
        mov     dword ptr [rbp - 16], eax
        jmp     .LBB0_1
.LBB0_4:
        pop     rbp
        ret

This is a fairly direct translation of the program as written. The inner loop loads i, compares it to len, if it's greater than or equal (jge) jumps out of the loop. Otherwise, it loads buf and i, stores 0 into that address, and then adds 1 to i and jumps to the start of the loop again. Exactly the code we wrote.

What happens if we turn on the optimizer to -O2? We get this:

clear:                                  # @clear
        test    esi, esi
        jle     .LBB0_2
        push    rax
        mov     edx, esi
        xor     esi, esi
        call    memset
        add     rsp, 8
.LBB0_2:
        ret

This code is drastically different than the above: in particular, the optimizer has inserted a function call we never wrote!

This leads to a really important thing to understand about optimizing compilers: the "as-if" rule, and "observable behaviour". In languages like C (and Swift), the optimizer is allowed to make any change to program behaviour that does not change the "observable behaviour" of the program. That is, the program must behave "as-if" the code you wrote was executed the way you wrote it, but it is not required to actually execute that code.

So why did I put scare quotes around "observable behaviour"? Because that phrase turns out to be a term of art: it has a very specific meaning, separate from its normal English meaning. The reason for this is important: from the perspective of the CPU, all behaviour is observable behaviour. Any change, even reordering two independent instructions, is observable. It's observable to you too: you can look at the binary and observe that it's different!

So language authors have to define what kinds of effects count as "observable behaviour". Some of these seem very natural to us as programmers because we're used to them: things like loop unrolling, constant-propagation, and other straightforward optimisations seem reasonable enough. Yes, technically if we tried to break on the specific instruction we were looking for, in debug mode that instruction would be there and in release mode it would not, but we generally forgive the compiler for this.

But there are other things that might feel more observable to us, but the compiler does not consider observable. One of these is some kinds of function calls. What function calls a given program makes, in what order, is not in and of itself considered observable behaviour. This is why the compiler is allowed to insert a call to memset in my code, even though I didn't call it: not calling memset is not an observable behaviour of my program.

Similarly, the compiler is allowed to not call functions it doesn't need to. For example, consider this code:

int clear_value(void) {
    return 0;
}

void clear(char *buf, int len) {
    int value = clear_value();
    for (int i = 0; i < len; i++) {
        buf[i] = value;
    }
}

In optimised builds, the call to clear_value simply never happens: if I added a breakpoint on all calls to clear_value, I'd get different results in debug and release builds.

For completeness, what counts as "observable behaviour"? It varies by language, but it generally includes I/O: the same data must be written into files, and the same I/O ordering must occur (that is, it's not acceptable to swap the order of a read and a write). Similarly, anything that does (or may do) any of those things also exhibits observable behaviour. In general, if a compiler can't see into the body of a function (e.g. because it's defined in a separate library), it must be assumed to do one of those things and so the call must remain.

I go through all of this lengthy material to cover that optimizers are explicitly allowed to change the behaviour of the program within some certain rules. Let's now tackle your other questions:

We do agree that lifetime is significant and must be visible. However, that agreement doesn't necessarily mean we all agree on when an object's lifetime ends.

You are applying C++'s lifetime management, which says that objects are destroyed at the end of the expression that lexically contains the point where they were created. That is, they are destroyed when their enclosing scope is destroyed (approximately). However, nothing in Swift commits to this being the rule of lifetime management in Swift. Indeed, the mere existence of withExtendedLifetime makes it clear that Swift's lifetime is not defined this way.

Consider the Transitioning to ARC document originally written for Objective-C programmers moving to ARC. This is contained in its FAQ:

How do I think about ARC? Where does it put the retains/releases?

Try to stop thinking about where the retain/release calls are put and think about your application algorithms instead. Think about “strong and weak” pointers in your objects, about object ownership, and about possible retain cycles.

The Swift book also doesn't say anything about scope. In the section on ARC it says:

Additionally, when an instance is no longer needed, ARC frees up the memory used by that instance so that the memory can be used for other purposes instead. This ensures that class instances do not take up space in memory when they are no longer needed.

and

To make sure that instances don’t disappear while they are still needed, ARC tracks how many properties, constants, and variables are currently referring to each class instance. ARC will not deallocate an instance as long as at least one active reference to that instance still exists.

Notice that this does not say "when an reference exits scope". Instead it uses the somewhat vague term "exists".

This vagueness is the source of the difference in behaviour. How long does a reference exist? Well, it certainly ceases to exist when the reference exits the scope that defined it (as in C++). This is easy for the compiler to evaluate quickly, so it's a natural fallback: certainly, no later than that point. However, in principle the reference may stop existing sooner than that! As nothing has committed to the idea that the reference lasts to the end of the function, the optimizer is free to assume that the reference stops existing at the moment of its last use. Hence, the call to swift_release may happen anytime after the last time you use a reference. This is the only promise the compiler makes.

I will repeat that as a call-out, because it's important: the lifetime of a Swift reference ends at the moment it is last accessible from any part of the program through valid Swift. However, just because the lifetime ends there does not mean that the compiler promises to call swift_release at that point. Instead, the compiler may call swift_release at any time from that point onward.

So, now to the most important question you asked:

Ideally, yes! It would be great if debug and release builds worked the same way. However, in practice, there are a number of problems with that idea.

The first is that the optimisations required to shorten the lifetime of a reference in release mode are expensive: they take time. This makes compiles slow. An important part of debug builds is that they need to be as fast as possible, to allow rapid development cycles. Making debug builds do this lifetime analysis will slow them down.

More importantly, this interacts with other optimisations. Inlining, constant-propagation, and other optimisations can all affect the lifetime of a reference. This means that in practice to achieve the same behaviour requires running all the same optimisations. This affects not only compile time, but also the debuggability of your program.

No, the core team would love predictability, but it comes with tradeoffs. As noted above, to make debug match release, we'd slow down our builds and make them less debuggable: it's a non-starter.

You might well ask: well then, why not make release match debug? Why not make the lifetime of all references explicitly end when their enclosing scope ends?

The easiest answer to this is that it makes Swift programs slower. Consider this code:

func doubleFirstAndPrint(_ array: [UInt8]) {
    var array2 = array  // Take local copy, we're going to mutate
    array2[0] *= 2
    print(array2)
}

func test() {
    let myArray = [1, 2, 3]
    doubleFirstAndPrint(myArray)
}

If the lifetime of a reference extends until the end of its lexical scope, then when we call test the myArray reference is alive until after doubleFirstAndPrint completes. That means that when we do array2[0] *= 2 in line 1 of doubleFirstAndPrint, there are three references to that array in scope: one in test (called myArray), one now-unreachable in doubleFirstAndPrint (the argument array) and another in doubleFirstAndPrint ( the local var array2). This means that the assignment here must copy and write, costing performance.

With Swift's model, we can flip that. We can say: well, myArray is not reachable after we pass it in to doubleFirstAndPrint. We created myArray at +1, so we can pass it at +1 to doubleFirstAndPrint, handing ownership to that function. Then, that function performs a copy to var array2. At this point, the argument array is unreachable, and so we can do a release for that name, and a retain for var array2. These two cancel out, so we can optimise them both away.

Now, when we come to our assignment, the refcount for var array2 is still only 1! That means we don't need to copy-on-write: Swift has correctly observed that the array is only reachable from one place in the code, the place we're modifying it from, and so there is no need to perform the CoW. Our program is also faster, as we've eliminated some retains/releases we didn't need in the code.

I hope this essay has helped make it a bit clearer why things are this way: why release and debug behave differently, and why it's hard and/or undesirable to make them behave the same in Swift.

This behavioural flexibility is part of why it in Swift it can be unwise to do resource management in deinit: as you've noted, it's not necessarily deterministically called at the same point in the code. In general, managing the lifetime of things that need to be cleaned up at a specific point is best done with with functions.