Why retain/release pair is not removed in this particular case?

shivatinker · January 15, 2025, 12:35am

Hey there!

I have some performance critical code in my project, and I noticed that 20-30% of the time on the hot path is spent on retain/release calls. I have reduced the example to this code:

protocol P {
    func take(a: A)
}

struct V: P {
    func take(a: A) {
        a.stuff()
    }
}

class A {
    func stuff() {
        print(0)
    }
}

struct B {
    let a = A()
    let b: [P]

    init(b: [P]) {
        self.b = b
    }

    func d() {
        for t in self.b {
            t.take(a: self.a)
        }
    }
}

let b = B(b: [V()])
b.d()

The hot function in question is B.d. It passes owned A instance to each value in the owned array of existentials [P].

Godbolt (swiftc nightly -O) shows the following disassembly of d:

output.B.d() -> ():
        ...
        mov     rbx, qword ptr [rsi + 16]
        test    rbx, rbx
        je      .LBB3_4
        mov     r14, rdi
        lea     r15, [rsi + 32]
        mov     qword ptr [rsp], rsi
        mov     rdi, rsi
        call    swift_retain@PLT
.LBB3_2:
        mov     r12, qword ptr [r15 + 24]
        mov     rbp, qword ptr [r15 + 32]
        mov     rdi, r15
        mov     rsi, r12
        call    __swift_project_boxed_opaque_existential_1
        mov     rdi, r14
        mov     r13, rax
        mov     rsi, r12
        mov     rdx, rbp
        call    qword ptr [rbp + 8]
        add     r15, 40
        dec     rbx
        jne     .LBB3_2
        mov     rdi, qword ptr [rsp]
        add     rsp, 8
        pop     rbx
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        pop     rbp
        jmp     swift_release@PLT
.LBB3_4:
        ...

In my understanding of ARC optimizations, lifetime of B.a is basically guaranteed through the call to B.d, so retain/release pair should be removed by the optimizer.

Is there any way to remove this overhead? In my code I am using Unmanaged to successfully remove retain/release calls in this example, but I am curious if this is an expected behavior and are there any "safe" ways to remove this overhead?

dmt · January 15, 2025, 6:40pm

FWIW It seems the retain/release pair is emitted for self.b._buffer._storage (not for self.a), which is of type __ContiguousArrayStorageBase.
Compiler explorer:

sil hidden @$s6output1BV1dyyF : $@convention(method) (@guaranteed B) -> () {
[%0: read v**.c*.v**, write v**.c*.v**, copy **, destroy v**.c*.v**]
[global: read,write,copy,destroy,allocate,deinit_barrier]
bb0(%0 : $B):
  debug_value %0, let, name "self", argno 1, loc "/app/example.swift":25:10, scope 59 // id: %1
  %2 = struct_extract %0, #B.b, loc "/app/example.swift":26:23, scope 60 // users: %37, %6, %4
  %3 = integer_literal $Builtin.Int64, 0, scope 66 // users: %19, %14, %23, %5
  debug_value %2, var, (name "$t$generator", loc "/app/example.swift":26:18), type $IndexingIterator<Array<any P>>, expr op_fragment:#IndexingIterator._elements, loc "/app/example.swift":26:23, scope 60 // id: %4
  debug_value %3, var, (name "$t$generator", loc "/app/example.swift":26:18), type $IndexingIterator<Array<any P>>, expr op_fragment:#IndexingIterator._position:op_fragment:#Int._value, loc "/app/example.swift":26:23, scope 60 // id: %5
  %6 = struct_extract %2, #Array._buffer, loc * "<compiler-generated>":0:0, scope 59 // user: %7
  %7 = struct_extract %6, #_ContiguousArrayBuffer._storage, loc * "<compiler-generated>":0:0, scope 59 // users: %22, %21, %8
  %8 = ref_element_addr [immutable] %7, #__ContiguousArrayStorageBase.countAndCapacity, scope 74 // user: %9
  %9 = struct_element_addr %8, #_ArrayBody._storage, scope 74 // user: %10
  %10 = struct_element_addr %9, #_SwiftArrayBodyStorage.count, scope 74 // user: %11
  %11 = struct_element_addr %10, #Int._value, scope 74 // user: %12
  %12 = load %11, scope 74                        // user: %13
  %13 = builtin "assumeNonNegative_Int64"(%12) : $Builtin.Int64, scope 74 // users: %34, %14, %19
  %14 = builtin "cmp_eq_Int64"(%13, %3) : $Builtin.Int1, scope 72 // user: %15
  cond_br %14, bb2, bb1, scope 68                 // id: %15

bb1:                                              // Preds: bb0
  %16 = integer_literal $Builtin.Int64, 1, scope 78 // user: %28
  %17 = integer_literal $Builtin.Int1, -1, scope 78 // user: %28
  %18 = struct_extract %0, #B.a, loc "/app/example.swift":27:28, scope 79 // user: %33
  %19 = builtin "cmp_sge_Int64"(%3, %13) : $Builtin.Int1, scope 78 // user: %20
  cond_fail %19, "loop induction variable overflowed", scope 78 // id: %20
  %21 = ref_tail_addr [immutable] %7, $any P, scope 85 // user: %27
  strong_retain %7, loc * "<compiler-generated>":0:0, scope 59 // id: %22 <<< here
  br bb3(%3), scope 68                            // id: %23

I don't know why it wasn't eliminated though.

Karl · January 15, 2025, 8:17pm

This is because for loops use consuming iteration, which needs to copy the array. There are two ways to solve this:

A smarter optimiser. If you compile with -Xfrontend -enable-ossa-modules, the copy gets optimised away, and the d function does not contain any retains or releases.

While it's interesting (and it proves that this is solvable by the compiler), it's probably not a great idea to ship things built using experimental compiler flags. The optimiser team have said they are working on improving this mode, so hopefully it will be enabled by default before too long
Use an index-based loop instead:
```
func d() {
    for t in 0..<self.b.count {
        self.b[t].take(a: self.a)
    }
}
```
Not as pretty as using the for loop directly on the array, but it avoids consuming the array and therefore also does not contain any ARC operations.

FlorianPircher · January 15, 2025, 8:49pm

When using indices, the compiler might have to insert bounds checks where before it did not, although I am unsure how an optimizer would handle this and if it was an actual issue.

shivatinker · January 15, 2025, 9:31pm

Thanks a lot, @Karl ! With indices this particular issue is fixed, however I have another puzzle for you

struct V {
    func take(a: Wrap) {
        a.a.stuff()
    }
}

class A {
    func stuff() { }
}

struct Wrap {
    let a: A
}

struct B {
    let a = A()
    var b = V()

    func d() {
        b.take(a: Wrap(a: self.a))
    }
}

var b = B()
b.d()

Godbolt

output.B.d() -> ():
        push    r13
        push    rbx
        push    rax
        mov     r13, rdi
        mov     rax, qword ptr [rdi]
        mov     rbx, qword ptr [rax + 56]
        call    swift_retain@PLT
        call    rbx
        mov     rdi, r13
        add     rsp, 8
        pop     rbx
        pop     r13
        jmp     swift_release@PLT

As far as I can tell, the issue is induced by Wrap struct. Again, I have managed to optimize it by declaring Wrap.a as Unmanaged<A> and using _withUnsafeGuaranteedRef to access A, but I am looking for "safe" solution.

In my program, Wrap is actually doing useful logic and Wrap.a is private.

PS. Enabling -enable-ossa-modules fixes this issue. Is there any other workarounds?
PPS. Looks similar to this thread to me.

Karl · January 15, 2025, 9:52pm

Yep, I encountered the same issue in the thread you linked to, and OSSA modules is the only workaround I know of.

shivatinker · January 15, 2025, 9:56pm

Unfortunately, OSSA modules does not solve this issue in my real project code, but it works in this simple example at least. I guess we are staying with Unmanaged for now.