My unit test is keep crashing because struct that conformed to Equable has a outlined retain and it was released

Hey Folks

What I have done is only added an option Int to the EventModel.AppliedFilters. And it causes a chain of crash on my CI in different tests of different files.

Xcode 14.3 swift 5.8

I have been keeping having crashes on my CI, sometimes it crashes locally, but for sure it disappears after I turn on the address sanitiser.

Sometimes I manage to mitigate the local crash by reducing the property count in this struct, or just grouping some of them into another embedded struct. But after I pushed to the CI, the crash continues.

I used godbolt to check the assembly code. there are obviously differences between 5.8 and nowadays nightly.
swift 5.8 lastest nightly swift. in the new assembly the outlined retain is gone...

Code gist.
In one of the crashes log1, log2 some module names are obfuscated. But there are pretty random happening on structs.

I have checked so many links but none of them point me to a solution. Help, please.

https://developer.apple.com/forums/thread/115715
https://developer.apple.com/forums/thread/698391

https://developer.apple.com/forums/thread/711393

all the threads when it crashes.

Over the internet, I rarely see a trace of people talking about CG Timer. it is related to garbage collection?

The gist shown has some data structures but no actual code (other than the memberwise inits), if you have a snippet app that reproduces the crash worth showing it, otherwise it's hard to decipher what you are talking about.

I'm struggling to find something that is small and representative of the problem. Totally aware of the limitation of this snippet. Will post it once I know how to produce it. Now, I'm still trying to find out why and how it happened. But the assembly code I posted is showing why I'm getting the crash as in swift 5.8 there are pairs of retain and release when the filter is set. while in the new night build it not there any more. Currently, I'm looking for an understanding of it. so I can maybe find a small example to reproduce it.

A crash in retain/release related functions is often indicative of memory corruption occuring, which could be because of bugs in unsafe code, data races, or compiler/runtime bugs. If you haven't yet, building with Address Sanitizer and Thread Sanitizer enabled may catch the source of the corruption closer to where it originally occurred.

thanks, Joe. I'm gonna keep doing more testing. So far every time I turned on ASAN, the crash disappear on my local machine. While our CI always be able to catch it. (Mac stadium virtual machine has a much-limited environment?) I haven't tried TSAN yet. Just thought it might be irrelevant as it is very likely a memory corruption. But I'll check now.
I thought structs are just a copy when passing around. My case here contradicts this assumption.

after all, not luck with TSAN it become the same as when ASAN was being used. the crash disappeared.

so, I fixed it by renaming my struct. form AppliedFIters to FilterAplied...... I do not know why

new crash log

libswiftCore.dylib`swift::RefCounts<swift::RefCountBitsT<(swift::RefCountInlinedness)1> >::incrementSlow:
0x18bfbb494 <+0>: stp x20, x19, [sp, #-0x20]!
0x18bfbb498 <+4>: stp x29, x30, [sp, #0x10]
0x18bfbb49c <+8>: add x29, sp, #0x10
0x18bfbb4a0 <+12>: mov x19, x0
0x18bfbb4a4 <+16>: cmn w1, #0x1
0x18bfbb4a8 <+20>: b.eq 0x18bfbb4d4 ; <+64>
0x18bfbb4ac <+24>: tbz x1, #0x3f, 0x18bfbb558 ; <+196>
0x18bfbb4b0 <+28>: mov x3, x2
0x18bfbb4b4 <+32>: lsl x8, x1, #3
-> 0x18bfbb4b8 <+36>: ldp x6, x5, [x8, #0x10]
0x18bfbb4bc <+40>: cmp w2, #0x1
0x18bfbb4c0 <+44>: b.eq 0x18bfbb4e4 ; <+80>
0x18bfbb4c4 <+48>: and x9, x6, #0x80000000ffffffff
0x18bfbb4c8 <+52>: mov x10, #-0x7fffffff00000001
0x18bfbb4cc <+56>: cmp x9, x10
0x18bfbb4d0 <+60>: b.ne 0x18bfbb4e4 ; <+80>
0x18bfbb4d4 <+64>: sub x0, x19, #0x8
0x18bfbb4d8 <+68>: ldp x29, x30, [sp, #0x10]
0x18bfbb4dc <+72>: ldp x20, x19, [sp], #0x20
0x18bfbb4e0 <+76>: ret
0x18bfbb4e4 <+80>: add x0, x8, #0x10
0x18bfbb4e8 <+84>: lsr x9, x5, #32
0x18bfbb4ec <+88>: lsl x8, x3, #33
0x18bfbb4f0 <+92>: adds x10, x8, x6
0x18bfbb4f4 <+96>: b.mi 0x18bfbb53c ; <+168>
0x18bfbb4f8 <+100>: mov w11, w5
0x18bfbb4fc <+104>: mov x7, x11
0x18bfbb500 <+108>: bfi x7, x9, #32, #32
0x18bfbb504 <+112>: mov x4, x6
0x18bfbb508 <+116>: mov x5, x7
0x18bfbb50c <+120>: casp x4, x5, x10, x11, [x0]
0x18bfbb510 <+124>: eor x9, x5, x7
0x18bfbb514 <+128>: eor x10, x4, x6
0x18bfbb518 <+132>: orr x9, x10, x9
0x18bfbb51c <+136>: cbz x9, 0x18bfbb4d4 ; <+64>
0x18bfbb520 <+140>: lsr x12, x5, #32
0x18bfbb524 <+144>: mov x6, x4
0x18bfbb528 <+148>: mov x9, x12
0x18bfbb52c <+152>: adds x10, x8, x4
0x18bfbb530 <+156>: b.pl 0x18bfbb4f8 ; <+100>
0x18bfbb534 <+160>: mov x9, x12
0x18bfbb538 <+164>: mov x6, x4
0x18bfbb53c <+168>: cmn w6, #0x1
0x18bfbb540 <+172>: b.eq 0x18bfbb4d4 ; <+64>
0x18bfbb544 <+176>: mov w2, w5
0x18bfbb548 <+180>: bfi x2, x9, #32, #32
0x18bfbb54c <+184>: mov x1, x6
0x18bfbb550 <+188>: bl 0x18bfbb55c ; swift::RefCountsswift::SideTableRefCountBits::incrementSlow(swift::SideTableRefCountBits, unsigned int)
0x18bfbb554 <+192>: b 0x18bfbb4d4 ; <+64>
0x18bfbb558 <+196>: bl 0x18bf825d0 ; swift::swift_abortRetainOverflow()

I wonder if that indicates a bug in the compiler with incremental recompilation. Does the crash also go away if do a clean rebuild?

So the crash is gone.

I had assumptions that maybe it is because CI has a very limited resource but since was reproducible locally. But, I can reproduce it locally on my machine now. Seems indeed something wrong with the incremental build.

The CI is supposed to be a clean build, but maybe not so clean after all. For the past runs, I was just doing CMD+SHIFT+K to clean. Could it be because the module cache is not actually cleaned? Because those crashed structs are coming from another pod. So could it be once after I had my pod updated then the rebuild did not invalidate the module cache when running the unit test target?

1 Like

Another update. On our CI we are having a new agent every time when a job is triggered. So the incremental build theory does not solve the problem there. Even though remove the folder inside the DerivedData folder it removed my local crash.