Random segfaults with async/await, but only on release builds

I have a Vapor application that is experiencing random segmentation faults, but only when built with -c release. It’s really weird for several reasons:

  • The segfault always occurs when accessing a particular data model protected by an actor, nowhere else. the data was originally written by my application, so I do not think the segfault is Vapor’s fault.

  • I am not using any Unsafe APIs anywhere near where the segfault is happening. The only place I use an unsafe pointer is when doing datetime parsing (using timegm), and there are multiple layers of validation between that and where the segfault is happening.

  • all of the data looks normal immediately after it is written, there are only problems when it is read back later. when i dump the data being read back, i see all kinds of data corruption, e.g. Decimals where the exponent field contains integer values that look like stack addresses (12 nonzero digits, starts with 0x7f...).

  • this same data model gets accessed all the time by methods on its enclosing actor, but the segfault occurs in a site where i call a method directly on one of the actor’s members. the call site is, of course, gated by await/async let, as the compiler demands. i have not observed a segfault when i wrap the call site in an actor method, and call that method from the other task instead. i’ve run the release binary several times with this modification, and it can get through several hundred work units before dying a “natural” death (see next section), just as the debug binaries do. for what it’s worth, the data model is a struct which contains some normal Swift arrays, which makes me suspect this issue has something to do with copy-on-write in a concurrency context.

  • the segfault never occurs on debug builds. on release builds, it often dies after about 20–70 units of work, but the debug binaries are consistently able to run for over 500 work units (before the OS kills it for excessive memory use.) I have never seen data corruption in a debug build. this, of course, makes it very difficult to identify the cause of this issue.

if anyone has any insights into what’s happening, that would be really helpful. especially tips on how to debug something that only occurs in release builds…

1 Like

That does sound a lot like either a concurrency bug or a use-after-free, and the latter seems unlikely without unsafe code. Have you tried running with TSan yet?

2 Likes

i'd eliminate all such suspects one by one until the bug disappears - the last one removed is the culprit, so you can bring it back and look at it closely. sometimes you have to remove significant amounts of app itself to make the crash culprit obvious. in some cases i do a binary bisection like remove half of the app -> still crashes -> remove half the remaining app -> still crashes -> and so on. it helps if you have 100% reproducible test to begin with. also try to enable various malloc guard checks - even if the problem is not in those operations they check - they will likely complain whenever heap is damaged for any reason, so you'll be alerted sooner.

1 Like

i’ve tried running with --sanitize=thread, and can confirm that there are a ton of data races that simply do not appear when the actor property access is wrapped in an actor method.

this data race occurs on every work unit, when calling the method on the actor member directly:

WARNING: ThreadSanitizer: data race (pid=14520)
  Read of size 8 at 0x7b7c000f0a28 by thread T73:
    #0 $s8ephemera3FTXO9OrderbookV9LiquidityV4bids4asksAGs10ArraySliceVyAA7DecimalVys5Int64VG5price_AP4sizetG_AStcfCTf4ggd_n crtstuff.c:? (ephemera+0xe6ef41)
    #1 $ss10ArraySliceVyxSicir8ephemera7DecimalVys5Int64VG5price_AH4sizet_Tg5 <compiler-generated>:? (ephemera+0xe6ef41)
    #2 $ss10ArraySliceVyxGSlsSly7ElementQz5IndexQzcirTW8ephemera7DecimalVys5Int64VG5price_AM4sizet_Tg5 <compiler-generated>:? (ephemera+0xe6ef41)
    #3 $sSlsE5first7ElementQzSgvgs10ArraySliceVy8ephemera7DecimalVys5Int64VG5price_AL4sizetG_Tg5 <compiler-generated>:? (ephemera+0xe6ef41)
    #4 $s8ephemera3FTXO9OrderbookV9LiquidityV4bids4asksAGs10ArraySliceVyAA7DecimalVys5Int64VG5price_AP4sizetG_AStcfCTf4ggd_n sources/ephemera/io/orderbooks.swift:87 (ephemera+0xe6ef41)
    #5 $s8ephemera11CoordinatorC7transityyYaFSayAA7DecimalVys5Int64VG5price_AI4sizeAC10Statistics33_1928725A3E2B25FF0010FD6120B12F64LLV10IndicatorsV10indicatorstGyYaYbcfu0_AiJ_ApQtSgyYaYbXEfU_AA3FTXO9OrderbookV9LiquidityVyYaYbcfu0_TY1_ builders.swift.o:? (ephemera+0xe7ed5b)
    #6 $s8ephemera3FTXO9OrderbookV9liquidity5depthAE9LiquidityVSi_tF sources/ephemera/io/orderbooks.swift:200 (ephemera+0xe7ed5b)
    #7 $s8ephemera11CoordinatorC7transityyYaFSayAA7DecimalVys5Int64VG5price_AI4sizeAC10Statistics33_1928725A3E2B25FF0010FD6120B12F64LLV10IndicatorsV10indicatorstGyYaYbcfu0_AiJ_ApQtSgyYaYbXEfU_AA3FTXO9OrderbookV9LiquidityVyYaYbcfu0_TY1_ <compiler-generated>:379 (ephemera+0xe7ed5b)
    #8 swift::runJobInEstablishedExecutorContext(swift::Job*) crtstuff.c:? (libswift_Concurrency.so+0x3c7f6)
    #9 _dispatch_worker_thread queue.c:? (libdispatch.so+0x372a6)

  Previous write of size 8 at 0x7b7c000f0a28 by thread T83:
    [failed to restore the stack]

  Thread T73 (tid=14594, running) created by main thread at:
    #0 pthread_create /home/build-user/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:965 (ephemera+0x16ae6b)
    #1 _dispatch_root_queue_poke_slow queue.c:? (libdispatch.so+0x33a45)
    #2 swift_task_create_common ??:? (libswift_Concurrency.so+0x3ed3f)
    #3 $s8ephemera8EphemeraV7ConnectV14ArgumentParser15ParsableCommandAafGP3runyyKFTW crtstuff.c:? (ephemera+0xe35da2)
    #4 $s8ephemera8EphemeraV7ConnectV14ArgumentParser15ParsableCommandAafGP3runyyKFTW <compiler-generated>:90 (ephemera+0xe35da2)
    #5 $s14ArgumentParser15ParsableCommandPAAE4mainyySaySSGSgFZ .build/checkouts/swift-argument-parser/Sources/ArgumentParser/Parsable Types/ParsableCommand.swift:95 (ephemera+0x2197ad)
    #6 $s14ArgumentParser15ParsableCommandPAAE4mainyyFZ .build/checkouts/swift-argument-parser/Sources/ArgumentParser/Parsable Types/ParsableCommand.swift:105 (ephemera+0x219830)
    #7 $s8ephemera8EphemeraV5$mainyyFZ sources/ephemera/ephemera.swift:18 (ephemera+0xdfded4)
    #8 ephemera_main sources/ephemera/common/builders.swift:? (ephemera+0xdfded4)
    #9 __libc_start_main /build/glibc-eX1tMB/glibc-2.31/csu/../csu/libc-start.c:308 (libc.so.6+0x270b2)

  Thread T83 (tid=14604, running) created by thread T81 at:
    #0 pthread_create /home/build-user/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:965 (ephemera+0x16ae6b)
    #1 _dispatch_root_queue_poke_slow queue.c:? (libdispatch.so+0x33a45)
    #2 $sScTss5NeverORszABRs_rlE5yieldyyYaFZ ??:? (libswift_Concurrency.so+0x2491c)
    #3 _dispatch_worker_thread queue.c:? (libdispatch.so+0x372a6)

SUMMARY: ThreadSanitizer: data race crtstuff.c:? in $s8ephemera3FTXO9OrderbookV9LiquidityV4bids4asksAGs10ArraySliceVyAA7DecimalVys5Int64VG5price_AP4sizetG_AStcfCTf4ggd_n
Terms of Service

Privacy Policy

Cookie Policy