Random segfaults with async/await, but only on release builds

taylorswift · September 7, 2021, 1:41am

I have a Vapor application that is experiencing random segmentation faults, but only when built with -c release. It’s really weird for several reasons:

The segfault always occurs when accessing a particular data model protected by an actor, nowhere else. the data was originally written by my application, so I do not think the segfault is Vapor’s fault.
I am not using any Unsafe APIs anywhere near where the segfault is happening. The only place I use an unsafe pointer is when doing datetime parsing (using timegm), and there are multiple layers of validation between that and where the segfault is happening.
all of the data looks normal immediately after it is written, there are only problems when it is read back later. when i dump the data being read back, i see all kinds of data corruption, e.g. Decimals where the exponent field contains integer values that look like stack addresses (12 nonzero digits, starts with 0x7f...).
this same data model gets accessed all the time by methods on its enclosing actor, but the segfault occurs in a site where i call a method directly on one of the actor’s members. the call site is, of course, gated by await/async let, as the compiler demands. i have not observed a segfault when i wrap the call site in an actor method, and call that method from the other task instead. i’ve run the release binary several times with this modification, and it can get through several hundred work units before dying a “natural” death (see next section), just as the debug binaries do. for what it’s worth, the data model is a struct which contains some normal Swift arrays, which makes me suspect this issue has something to do with copy-on-write in a concurrency context.
the segfault never occurs on debug builds. on release builds, it often dies after about 20–70 units of work, but the debug binaries are consistently able to run for over 500 work units (before the OS kills it for excessive memory use.) I have never seen data corruption in a debug build. this, of course, makes it very difficult to identify the cause of this issue.

if anyone has any insights into what’s happening, that would be really helpful. especially tips on how to debug something that only occurs in release builds…

jrose · September 7, 2021, 2:58am

That does sound a lot like either a concurrency bug or a use-after-free, and the latter seems unlikely without unsafe code. Have you tried running with TSan yet?

tera · September 7, 2021, 3:26am

i'd eliminate all such suspects one by one until the bug disappears - the last one removed is the culprit, so you can bring it back and look at it closely. sometimes you have to remove significant amounts of app itself to make the crash culprit obvious. in some cases i do a binary bisection like remove half of the app -> still crashes -> remove half the remaining app -> still crashes -> and so on. it helps if you have 100% reproducible test to begin with. also try to enable various malloc guard checks - even if the problem is not in those operations they check - they will likely complain whenever heap is damaged for any reason, so you'll be alerted sooner.

taylorswift · September 7, 2021, 10:27pm

i’ve tried running with --sanitize=thread, and can confirm that there are a ton of data races that simply do not appear when the actor property access is wrapped in an actor method.

this data race occurs on every work unit, when calling the method on the actor member directly:

WARNING: ThreadSanitizer: data race (pid=14520)
  Read of size 8 at 0x7b7c000f0a28 by thread T73:
    #0 $s8ephemera3FTXO9OrderbookV9LiquidityV4bids4asksAGs10ArraySliceVyAA7DecimalVys5Int64VG5price_AP4sizetG_AStcfCTf4ggd_n crtstuff.c:? (ephemera+0xe6ef41)
    #1 $ss10ArraySliceVyxSicir8ephemera7DecimalVys5Int64VG5price_AH4sizet_Tg5 <compiler-generated>:? (ephemera+0xe6ef41)
    #2 $ss10ArraySliceVyxGSlsSly7ElementQz5IndexQzcirTW8ephemera7DecimalVys5Int64VG5price_AM4sizet_Tg5 <compiler-generated>:? (ephemera+0xe6ef41)
    #3 $sSlsE5first7ElementQzSgvgs10ArraySliceVy8ephemera7DecimalVys5Int64VG5price_AL4sizetG_Tg5 <compiler-generated>:? (ephemera+0xe6ef41)
    #4 $s8ephemera3FTXO9OrderbookV9LiquidityV4bids4asksAGs10ArraySliceVyAA7DecimalVys5Int64VG5price_AP4sizetG_AStcfCTf4ggd_n sources/ephemera/io/orderbooks.swift:87 (ephemera+0xe6ef41)
    #5 $s8ephemera11CoordinatorC7transityyYaFSayAA7DecimalVys5Int64VG5price_AI4sizeAC10Statistics33_1928725A3E2B25FF0010FD6120B12F64LLV10IndicatorsV10indicatorstGyYaYbcfu0_AiJ_ApQtSgyYaYbXEfU_AA3FTXO9OrderbookV9LiquidityVyYaYbcfu0_TY1_ builders.swift.o:? (ephemera+0xe7ed5b)
    #6 $s8ephemera3FTXO9OrderbookV9liquidity5depthAE9LiquidityVSi_tF sources/ephemera/io/orderbooks.swift:200 (ephemera+0xe7ed5b)
    #7 $s8ephemera11CoordinatorC7transityyYaFSayAA7DecimalVys5Int64VG5price_AI4sizeAC10Statistics33_1928725A3E2B25FF0010FD6120B12F64LLV10IndicatorsV10indicatorstGyYaYbcfu0_AiJ_ApQtSgyYaYbXEfU_AA3FTXO9OrderbookV9LiquidityVyYaYbcfu0_TY1_ <compiler-generated>:379 (ephemera+0xe7ed5b)
    #8 swift::runJobInEstablishedExecutorContext(swift::Job*) crtstuff.c:? (libswift_Concurrency.so+0x3c7f6)
    #9 _dispatch_worker_thread queue.c:? (libdispatch.so+0x372a6)

  Previous write of size 8 at 0x7b7c000f0a28 by thread T83:
    [failed to restore the stack]

  Thread T73 (tid=14594, running) created by main thread at:
    #0 pthread_create /home/build-user/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:965 (ephemera+0x16ae6b)
    #1 _dispatch_root_queue_poke_slow queue.c:? (libdispatch.so+0x33a45)
    #2 swift_task_create_common ??:? (libswift_Concurrency.so+0x3ed3f)
    #3 $s8ephemera8EphemeraV7ConnectV14ArgumentParser15ParsableCommandAafGP3runyyKFTW crtstuff.c:? (ephemera+0xe35da2)
    #4 $s8ephemera8EphemeraV7ConnectV14ArgumentParser15ParsableCommandAafGP3runyyKFTW <compiler-generated>:90 (ephemera+0xe35da2)
    #5 $s14ArgumentParser15ParsableCommandPAAE4mainyySaySSGSgFZ .build/checkouts/swift-argument-parser/Sources/ArgumentParser/Parsable Types/ParsableCommand.swift:95 (ephemera+0x2197ad)
    #6 $s14ArgumentParser15ParsableCommandPAAE4mainyyFZ .build/checkouts/swift-argument-parser/Sources/ArgumentParser/Parsable Types/ParsableCommand.swift:105 (ephemera+0x219830)
    #7 $s8ephemera8EphemeraV5$mainyyFZ sources/ephemera/ephemera.swift:18 (ephemera+0xdfded4)
    #8 ephemera_main sources/ephemera/common/builders.swift:? (ephemera+0xdfded4)
    #9 __libc_start_main /build/glibc-eX1tMB/glibc-2.31/csu/../csu/libc-start.c:308 (libc.so.6+0x270b2)

  Thread T83 (tid=14604, running) created by thread T81 at:
    #0 pthread_create /home/build-user/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:965 (ephemera+0x16ae6b)
    #1 _dispatch_root_queue_poke_slow queue.c:? (libdispatch.so+0x33a45)
    #2 $sScTss5NeverORszABRs_rlE5yieldyyYaFZ ??:? (libswift_Concurrency.so+0x2491c)
    #3 _dispatch_worker_thread queue.c:? (libdispatch.so+0x372a6)

SUMMARY: ThreadSanitizer: data race crtstuff.c:? in $s8ephemera3FTXO9OrderbookV9LiquidityV4bids4asksAGs10ArraySliceVyAA7DecimalVys5Int64VG5price_AP4sizetG_AStcfCTf4ggd_n

matejukmar · September 9, 2022, 4:34am

I was just testing my Vapor based messaging server and hit it with 6300 concurrent web-socket connections. Got a segmentation fault crash which I believe must be due to some kind of async await racing issue in Swift actor. It seems to happen in Swift's dictionary getter. I can't see what would I be doing wrong in my code.

Linux version: Ubuntu 20.04.2 LTS
Swift version: Swift version 5.6.1 (swift-5.6.1-RELEASE)

Crash report looks like this:

Sep 07 09:29:18 machine-1 runProduction[2371825]: 0x558805cdb582, Backtrace.(printBacktrace in _B82A8C0ED7C904841114FDF244F9E58E)(signal: Swift.Int32) -> () at /home/username/live-messaging/.build/checkouts/swift-backtrace/Sources/Backtrace/Backtrace.swift:66
Sep 07 09:29:18 machine-1 runProduction[2371825]: 0x558805cdb817, closure #1 (Swift.Int32) -> () in static Backtrace.Backtrace.install(signals: Swift.Array<Swift.Int32>) -> () at /home/username/live-messaging/.build/checkouts/swift-backtrace/Sources/Backtrace/Backtrace.swift:80
Sep 07 09:29:18 machine-1 runProduction[2371825]: 0x558805cdb817, @objc closure #1 (Swift.Int32) -> () in static Backtrace.Backtrace.install(signals: Swift.Array<Swift.Int32>) -> () at /home/username/live-messaging/<compiler-generated>:79
Sep 07 09:29:18 machine-1 runProduction[2371825]: 0x7fa77196f41f
Sep 07 09:29:18 machine-1 runProduction[2371825]: 0x558805ad8a3f, generic specialization <Swift.String, Swift.Dictionary<Swift.ObjectIdentifier, App.Connection>> of Swift.Dictionary.subscript.getter : (A) -> Swift.Optional<B> at /home/username/live-messaging/<compiler-generated>:0
Sep 07 09:29:18 machine-1 runProduction[2371825]: 0x558805ad8a3f, closure #2 (Swift.String, App.Conversation.Participant) -> App.ConversationInfo.Participant in App.MessagingCenter.onConnectionUpgraded(ws: WebSocketKit.WebSocket, user: App.User, conversationId: Swift.String) throws -> () at /home/username/live-messaging/Sources/App/MessagingCenter/MessagingCenter.swift:145
Sep 07 09:29:18 machine-1 runProduction[2371825]: 0x558805ad8a3f, reabstraction thunk helper from @callee_guaranteed (@guaranteed Swift.String, @guaranteed App.Conversation.Participant) -> (@owned App.ConversationInfo.Participant, @error @owned Swift.Error) to @escaping @callee_guaranteed (@in_guaranteed (key: Swift.String, value: App.Conversation.Participant)) -> (@out App.ConversationInfo.Participant, @error @owned Swift.Error) at /home/username/live-messaging/<compiler-generated>:0
Sep 07 09:29:18 machine-1 runProduction[2371825]: 0x558805ad8a3f, generic specialization <Swift.Dictionary<Swift.String, App.Conversation.Participant>, App.ConversationInfo.Participant> of (extension in Swift):Swift.Collection.map<A>((A.Element) throws -> A1) throws -> Swift.Array<A1> at /home/username/live-messaging/<compiler-generated>:0
Sep 07 09:29:18 machine-1 runProduction[2371825]: 0x558805ad8a3f, App.MessagingCenter.onConnectionUpgraded(ws: WebSocketKit.WebSocket, user: App.User, conversationId: Swift.String) throws -> () at /home/username/live-messaging/Sources/App/MessagingCenter/MessagingCenter.swift:141
Sep 07 09:29:18 machine-1 runProduction[2371825]: 0x558805b05f6b, (1) suspend resume partial function for App.onUpgrade(req: Vapor.Request, ws: WebSocketKit.WebSocket) async -> () at /home/username/live-messaging/Sources/App/ws/upgrader.swift:19
Sep 07 09:29:18 machine-1 runProduction[2371825]: 0x7fa77213dcb7
Sep 07 09:29:18 machine-1 runProduction[2371825]: 0x7fa77213e398
Sep 07 09:29:18 machine-1 runProduction[2371825]: 0x7fa7718e8414
Sep 07 09:29:18 machine-1 runProduction[2371825]: 0x7fa7718e8272
Sep 07 09:29:18 machine-1 runProduction[2371825]: 0x7fa7718f3361
Sep 07 09:29:18 machine-1 runProduction[2371825]: 0x7fa771963608
Sep 07 09:29:18 machine-1 runProduction[2371825]: 0x7fa770d18132
Sep 07 09:29:18 machine-1 runProduction[2371825]: 0xffffffffffffffff
Sep 07 09:29:18 machine-1 runProduction[2371821]: Segmentation fault (core dumped)
Sep 07 09:29:18 machine-1 systemd[1]: lmsg.service: Main process exited, code=exited, status=139/n/a
Sep 07 09:29:18 machine-1 systemd[1]: lmsg.service: Failed with result 'exit-code'.

And the point in my code where crash happens (when accessing dictionary getter):

import NIOCore
import Vapor

typealias LMID = String
typealias UserId = LMID
typealias ConnectionId = ObjectIdentifier

actor MessagingCenter {
    ...
    var connectionsByUserId: [UserId: [ConnectionId: Connection]] = [:]
    ...

    func onConnectionUpgraded(ws: WebSocket, user: User, conversationId: ConversationId) throws {
        ...
        let convInfo = ConversationInfo(
            conversationId: conversationId,
            date: conversation.dateCreated,
            participants: conversation.participants.map {
                ConversationInfo.Participant(
                    id: $1.id,
                    isAdmin: $1.isAdmin,
                    isConnected: connectionsByUserId[$1.id] != nil // <- crash occured on this line
                )
            },
            messages: conversation.messages.map {
                ConversationInfo.Message(
                    id: $0.id,
                    senderId: $0.senderId,
                    date: $0.date,
                    payload: $0.payload
                )
            }
        )
        ...
    }

MPLewis · September 9, 2022, 5:11am

There were a few concurrency bugs fixed in 5.6.2 - does upgrading to that (or 5.6.3) fix anything for you?

matejukmar · September 9, 2022, 6:07am

I've rebuild it with 5.6.3, run my test a few times, but get the same crash on the same spot.

matejukmar · September 14, 2022, 6:20am

Follow up: I converted my code from using actor to use the DispatchQueue instead (with minimal changes to code) and now the crash does not occur.

vine · September 14, 2022, 10:45am

5.7 is release

matejukmar · September 14, 2022, 1:11pm

I can confirm that 5.7 fixed the crash! (tested with actor build)