SwiftNIO needs a noassert mode, does this exist?

taylorswift · March 6, 2024, 9:27pm

the crash in HTTP2CommonInboundStreamMultiplexer.swift was fixed in late January. you are correct that when i encountered that crash the second time (after January), it had already been fixed, and we had not obtained the fix because we had stale upToNextMinor requirements that had not been updated since December. we are going to change our procedures to bump dependency requirements more frequently than we did before.

as of

package	version
swift-nio	2.64.0
swift-nio-ssl	2.26.0
swift-nio-http2	1.30.0

the crash in NIOAsyncWriter.InternalClass.deinit is still occurring, at least from a trivial experiment. based on Cory’s answer here, it sounds like the fix is going to require some substantial refactoring inside SwiftNIO.

in the same answer, he suggested wrapping the channel in a reference-counted class that touches the channel on deinit to workaround the crash. i initially imagined he meant something like this:

deinit
{
    Task<Void, any Error>.init
    {
        try await self.frames.executeThenClose { _, _ in }
    }
}

which… also crashed…

Object 0x7f09b4001860 of class InternalClass deallocated with non-zero retain count 2. This object's deinit, or something called from it, may have created a strong reference to self which outlived deinit, resulting in a dangling reference.
swift-runtime: unable to suspend thread 101706
swift-runtime: unable to suspend thread 101706

💣 Program crashed: Bad pointer dereference at 0x0000000000000000

this is actually a very old bug that i was fortunate enough to have recognized from having frequented these forums for many years. it can be worked-around by copying the stored properties to local variables within the deinit. i mention it because i think it illustrates a systemic problem. not everyone who we want to start using Swift on the server is going to be familiar enough with the bug lore to have a nuanced understanding of Swift runtime crashes, they are just going to view a bunch of stack dumps in journalctl and conclude that Swift is crashy and unreliable.

this is encouraging, and i am grateful that the NIO team is willing to revise its practices in this area. this sends a positive signal about the viability of Swift on the server.

recovery time is really important here. obviously, it is best to never crash at all. but if you do crash, the paramount objective is to get up and running again as quickly as possible.

sometimes you crash on a Googlebot request. there is nothing that can be salvaged there, you just have to eat the search penalty. but sometimes we crash on a request from someone we don’t really care about, like an Ahrefsbot. in that situation, it’s very important to get up immediately, in case that request is followed by a Googlebot request.

today, there are basically two things that can happen when a swift application crashes:

it goes into backtrace collection, which succeeds and takes about 20–30 seconds to complete.
it goes into backtrace collection, at an unlucky time, and consumes all available RAM on the node, which locks up the entire node. the node must be rebooted through the AWS console. this type of outage can last for days unless you have a human on-call 24/7.

situation #2 is an extreme case (that still happens all too often), but even situation #1 can be damaging especially if it occurs multiple times per day. 30 seconds is a long time for a server to be down.