SwiftNIO needs a noassert mode, does this exist?

taylorswift · March 5, 2024, 11:23pm

i have run into several instances (1, 2) of runtime crashes in SwiftNIO due to internal precondition checks in the library, and it sounds like i am not the only one. i understand precondition checks are valuable to the library. but as a consumer of these libraries, i cannot stress enough how economically damaging these precondition failures are to our internet business.

to make this more concrete for folks, here is a graph of Google search indexing for one of our storefronts:

around early February, we experienced transient service disruptions due to a crashing application which led Google to delist nearly sixty percent of our pages (the green bars), and it has taken this particular storefront about a month to recover its listings and lost pagerank.

although regular disasters are an inherent part of doing any sort of business, it is hard to trace the root cause of this particular mode of disaster to anything other than “our” (my) decision to use Swift on the server.

is there a way to disable these precondition checks at compile time, so that we do not continue experiencing similar disasters?

bbrk24 · March 6, 2024, 1:17am

To state the obvious: -Ounchecked.

taylorswift · March 6, 2024, 1:19am

if i understand correctly, that applies to the entire build tree, whereas i am only interested in disabling precondition checks within SwiftNIO.

Joe_Groff · March 6, 2024, 1:41am

If the code failed a precondition check, that means there was a serious programming error. If you disable these checks, then the only place for code to proceed into is undefined behavior, where security exploits and data corruption lie. As damaging as losing service is, it's not nearly as bad as corrupting or exfiltrating your customers' data. Rather than try to eliminate assertion checks completely, it would be more realistic to talk about what we could do to make it so that handling and recovering from an assertion failure has lower overhead. Taking down a whole process is too severe for servers and many other applications, and it would be good for us to have some way to let a supervisor process wind down gracefully when one job runs into an unrecoverable situation.

taylorswift · March 6, 2024, 2:14am

i understand this argument, as i have often made it myself. however, there is a crucial difference between serious programming errors in code you (or your teammate) wrote, and serious programming errors in libraries.

if you hit a precondition failure in code your team is responsible for, there is only one sensible course of action: all hands on deck until the crash is fixed.

if you hit a precondition failure in libraries you depend on, you don’t have a ton of great options. “stop all operations until somebody in a completely different organizational hierarchy resolves the issue” is not a real course of action, it is more of a signal that you have fundamental supply chain issues and need to find a different supplier. if there are no alternative suppliers, then that means your venture is a dead end. i do not want our organization to become a cautionary tale that investors use to argue that businesses that use Swift on Server are doomed. this is obviously bad for us, and bad for the Swift community as a whole.

a decent stop-gap would be to have some way to recover from fatalError and friends, as discussed in this thread: Testing fatalError and friends .

i think many developers are conceiving of it exclusively in terms of testing, but having some way of recreating a crashed MultiThreadedEventLoopGroup would also be valuable in production. however, it would be tough to separate crashes that happen in our code from crashes that happen in library code, because they all run on the same thread group. the goal is not to disable all precondition failures, just precondition failures in code we do not control.

johannesweiss · March 6, 2024, 10:13am

@taylorswift Your two crashes are in the new NIO async support and I believe have been fixed since. The other crash you reference is due to a mis-compilation.

As pointed out by others, SwiftNIO uses preconditions only for things that are very severe errors that should never happen, you really shouldn't just ignore those. But software does have bugs and the async support is still new and doesn't benefit from the years worth of testing. Finally, make sure to always run the latest versions, as I understand at least one of the two bugs you ran into were already fixed when you hit them.

lukasa · March 6, 2024, 10:21am

To elaborate on this position, NIO uses preconditions for 3 fundamentally quite different reasons:

To protect against invoking undefined-behaviour, e.g. out-of-bounds memory access. These exist in our core data types, like ByteBuffer and friends. Disabling these preconditions is a very bad idea: if you hit them, you're probably screwed.
To protect against users causing very challenging to debug issues, e.g. leaking promises, by misusing the API. This is an attempt to turn what would be a silent failure into a loud one. As an example here, if you drop all references to an unfulfilled promise then the callbacks will never execute, and so you will quietly get things "not happening". It is a programmer error to drop the promise.

For this case, we could imagine having some alternate modes of operation. It seems plausible to imagine having an environment variable that turns these into some other, more silent, failure mode. The problem is that debugging these will be nigh-on impossible. As more diagnostic tools become available in the language we could consider adopting them to make this easier: for example, emitting backtraces on dropped promises instead of running through them.

Nonetheless, while the crash is in NIO code, the actual error is in the caller's code. This remains an action on the caller, not on the NIO team.
An attempt to guard against something believed to be impossible.

The majority of these are in NIO state machines, and exist because a certain edge was believed to be absolutely unreachable. This was best practice in our team for a long time: unreachable edges are untestable edges, and if the edge is actually impossible it's a good idea to signal that in the code so that developers and maintainers don't have to worry about what happens if it happens.

Unfortunately, developers are humans and so sometimes we get things wrong. The issues you're concerned about fundamentally fall into this category. If you actually hit any of the preconditions in part (1) and (2), those are signalling very real bugs in your program that really do need to be fixed. We absolutely cannot run past the errors in part (1), and the preconditions in part (2) are there to help make services reliable.

Part (3) is a more interesting one, because errors here tend to be the worst. That's because when these are hit it usually turns out that something believed to be impossible, actually wasn't.

To that end we've revised our best-practices on these. For newer state machines, we tend to follow a pattern where we assert on the impossible path, and then throw an error. This allows users to gracefully tear the connection down, while still producing "loud" and easy-to-debug errors in debug mode.

Joe_Groff · March 6, 2024, 3:49pm

I don't think this supply chain problem is specific to Swift; in Java or Python land, you have to deal with libraries raising runtime exceptions, Go and Rust have panics, and so on. The Swift specific problem is what you're able to do about them. In those other languages, you can "handle" a serious error in a library to some degree and contain the damage, and I agree it's a real problem for Swift in server and other applications that you can't do that at all. But rarely is swallowing the error and going on the right thing to do in those other languages either.

In our development of embedded Swift, we've found that we can handle precondition traps by calling a common fatal error function rather than a trap instruction with minimal code size overhead, at least on ARM where a call and a trap are both four bytes. If we do that, then we should be able to pretty cleanly separate "clean" precondition failures from signals raised by out-of-control unsafe code. Although we still wouldn't be able to really continue execution on the same task or thread after a precondition failure, we could potentially treat only the affected task as hung on the precondition, and allow the rest of the system to keep executing. We'd want to have some way to tell a supervisor that a precondition failure occurred so that it can wind down the process, since it's still not in a good state, and the hung task could be holding on to resources or locks that ultimately deadlock the process, but that could be a way to keep a crash serving one request from taking down all of the requests in the same process.

taylorswift · March 6, 2024, 9:27pm

the crash in HTTP2CommonInboundStreamMultiplexer.swift was fixed in late January. you are correct that when i encountered that crash the second time (after January), it had already been fixed, and we had not obtained the fix because we had stale upToNextMinor requirements that had not been updated since December. we are going to change our procedures to bump dependency requirements more frequently than we did before.

as of

package	version
swift-nio	2.64.0
swift-nio-ssl	2.26.0
swift-nio-http2	1.30.0

the crash in NIOAsyncWriter.InternalClass.deinit is still occurring, at least from a trivial experiment. based on Cory’s answer here, it sounds like the fix is going to require some substantial refactoring inside SwiftNIO.

in the same answer, he suggested wrapping the channel in a reference-counted class that touches the channel on deinit to workaround the crash. i initially imagined he meant something like this:

deinit
{
    Task<Void, any Error>.init
    {
        try await self.frames.executeThenClose { _, _ in }
    }
}

which… also crashed…

Object 0x7f09b4001860 of class InternalClass deallocated with non-zero retain count 2. This object's deinit, or something called from it, may have created a strong reference to self which outlived deinit, resulting in a dangling reference.
swift-runtime: unable to suspend thread 101706
swift-runtime: unable to suspend thread 101706

💣 Program crashed: Bad pointer dereference at 0x0000000000000000

this is actually a very old bug that i was fortunate enough to have recognized from having frequented these forums for many years. it can be worked-around by copying the stored properties to local variables within the deinit. i mention it because i think it illustrates a systemic problem. not everyone who we want to start using Swift on the server is going to be familiar enough with the bug lore to have a nuanced understanding of Swift runtime crashes, they are just going to view a bunch of stack dumps in journalctl and conclude that Swift is crashy and unreliable.

this is encouraging, and i am grateful that the NIO team is willing to revise its practices in this area. this sends a positive signal about the viability of Swift on the server.

recovery time is really important here. obviously, it is best to never crash at all. but if you do crash, the paramount objective is to get up and running again as quickly as possible.

sometimes you crash on a Googlebot request. there is nothing that can be salvaged there, you just have to eat the search penalty. but sometimes we crash on a request from someone we don’t really care about, like an Ahrefsbot. in that situation, it’s very important to get up immediately, in case that request is followed by a Googlebot request.

today, there are basically two things that can happen when a swift application crashes:

it goes into backtrace collection, which succeeds and takes about 20–30 seconds to complete.
it goes into backtrace collection, at an unlucky time, and consumes all available RAM on the node, which locks up the entire node. the node must be rebooted through the AWS console. this type of outage can last for days unless you have a human on-call 24/7.

situation #2 is an extreme case (that still happens all too often), but even situation #1 can be damaging especially if it occurs multiple times per day. 30 seconds is a long time for a server to be down.

Joe_Groff · March 6, 2024, 9:33pm

That sounds like a pretty severe problem with backtrace collection then. @al45tair are there ways we could limit resource usage of backtrace collection in cases where the machine is near its limits?

al45tair · March 6, 2024, 9:53pm

If it helps, there is a change in 5.10 that should improve this; specifically, you can set SWIFT_BACKTRACE=symbolicate=fast or even SWIFT_BACKTRACE=symbolicate=off, both of which should have a fairly dramatic impact on the amount of time and memory that gets used when the program crashes.

Those options are specifically intended for the exact use-case we're talking about here; that is, getting things running again as quickly as possible while still providing actionable information about the problem that occurred.