Another taylorswift post-mortem

taylorswift · July 12, 2024, 8:29pm

i want to preface this by saying that the events i’m about to describe only affected worker nodes deep within our infrastructure, and it had no material impact on any of our clients. nobody is at fault and i am only sharing this to better-inform ways we can make Swift more suitable for use on the server.

with that out of the way, let’s jump into the timeline!

incident summary

yesterday, we deployed a small library update to a cluster of servers. the update was only intended to change the behavior of one particular process running on each machine, but we know now that - through the library - it also affected a number of daemonized scripts written in Swift running on the same servers.

i noticed that intracluster latencies were slowly creeping upwards, but i thought little of it at the time and suspected that a database query had somehow been pessimized.

this morning we realized that the bottom had fallen out overnight and that the cluster had ground to a halt due to exceptionally high CPU usage.

we spent a little time debugging the primary worker processes, but as it turns out, there was nothing wrong with the application itself. instead, the culprit was a peripheral maintenance script that runs as a daemon and was crashing on startup and triggering Swift backtrace collection, in an endless loop. when i inspected the backtraces, i realized the script was attempting to load a non-existent .pem file and throwing a top-level error, which the Swift runtime interprets as a crash.

what safeguards failed?

daemons that fail on startup are a common occurrence on servers, which is why systemd will not attempt to restart a process that fails repeatedly in quick succession. this is an important safeguard that limits the fallout from this sort of scenario in other languages.

unfortunately, Swift applications (especially scripts) have a peculiarity that renders this safeguard ineffective:

by default, try/throws in Swift at the top level is considered an “uncontrolled exception”, which triggers backtrace collection, instead of gracefully exiting and reporting a failure status.
Swift backtrace collection takes a long time from the perspective of the OS, which causes systemd to believe the process is doing meaningful work, as opposed to “failing repeatedly in quick succession”.
because of the long interval between process launch and the conclusion of backtrace collection, systemd will continue attempting to restart the process until a higher-level circuit breaker kicks in.

what could have prevented this?

the immediate cause of this incident was attempting to load a pem file from a script that didn’t have access to it. although it’s a really bad idea to hard-code private keys into client applications, this is pretty common practice for server applications, as the security calculus is a bit different. so we could have reduced the number of possible failure points by loading the private key from a String instead of a file.

PEM+ASN1 parsing is still failable at runtime, but one could imagine a Swift macro that could parse and validate a PEM string at compile time. there is no fundamental reason why loading credentials should be failable at all, especially in a language that has macros.

but that’s losing sight of the bigger issue, which is that Swift allows, but can’t handle throwing from main. our server application was written defensively to catch all top-level errors and log them manually, but our scripts were not so carefully written, and many of them are still liable to throw uncaught errors from main.

Swift as a scripting language

some might read this as a reason why Swift should not be used for scripting, but there is really no fundamental reason why writing scripts in Swift should be more dangerous than writing them in another language. errors in Swift are controlled failures that ought to beget controlled exits. the language is making a deliberate choice to convert these controlled failures into uncontrolled exits, which does not make a lot of sense to me.

once you’ve been burned by this a few times, it’s straightforward to code defensively against it, but that’s irrelevant. crashing on try is a terrible default.

conclusions

in the very long term, macros might reduce the number of possible run-time failure modes, making this a less common occurrence. but that’s no substitute for gracefully handling top-level errors, especially in scripts.

we can invest some effort in educating developers, especially new users, about the perils of throwing from main, and we can encourage them to adopt more-sophisticated patterns that defend against this. but ultimately, crashing on try is just a really bad default that must be changed.

wes1 · July 13, 2024, 9:32pm

Thanks for sharing this and other post mortem's, and for pointing out the backtrace delay before:

How to opt-out of swift 5.9 “interactive exit” behavior? - #14 by taylorswift

There they mention the SWIFT_BACKTRACE environment variable, which supports disable:
swift/docs/Backtracing.rst at main · swiftlang/swift · GitHub

Did that not work to avoid confusing systemd?

It seems doable to set that before running a script (and to have different settings depending on what you are launching).

The uncontrolled crash default seems like the safest thing for first-failure data capture, since the developer could have caught the error, and since it can be disabled by the deployer. Frustrating, though!

ksluder · July 13, 2024, 11:09pm

What would you rather do instead, exit cleanly? That doesn’t help in the case where the program reliably fails after doing enough work to satisfy systemd’s heuristics.

I think the problem here is really an immature deployment and stability process. Immediately after rolling out the configuration change that caused this issue, someone should have been alerted to a sudden spike in crash rate. That person would then halt or revert the configuration change.

Gabardone · July 16, 2024, 5:02pm

I'm kinda surprised that main is a throwing function. Rubs the wrong way against how the rest of error management is designed.

I was already running most of my script exception handling through a method that posts the error to stdErr and exit(1) for debugging purposes but I don't see why the system should do more than that.

John_McCall · July 16, 2024, 6:19pm

There's no good reason for an error thrown from a script to trigger a crash and backtrace collection, since the crash information and backtrace will almost definitionally be useless. I agree that we should just log the error to stderr and exit(1).

I also agree with Kyle that something in your process really should be alerting you to an unusual number of crashes.

jaleel · July 17, 2024, 9:47am

Just wanted to say thx for post-mortem, really valuable! Even though I knew that crashing on try is not a good idea, never thought about the fact that main is throwing and what it does.

ksluder · July 17, 2024, 3:55pm

What’s the difference between a script and a server? Is it top-level code outside of main? Does that work if the script author wants to write asynchronous code? Are @taylorswift’s “daemonized” scripts still scripts?

taylorswift · July 17, 2024, 4:00pm

there is no bright line between the two, the main practical difference for us was that the “server” had more eyes on it and did all the elaborate custom error handling to prevent this sort of thing from happening, and the “scripts” - being non-critical parts of our infrastructure - did not. that they could end up causing spillover effects on the critical components was not something we really considered.

ksluder · July 17, 2024, 4:11pm

How did you “daemonize” your scripts? I can see a semantic difference between swift run foo and ./foo, where swift run sets up all the exception handling machinery.

taylorswift · July 17, 2024, 4:15pm

we compile it ahead-of-time as a regular SwiftPM executable target and systemd launches it through a service description file. systemd never interacts with SwiftPM (in fact, it is not even installed on the target), and top-level code is not involved here.

ksluder · July 17, 2024, 4:39pm

I wouldn’t expect or want SwiftPM-produced executables to differ in behavior from executables built using other build systems unless I explicitly opt in to this behavior.

On Darwin platforms, we have the ability to attach a process to another process’s exception port, which enables offline backtrace generation while launchd restarts the important process. By inserting in-process backtrace collection, it sounds like SwiftPM has unilaterally decided to do things a different way at the expense of uptime.

That said, I would expect the uncaught error handler to exit with a nonzero error code to signal to systemd that it has crashed. If systemd doesn’t acknowledge that, that sounds like a systemd bug.

taylorswift · July 17, 2024, 4:45pm

This means that services which specify Restart=always are restarted 100ms after they crash, and if the service crashes more than 5 times in 10 seconds, systemd does not attempt to restart the service anymore.

source

Joe_Groff · July 17, 2024, 4:45pm

I believe the behavior here comes from the Linux vs Darwin distinction rather than SwiftPM vs not. Across the entire Linux universe there is no universal crash handler that can be relied on, so the Swift runtime sets up its own backtrace handler. On Darwin we rely on the system crash handler as you said (though the in-process handler can also be enabled there as an option).

ksluder · July 17, 2024, 4:50pm

It sounds like you configured systemd to explicitly ignore the distress signals your crashing process was sending out. I’m not sure there’s much SwiftPM, the runtime, or anyone can do in that case.

taylorswift · July 17, 2024, 4:55pm

that was the default, which was probably chosen because it works well for the great majority of programming languages. my argument was never that there is no way to resolve the issue. my argument was that Swift’s choice of default behavior does not align well with the system.

ksluder · July 17, 2024, 4:56pm

Don’t the majority of programming languages, especially those that are used for server side programming, handle uncaught exceptions by collecting backtraces in-process? I’m specifically thinking of Python and Ruby as examples.

taylorswift · July 17, 2024, 5:03pm

one difference is that backtrace collection in those languages is near-instantaneous, so they get away with collecting backtraces all the time. but the deeper reason is that those backtraces are useful because those languages’ equivalents of Error trace the stack.

in Swift, errors don’t remember all the frames they bubble up through. so these backtraces are not only slow, but they are also useless.

the equivalent of an “uncaught exception” in Swift is something like an integer overflow. in this situation, the backtrace is useful, even if it would suffer from many of the same problems described in the original post. but that is not what we are talking about here.

ksluder · July 17, 2024, 5:16pm

Ok, I think I understand your argument now. Because Swift decided to implement errors instead of exceptions, logging the backtrace of a thrown Error can never be meaningful, because it will always terminate in the most recent callee. In the case of an error thrown from main, this is just main itself. It doesn’t matter if the program is a “script” or not.

John_McCall · July 17, 2024, 6:53pm

Yes, that was what I was getting at.

justinpawela · July 26, 2024, 4:12pm

If we're talking about a low-stakes script, one could just remove the throws keyword from the main function, and perhaps just use main as a do/catch wrapper around a _main function. You get the benefits skipping out on some error handling but will still catch any unhandled errors, giving you a single place to log them and exit cleanly.

@main struct MyScript {
    static func main() {
        do {
            try Self._main()
        } catch {
            print(error)
            // log, exit(1), or whatever makes sense
        }
    }

    private static func _main() throws {
        // actual work here...
    }
}

It's not fancy, but it gives you more control than just letting main throw. Don't let main throw and nothing will ever bubble up out of it.