i want to preface this by saying that the events i’m about to describe only affected worker nodes deep within our infrastructure, and it had no material impact on any of our clients. nobody is at fault and i am only sharing this to better-inform ways we can make Swift more suitable for use on the server.
with that out of the way, let’s jump into the timeline!
incident summary
yesterday, we deployed a small library update to a cluster of servers. the update was only intended to change the behavior of one particular process running on each machine, but we know now that - through the library - it also affected a number of daemonized scripts written in Swift running on the same servers.
i noticed that intracluster latencies were slowly creeping upwards, but i thought little of it at the time and suspected that a database query had somehow been pessimized.
this morning we realized that the bottom had fallen out overnight and that the cluster had ground to a halt due to exceptionally high CPU usage.
we spent a little time debugging the primary worker processes, but as it turns out, there was nothing wrong with the application itself. instead, the culprit was a peripheral maintenance script that runs as a daemon and was crashing on startup and triggering Swift backtrace collection, in an endless loop. when i inspected the backtraces, i realized the script was attempting to load a non-existent .pem
file and throwing a top-level error, which the Swift runtime interprets as a crash.
what safeguards failed?
daemons that fail on startup are a common occurrence on servers, which is why systemd
will not attempt to restart a process that fails repeatedly in quick succession. this is an important safeguard that limits the fallout from this sort of scenario in other languages.
unfortunately, Swift applications (especially scripts) have a peculiarity that renders this safeguard ineffective:
- by default,
try
/throws
in Swift at the top level is considered an “uncontrolled exception”, which triggers backtrace collection, instead of gracefully exiting and reporting a failure status. - Swift backtrace collection takes a long time from the perspective of the OS, which causes
systemd
to believe the process is doing meaningful work, as opposed to “failing repeatedly in quick succession”. - because of the long interval between process launch and the conclusion of backtrace collection,
systemd
will continue attempting to restart the process until a higher-level circuit breaker kicks in.
what could have prevented this?
the immediate cause of this incident was attempting to load a pem
file from a script that didn’t have access to it. although it’s a really bad idea to hard-code private keys into client applications, this is pretty common practice for server applications, as the security calculus is a bit different. so we could have reduced the number of possible failure points by loading the private key from a String
instead of a file.
PEM+ASN1 parsing is still failable at runtime, but one could imagine a Swift macro that could parse and validate a PEM string at compile time. there is no fundamental reason why loading credentials should be failable at all, especially in a language that has macros.
but that’s losing sight of the bigger issue, which is that Swift allows, but can’t handle throwing from main. our server application was written defensively to catch all top-level errors and log them manually, but our scripts were not so carefully written, and many of them are still liable to throw uncaught errors from main.
Swift as a scripting language
some might read this as a reason why Swift should not be used for scripting, but there is really no fundamental reason why writing scripts in Swift should be more dangerous than writing them in another language. errors in Swift are controlled failures that ought to beget controlled exits. the language is making a deliberate choice to convert these controlled failures into uncontrolled exits, which does not make a lot of sense to me.
once you’ve been burned by this a few times, it’s straightforward to code defensively against it, but that’s irrelevant. crashing on try
is a terrible default.
conclusions
in the very long term, macros might reduce the number of possible run-time failure modes, making this a less common occurrence. but that’s no substitute for gracefully handling top-level errors, especially in scripts.
we can invest some effort in educating developers, especially new users, about the perils of throwing from main, and we can encourage them to adopt more-sophisticated patterns that defend against this. but ultimately, crashing on try
is just a really bad default that must be changed.