Emergency Shutdown

dabrahams · December 16, 2025, 11:54pm

When a precondition is violated, it is sometimes important to take emergency shutdown measures that increase the likelihood of user satisfaction when the program is restarted, or of reproducing the issue that caused the precondition violation. For example, one could log all of the user's actions. Developers can very often reproduce the issue using such a log, and users can be offered the opportunity to apply one or more of these actions to the last explicitly-saved state of their document. Or, a program may want to release resources that aren’t automatically reclaimed when processes exit, such as a large temporary file.

I don't see a way to do this in Swift, and my experiments with signal handlers have not been successful. Does anyone have a method that works?

benpious · December 17, 2025, 12:51am

Full disclosure: I don't have any answers to this, just a bunch of questions:

I assume you're already aware of KSCrash and it's not what you're looking for? Are you looking for something that works on non-Apple platforms or for a way to write Async Signal Safe Code in Swift so you can build the equivalent in Swift?

Genuinely curious: how did you intend to write the "log of all the user's actions" to disk in an Async Signal Safe way? Asking for a friend who is my employer. I think most apps that care about this just write logs to disk preemptively so that if they crash those logs are available, though I'm not sure how something like Firebase handles it, perhaps they have a more sophisticated technique.

tera · December 17, 2025, 8:30pm

In case you haven't seen this already: Implementing Your Own Crash Reporter.

dabrahams · December 17, 2025, 11:23pm

I’m actually not interested in crash reporting specifically. I’m interested in a general, cross-platform way to do something arbitrary, which could mean saving a copy of the document’s last-known good state… or maybe it means deleting a large temporary file that will no longer be relevant.

It’s possible this is a misguided quest—arguably a program should be just as resilient to a person tripping over the power cord as it is to early termination due to precondition violation, which means anything you’d want to do must be done preemptively, and there’s nothing to be done about the file other than locating it in a designated temporary directory.

Joe_Groff · December 18, 2025, 12:27am

That would be my general advice. A crashed process is a potentially-compromised one, and what you can do by sifting through the wreckage is always going to be limited. Along similar lines to what you said, a robust program should also be able to "crash on success" and simply terminate the process once it's ready to quit without needing any cleanup on the way out. Admittedly, persistent shared resources such as terminal status or (as you noted) temporary files aren't always friendly to this ideal.

If you do want to observe or react to a process's abnormal termination, and your target platform is amenable, a more robust way to do so is to have a minimal parent process that spawns and monitors a child process that does the actual work. That way, if the child process does go awry, the monitor code won't be compromised, and you can perform cleanup or logging actions without having to worry about being signal-safe or spreading corruption. The Swift runtime itself takes this approach when its builtin backtrace functionality is enabled.

benpious · December 18, 2025, 4:54am

If you do want to observe or react to a process's abnormal termination, and your target platform is amenable, a more robust way to do so is to have a minimal parent process that spawns and monitors a child process that does the actual work

cries in iOS developer

dabrahams · December 18, 2025, 5:40pm

Yeah, but recovering any information stored in the possibly-wrecked state becomes quite difficult if not impossible for the monitor. I realize of course that you may be saying I shouldn’t want to.

Joe_Groff · December 18, 2025, 7:48pm

In some ways, with the right APIs, it's easier, since most OSes these days have a way you can poke at the child process in its crashed state, without executing code within the child process and potentially disturbing its state. Doing that does require some coordination for the child to put the interesting data at easily-discoverable addresses for the parent.

mayoff · December 18, 2025, 10:13pm

If you don't need to refer to the large temporary file by name after creating it (like, you already have an open fd on it and you'll never need to call open, link, rename, etc. on its name) then you can unlink it immediately after opening it (on Unix-likes at least).

Assuming that's not sufficient, what about spawning a child process (early in the main process's life) to take “emergency measures”? This is sort of the reverse of Joe's minimal parent suggestion, and has the advantage of preserving the semantics of wait and kill system calls that target the parent.

I assume a Unix-like system that allows child spawning (so, not iOS). The parent creates a pipe, sets FD_CLOEXEC on the write end, and connects the read end to the child's stdin. The write end closes when the parent exits for any reason, and the child reads EOF when that happens. I'm imagining a simple protocol where the parent can send filenames to the child for cleanup-on-exit, or send debug log messages to be written only if the parent crashes, and a clean-exit message to let the child know the parent is exiting normally, in which the child can discard the debug log.

dabrahams · December 18, 2025, 11:02pm

@Joe_Groff in general, intelligibly poking at child data requires child code, because the child implements some abstraction over the data.

@mayoff Thanks, but I think I’m settling on “just do what you need to do pre-emptively.”

This is ultimately about what advice I’m giving in a book chapter, so cannot be platform-specific, and all the schemes that involve something happening after the trap are much too complicated to be a general recommendation. Someone would have to build a portable library for this before I’d even consider discussing it.

Thanks to everybody who responded; it was interesting and I landed on a simple answer, which is all to the good!

Joe_Groff · December 19, 2025, 12:06am

It sounds like you're happy with your answer, and "just say no" is certainly a fine one here, especially since the mechanisms for IPC are finicky and platform-dependent. A sufficiently determined language runtime implementer could implement their data structure abstractions in such a way that data can be interpreted both in-process and "offline" out-of-process. We've done this in an ad-hoc way for most of Swift's runtime metadata; aside from being able to inspect crashed process state, there would be other potential benefits to doing so, such as a debugger being able to nondestructively interpret code out-of-process (unlike how lldb injects code in-process and occasionally makes things worse when trying its best to evaluate expressions in a crashed process context).

dabrahams · December 24, 2025, 8:25pm

My book isn’t targeting language runtime implementers; it’s trying to give practical, general advice. Do you still think it’s relevant?

ksluder · December 24, 2025, 11:44pm

Have you asked any other audiences? If you have, I’m curious what answers you have gotten. Particularly if there’s a consensus among people who specialize in high-availability or high-resiliency systems.

dabrahams · December 24, 2025, 11:56pm

I haven’t

ksluder · December 25, 2025, 12:11am

As a non-expert who might like to learn from such a book, I might recommend fleshing out the possibilities a little more to make a more complete recommendation. (This is where the opinion of experts, especially those who work in other languages/ecosystems, would be valuable.)

Some of the interesting questions I’ve seen raised include:

Should language runtimes implement task isolation and fatal error handling, or delegate them to the host OS’s process abstraction?
- .NET is an ecosystem that started out one way and transitioned to the other.)
Should programs that have detected an error condition proactively try to save user data before raising a fault, or should they terminate immediately and rely on separate, parallel techniques like automatic backups or journaling to recover user data?
- While I’m personally on the “crash immediately to avoid writing out corrupt data” side, I don’t think there’s universal consensus on this in application development. I wonder what the story is across filesystems and databases.
How verbose should an app’s logging be when deployed in production?
- It turns out storing and searching logs can get very expensive. It also turns out the story for live debugging is virtually nonexistent for some server ecosystems, so it’s the only way to debug a program failure.

dabrahams · February 11, 2026, 11:24pm

@processeus pointed out to me that

in a robotics system, you may need to perform safety measures, such as lowering a motorized arm slowly to avoid falling and damaging components. Often, the emergency stop also doesn't just cut the power but e.g. keeps holding onto suction cups so the robot doesn't drop a 10kg glass window.

I do think, to be a truly general-purpose language, Swift needs an answer for these cases. @processeus, could these cases be handled with a monitor process? I think once upon a time robots ran on OSes without process support, but I don't know if that's part of today's reality.

I think your first bullet is out-of-scope for the book, but the others could be interesting. Would you mind opening issues for them where we can explore them in more detail?

processeus · February 12, 2026, 12:20am

Nowadays there are all kinds of control techniques for robots, from microcontrollers without an OS, to some lighter weight real-time OS-es like FreeRTOS or even embedded linux. Whatever method you choose, we have to consider what type of crash are we planning to prepare for.

When talking about how we must handle the crash, we need to think about what caused it. I'll describe an error handling scheme I designed for a factory line.

logic error in the code, but unrelated to the critical subsystem responsible for emergency stop
- When noticing the issue, we could immediately issue an explicit wind-down routine. It requires careful consideration to be sure that the subsystem responsible for the wind-down is independent enough from the logic error that its invariants are not yet broken. This requires reasoning about the potential likely sources of the incorrectness of the program, which is all about assumed probabilities, no guarantees. Especially in embedded applications, there can be all kinds of concurrency issues with interrupts that are very hard to reason about, and defy the expectations of independence (rendering all our predictions wrong). Therefore I would discourage such attempts.
- Similarly to an external process, robots can have an external, dedicated microcontroller whose only job is to watch the beefy, complicated Linux microcontroller, and see if it crashes. It can also get emergency stop conditions from sensors directly to respond to certain deterministic events immediately. This safety unit can at any point disconnect the Linux machine and take over full control of the output peripherals.
external hardware failure in a subsystem, or unexpected sensed state (e.g. an incoming croissant's weight is measured to be 100kg)
- We need to immediately stop this subsystem in the simplest and quickest way to avoid damaging hardware or products on the line. The other subsystems may go on with their operation until they hit some dependency point with the halted subsystem (e.g. they would need to forward/receive an item from it). At that point, they can gracefully stop. In a long production line it's very useful if the system's different independent parts can continue operation and don't terminate in a hard-to-recover state. A worked would otherwise need to pick up all the croissants from the 50m long conveyor belt that the functioning half of the system could have finished on its own.

This is probably an oversimplification, and I'm not a robotics expert, so I'm interested to hear about other error/crash classification and handling strategies.