Emergency Shutdown

When a precondition is violated, it is sometimes important to take emergency shutdown measures that increase the likelihood of user satisfaction when the program is restarted, or of reproducing the issue that caused the precondition violation. For example, one could log all of the user's actions. Developers can very often reproduce the issue using such a log, and users can be offered the opportunity to apply one or more of these actions to the last explicitly-saved state of their document. Or, a program may want to release resources that aren’t automatically reclaimed when processes exit, such as a large temporary file.

I don't see a way to do this in Swift, and my experiments with signal handlers have not been successful. Does anyone have a method that works?

4 Likes

Full disclosure: I don't have any answers to this, just a bunch of questions:

I assume you're already aware of KSCrash and it's not what you're looking for? Are you looking for something that works on non-Apple platforms or for a way to write Async Signal Safe Code in Swift so you can build the equivalent in Swift?

Genuinely curious: how did you intend to write the "log of all the user's actions" to disk in an Async Signal Safe way? Asking for a friend who is my employer. I think most apps that care about this just write logs to disk preemptively so that if they crash those logs are available, though I'm not sure how something like Firebase handles it, perhaps they have a more sophisticated technique.

In case you haven't seen this already: Implementing Your Own Crash Reporter.

1 Like

I’m actually not interested in crash reporting specifically. I’m interested in a general, cross-platform way to do something arbitrary, which could mean saving a copy of the document’s last-known good state… or maybe it means deleting a large temporary file that will no longer be relevant.

It’s possible this is a misguided quest—arguably a program should be just as resilient to a person tripping over the power cord as it is to early termination due to precondition violation, which means anything you’d want to do must be done preemptively, and there’s nothing to be done about the file other than locating it in a designated temporary directory.

1 Like

That would be my general advice. A crashed process is a potentially-compromised one, and what you can do by sifting through the wreckage is always going to be limited. Along similar lines to what you said, a robust program should also be able to "crash on success" and simply terminate the process once it's ready to quit without needing any cleanup on the way out. Admittedly, persistent shared resources such as terminal status or (as you noted) temporary files aren't always friendly to this ideal.

If you do want to observe or react to a process's abnormal termination, and your target platform is amenable, a more robust way to do so is to have a minimal parent process that spawns and monitors a child process that does the actual work. That way, if the child process does go awry, the monitor code won't be compromised, and you can perform cleanup or logging actions without having to worry about being signal-safe or spreading corruption. The Swift runtime itself takes this approach when its builtin backtrace functionality is enabled.

9 Likes

If you do want to observe or react to a process's abnormal termination, and your target platform is amenable, a more robust way to do so is to have a minimal parent process that spawns and monitors a child process that does the actual work

cries in iOS developer

4 Likes

Yeah, but recovering any information stored in the possibly-wrecked state becomes quite difficult if not impossible for the monitor. I realize of course that you may be saying I shouldn’t want to.

In some ways, with the right APIs, it's easier, since most OSes these days have a way you can poke at the child process in its crashed state, without executing code within the child process and potentially disturbing its state. Doing that does require some coordination for the child to put the interesting data at easily-discoverable addresses for the parent.

3 Likes

If you don't need to refer to the large temporary file by name after creating it (like, you already have an open fd on it and you'll never need to call open, link, rename, etc. on its name) then you can unlink it immediately after opening it (on Unix-likes at least).

Assuming that's not sufficient, what about spawning a child process (early in the main process's life) to take “emergency measures”? This is sort of the reverse of Joe's minimal parent suggestion, and has the advantage of preserving the semantics of wait and kill system calls that target the parent.

I assume a Unix-like system that allows child spawning (so, not iOS). The parent creates a pipe, sets FD_CLOEXEC on the write end, and connects the read end to the child's stdin. The write end closes when the parent exits for any reason, and the child reads EOF when that happens. I'm imagining a simple protocol where the parent can send filenames to the child for cleanup-on-exit, or send debug log messages to be written only if the parent crashes, and a clean-exit message to let the child know the parent is exiting normally, in which the child can discard the debug log.

1 Like

@Joe_Groff in general, intelligibly poking at child data requires child code, because the child implements some abstraction over the data.

@mayoff Thanks, but I think I’m settling on “just do what you need to do pre-emptively.”

This is ultimately about what advice I’m giving in a book chapter, so cannot be platform-specific, and all the schemes that involve something happening after the trap are much too complicated to be a general recommendation. Someone would have to build a portable library for this before I’d even consider discussing it.

Thanks to everybody who responded; it was interesting and I landed on a simple answer, which is all to the good!

2 Likes

It sounds like you're happy with your answer, and "just say no" is certainly a fine one here, especially since the mechanisms for IPC are finicky and platform-dependent. A sufficiently determined language runtime implementer could implement their data structure abstractions in such a way that data can be interpreted both in-process and "offline" out-of-process. We've done this in an ad-hoc way for most of Swift's runtime metadata; aside from being able to inspect crashed process state, there would be other potential benefits to doing so, such as a debugger being able to nondestructively interpret code out-of-process (unlike how lldb injects code in-process and occasionally makes things worse when trying its best to evaluate expressions in a crashed process context).

5 Likes

My book isn’t targeting language runtime implementers; it’s trying to give practical, general advice. Do you still think it’s relevant?

Have you asked any other audiences? If you have, I’m curious what answers you have gotten. Particularly if there’s a consensus among people who specialize in high-availability or high-resiliency systems.

I haven’t

As a non-expert who might like to learn from such a book, I might recommend fleshing out the possibilities a little more to make a more complete recommendation. (This is where the opinion of experts, especially those who work in other languages/ecosystems, would be valuable.)

Some of the interesting questions I’ve seen raised include:

  • Should language runtimes implement task isolation and fatal error handling, or delegate them to the host OS’s process abstraction?
    • .NET is an ecosystem that started out one way and transitioned to the other.)
  • Should programs that have detected an error condition proactively try to save user data before raising a fault, or should they terminate immediately and rely on separate, parallel techniques like automatic backups or journaling to recover user data?
    • While I’m personally on the “crash immediately to avoid writing out corrupt data” side, I don’t think there’s universal consensus on this in application development. I wonder what the story is across filesystems and databases.
  • How verbose should an app’s logging be when deployed in production?
    • It turns out storing and searching logs can get very expensive. It also turns out the story for live debugging is virtually nonexistent for some server ecosystems, so it’s the only way to debug a program failure.
3 Likes