[stdlib] Cleanup callback for fatal Swift errors

kattrali · July 15, 2019, 12:12pm

In the event of a fatal error caused by Swift code, there is no direct way to get the error message and context from Swift without out-of-process log parsing. Fatal errors "fall through" to signal handlers at which point the crash context is lost. The goal of this proposal is to provide a native Swift cleanup callback for fatal errors without the complexity of signal handlers nor allowing attempted recovery. This context could be written to disk or logged in a custom format or aggregated for later analysis.

Proposed solution

Add an onFatalError function which takes a closure as an argument. The closure expects a message and optionally a file and line number, similar to the semantics of the various types _assertionFailure(). The onFatalError closure is invoked by any call to fatalError(), preconditionFailure() and assertionFailure() providing a cleanup opportunity before the app is ultimately terminated by trap().

The handler is active globally, similar to facilities in other languages like Rust's panic::set_hook, Python's sys.excepthook, and NSSetUncaughtExceptionHandler.

The onFatalError function returns the existing fatal error handler (if any) to allow handler chaining if needed. The last registration of onFatalError "wins". This is analogous to NSSetUncaughtExceptionHandler.

Trivial usage example with handler chaining:

onFatalError { message, file, line in
  print("This is a custom callback. Received error: '\(message)'")

  if let file = file, let line = line {
    print("The error occurred in \(file):\(line)")
  }
}

var prevHandler: AssertionFailureCallback? = nil

prevHandler = onFatalError { message, file, line in
  print("This is the second handler. Received error: '\(message)'")

  if let prevHandler = prevHandler {
    prevHandler(message, file, line)
  }
 }

// Examples of fatal errors:
let text: String? = nil
print(text!)

let items = [1, 2, 3]
print("The fourth item is \(items[4])")

fatalError("Damage report!")

Example implementation

Apps would typically set a fatal handler at the end of the launch lifecycle and use the handler to add custom state to debug issues which arise.

App code example (eliding some helper functions)

// AppDelegate.swift
import UIKit

@UIApplicationMain
class AppDelegate: UIResponder, UIApplicationDelegate {

    var window: UIWindow?

    var gameWorld: World?

    func application(_ application: UIApplication, didFinishLaunchingWithOptions launchOptions: [UIApplication.LaunchOptionsKey: Any]?) -> Bool {
        let world = World()

        // Wrap fatal error handler to provide a reference to interesting
        // state for debugging
        let didCrash = FatalHandler.install(world)

        if didCrash {
            // load empty world and prompt user to say what's happening and why
        } else {
            world.loadFromSave()
        }
        gameWorld = world

        return true
    }
}

// FatalHandler.swift
import Foundation

var crashInfoPath: String?
var crashedLastLaunch = false
var crashedThisLaunch = false
var worldContext: World?

class FatalHandler {

    // checks and returns crash state
    public class func install(_ world: World) -> Bool {
        var didCrash = false
        // Make context available in a non-capturing scope for signal handlers
        worldContext = world

        // Store crash context in app cache
        let cacheDirs = NSSearchPathForDirectoriesInDomains(.cachesDirectory, .userDomainMask, true)
        if let cacheDir = cacheDirs.first {
            crashInfoPath = cacheDir + "/crashinfo"
            if access(crashInfoPath!, F_OK) != -1 {
                didCrash = true
                // (...) read file, handle crashing conditions, keep track of
                // which worlds have repeated errors, etc

                // Delete when done
                unlink(crashInfoPath!)
            }

            onFatalError { message, file, line in
                crashedThisLaunch = true
                // Open a file passing the file handler to a closure
                openFile(path: crashInfoPath) { fd in
                    // Writes a simple structured file to a pre-configured path
                    // Format:
                    //   date
                    //   message
                    //   world seed (int)
                    //   file:line
                    writeCrashInfo(fd: fd, message: message.description, seed: worldContext?.seed, file: file, line: line)
                }
            }

            // (...) Install signal handlers to also write info in case of other
            // types of crashes if crashedThisLaunch is false
            installSignalHandler(SIGABRT)
            installSignalHandler(SIGSEGV)
            installSignalHandler(SIGFPE)
            installSignalHandler(SIGILL)
            installSignalHandler(SIGTRAP)
        }
        return didCrash
    }
}

Simple scripts would set a handler near the beginning of the file, flushing ongoing work and state to disk.

Script code example

import Darwin

let crashInfoPath = "\(UUID().uuidString).crashlog"
var itemsProcessed = 0

onFatalError { message, file, line in
  openFile(path: crashInfoPath) { fd in
    writeCrashInfo(fd, message, itemsProcessed, file, line)
  }
}

// (...) Do some work, incrementing itemsProcessed as needed

More complex scripts or servers which manage multiple processes would install a handler at the beginning of a work unit, aggregating failed job output for later analysis. Combinations of message, file, line number, and additional state indicating what work was happening at the time surfaces potentially interesting code paths which could use more testing and review.

Additional discussion

Alternatives

Supporting multiple handlers which are executed sequentially based on order of registration. This reduces the overhead of managing previous handlers though removes the option of uninstalling handlers when using the lowest-level constructs. This would be similar to Ruby's at_exit.
```
onFatalError { message, file, line in
   print("Runs next!")
 }
onFatalError { message, file, line in
   print("Runs first!")
 }
```
Custom file logging option. I haven't pursued this one deeply, but most cases I can imagine for using a cleanup callback involve writing the fatal error context to a file in a structured format, so the standard interface could instead involve writing the message and file/line info to a custom file path for later analysis.
```
registerFatalErrorLog(URL(fileURLWithPath: "/path/to/log"))
```
However, this style of interface would remove the ability to capture custom application state when the fatal error occurs.
Enhancing signal() to provide Swift-specific context, if any, through either a change in closure arguments or a signal-safe, Swift fatal error context.
```
signal(SIGILL) { sig, errmsg, file, line in
  // ...
}
```
```
signal(SIGILL) { sig in
  if FatalError.set {
	let message = fatalError.message
  }
}
```

On naming

The callback is named onFatalError, however it runs for fatalError(), assertionFailure(), and preconditionFailure(). onAssertionFailure() could be a better name because all three methods funnel through _assertionFailure(), though that's not obvious without looking through the source code. There's also the option of getting rid of the "error" part altogether in favor of something like onFatal().

jdmcd · July 15, 2019, 1:12pm

I love this idea. Having a handler like this would make server side swift so much easier to deal with. +1

Michael_Ilseman · July 15, 2019, 5:08pm

CC @ktoso, who has also been looking into adapting something similar to Rust’s approach for Swift.

This kind of functionality is important for Swift to add, but it will likely require careful design and iteration to make it sound with Swift’s semantics. For example, you may want to ensure defer blocks and deinits are executed (i.e. unwinding).

ktoso · July 16, 2019, 2:59am

Thanks for the good idea, analysis and proposal, @kattrali

This is indeed something that’s close to our hearts and something we have been looking into ways to improve the status quo with @drexin for some time; Some of this work is ongoing in the Server Side Work Group (links below), but we are aiming for those improvements to help not only the server ecosystem, but the Swift-ecosystem as a whole.

For the sake of discussions we have recently been trying to stick to the following terms when discussing failure handling features (none of these are official or anything, just to make sure we use specific words for each of the failure types):

errors – Swift's current Error type and how one deals with them, also by passing them around in Result, EventLoopFuture and similar types.
faults
- "soft" faults – e.g. Swift's fatalError, array out of bounds, divisions by zero, force unwraps of nil values and similar situations.
  - These situations do not lead to memory unsafety, and are most often issued right before things could have ended in memory-unsafety, e.g. an array write outside of the arrays bounds results in first soft-faulting, rather than allowing the write to proceed into some arbitrary memory location.
- "hard" faults – what currently gets mixed together with soft faults in Swift since they both get signalled as signals (e.g. UD2), so programs have no chance to figure out "was it bad...? or really bad?"
  - Hard faults also include "random C code did something nasty", we never want to capture those, and propose to keep those as "faults" that a Swift program should never be able to capture. For those who want to, they could still install a signal handler and e.g. capture a backtrace there (even though this may be quite "very unsafe")
Also, let us collectively refer all those errors or faults as "failures."

Currently, Swift does not really have a good way to distinguish or capture the latter two — the “faults”.

Changes we do here should be part of a larger "failure handling" story we believe, which in part has to untangle the soft and hard faults, but also improve the user experience around them (see: ongoing backtrace improvement discussions).

I believe that there is a number of things which could be done to improve the failure handling story in Swift, and they all somewhat are linked to each other yet have varying levels of effort and benefits.

uncaught "soft" fault capture
- this proposal (or similar), where a global “call-me-before-you-crash handler is provided. As you said @kattrali, this has the benefit of not forcing implementations to rely on signal handlers for this.
- this should be safe to set and access in concurrent settings. though likely no guarantees about concurrent execution of handler can be made.
- this should not fire for hard faults, as during those we may be facing memory corruption or other "very bad" situations.
panics and unwinding (?) – by promoting the “soft" faults to an actual concept Swift is aware of, we can enable those to be treated as "very bad, terminate execution of this thread", however we can allow thread-pools or similar to "isolate" the issue.
- These failures are after all about "some logical invariant was broken" and not "random memory corruption"
- panics "should not happen" in well behaved programs, unlike errors which may be used for validation etc
- these may want to execute defer, deinit and/or similar code blocks, and continue crashing until isolated; if not "isolated", they'd leak to the handler which this thread proposes – the "uncaught fault handler," – as proposed in this topic.
- panics (soft faults) are "less bad" than hard faults, and would still want to be able to get (in-process) backtraces for them; unlike for Errors which shall remain light-weight.
improved backtrace experience, ongoing work:
- Crash backtraces an attempt to improve status quo (esp. on server, but aiming to provide an improved experience across platforms) for backtraces in Swift, such that they are predictable and of good quality. We are also investigating which ways are the best to obtain the best quality backtraces, and investigating TSAN's implementation of gathering those etc. This topic is currently handled by @IanPartridge, and we hope to collaborate on this as soon as we confirm some things about the best way of implementing this.
- Demangle Function -- which would expose Swift's demangle mechanism to user land, so libraries can rely on it for in- (as well as off-) process symbolication; This matters more for the server ecosystem, and has a number of aspects to it... however exposing the demangling method is the first step here.
- Better runtime failure messages (not yet enabled by default) Better runtime failure messages (not yet enabled by default) by eeckstein · Pull Request #25978 · apple/swift · GitHub
- ongoing investigations across existing implementations, including TSAN's handling of this.

Having that said, there remains a lot to figure out and see if and how we can get there. It is like @Michael_Ilseman, said something that needs some careful design in multiple steps. (And it is not really up to me to decide what will land here and how, but from an user's perspective, these we see as some of the main things to address).

In face of the latter two topics/ideas though: how would we design the first "global handler" such that the latter two can still land and feel like a natural fit? If we had those features, this proposal could be seen as "uncaught panic (soft fault) handler" – if a panic was not isolated/stopped, it would reach the outermost "layer", and there it'd invoke the user-installed global handler; it would not be allowed to survive, and the process would be forced to crash as it does nowadays though. So that could quite "fit" the model.

The proposal here is a nice improvement and lower effort than the topics 2) and 3). It would allow some users to get away from the signal handlers, and it would allow us to install better backtrace libraries (which are an ongoing effort of the SSWG) using this handler (which we are developing right now, and would perhaps be able to upstream those later). I would want to make sure that however it is exposed, allows the future developments to happen and fit in nicely – e.g. this affects what type of parameters the callback should receive, and what kinds of guarantees about invoking this closure we are able to provide.

In other words, I hope that we can figure out a plan for failure handling in Swift as a whole, such that these incremental improvements can build up to a solid story and cohesive story and all play nicely with each other. Right now we don't have more details though.

kattrali · July 17, 2019, 4:48pm

This is great feedback, thank you!

@ktoso - Its especially great to be connected to the existing conversations around this topic, and the terminology being used.

I see a few immediate refinements to the pitch for a global soft fault handler, in particular:

Clearly stating the difference between soft vs hard faults and that hard faults will always require signal handling
Proposing possible ways that a global soft fault handler can fit in a future where there could be localized soft fault handling, and perhaps building a few rough example implementations and usages

So I'm going to spent a bit of time reading more of the existing discussions for the next draft.

these may want to execute defer , deinit and/or similar code blocks, and continue crashing until isolated; if not "isolated", they'd leak to the handler which this thread proposes – the "uncaught fault handler," – as proposed in this topic.

Is there more discussion about this component?

Joe_Groff · July 17, 2019, 5:02pm

If I understand @kattrali's proposal correctly, it seems like onFatalError would be something that runs immediately before program termination, since this API doesn't provide any indication of where to continue from or whether the error is considered to be handled. An API like this still seems useful to be able to log backtraces, but to me, the safest thing to do would be to crash without trying to unwind anything.

If we introduced an interface that also allowed the program to continue executing, like what @ktoso is talking about with catching soft faults, I think leaving the crashed subprogram in a hung state, without unwinding it, might be a good incremental step toward improving the robustness of Swift programs. While not ideal, that could still let the supervisor part of your program finish servicing other requests if one crashes, for instance.

kirilltitov · July 17, 2019, 5:48pm

I really like the idea, however, I agree with @ktoso , sometimes fatalError really means fatal error when things got utterly nasty like memory corruption etc,. and in these cases cleanup callback might at the very least fail, but I'm afraid in some cases it might make things horribly worse

I think what we actually can do now is try to brainstorm brand new error model for Swift where we can separate fatal (panic) and non-fatal errors and implement cleanup handlers for latter.

kattrali · July 17, 2019, 7:49pm

If I understand @kattrali's proposal correctly, it seems like onFatalError would be something that runs immediately before program termination, since this API doesn't provide any indication of where to continue from or whether the error is considered to be handled. An API like this still seems useful to be able to log backtraces, but to me, the safest thing to do would be to crash without trying to unwind anything.

Correct. The intent is to add a means to record information about the fault or perform final cleanup within the app/server before entering a signal-safety-required context. It also would separate, for example, logical errors in Swift from failed syscalls.

While any kind of recovery and continuation is outside of the scope of what's being proposed here, the proposal could be refined a bit to state that more clearly and illustrate where it would fit if localized fatal error handling were to be added.

drexin · July 17, 2019, 11:46pm

It's important to differentiate between nasty things like segfaults and not so bad things like fatalError calls from user code or the runtime (e.g. force unwrap nil, array index out of bounds etc.). Those fatal errors are triggered because it was detected that the operation that was about to be executed would result in memory corruption or similar bad things. So the corruption did not actually occur, because the runtime detected the attempt and prevented it. In this case it should be safe to continue running the program and execute cleanup code, or print some debug information etc. If we actually corrupted memory, there's nothing we can do about it and even the attempt to run code afterwards could result in unpredictable behavior, e.g. executing a cleanup callback that tries to print some information and incidentally touches the corrupted memory. So these cases should not be handled.

I think a first good step would be to allow custom code to be hooked into the _assertionFailure logic, as @kattrali proposed. One question is, if a callback gets registered as part of a function that calls code that could potentially fail, how do we ensure that it gets unregistered if the code did not fail? Or should we only allow registration of a global handler that can for example print the backtrace just before the app crashes.

lukasa · July 18, 2019, 8:09am

I know that @drexin knows this, but I want to make explicit something that is implicit in this sentence for the sake of others viewing this thread: "safe" here explicitly means "memory safe", not "free of bugs".

By definition if you hit a fatalError or precondition you must have some state in your program that is logically inconsistent: that's what those are for. This logical inconsistency may persist into any state you share with your cleanup code. As a result, your cleanup code needs to be extremely conservative to avoid falling foul of the same issue. This means that this is not a mechanism for arbitrary resumption of logic: you really do need to be taking action to throw away all the state you have that may be in a bad shape, and in a shared-memory system like Swift that means almost all of it.

This means that most users should never write code that "recovers" from a panic, because the odds of getting that right are pretty low. Not that the language shouldn't have the facility, of course: just that we should think of recovery the way we think of unsafeBitCast, as a tool that is fundamentally a bit dangerous and to be avoided in almost all code.

Jean-Daniel · July 18, 2019, 8:23am

I agree with that. While low-level language like C provide a way to handle and recover from such critical failure (signal + setjmp, longjmp for instance), nobody uses it for a good reason. This is fundamentally unsafe to recover from such condition.

I think that any method that allow to execute code in such condition must be explicitly tagged as unsafe.

And by the way, I can't remember the last time I saw a signal handler that uses only signal safe functions, so make sure to design that feature to avoid such limitation, or to be able to enforce it at compiler level.

lukasa · July 18, 2019, 8:30am

I think I disagree with this.

In Swift and other similar languages unsafe has a fairly clear meaning, which is that the operation may perform memory-unsafe operations. In this context none of the operations you can perform will be any more memory-unsafe than they were before: they'll just potentially be logically-unsafe. Swift does not annotate such code today (how could it), and so I wouldn't propose that it should do so in response to this feature.

However, we should develop community guidelines and documentation that strongly warn that arbitrary panic recovery doesn't lead to good outcomes, the same way the Python community warns against using except: to catch exceptions, instead strongly encouraging except Exception:.

drexin · July 18, 2019, 8:24pm

@lukasa Thanks for clarifying that.

Jean-Daniel · July 18, 2019, 9:35pm

Yes, that's a good point.

If we follow the actual proposition and trap() is unconditionally called after the handler, it will already be a strong signal that fatalErrorHandler are not design to perform any recovery.

Lantua · July 18, 2019, 9:46pm

It seems that the only purpose of fatalErrorHandler is to log the relevant data.

Since it happens mainly when it is logically inconsistent, but still be memory-consistent, exposing an entire callback to userland doesn’t seem like a very good approach.

Would it be better if we only mark relevant data (in addition to error message) in fatalError call, and have it logged in a searchable/navigable format? Even things like out-of-bound access could use a variable name/index.

lukasa · July 19, 2019, 8:28am

I don't think that's quite right: it would also be useful to be able to use this handler to gracefully release other resources where possible, at least in the long term.

I think Rust is illustrative here. It allows catching panics with catch_unwind and makes it clear that there are circumstances where doing this may be acceptable. Emulating that in Swift may be tricky due to the absence of the explicit Rust lifetime system, but approaching that space could be profoundly useful to potentially allow recovery of the program in some systems.

The biggest downside here is that what data is relevant depends very much on the program in question, and the faulting code rarely knows. Many programs attempt to do this already, if only in debug mode, by passing strings to precondition or fatalError.

Lantua · July 19, 2019, 12:05pm

Yeah, I'd say that it's not nearly enough in many situations. The "logging" inso far deals specifically with StringConvertible types. It'd be much better to be able to do deep logging on a struct/class and navigate them later.

lukasa · July 19, 2019, 12:18pm

I agree. What I'm trying to convey is only that the code at the point of failure knows what is wrong, but rarely knows "why". The "why" is usually elsewhere in the stack, and that's the data you really want to see.

ben-cohen · July 19, 2019, 3:04pm

It's worth also noting that the "why" might still be memory corruption, even if the precondition that's failed might seem like it's a logic error that's perhaps ok to recover from. If you've accidentally zeroed out the wrong memory with unsafe shenanigans, it can manifest as an unexpected nil, for example.

Joe_Groff · July 19, 2019, 5:08pm

Retrofitting unwinding to Swift would also poke a few holes in things we've been taking for granted. One of the nice things about the explicit error handling model is that it makes the error propagation back edges explicit with try annotations, which is particularly helpful when writing unsafe code so that you know where you have to be mindful of maintaining invariants. One of the important lessons from C++ is that writing correct unsafe code in the face of implicit exceptions is humanly impossible—not even Rust can do it. It would be a shame to lose this property in Swift.

The other big hole I see is the interaction of unwinding and inout. Right now, Swift code and the optimizer both benefit from a lot of freedom with exclusive borrows, since we can freely move values out of an exclusively-borrowed memory location and leave it temporarily invalid, as long as you move something back before your borrow ends. Dictionary for instance uses this to move values into a temporary Optional value that can be modified in-place during subscripting and then moved back into place when the subscript ends. You can't do that generally in Rust because of the threat of unwinding.

Aside from that, there's also the issue that an interrupted function may not leave behind a well-formed value in an inout because of optimizations or other transformations; before we adopted the exclusivity model for inout, we had problems like this with our "notionally noalias, but we'll still try to remain memory safe" model, which severely limited our ability to optimize inout functions without potentially leading to invalid states in the face of aliasing.

If unwinding is strictly upward and non-interruptible, there might still be mitigations to the inout problems, since we'd only need to ensure that destruction is still safe—for Dictionary, we could for instance leave behind a safely-destructible, but otherwise invalid, sentinel representation when we want to move the value out of the table. Similarly, the optimizer could still perform transformations that might expose transient invalid states of a value, as long as that state can still be destructed. I don't have an obvious answer for what we could do about unsafe code, and I suspect it could end up being a more pervasive problem in Swift than in Rust, since safe Rust can go a lot "closer to the metal" than safe Swift today can.