[stdlib] Cleanup callback for fatal Swift errors

ktoso · July 16, 2019, 2:59am

Thanks for the good idea, analysis and proposal, @kattrali

This is indeed something that’s close to our hearts and something we have been looking into ways to improve the status quo with @drexin for some time; Some of this work is ongoing in the Server Side Work Group (links below), but we are aiming for those improvements to help not only the server ecosystem, but the Swift-ecosystem as a whole.

For the sake of discussions we have recently been trying to stick to the following terms when discussing failure handling features (none of these are official or anything, just to make sure we use specific words for each of the failure types):

errors – Swift's current Error type and how one deals with them, also by passing them around in Result, EventLoopFuture and similar types.
faults
- "soft" faults – e.g. Swift's fatalError, array out of bounds, divisions by zero, force unwraps of nil values and similar situations.
  - These situations do not lead to memory unsafety, and are most often issued right before things could have ended in memory-unsafety, e.g. an array write outside of the arrays bounds results in first soft-faulting, rather than allowing the write to proceed into some arbitrary memory location.
- "hard" faults – what currently gets mixed together with soft faults in Swift since they both get signalled as signals (e.g. UD2), so programs have no chance to figure out "was it bad...? or really bad?"
  - Hard faults also include "random C code did something nasty", we never want to capture those, and propose to keep those as "faults" that a Swift program should never be able to capture. For those who want to, they could still install a signal handler and e.g. capture a backtrace there (even though this may be quite "very unsafe")
Also, let us collectively refer all those errors or faults as "failures."

Currently, Swift does not really have a good way to distinguish or capture the latter two — the “faults”.

Changes we do here should be part of a larger "failure handling" story we believe, which in part has to untangle the soft and hard faults, but also improve the user experience around them (see: ongoing backtrace improvement discussions).

I believe that there is a number of things which could be done to improve the failure handling story in Swift, and they all somewhat are linked to each other yet have varying levels of effort and benefits.

uncaught "soft" fault capture
- this proposal (or similar), where a global “call-me-before-you-crash handler is provided. As you said @kattrali, this has the benefit of not forcing implementations to rely on signal handlers for this.
- this should be safe to set and access in concurrent settings. though likely no guarantees about concurrent execution of handler can be made.
- this should not fire for hard faults, as during those we may be facing memory corruption or other "very bad" situations.
panics and unwinding (?) – by promoting the “soft" faults to an actual concept Swift is aware of, we can enable those to be treated as "very bad, terminate execution of this thread", however we can allow thread-pools or similar to "isolate" the issue.
- These failures are after all about "some logical invariant was broken" and not "random memory corruption"
- panics "should not happen" in well behaved programs, unlike errors which may be used for validation etc
- these may want to execute defer, deinit and/or similar code blocks, and continue crashing until isolated; if not "isolated", they'd leak to the handler which this thread proposes – the "uncaught fault handler," – as proposed in this topic.
- panics (soft faults) are "less bad" than hard faults, and would still want to be able to get (in-process) backtraces for them; unlike for Errors which shall remain light-weight.
improved backtrace experience, ongoing work:
- Crash backtraces an attempt to improve status quo (esp. on server, but aiming to provide an improved experience across platforms) for backtraces in Swift, such that they are predictable and of good quality. We are also investigating which ways are the best to obtain the best quality backtraces, and investigating TSAN's implementation of gathering those etc. This topic is currently handled by @IanPartridge, and we hope to collaborate on this as soon as we confirm some things about the best way of implementing this.
- Demangle Function -- which would expose Swift's demangle mechanism to user land, so libraries can rely on it for in- (as well as off-) process symbolication; This matters more for the server ecosystem, and has a number of aspects to it... however exposing the demangling method is the first step here.
- Better runtime failure messages (not yet enabled by default) Better runtime failure messages (not yet enabled by default) by eeckstein · Pull Request #25978 · apple/swift · GitHub
- ongoing investigations across existing implementations, including TSAN's handling of this.

Having that said, there remains a lot to figure out and see if and how we can get there. It is like @Michael_Ilseman, said something that needs some careful design in multiple steps. (And it is not really up to me to decide what will land here and how, but from an user's perspective, these we see as some of the main things to address).

In face of the latter two topics/ideas though: how would we design the first "global handler" such that the latter two can still land and feel like a natural fit? If we had those features, this proposal could be seen as "uncaught panic (soft fault) handler" – if a panic was not isolated/stopped, it would reach the outermost "layer", and there it'd invoke the user-installed global handler; it would not be allowed to survive, and the process would be forced to crash as it does nowadays though. So that could quite "fit" the model.

The proposal here is a nice improvement and lower effort than the topics 2) and 3). It would allow some users to get away from the signal handlers, and it would allow us to install better backtrace libraries (which are an ongoing effort of the SSWG) using this handler (which we are developing right now, and would perhaps be able to upstream those later). I would want to make sure that however it is exposed, allows the future developments to happen and fit in nicely – e.g. this affects what type of parameters the callback should receive, and what kinds of guarantees about invoking this closure we are able to provide.

In other words, I hope that we can figure out a plan for failure handling in Swift as a whole, such that these incremental improvements can build up to a solid story and cohesive story and all play nicely with each other. Right now we don't have more details though.