Crash backtraces

When we were discussing the Charter of the SSWG last year, one area several people mentioned they were interested in collaborating on is improving the experience of deploying Swift applications and managing them at scale.

A critical part of that is ensuring that Swift has a good story for FFDC (First Failure Data Capture - please forgive me using IBM terminology, I'm sure other people have different/better names for the same thing). The idea is that when your app falls over, you should have useful diagnostics available immediately without having to recreate the problem a second time (which of course may be impossible for heisenbugs).

@johannesweiss posted some useful code for printing crash backtraces in release mode and I packaged it up into a tiny SPM library:

I'm also aware of two other libraries which tackle this problem in different ways.

(uses libunwind)

(uses some hackery (@_silgen_name to call into stdlib for demangling))

Would people be interested in collaborating on a stacktrace library through the SSWG? Does anyone have thoughts on the technical approaches taken by the libraries I mention above?

12 Likes

As you said this is going to be critical for production apps at scale.

Id be happy to help.

3 Likes

It would be fantastic for the community. Especially if it will also work on iOS and macOS, as nowadays you basically have to sell out your customers when you use any of the "free" solutions out there.

2 Likes

It's not clear to me what the best/recommended path is on Darwin. libunwind ships with macOS but there is also CoreSymbolication.framework. Now that we are ABI stable on Darwin what is the official way?

First, thanks @IanPartridge for starting this discussion. I think we all agree that the current situation is not optimal and we are very eager to do something about it.

We have been collecting some information about how other languages with a similar scope as Swift handle this today. The most prominent ones are probably C++, Go and Rust. They all have support for retrieving backtraces, with varying degrees of options and manual work required. Go and Rust both have built-in support for printing backtraces. Go always does it with no configuration required, while Rust only prints the backtrace when the environment variable RUST_BACKTRACE is set to ā€œ1ā€. Rust also by default strips debug symbols from release builds (just like Swift), so building with ā€œ-gā€ is required in release mode. Go does not differentiate between debug and release builds and always includes debug symbols. Both languages also allow users to catch and recover from panics, in which case a backtrace must be printed manually by the user, which in Go can be done with PrintStack() , found in the runtime/debug package and in Rust it can be retrieved using the backtrace crate (https://crates.io/crates/backtrace). C++ does not by default print a backtrace on crashes, but they can be manually retrieved using backtrace() and then demangeled with cxa_demangle() , or more comfortably by using the boost stacktrace module (https://github.com/boostorg/stacktrace).

So there is quite a bit of prior art here, that could be used as inspiration. In Swift today there's no built-in way to trap panics, but signal handlers can be used instead (like Ian already does in his library). I think a good start would be to check our options (e.g. backtrace() vs libunwind ) and see how well each of those work on the platforms we want to support. Also how much control do we want to give the user. Is it sufficient to just install a pre-defined hook and print the backtrace, or should users be able to install their own hooks and be able to retrieve the backtrace as a proper data structure (like in Rust's backtrace crate).

We think this would be a great candidate for a SSWG hosted project and would love to join your effort in making this real, as this would dramatically improve the overall experience of developing server-side Swift code.

Looking forward to discussing this more.

7 Likes

On Apple platforms, we rely on the system crash tracer, which collects a crash report once a process crashes and handles symbolication of the backtrace. It might be interesting to consider a similar out-of-process monitor-based approach for the server, since in-process signal handlers might fail to capture some forms of failure (particularly SIGKILLs) and could interfere with an app's own signal handlers.

We've also been looking into addressing shortcomings of backtrace symbolication itself in Swift, and looking for ways we can improve that which would likely benefit both Apple and server platforms. Traditional backtrace libraries which merely walk a callstack and rely on the symbol table for symbolication are limited in how well they can deal with inline frames. Something that uses DWARF debug info to symbolicate backtraces could give a more accurate account of inlined functions. On Apple platforms, the inessential debug info is separated from the binary so that customer machines don't need to download it, but the developer can use the debug info to symbolicate crash reports on their end; this separation of concerns may be less important on the server, though.

On a related note, Swift uses trap instructions for safety checks, and although these instructions are uniqued for each trap reason, and the instructions are associated with source locations for the trap reason in DWARF, there's no more detailed accounting of the reason for the trap. We've been discussing ways we might be able to record richer messages for these traps in debug info as well.

7 Likes

Let's put Darwin platforms to one side for now - I think the SSWG goal should be to improve the status quo on Linux.

I think we should consider creating something like Rust's "backtrace" crate. I like this approach because it provides two key features:

  1. A quick API to walk the current thread's stack - https://docs.rs/backtrace/0.3.30/backtrace/fn.trace.html
  2. An API to capture a backtrace for later inspection/logging etc. - https://docs.rs/backtrace/0.3.30/backtrace/struct.Backtrace.html#method.new

It also abstracts away the backend implementation so we could experiment with both backtrace() and libunwind and leave the door open to supporting other platforms in future.

The question of hooking the backtrace generator up to, for example, a SIGILL handler is a separate concern in my view.

Another separate question is about the viability of running the Swift demangler in-process, including in during a crash situation.

3 Likes

The demangler is already in the Swift runtime, so using it to pretty up backtraces should not be a problem. The issues I raised seem readily applicable to Linux as well as Darwin; in both environments, a traditional backtrace is going to miss out on inline frames, and having reason metadata for traps would allow crash reports to contain more descriptive and actionable information.

5 Likes

Could it make sense for the standard library to expose a public demangle function? The stdlib already tests that this is possible here: https://github.com/apple/swift/blob/b7daed7958956f5e91a66a9a3bc1c959f9471ef6/test/stdlib/Runtime.swift.gyb#L24

Implementation wise, it seems pretty trivial, but I assume maybe there's some other considerations when adding this?

2 Likes

Exposing that makes a lot of sense, especially for in-process backtraces and things like that. My main concern would be people trying to parse the demangler output, when it isn't really designed to be a stable output format, but that's an existing problem with things like String(describing: T.self) that already generate demangled strings.

2 Likes

Excellent :+1: Very glad that this seems to be not too controversial.

The second part to the thread is where to get the backtraces from, we could invest into getting the information out of DWARF debug infos if we think that'd be the way to go. We'll need to explore it a bit but with a few hints here and there I hope we'd be able to pull it off. Or as MVP we'd start out with the simple backtrace() and improve over time...

I agree the signal handler / installing may not necessarily be part of the same discussion, but in the library we could perhaps provide either a small function OR pattern for runtime (i.e. http frameworks like kitura / vapor) so perhaps they could install those handlers for their users, so end-users would not have to care "how" they got the better traces -- we are also in good position to collaborate with developers/users of the potential "nice backtraces library", so overall quite optimistic here.

Hope to have more information once back from traveling after wwdc :slight_smile:

1 Like

What might a stdlib demangle look like? The existing API is String -> String but possibly we could match the existing print() and debugPrint() APIs which have a pair of functions each:

// demangle to stdout
public func demangle(_ mangledName: String)

// demangle to the given output stream
public func demangle<TargetStream>(
  _ mangledName: String,
  to target: inout TargetStream
) where Target : TextOutputStream

I might make a separate thread discussing this addition, but to me it seems perfectly fine to just do this:

// demangle to stdout
print(demangle("$sSi"))

// demangle to the given output stream
print(demangle("$sSb"), to: &stream)

swift_demangle also supports writing to a buffer which could be incorporated somehow, but I'm not sure how useful that would be.

Yeah probably this belongs in the Evolution section, not here. Please do open a thread :slight_smile:

This might be useful in a crash scenario where you want to avoid allocating memory. A pre-allocated buffer could be passed in.

2 Likes

Another benefit of using a separate process for crash reporting would be that the crash handling process doesn't need to live in an austere runtime environment because of a possibly corrupt host process. I was recently talking to some engineers about their work on the crash handler for Clang; they had struggled with the limitations of in-process handling for a while, but ultimately switched to forking a supervisor process, and that's what allowed Clang to report not only a simple backtrace but also collect inputs from the filesystem in order to bundle up inputs to reproduce the crash. That specific case might not be of much relevance to servers, but I can imagine servers wanting to be able to collect more interesting information from their environment, such as logs, and bundling them into rich crash reports, and that becomes tricky if you have to work from an arbitrarily-corrupted process state.

4 Likes