[Pitch] Swift Backtracing API

al45tair · January 23, 2023, 10:01am

As part of my work on adding backtraces to Swift, I'd like to propose the addition of an API surface so that Swift programs are able to programmatically capture backtraces. This is often helpful when debugging a non-fatal problem that rarely occurs (since you can add code to detect the problem and emit a backtrace), and should be useful for testing frameworks and other utility packages.

I have a draft SE proposal here that shows what I'm presently working towards.

Comments or suggestions appreciated.

DevAndArtist · January 23, 2023, 10:16am

This looks very interesting, but I personally don't have much experience in that area to provide valuable feedback. The only thing that caught immediately my attention was this property:

public var buildID: [UInt8]?

Is it possible to have a non-optional but empty buildID array? If not, why isn't the absence of an id not modeled via an empty array instead?

Ideally it would probably be something like NonEmpty<[UIInt8]>?, but unfortunately we still don't have an official way to express expected collection sizes or numeric ranges.

al45tair · January 23, 2023, 10:30am

Is it possible to have a non-optional but empty buildID array? If not, why isn't the absence of an id not modeled via an empty array instead?

Interesting point. On Darwin platforms, no, it isn't. On ELF systems though, the build ID comes from an ELF note in a format that doesn't appear to have a proper formal specification anywhere; in principle, it could be an empty array in that case (though the utility of such a thing is highly questionable for obvious reasons).

Personally I like the optionality here because it makes explicit the fact that it might not exist; not every binary has a build ID. I think without it, users of the API might be lured into thinking that wasn't true, particularly if they're developing on Darwin platforms where binaries do all have build IDs.

ktoso · January 23, 2023, 10:37am

Awesome, very exciting to see that we're considering an explicit API as well!

I'm wondering if for some testing and debugging tools "this thing that caused this crash, originally was created <information captured manually using this API, e.g. just in special debugging mode>". For such tools it often is enough to just get a frame or two above from where I'm capturing.

So with that in mind:

is it worth adding an API to: capture just a few frames of a backtrace (limit: or similar),
and is it worth allowing to symbolicate a single Frame rather than only the entire Backtrace?

I'm thinking mostly about APIs like "you created task/future/stream " and when a thing in relation to it crashes, we're able to track down "where did that one come from". I definitely had often reverse engineered such question using manually printing every created object and then back-tracking from a crash to "where" a thing was created. WDYT?

tevelee · January 23, 2023, 10:58am

Awesome proposal! Very much looking forward to this

Two questions:

In the unsymbolicated Backtrace struct, I don't see a relation between Frames and Images, in fact images is a lazy var. Is this a performance decision? Do we need the symbolication step to know which module/framework a given frame address is related to?
In the SymbolicatedBacktrace struct the Symbol struct contains var imageIndex: Int users of this API can look up in the images array. Why did you decide not to include the Image directly? (lazy var image: Image)

Thanks!

DevAndArtist · January 23, 2023, 10:58am

In addition to capturing and tracing the states, would it be possible to somehow determine if the async task that we walked through was cancelled or not?

Right now only root tasks can cancel, but I think this limitation could be lifted and we could see partial child tasks to be cancellable and discardable. Collecting some traces of whether a particular task was cancelled or not would be very valuable.

However I'm not 100% sure if this fits into this API at all or not.

hassila · January 23, 2023, 11:03am

Super happy to see this. Just wanted to +1 the separation of symbolication from the capture, gold decision - as it can be expensive, one might want to symbolicate on the tail end of an operation to minimize the impact of the operation in progress too.

al45tair · January 23, 2023, 11:35am

Possibly, yes. offset: might make sense too, so that you can ask to skip the current frame (for example). I was trying to keep the number of options to a minimum to avoid complicating things, but on reflection maybe these two would be useful enough to include.

I wonder whether limit: should have a "sane" default value (i.e. not nil) to protect against cases where someone has runaway recursion and then called something that tried to capture a backtrace?

I'm less keen on that, I think. It's obviously possible in principle, though it will be more expensive than you'd expect I think (various costs are spread across all frames normally, but you'd be forced to pay them even to symbolicate one), and it complicates the API (I like the fact that, as currently designed, you either have a backtrace with no symbolication, or one that has been symbolicated — and that's expressed through the type system).

al45tair · January 23, 2023, 11:46am

Yes. Obtaining a list of images is an expensive operation, and it isn't necessarily needed in every case. For example, if you were capturing a backtrace using the fast unwinder when creating an Error of some kind, you wouldn't want to capture the image list at that point.

Multiple Symbols will likely resolve to the same image, and since Symbol doesn't have a back-reference to the SymbolicatedBacktrace, it can't easily look it up dynamically from an internal variable holding the index.

If Image were a class instead, we could then have var image: Image in Symbol without too much worry, albeit with a little refcounting overhead, but on balance having the image index seemed simplest, and also avoids worrying about how to compare Images (you might naïvely expect you could compare the build ID, and that would be the right comparison for some purposes, but actually you can load the same image multiple times into an address space, with different base addresses, so…)

lukasa · January 23, 2023, 12:29pm

This is very exciting! I don't have substantive feedback on the API surface, it seems reasonable to me, but I'm delighted to see the work on this functionality coming to fruition.

FranzBusch · January 23, 2023, 1:36pm

This looks great! Can't wait to use it. One question though currently capture() only allows to get the backtrace for your current location in the program for your current thread. I was wondering if we could expose something to get backtraces for all threads owned by the current process?

This would allow to implement an in process profiler that captures the stack traces of all threads at some periodic intervals.

Also does this work with Objective-C/C++/C code in the stack?

al45tair · January 23, 2023, 1:42pm

While there will be code to capture backtraces from other threads (because we need it when we're capturing backtraces for a crash), I'm not proposing to expose API for that at this point. We could potentially add something for that in the future, however.

Yes. That's absolutely a requirement here, and symbolication and demangling needs to work for those cases too.

ktoso · January 23, 2023, 1:47pm

Sounds like a good idea, some arbitrary number you think would be good here and perhaps configurable via an ENV variable might be nice for this.

I see; I don't see a strong need for this -- primarily was thinking about the "limit / offset" dance to avoid collecting everything -- if I'd be able to just collect 3-4 frames and symbolicate those that makes sense for those small tools I was thinking about

Thank you for the work here! It's looking great.

--

Mini question: I know we had this for a while on apple platforms, but the backtrace through async frames I was wondering if you could explain a bit how to interpret a Frame that has isAsync = true. Do I understand correctly that these are always going to be a suspension point? (e.g. the function in which await hello() caused a call to hello() which then would be a normal frame "next"). Or am I interpreting it the the wrong way around?

tcldr · January 23, 2023, 1:50pm

This looks great, excited to see it come to fruition.

It immediately makes me think of a potential future direction that plays into the discussion around the elusive type throws.

One of the issues in that discussion was that while most felt it had strong use cases around local control flow, there was some reticence around moving forward with the feature due to the potential of 'abuse', and a proliferation of overly rigorous Error types with deeply nested sub errors.

A quote from that disucussion:

However, there was one fairly reasonable use case for nested errors: and that was to create a pseudo backtrace from where an Error was thrown to bubble up to the application layer and diagnose where/why an Error occurred.

It seems that the feature pitched in this thread would be really useful in helping programmers supersede that practice, if there was some way of enclosing throwing code in a block that would 'collect' a backtrace at the precise point an error was thrown (and/or rethrown), it would be a really powerful thing for production debugging.

Something along the lines of:

do {
  try callIntoDeeplyNestedLibraryWhichThrowsInVariousPlaces()
}
catch let error: KnownError {
  // deal with a _known_ error as usual
 ...
}
catch _, let backtraces {
   // gather backtraces and report via developer chosen, application level library/utility
  Log.nonFatalErrors(backtraces)
}

I'm not sure how performance intensive this would be, but if it could be done without too much performance impact (and conditionally on enclosure of a catch that included that second param, it might be a really nice feature to have.

And hopefully we'd finally see typed throws.

al45tair · January 23, 2023, 1:56pm

As I understand it, frames with isAsync = true represent continuations (resumption points, really, rather than suspension points per se). So the top one will be the continuation that invoked whatever non-async frames are above it in the backtrace. Subsequent async frames show you where asynchronous execution will resume next. The program counter values for (subsequent) async frames are always exact, rather than being return addresses, because they're called by the concurrency runtime when it's ready for them to execute.

ktoso · January 23, 2023, 2:06pm

Ah right, the addresses of where we'd resume, makes sense -- thank you for clarifying

mattie · January 23, 2023, 2:50pm

I see a few problems with the addition of an explicit UnwindAlgorithm. Allowing the user to think they are in control here could be an issue, since the available information can vary from frame to frame within a given trace. Also, I don't see Compact Unwind as a case, but that is a very important mechanism on Darwin platforms. And, because compact unwind will also potentially require Dwarf, this feels like a can of worms.

I also think there's an issue assuming fast == frame pointers. On Android NDK, for example, there are no frame pointers by default (or at least there didn't used to be, have not checked in a while). So, what happens in that, admittedly unusual, situation? I'd propose separating ABI from the API client's intention. Would you consider maybe auto, precise, fast? I think this could make it more portable, while also better capturing the client's intention without needed them to understand the ABI details.

Inlining means that one address will map to more than one symbol. You can see this with the atos -i flag, for example. The API currently cannot support this.

Swift does a lot of code generation, and today, it captures that by encoding the file with a line/column both 0. Users often find this confusing, and is typically mistaken as a bug in the symbolication/backtracing system. To my knowledge, there isn't enough metadata in the Darwin dwarf info today to be more clear about this. But, I wanted to bring this up because addressing it complicates the SourceLocation struct.

Finally, I see a comment about adjusting the program counter, but it isn't 100% accurate. A return-address working accurately for line-column information look up is a special-case. Address adjustments are, in general, required for all cases. I just wanted to point it out because I've spent too much of my life wresting with this problem to let it slide without being annoyingly pedantic about it.

hassila · January 23, 2023, 3:39pm

+1 for that too as a future direction - can see a few cases when it’d be very helpful.

al45tair · January 23, 2023, 3:45pm

I think you're over-estimating the amount of control being provided here. The dwarf option will, on Darwin, also use compact unwind information (and I have no plans to provide a separate setting for "just DWARF" or "just compact unwind", because those make no sense), and will fall back to using the frame pointer if there's no DWARF data available for a given frame.

That said, you make an interesting point about maybe naming the options auto, precise and fast rather than giving more specific names in the API. That does seem like a good idea as it would get some system specifics out of the API that don't really need to be there.

I'd like to see an example of exactly what you have in mind here, but note also that to some extent what we return here may be up to platform APIs, so even if we were to alter the API to return e.g. [Symbol] instead of just Symbol, that doesn't mean that on any given platform you'd be able to get multiple results in practice.

I think that's probably a higher layer concern here; it's certainly true that, for instance, you may not want to display thunk functions and things to the user when you print a backtrace. This API is really just there to let you capture the information that exists in the binary, so yes, you will sometimes see line and column both 0.

I'm not sure quite what you mean by that. You wouldn't want to adjust the address for a program counter value that came from an async continuation or from a thread context that was captured somehow — those are accurate program counter values and should be used as-is. The point here is that the API will provide the necessary information to let you know whether or not a given frame's program counter needs adjustment, and will also provide an adjusted value for you so you don't need to worry about that if that's what you want.

mattie · January 23, 2023, 4:17pm

I'm just looking at the API, which offers great deal of (implied) control. I think that, given how it is defined now, "just compact unwind" does indeed make sense. That option could produce a much higher-quality backtrace than only frame pointers, would be slower, but not as slow or good as compact unwind + dwarf.

I'm glad you are into the idea of a more-abstract unwind strategy. Especially because it sounds like this is actually how it works under the hood.

If you can get source location information at all, I assume it means you have access to DWARF data? In that case, you'll have access to the inline info needed and CoreSymbolication (which backs atos) will be able to iterate over all the data for a single address. I have do not know how this work on non-darwin platforms.

Inline support is a PITA, because it happens so rarely, and complicates so much. The inline support in the gSYM file format used by LLVM, for example, is very complex. But, I do want to stress that as-is this API cannot be used to symbolicate inlined functions.

I really dislike the "line 0" overload, and I would hate to see that perpetuated in a newly-designed API, even if there is no established other way to do this today. The binary could describe this, the DWARF spec supports it. It just isn't actually emitted by the swift compiler/linker today. But maybe one day it gets fixed, and if it ever did, this API would need to be revised to also support it.

al45tair:

mattie:

Finally, I see a comment about adjusting the program counter, but it isn't 100% accurate. A return-address working accurately for line-column information look up is a special-case. Address adjustments are, in general, required for all cases.

I'm not sure quite what you mean by that. You wouldn't want to adjust the address for a program counter value that came from an async continuation or from a thread context that was captured somehow — those are accurate program counter values and should be used as-is. The point here is that the API will provide the necessary information to let you know whether or not a given frame's program counter needs adjustment, and will also provide an adjusted value for you so you don't need to worry about that if that's what you want.

Even for a continuation, I would expect the return address to be the instruction after the call. And, to reconstruct the calling function, just like with a normal call, that's not the address you need. It frequently works, because of how many addresses map to the same line of code. The current comment talks only about inline functions, which is definitely not the only place this happens. But, like I said, this is being pedantic. As long as the API returns the real return address in addition to any adjusted value I think it will work great.