Measuring, avoiding, and diagnosing a stack memory overflow crash

We recently saw an EXC_BAD_ACCESS crash in our app that was very difficult to assess from the stack trace. The top several calls were referencing the swift demangler:

#0	0x0000000185874a14 in swift::Demangle::__runtime::Demangler::DemangleInitRAII::DemangleInitRAII(swift::Demangle::__runtime::Demangler&, __swift::__runtime::llvm::StringRef, std::__1::function<swift::Demangle::__runtime::Node* (swift::Demangle::__runtime::SymbolicReferenceKind, swift::Demangle::__runtime::Directness, int, void const*)>) ()
#1	0x0000000185873d0c in swift::Demangle::__runtime::Demangler::demangleType(__swift::__runtime::llvm::StringRef, std::__1::function<swift::Demangle::__runtime::Node* (swift::Demangle::__runtime::SymbolicReferenceKind, swift::Demangle::__runtime::Directness, int, void const*)>) ()
#2	0x0000000185853d10 in _findExtendedTypeContextDescriptor(swift::TargetContextDescriptor<swift::InProcess> const*, swift::Demangle::__runtime::Demangler&, swift::Demangle::__runtime::Node**) ()
#3	0x000000018585588c in swift::SubstGenericParametersFromMetadata::buildDescriptorPath(swift::TargetContextDescriptor<swift::InProcess> const*, swift::Demangle::__runtime::Demangler&) const ()
#4	0x00000001858558f4 in swift::SubstGenericParametersFromMetadata::buildDescriptorPath(swift::TargetContextDescriptor<swift::InProcess> const*, swift::Demangle::__runtime::Demangler&) const ()
#5	0x00000001858558f4 in swift::SubstGenericParametersFromMetadata::buildDescriptorPath(swift::TargetContextDescriptor<swift::InProcess> const*, swift::Demangle::__runtime::Demangler&) const ()
#6	0x0000000185856028 in swift::SubstGenericParametersFromMetadata::setup() const ()
#7	0x000000018586237c in std::__1::__function::__func<(anonymous 

But after much trial and error, we eventually realized this was stemming from a stack overflow. We verified that was the case by increasing the stack size by setting Other Link Flags to -Wl,-stack_size,0x10000000, which gave us a larger stack size and avoided the bad access at runtime. We also found the offending struct, changed it to a class, and were unable to see a crash after that change.

It also turned out to be the case that the stack trace was a red herring. Especially as continued runs would surface subtly different traces. We also never saw a typical stack overflow crash that would reference chkstk_darwin in the stack trace. So I'm not sure what to make of that.

The offending struct is rather large yet has existed in our app for years. The difference now is that we are adopting SwiftUI, and this struct is being used in @Environment objects. It's unclear to me if @Environment is somehow copying the data on the stack which is causing the stack inflation to happen, or if simply nesting a large struct type within another SwiftUI.View struct is causing excess stack memory growth.

What's also alarming to me, is that some devices saw this crash consistently while other devices never saw a crash. One of my colleague's iPhone 13 mini (running iOS 16.5) saw the crash 100% of the time on her phone, while my iPhone mini 13 (running iOS 16.1.1) never saw the crash. I was able to eventually force the crash by artificially inflating the struct by adding many unused properties to the struct to arbitrarily extend its size.


I'm looking for a few ideas here: 1) on how we can mitigate this in the future, 2) how we can currently measure this and warn ourselves if our stack baseline gets too high, and 3) why we're seeing the stack size inconsistencies across devices.

  1. For my first question, we haven't yet been forced to adopt Copy-on-Write semantics but we realize this would facilitate one solution to our problem. I don't want to outright ban large structs, thus throwing out the baby with the bathwater, but I'm curious if there's any guidance or community wisdom we can lean on here. There's already at least one great forum post that shares some shared pains and strategies, but that's also 2 years old so it's possible ideas and solutions have evolved since then.

I'd also be interested in seeing examples of Copy-on-Write wrappers, macros (for when we can move to Xcode 15), or any other fancy solutions here that are more ergonomic than some of the solutions I've seen.

  1. In terms of measuring the stack, is this something that Instruments can already inform us of? We came across another post that has code to calculate the existing stack size. It doesn't seem sufficient enough to be a trustworthy measuring stick during long-running instrumentation, but might be another data point in tracking down future mysterious crashes.

  2. What would be the reason that only some of our devices are seeing the stack overflow crash (it being 100% consistent for those devices)? Does anyone have documentation or anecdotal evidence that suggests the stack size for apps can be varied under different circumstances? What's really bad here is that none of us saw this issue until we went to TestFlight which then surfaced the crash for 100% of our testers, and we eventually had to locate test devices that also experienced the crash.

2 Likes

I investigated a stack overflow issue earlier this year. In my case none of the existing mechanisms (macOS stack guard, address/thread sanitizers, etc) were able to detect the issue. I summarized what happened under the hood here.

I found it was very helpful to estimate stack size using vmmap. See my approach here. How to apply it in practice, however, might be tricky. In my case when stack overflow occurred, the main thread silently used another thread's stack, so vmmap output was't accurate. To avoid this, you can set a large custom stack size (as you mentioned) when using this approach.

There are many reasons (including compiler bugs) that could lead to stack overflow. I think the key is to identify the code that caused the issue. The most straightforward approach is to look at assembly code (it doesn't require being familiar with low level details like register usage, but just to get basic information like what func it is and stack size allocated). In my case I used this approach to identify the issue occurred in enum accessor. EDIT: I'm aware that the code that crashed might not be the code that caused the issue. For example, in my case it's completely possible that main thread messed up background thread's stack and caused the background thread crashed. So this approach has its limitation, though it should be helpful usually..

2 Likes

One advantage of 64 bit CPUs is a huge address space and in principle we could have used really large stacks, like 16GB – VM would allocate just what's actually used rounded to VM page size (~4K).

I found stack overflow sniffer unreliable and when in doubt prefer to use my hand made stack overflow checker in key places (e.g. the beginning of a recursive function). In principle you may do your key functions throwing and throw when you detect that stack is about to overflow – this if you can cope with the stack overflow condition sensibly.

The difference in stack requirements could be due to a difference in OS versions all other things being equal. I'd measure the current stack requirement and if it is, say, 70% of default stack size - do something about it, by either reducing the stack requirement or enlarging the stacks. Note that secondary thread sizes are smaller by default, with some API's you can control them, but not with other API's.

COW is strictly speaking orthogonal aspect, you can do your structs smaller with or without it, and you can still have your structs "structs" on the top level, with a reference field for the actual storage. Otherwise you can use other forms of reducing space (e.g. Int -> Int16, reorder fields to have a better alignment, pack several individual Bool fields together, etc).

1 Like