Struct and enum accessors take a large amount of stack space

rayx · February 21, 2023, 10:00am

Accessing a simple enum value consumes about 16K additional stack space. See code below:

import Foundation

enum Foo {
    case x(Int)
}

func main() {
    sleep(2)
    print(Foo.x(1))
    sleep(2)
}

main()

To reproduce the issue, put the code in a command line project, run vmmap in a terminal to monitor the process stack size, then run the above code:

$ while true; do vmmap $(pgrep StackSizeTest) 2>/dev/null | grep "^Stack" | grep -v Guard ; echo "";  sleep 1; done

(Note: replace the pgrep command argument with your project name. Also make sure the project name is unique so that pgrep returns only one pid).

Below is the output in my terminal. The process starts with 24K stack size, but grows to 40K when accessing a Foo instance. I'm sure the stack size change is not caused by print(), because if I print a integer, instead of Foo instance, the stack size doesn't change.

Stack                    7ff7bf700000-7ff7bff00000 [ 8192K    24K    24K     0K] rw-/rwx SM=PRV          thread 0
Stack                               8192K      24K      24K       0K       0K       0K       0K        1 

Stack                    7ff7bf700000-7ff7bff00000 [ 8192K    40K    40K     0K] rw-/rwx SM=PRV          thread 0
Stack                               8192K      40K      40K       0K       0K       0K       0K        1

My experiments show struct accessor has the same behavior. My environment: macOS 13.2.1, Xcode 14.2, Intel CPU.

I noticed this behavior because I'm investigating a stack overflow issue in my app. The enum values in my app have one associated value of about 1K size. However, accessing these enum values caused a worker thread stack size grow gradually from 20K to more than 512K (and hence overflowed). The enum accessor is compiled as outlined function in the app. Below are two versions of it, each having a different frame size (there are more versions in the binary and all of them are for the same enum type).

This version creates a stack frame of about 84K size (0x15240 / 1024 = 84K).

acdbCNTests`outlined init with copy of FIDailyCEW:
->  0x13c18c070 <+0>:      pushq  %rbp
    0x13c18c071 <+1>:      movq   %rsp, %rbp
    0x13c18c074 <+4>:      subq   $0x15240, %rsp            ; imm = 0x15240 
    0x13c18c07b <+11>:     movq   %rdi, -0x50(%rbp)
    0x13c18c07f <+15>:     movq   %rsi, -0x48(%rbp)
    0x13c18c083 <+19>:     movq   %rsi, -0x40(%rbp)
    0x13c18c087 <+23>:     movq   %rdi, -0x38(%rbp)
    0x13c18c08b <+27>:     movq   %rsi, -0x30(%rbp)
    0x13c18c08f <+31>:     movq   %rdi, %rax
    0x13c18c092 <+34>:     movq   %rax, -0x28(%rbp)
    0x13c18c096 <+38>:     movq   %rdi, -0x20(%rbp)
    0x13c18c09a <+42>:     xorl   %eax, %eax
    0x13c18c09c <+44>:     movl   %eax, %edi
    0x13c18c09e <+46>:     callq  0x13f1273b0               ; type metadata accessor for acdbCN.FICEW at <compiler-generated>

This version creates a stack frame of about 522K size (0x82ab0 / 1024 = 522K). It caused stack overflow in worker thread.

acdbCNTests`outlined assign with take of SomeDailyCEW?:
    0x13b582e20 <+0>:       pushq  %rbp
    0x13b582e21 <+1>:       movq   %rsp, %rbp
    0x13b582e24 <+4>:       pushq  %r14
    0x13b582e26 <+6>:       pushq  %rbx
    0x13b582e27 <+7>:       subq   $0x82ab0, %rsp            ; imm = 0x82AB0 
    0x13b582e2e <+14>:      movq   %rdi, -0x50(%rbp)
    0x13b582e32 <+18>:      movq   %rsi, -0x48(%rbp)
    0x13b582e36 <+22>:      movq   %rsi, %rax
    0x13b582e39 <+25>:      movq   %rax, -0x40(%rbp)
    0x13b582e3d <+29>:      movq   %rdi, -0x38(%rbp)
    0x13b582e41 <+33>:      movq   %rsi, -0x30(%rbp)
    0x13b582e45 <+37>:      xorl   %eax, %eax
    0x13b582e47 <+39>:      movl   %eax, %edi
->  0x13b582e49 <+41>:      callq  0x13b2c8950               ; type metadata accessor for acdbCN.SomeDailyCEW at <compiler-generated>

I wonder is this behavior by design or a bug? It can easily cause unexpected overflow. Is there some way or best practice to avoid this behavior?

tera · February 21, 2023, 3:58pm

I have no intel to test this, but there is nothing wrong if

print(Foo.x(1))

requires more stack space temporarily than:

print(1)

Mapped (wired) stack space grows in page size increments (likely 16K), so if before the "print" statement the used logical (virtual) stack size is, say, 4K, then the wired size is 16K. Then, if within the deepest point of "print(1)" execution the used stack size becomes, say, 14K - the wired stack size is still within the currently mapped limits - stays at 16K. And if within the deepest point of "print(Foo.x(1))" execution the used stack size becomes 17K - then the new 16K page is wired and the wired stack size becomes 32K. Once grown it is never unwired back when logical (virtual) stack size shrinks back. In those regards the way you are monitoring stack size is not be very accurate. †

This is ok.

Just checking: is this debug or release build? Is that affected by Xcode diagnostic options? Some of them are known to cause lots of heap allocation to track things down, maybe they can also affect stack allocation?

I am not familiar with the "outlining" business... How many versions of the function are created?! Is there a way to prohibit this "outlining"? Similar to how we can prohibit inlining.

These:

are of course very suspicious. I don't know if Swift compiler has any guidlines IRT stack usage policy but generating code that allocates 522K off stack doesn't sound wise. @eskimo to the rescue? or, in this case @John_McCall?

What you found here sounds quite dangerous and we (as in swift community) would benefit if the root cause of this issue is found first, hence I can'd recommend you a workaround (yet). (In other words it would be quite unfortunate for the rest of us if you, say, rewrite your app in a way so it no longer triggers this bug – if this is a swift compiler bug – the more people would suffer from it down the road.)

† I've checked your above mini sample on godbolt and didn't find anything suspicious IRT stack handling. As indicated previously, from experience when resolving issues like this one, it might be harder to go from zero to an MVP (minimal full sample that reproduces the issue you are observing in a full app), often times it is easier to go the other way around, from full app down to zero. I can help you with that if you want - obviously I'd need to see the sources and I can sign the NDA if needed. Or try apple dev support?

taylorswift · February 21, 2023, 4:32pm

my hunch is this has to do with reflection, because you did not define a CustomStringConvertible conformance for the enum.

does this still occur if you implement description?

Joe_Groff · February 21, 2023, 4:54pm

I agree with taylorswift and the others in this thread that the minor increase in stack growth between print(1) and print(Foo.x(1)) is probably because of print's reflective traversal of the enum and nothing to do with value type codegen. But we definitely do have code size and stack size issues with very large value types, since they tend to get copied a lot and we don't always minimize copies that well. Since you're already using an enum, have you tried making the large payload cases indirect to reduce the inline size of the type?

Jon_Shier · February 21, 2023, 5:23pm

Another approach is using a simple reference box or more complex copy-on-right box. These are common approaches for communities like the composable architecture which deal with large value types when modeling whole app state. Is possible Swift will ever deal with large values automatically?

Joe_Groff · February 21, 2023, 5:31pm

The most obvious thing to add would be direct support for indirect fields in structs. It would make sense to me to automatically indirect value types once they get big enough or have enough refcounted fields inside of them for that to make sense.

ksluder · February 21, 2023, 8:04pm

I would definitely want to be able to control whether any of my struct’s fields are indirect. One reason would be to conform to the BitwiseCopyable constraint being discussed in the other thread, but another would be to avoid ARC.

tera · February 21, 2023, 9:19pm

Sounds like you might have a reproducible sample code for that or, perhaps, a failing unit test that shows the issue; if so can you share it? I can think of a simple case involving recursion but @rayx said there is no recursion in that part of the app.

. Perhaps it could work just as with enums: explicit "indirect" keyword near individual fields or on the whole structure.

John_McCall · February 21, 2023, 9:50pm

It's possible that there's just something we could be doing in the frontend to make sure that LLVM re-uses stack slots more reliably. Even if we copy these types too many times, we shouldn't have to have all of those copies around on the stack at once.

rayx · February 22, 2023, 4:21pm

Thanks for confirming the issue and suggestion for indirect. I didn't know indirect can be used for general purpose like this. Yes, it works well. The stack size is 56K now.

BTW, I did an experiment before I made the indirect change. I modified my code to remove the enum in question but kept the payload type, which is a struct of 1K size. I find that also greatly reduces the stack size. It seems enum is more likely to have stack size issue than struct.

Yes, that's the reason.

I see your point. It's an estimation (though very useful). Chances are the initial 24k stack memory are not completely used. That explains why print(1) doesn't cause vmmap output change. BTW, pagesize shows macOS uses 4k page size on Intel CPU.

I didn't know about outlined function either. The concept is simple. See here.

Sorry about the "multiple versions" confusion. It was my hypothesis, based on the observation that stack size grown gradually and finally overflowed. I tried to prove it and thought the two versions I posted were for the same enum, but they weren't. I investigated this today and finally figured out what happened. Since this is a known issue and we have solutions to avoid it, the following details are just for fun.

For a specific enum, there seems to be just one piece of code for its accessor. As the code in my original post shown, it may have large frame size.
There is a 4K stack guard space immediately after each stack. Unfortunately, due to the big frame size of enum accessor and the way how it works (it seems that while accessor reserves a large amount of space, it doesn't necessarily uses all of them), the stack guard fails to catch the overflow issue, because the accessor doesn't necessarily access that 4k address space.
As a result, although there is overflow, the code doesn't crash because it is using another thread's stack! The code may continue to run until it finally tries to access an invalid virtual address (vmmap shows there are unallocated virtual address range between two stacks sometimes). Otherwise the code completes successfully (although it shouldn't). I believe this explains the random behaviors I observed while I tested the issue these days.

tera · February 22, 2023, 4:38pm

Good to know you figured it out and found the workaround.

I'd still like to see a small example that stackoverflows working with 1K-sized enums/structs and not using recursion. IMHO this should be added to the test suit of swift compiler.

Joe_Groff · February 22, 2023, 4:42pm

We recently found an issue where the compiler was failing to reuse stack space between switch cases, and allocating the stack space necessary for all of the enum payloads and cases' local state even though only one actually executes at a time. You might be running into the same problem.

Until we fix that issue, one workaround we've found for this issue is to wrap up each case block in an immediately-invoked closure, like:

switch foo {
case .bar:
  _ = {
    ...
  }()
case .bas:
  __ = {
    ...
  }()
}

If you see stack size issues even after adopting indirect cases, you might try that to see if it helps.

tera · February 22, 2023, 11:06pm

Thank you. Based on this information I was able creating a minimal crashing app.

contrived example

import Foundation

struct A {
    var a = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
}
struct B {
    var b = (A(), A(), A(), A(), A(), A(), A(), A(), A(), A())
}
struct C {
    var b = (B(), B(), B(), B(), B(), B(), B(), B(), B(), B())
}

enum E {
    case a00(C), a01(C), a02(C), a03(C), a04(C), a05(C), a06(C), a07(C), a08(C), a09(C)
    case a10(C), a11(C), a12(C), a13(C), a14(C), a15(C), a16(C), a17(C), a18(C), a19(C)
    case a20(C), a21(C), a22(C), a23(C), a24(C), a25(C), a26(C), a27(C), a28(C), a29(C)
    case a30(C), a31(C), a32(C), a33(C), a34(C), a35(C), a36(C), a37(C), a38(C), a39(C)
    case a40(C), a41(C), a42(C), a43(C), a44(C), a45(C), a46(C), a47(C), a48(C), a49(C)
    case a50(C), a51(C), a52(C), a53(C), a54(C), a55(C), a56(C), a57(C), a58(C), a59(C)
    case a60(C), a61(C), a62(C), a63(C), a64(C), a65(C), a66(C), a67(C), a68(C), a69(C)
    case a70(C), a71(C), a72(C), a73(C), a74(C), a75(C), a76(C), a77(C), a78(C), a79(C)
    case a80(C), a81(C), a82(C), a83(C), a84(C), a85(C), a86(C), a87(C), a88(C), a89(C)
    case a90(C), a91(C), a92(C), a93(C), a94(C), a95(C), a96(C), a97(C), a98(C), a99(C)
}

func f(_ c: C) {
    print(c)
}

@inline (never)
func foo(_ e: E) {

// without indirect:
//    0x1000170ec <+32>:   sub    sp, sp, #0xc5, lsl #12    ; =0xc5000, 806912 (approx 100 x 8000)
    
// with indirect:
//    0x1000173cc <+32>:   sub    sp, sp, #0xc3, lsl #12    ; =0xc3000

// with {f(c)}()
//    0x100003904 <+92>:   sub    sp, sp, #0x1, lsl #12     ; =0x1000

    switch e {
    case .a00(let c): f(c); case .a01(let c): f(c); case .a02(let c): f(c); case .a03(let c): f(c); case .a04(let c): f(c)
    case .a05(let c): f(c); case .a06(let c): f(c); case .a07(let c): f(c); case .a08(let c): f(c); case .a09(let c): f(c)
    case .a10(let c): f(c); case .a11(let c): f(c); case .a12(let c): f(c); case .a13(let c): f(c); case .a14(let c): f(c)
    case .a15(let c): f(c); case .a16(let c): f(c); case .a17(let c): f(c); case .a18(let c): f(c); case .a19(let c): f(c)
    case .a20(let c): f(c); case .a21(let c): f(c); case .a22(let c): f(c); case .a23(let c): f(c); case .a24(let c): f(c)
    case .a25(let c): f(c); case .a26(let c): f(c); case .a27(let c): f(c); case .a28(let c): f(c); case .a29(let c): f(c)
    case .a30(let c): f(c); case .a31(let c): f(c); case .a32(let c): f(c); case .a33(let c): f(c); case .a34(let c): f(c)
    case .a35(let c): f(c); case .a36(let c): f(c); case .a37(let c): f(c); case .a38(let c): f(c); case .a39(let c): f(c)
    case .a40(let c): f(c); case .a41(let c): f(c); case .a42(let c): f(c); case .a43(let c): f(c); case .a44(let c): f(c)
    case .a45(let c): f(c); case .a46(let c): f(c); case .a47(let c): f(c); case .a48(let c): f(c); case .a49(let c): f(c)
    case .a50(let c): f(c); case .a51(let c): f(c); case .a52(let c): f(c); case .a53(let c): f(c); case .a54(let c): f(c)
    case .a55(let c): f(c); case .a56(let c): f(c); case .a57(let c): f(c); case .a58(let c): f(c); case .a59(let c): f(c)
    case .a60(let c): f(c); case .a61(let c): f(c); case .a62(let c): f(c); case .a63(let c): f(c); case .a64(let c): f(c)
    case .a65(let c): f(c); case .a66(let c): f(c); case .a67(let c): f(c); case .a68(let c): f(c); case .a69(let c): f(c)
    case .a70(let c): f(c); case .a71(let c): f(c); case .a72(let c): f(c); case .a73(let c): f(c); case .a74(let c): f(c)
    case .a75(let c): f(c); case .a76(let c): f(c); case .a77(let c): f(c); case .a78(let c): f(c); case .a79(let c): f(c)
    case .a80(let c): f(c); case .a81(let c): f(c); case .a82(let c): f(c); case .a83(let c): f(c); case .a84(let c): f(c)
    case .a85(let c): f(c); case .a86(let c): f(c); case .a87(let c): f(c); case .a88(let c): f(c); case .a89(let c): f(c)
    case .a90(let c): f(c); case .a91(let c): f(c); case .a92(let c): f(c); case .a93(let c): f(c); case .a94(let c): f(c)
    case .a95(let c): f(c); case .a96(let c): f(c); case .a97(let c): f(c); case .a98(let c): f(c); case .a99(let c): f(c)
    }
}

func test() {
    print(MemoryLayout<C>.size) // 8000
    let e = E.a00(C())

    let t = Thread {
        print("stack size: \(Thread.current.stackSize)") // stack size: 524288 (512K)
        foo(e)
        print("thread end")
    }
    t.start()
    sleep(10)
}

test()

a few notes:

the crash is typically within ___chkstk_darwin - so it does detect overflow, at least in some cases
Interestingly indirect didn't help in this case
_ = {...}() worked as a workaround

Until we have a proper fix in the the compiler (whatever it is) can compiler generate a warning, something along the lines:

Warning, generated code might overflow stack as required stack size is too big (800K).

Joe_Groff · February 22, 2023, 11:52pm

I think LLVM does have an internal flag that can raise warnings when it lowers a function with too much stack allocated (but I don't recall what it is, sorry; I believe it's exposed through clang for C code though). It's difficult to pre-determine before LLVM unfortunately, since LLVM is ultimately the layer that decides what the stack looks like.

jklausa · May 29, 2024, 11:51am

Sorry for resurrecting old thread - do you know if this ever got fixed/is there a bug ticket for it?

I think we’re hitting this too; but want to make sure it’s possible that it’s still this bug; and I’m not chasing ghosts.

monocularvision · August 1, 2024, 9:30pm

I'm wondering the same thing as well. We just hit this issue with a large switch statement over a large enum and had to workaround it by wrapping each case block in closures.