EXC_BAD_ACCESS using withTaskGroup depending on size of data

I am getting an EXC_BAD_ACCESS when using withTaskGroup in the following way, depending on how big the structure being built is. I would be thankful for any idea on how to solve this.

(The demo project is on github in the case anyone would like to try it, the path packagePath in the Swift file has to be adjusted. When started in debug mode from within Xcode, it shows the error; for a release build running in terminal one gets zsh: bus error ./BadAccessDemo2.)

Thanks.

import Foundation
import SwiftXMLC

@main
struct Test {
    
    static func main() async throws {
        
        // !!! adjust path before running: !!!
        let packagePath = "/Users/stefan/Projekte/BadAccessDemo2"
        
        let paths = [
            // small example:
            "\(packagePath)/test1.xml",
        
            // same structure, but a little bigger:
            "\(packagePath)/test2.xml",
        ]
        
        await paths.forEachAsyncThrowing { path in // (forEachAsyncThrowing defined below; same result without it)
        
            // OK in both cases:
            await inner(path: path, i: 1)
            
            if #available(macOS 10.15, *) {
                await withTaskGroup(of: Void.self) { group in
                    
                    func outer() async {
                        
                        group.addTask {
                            // OK for smaller example, EXC_BAD_ACCESS for the larger example:
                            await inner(path: path, i: 2)
                        }
                    }
                    
                    await outer()
                    
                    for await _ in group {}
                }
            } else {
                print("wrong OS version")
            }
        }
    }
}

func inner(path: String, i: Int) async {
    let document = XDocument()

    do {
        let data = try Data(contentsOf: URL(fileURLWithPath: path))
        do {
            // building a structure:
            try XParser().parse(fromData: data, eventHandlers: [XParseBuilder(document: document)])
            
            // writing it back to another file as a test:
            let copyPath = "\(path).copy\(i).xml"
            document.write(toFile: copyPath)
            
            print("\(copyPath) written")
            print("press RETURN to continue..."); _ = readLine()
        }
        catch {
            print(error.localizedDescription)
        }
    }
    catch {
        print(error.localizedDescription)
    }
}

extension Sequence {
    func forEachAsyncThrowing (
        _ operation: (Element) async throws -> Void
    ) async rethrows {
        for element in self {
            try await operation(element)
        }
    }
}

Can you post the actual crash log, or at least the crashing backtrace?

I hope this is what you asked for:

-------------------------------------
Translated Report (Full Report Below)
-------------------------------------

Process:               BadAccessDemo2 [51166]
Path:                  /Users/USER/*/BadAccessDemo2
Identifier:            BadAccessDemo2
Version:               ???
Code Type:             ARM-64 (Native)
Parent Process:        zsh [47252]
Responsible:           Terminal [47250]
User ID:               501

Date/Time:             2022-01-07 21:31:04.0692 +0100
OS Version:            macOS 12.0.1 (21A559)
Report Version:        12
Anonymous UUID:        AC3DEC9D-06B9-3E64-2D4B-44D808CBB77B

Sleep/Wake UUID:       6E6113C5-9465-476E-A068-3B373C450C7B

Time Awake Since Boot: 470000 seconds
Time Since Wake:       19757 seconds

System Integrity Protection: enabled

Crashed Thread:        1  Dispatch queue: com.apple.root.user-initiated-qos.cooperative

Exception Type:        EXC_BAD_ACCESS (SIGBUS)
Exception Codes:       KERN_PROTECTION_FAILURE at 0x000000016fcdfff0
Exception Codes:       0x0000000000000002, 0x000000016fcdfff0
Exception Note:        EXC_CORPSE_NOTIFY

Termination Reason:    Namespace SIGNAL, Code 10 Bus error: 10
Terminating Process:   exc handler [51166]

VM Region Info: 0x16fcdfff0 is in 0x16fcdc000-0x16fce0000;  bytes after start: 16368  bytes before end: 15
      REGION TYPE                    START - END         [ VSIZE] PRT/MAX SHRMOD  REGION DETAIL
      Stack                       16f4e0000-16fcdc000    [ 8176K] rw-/rwx SM=PRV  thread 0
--->  STACK GUARD                 16fcdc000-16fce0000    [   16K] ---/rwx SM=NUL  ... for thread 1
      Stack                       16fce0000-16fd68000    [  544K] rw-/rwx SM=PRV  thread 1

Thread 0::  Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib        	       0x1bdfbd954 mach_msg_trap + 8
1   libsystem_kernel.dylib        	       0x1bdfbdd00 mach_msg + 76
2   CoreFoundation                	       0x1be0c4e38 __CFRunLoopServiceMachPort + 372
3   CoreFoundation                	       0x1be0c32f0 __CFRunLoopRun + 1212
4   CoreFoundation                	       0x1be0c2694 CFRunLoopRunSpecific + 600
5   CoreFoundation                	       0x1be14ec28 CFRunLoopRun + 64
6   libswift_Concurrency.dylib    	       0x23b5e8ee8 swift_task_asyncMainDrainQueueImpl() + 40
7   libswift_Concurrency.dylib    	       0x23b5e8ec0 swift_task_asyncMainDrainQueue + 100
8   BadAccessDemo2                	       0x100127d04 BadAccessDemo2_main + 84
9   dyld                          	       0x10059d0f4 start + 520

Thread 1 Crashed::  Dispatch queue: com.apple.root.user-initiated-qos.cooperative
0   libswiftCore.dylib            	       0x1cb3da1b4 swift_arrayDestroy + 4
1   libswiftCore.dylib            	       0x1cb125794 _DictionaryStorage.deinit + 268
2   libswiftCore.dylib            	       0x1cb1258c8 _DictionaryStorage.__deallocating_deinit + 16
3   libswiftCore.dylib            	       0x1cb3e7774 _swift_release_dealloc + 56
4   BadAccessDemo2                	       0x10015de14 XElement.deinit + 60
5   BadAccessDemo2                	       0x10015de48 XElement.__deallocating_deinit + 12
6   libswiftCore.dylib            	       0x1cb3e7774 _swift_release_dealloc + 56
7   libswiftCore.dylib            	       0x1cb3e8724 bool swift::HeapObjectSideTableEntry::decrementStrong<(swift::PerformDeinit)1>(unsigned int) + 292
8   BadAccessDemo2                	       0x10015de2c XElement.deinit + 84
9   BadAccessDemo2                	       0x10015de48 XElement.__deallocating_deinit + 12
10  libswiftCore.dylib            	       0x1cb3e7774 _swift_release_dealloc + 56
11  libswiftCore.dylib            	       0x1cb3e8724 bool swift::HeapObjectSideTableEntry::decrementStrong<(swift::PerformDeinit)1>(unsigned int) + 292
...

Yes, thanks. That should allow at least some investigation without necessarily having to run the example app. What version of Xcode are you running? (I'd also suggest updating your OS.)

Xcode 13.2.1, Swift 5.5.2 (swiftlang-1300.0.47.5 clang-1300.0.29.30), macOS 12.0.1 (21A559).

Thanks!

I updated to macOS 12.1, same problem.

Hey I think I know what the problem is on this one,

extension Sequence {
    func forEachAsyncThrowing (
        _ operation: (Element) async throws -> Void
    ) async rethrows {
        for element in self {
            try await operation(element)
        }
    }
}

since self is referenced inside of the sequence in for element in self and you have
await withTaskGroup(of: Void.self) { group in every time from how I understand it is that every time a task is created or nested inside of another task and that task is inside another task it doesn't know which thread to go back to, I would test out using unowned or weak self

No, this is just for convenience here, the problem persists if you remove the usage of forEachAsyncThrowing. But thanks for the suggestion.

…And it is running on a M1 machine (sometimes the architecture makes a difference).

UPDATE: I posted the Swift bug SR-15680.

I wonder if it is a valid way to asyncly call group.addTask, since the doc says:

Don’t use a task group from outside the task where you created it. In most cases, the Swift type system prevents a task group from escaping like that because adding a child task to a task group is a mutating operation, and mutation operations can’t be performed from a concurrent execution context like a child task.

OK, then I do not know how to control the parallelism (how many "work items" are created at once). I need to do this because each work item can be large.

But I have a problem much worse: After changing some code in my real application (not the demo app), I also get the EXC_BAD_ACCESS when not using withTaskGroup, but only doing something that corresponds to the await inner(path: path, i: 1) call, but outside withTaskGroup, something that I could not reproduce with my demo project. It seems to have something to do with async/await and too complicated call chains + big data structures. It is really unsatisfactory because everything is working fine with smaller data structures and/or less complicated call chains, the data structures and the call chains being of the same kind in both case. So I have some nested async/await calls inside my main() async throws, and every time I change some subtle little thing I then may get this crash or not, as soon as the data gets too large. I really do not see anything I am doing that could not be OK (besides maybe what you said, but I am getting the error now even without withTaskGroup as I said). Somehow async/await seems either be a very difficult thing where you really, really need to use it "just the right way", or this feature and/or Swift version 5.5 is "unstable" / has uncorrect behaviour. I am thinking about removing every async/await completely from my application again and wait until this async/await thing gets more stable (if it is indeed "unstable"). This would be a pity, because it seemed to be an easy to use mechanism and I really would like be able to process my work items asynchronously.

Without having actually run the sample (I'm not with my macOS 12 machine atm), the crash log makes me somewhat suspicious of a stack overflow, and looking at the sample data and SwiftXMLC source, these two stored variables in particular look like potential suspects.

Have you run your sample in a debugger and examined the full stack trace after the EXC_BAD_ACCESS?

Well, the structure gets big, but not too big when not using async/await, even for much bigger files, never getting any error. So I know the data structure gets big, the point is that just when adding async/await ones gets the error. How can this be, or should one not use async/await with such large data structures? And the data structure is OK even in the case with async/await when the error occurs, as it is written to a file again and just returning from that function the error occurs.

The error says (see above):

VM Region Info: 0x16fcdfff0 is in 0x16fcdc000-0x16fce0000;  bytes after start: 16368  bytes before end: 15
      REGION TYPE                    START - END         [ VSIZE] PRT/MAX SHRMOD  REGION DETAIL
      Stack                       16f4e0000-16fcdc000    [ 8176K] rw-/rwx SM=PRV  thread 0
--->  STACK GUARD                 16fcdc000-16fce0000    [   16K] ---/rwx SM=NUL  ... for thread 1
      Stack                       16fce0000-16fd68000    [  544K] rw-/rwx SM=PRV  thread 1

I am currently replacing async/await with code using semaphores.

Yes. As you can see from even the abbreviated stack trace the error occurs while XElements are being deallocated. To elaborate a bit on what I wrote above, I suspect that the variables I linked lead to the synthesized deinits calling each other recursively, which means that call depth would be dictated directly by the length of the chain of elements with the same name.

I'd hazard a guess that the threads the default executor uses might have different stack size configurations, or maybe having a few more calls above your code is enough to make a difference.

Either way, I really recommend you to look at the full stack trace, as then you will know for certain whether recursion/stack depth are actually the problem.

If they are, async/await merely shone a light on an architecture problem that might have surfaced in a number of other ways (different stack size configurations depending on OS/device/...), and would also have a decent impact on performance of deallocating your models, even if it doesn't crash.

OK; thanks.

...Just one more thought: I am not doing any deinits, why should there be some functions calling each other if it should just be a matter of deallocating objects? Does this always occur with a chain of weakly referenced objects that you could get a stack overflow from that?

You are not implementing a deinit, but the compiler synthesizes one that does the necessary ARC housekeeping. It would release the next node, which in turn would call that nodes deinit, which releases the next next one and so on. As far as I understand it, any linked list, weak or not, would run into this given sufficient length.

For reference, here you can see a deinit of a linked list node, written to avoid the same problem I suspect is at fault in your example. I wouldn't put too much trust in the linked list implementation itself, but the strategy used by the deinit should apply.

(Edit: I'll have a look at your sample when I'm back with my macOS 12 machine as well.)

Thanks again!

But what is that isUniquelyReferencedNonObjC in the code? It is not known.

Search is your friend: Remove isUniquelyReferenced or isUniquelyReferencedNonObjC?