Testing fatalError and friends

AlexanderM · September 27, 2023, 12:15am

The problem persists: any owned resources still won't get cleaned up. E.g. if you take a lock, crash, then try to resume, that lock will remained locked. After that, any other code that needs that lock will deadlock.

Joe_Groff · September 27, 2023, 12:25am

That's the least of your issues as far as unreclaimed resources go, since as others noted nothing else being actively held by the crashed thread will get cleaned up either, and killing the thread might leave things like locks in an inconsistent state. If you catch a trap on a dedicated thread, you should still probably arrange to restart your process as soon as possible to fully recover.

dmt · September 27, 2023, 12:43am

"others"

That's obvious. I was talking about a situation when you can't restart. But anyway I got your point.

Well, I mean, it was me who wrote this . No doubt. And leaving a dependent thread deadlocked is better than attempting to restore things, IMO, again hello Java.

AlexanderM · September 27, 2023, 12:44am

I know this is speculation, but why are the odds that this would be “mostly fine” in a test context?

I’ve been using the CwlPreconditionTesting for a while, and haven’t seen any odd bugs from it (yet?)

Joe_Groff · September 27, 2023, 12:47am

This would be my perspective too. My concern with your suggestion of trying to reclaim a thread is that, with some OS lock implementations, doing that might release locks being held by the crashed thread, even though the locking operation running on the crashed thread is notionally still going. If another still-running thread can successfully take the lock and proceed, that could lead to it observing broken invariants that were left broken by the crashed thread.

AlexanderM · September 27, 2023, 1:02am

Reading the room, I’m getting the understanding that proper solution here is something using “idea 2” (to spawn a process and then supervise its exit reason).

My issue with that is that it’s really complex to setup, so nobody does it. The standard library is the only project I’ve seen that has implemented this.

Perhaps the macro-based nature of swift-testing could implement this transparently for the dev, though I’m weary of dropping the 9001th feature request. They seem to have enough of those for now haha

dmt · September 27, 2023, 1:02am

Do you mean something like pthread mutex with ROBUST? (I don't know any other examples). I guess I implicitly assumed that our signal handler will be notified before any such things. But if that's not the case than it's a valid point.

I asked but they didn't reply yet.

benlings · September 27, 2023, 6:07am

The original concurrency manifesto included the concept of reliable actors that could provide fault isolation at a smaller level of granularity than at a whole process level. Is that style of fault isolation still a future possibility?

ktoso · September 27, 2023, 6:09am

That’s what distributed actors are, and yeah you could write a simple system that spawns off a process for just that purpose of fault isolation (we’ve done so in the past as a PoC). So the building blocks are there and waiting for assembly of the LEGO fortress

Joe_Groff · September 27, 2023, 2:39pm

It would also be interesting in the fullness of time to maybe have an in-process host for distributed actors, to let one run within an isolated heap in the same process and allow it to contain failures in environments where spawning processes isn't available.

tera · September 27, 2023, 5:18pm

Could deadlock be detected reliably? I guess we could fatalError as well when deadlock happens.

Note that the same issue could still happen with normal error handling:

// let's assume we are not using `withLock` for some reason
// and also forgot releasing resources in `defer`
let p = malloc()
let file = open(....)
lock.lock()
try something()
lock.unlock()
close(file)
free(p)

Joe_Groff · September 27, 2023, 5:22pm

Sure. If you're using a safe wrapper around lock/unlock, which uses ownership to otherwise safely ensure you only have exclusive access to some resource while a lock is being held, then trying to continue execution after a crashed thread has potentially released locks it's still holding would sabotage the otherwise structural safety of that interface.

wadetregaskis · September 27, 2023, 7:55pm

I suspect deadlock detection in a language as flexible as Swift (unsafe bits, it can call arbitrary C/C++, etc) is probably intractable. An inverse halting problem.

Breaking locks is very context-specific, as to if it can be done as well as how. See the whole field of distributed locking for gory details. I'm pessimistic that a general-purpose mechanism (such as for recoverable fatal errors with setjmp hacks) is possible. Though, a good-enough solution might be possible, at least within a restricted domain…?

That's why I'm more optimistic about the 'unchecked exceptions' sort of approach, where you at least follow a familiar pattern and can reuse existing language functionality to make code [more] exception-safe (e.g. defer, RAII mechanisms broadly, etc). Just blindly jumping over stack frames, or parking whole threads permanently, is going to be more prone to compounding errors at runtime (although so far as potential errors go, merely locking up is closer to the benign end of the scale).

That said, for the unit testing case specifically, all those options remain on the table. You can [practically] always spawn more threads inside your test driver, cheaply and easily, and the whole process is very transient anyway.

tera · November 10, 2023, 12:59am

I noticed that the above "Fatal error: Index out of range" is printed to console before the signal handler execution. Is it possible to grab that reason phrase, so the error I am creating is not just "signal #x received" but more detailed?

dmt · November 10, 2023, 7:42pm

This will be tricky. I'm not sure if you could patch swift::swift_reportError, which would be handful in this case.
But you may get lucky with gCRAnnotations.message.
The problem here is there could be more than one gCRAnnotations global variables, as it may appear in several dylibs in respective __crash_info sections. So you probably want to collect messages from all of them. In order to do so, you need to iterate over all loaded binaries with _dyld_get_image_header or something, and then get address of __crash_info with getsectiondata(header, "__DATA", "__crash_info", ...)

lorentey · November 10, 2023, 9:53pm

The lack of a good way to exercise traps has been a constant pain for me too, while maintaining libraries outside of the stdlib. (It just never reached a point where it turned into a must-fix obstacle.)

I believe idea 2 (run tests in a separate process, and detect crashes) is the right direction, unless we want to entirely overhaul our trap facility.

We know this is possible to do; StdlibUnittest has been doing it for years. Doing it well and exposing it as a public facility should be within reach!

To make this work, support for trap expectations need to be integrated directly into XCTest and swift-testing.

tera · November 10, 2023, 10:54pm

Thank you, excellent idea!

Based on this hint I was able to cook a working solution:

// DO NOT USE THIS CODE
func crashReason() -> String? {
    var size = 0
    // 🔶 Warning: 'getsectdatafromFramework' was deprecated in macOS 13.0: No longer supported
    guard let ptr = getsectdatafromFramework("libswiftCore.dylib", "__DATA", "__crash_info", &size) else {
        return nil
    }
    return ptr.withMemoryRebound(to: (Int64, UnsafePointer<UInt8>?).self, capacity: 1) { pointer in
        guard let message = pointer.pointee.1 else { return nil }
        return String(cString: message)
    }
}

Example output:

There's a deprecated warning though for getsectdatafromFramework.

Yes, this method also works:

for i in 0 ..<  _dyld_image_count() {
    let header = _dyld_get_image_header(i)
    var size = 0
    if let data = getsectiondata(header, "__DATA", "__crash_info", &size) {
        // grab message from data
        // then break or continue
    }
}

Do I just pick the first non nil message from that list?
Is it the case that depending upon a particular error that message will live in a different header?

dmt · November 10, 2023, 11:33pm

It depends. I might be wrong here, but there are __crash_info sections in libswiftCore, CoreFoundation, libc. So, depending where a signal was raised from it may be in any of them. Most common sources are: C function abort(), [NSException raise] and swift_reportError (and there's another one in WebKit, but this isn't that common for those who don't work with it).
But basically one shouldn't make assumption if there's a non-empty message in one of the __crash_info sections, there's no in others. A function may save a message in one __crash_info, than pass control to another one, and it will save another message in another __crash_info section.
Example:

__crash_info of libswiftCore.dylib:
Fatal error: Attempted to read an unowned reference but object 0x281559500 was already deallocated
__crash_info of libsystem_c.dylib:
abort() called

tera · November 11, 2023, 12:25am

Good to know.

Better version that captures all crash reasons:

// MARK: DO NOT USE THIS CODE
func crashReasons() -> [String] {
    (0 ..< _dyld_image_count()).compactMap { i -> String? in
        return _dyld_get_image_header(i).withMemoryRebound(to: mach_header_64.self, capacity: 1) { header in
            var size = 0
            guard let ptr = getsectiondata(header, "__DATA", "__crash_info", &size) else {
                return nil
            }
            return (ptr + 8).withMemoryRebound(to: UnsafePointer<UInt8>?.self, capacity: 1) { pointer in
                guard let message = pointer.pointee else { return nil }
                return String(cString: message)
            }
        }
    }
}

I noticed that these guys do not write into any of the enumerated "__crash_info's":

let a = 1 as! String // Swift runtime failure: failed cast

var x: UInt8 = 0
x - 1 // Swift runtime failure: arithmetic overflow

unsafeBitCast(0, to: UnsafeMutablePointer<UInt8>.self).pointee = 42
// in xcode debugger: EXC_BAD_ACCESS
// in console: zsh: segmentation fault

Do you know why that could be?

A side gotcha with `mach_header` vs `mach_header_64`

_dyld_get_image_header(i) returns mach_header
but getsectiondata wants mach_header_64
hence the reinterpret cast.

These are the cases I tested so far:

fatalError("Hello")             // ✅ "App/main.swift:6: Fatal error: Hello\n"
abort()                         // ✅ "abort() called"
precondition(false, "hello")    // ✅ "App/main.swift:8: Precondition failed: hello\n"
assert(false, "hello")          // ✅ "App/main.swift:9: Assertion failed: hello\n" (debug)
[][0]                           // ✅ "Swift/ContiguousArrayBuffer.swift:600: Fatal error: Index out of range\n"
var d = [0:0, 0:0]              // ✅ "Swift/Dictionary.swift:830: Fatal error: Dictionary literal contains duplicate keys\n"
var x = 0
0 / x                           // ✅ "Swift/arm64e-apple-macos.swiftinterface:34494: Fatal error: Division by zero\n"
let e = NSException(name: NSExceptionName("42"), reason: "hello")
e.raise()                       // ✅ "*** Terminating app due to uncaught exception \'42\', reason: \'hello\'"
                                //    "terminating due to uncaught exception of type NSException"
                                //    "abort() called"
swift_reportError(42, "hello")  // ✅ "hello"

var ptr: Int?
print(ptr!)                     // ✅ "App/main.swift:20: Fatal error: Unexpectedly found nil while unwrapping an Optional value\n"

func failing() throws {
    throw NSError(domain: "domain", code: 42, userInfo: [NSLocalizedDescriptionKey: "hello"])
}
try! failing()                  // ✅ "App/main.swift:24: Fatal error: \'try!\' expression unexpectedly raised an error: Error Domain=domain Code=42 \"hello\" UserInfo={NSLocalizedDescription=hello}\n"

unsafeBitCast("1", to: Int.self)// ✅ "Swift/arm64e-apple-macos.swiftinterface:3119: Fatal error: Can\'t unsafeBitCast between types of different sizes\n"

let a = 1 as! String            // 🛑 (no __crash_info for some reason)

var y: UInt8 = 0
y - 1                           // 🛑 (no __crash_info for some reason)

let p = unsafeBitCast(0, to: UnsafeMutablePointer<UInt8>.self)
p.pointee = 42                  // 🛑 (no __crash_info for some reason)

enable_fp_exceptions()
print(sqrt(-1.0))               // 🛑 (no __crash_info for some reason)

Please shout if you want to test something else.

dmt · November 11, 2023, 5:27am

This is bare attempt to write to an invalid memory address, there's just no code that could do something to save a message.

These two are transformed to cond_fail SIL instruction which is then transformed into assembly with pseudocode something like:

if (!condition) {
 trap // "brk 1"/"ud2"
}

So this instruction doesn't save the message to __crash_info. And it seems like the message isn't preserved in the binary at all.
I don't know why it was implemented this way.