`swift-runtime: unable to suspend thread` when compiling in QEMU

Previous possibly related thread: Build crash when building in QEMU using new Swift 5.6 arm64 image

I am trying to build the following Dockerfile on a Linux x86_64 host:

FROM swift:5.9
RUN apt-get update
RUN apt-get install -y \
        build-essential \
        curl \
        libnghttp2-dev \
        libssl-dev \
        patchelf \
        unzip \
        zlib1g-dev
RUN mkdir -p /protoc-gen-swift && \
    mkdir -p /grpc-swift
RUN curl -sSL https://api.github.com/repos/grpc/grpc-swift/tarball/1.19.1 | tar xz --strip 1 -C /grpc-swift
WORKDIR /grpc-swift
RUN make
RUN make plugins

Running docker buildx build --platform linux/arm64 . results in the following error:

119.4 Creating working copy for https://github.com/apple/swift-nio.git
120.6 Working copy of https://github.com/apple/swift-nio.git resolved at 2.59.0
125.6 error: 'swift-nio-extras': Invalid manifest (compiled with: ["/usr/bin/swiftc", "-vfsoverlay", "/tmp/TemporaryDirectory.xvbRB2/vfs.yaml", "-L", "/usr/lib/swift/pm/ManifestAPI", "-lPackageDescription", "-Xlinker", "-rpath", "-Xlinker", "/usr/lib/swift/pm/ManifestAPI", "-swift-version", "5", "-I", "/usr/lib/swift/pm/ManifestAPI", "-package-description-version", "5.6.0", "/grpc-swift/.build/checkouts/swift-nio-extras/Package.swift", "-Xfrontend", "-disable-implicit-concurrency-module-import", "-Xfrontend", "-disable-implicit-string-processing-module-import", "-o", "/tmp/TemporaryDirectory.7Tvpa4/swift-nio-extras-manifest"])
125.6 swift-runtime: unable to suspend thread 2739
125.6 swift-runtime: unable to suspend thread 2739
140.0 make: *** [Makefile:25: all] Error 1

I'm unable to find any previous reference to this error on the internet, it is emitted from here: https://github.com/apple/swift/blob/main/stdlib/public/runtime/CrashHandlerLinux.cpp#L506

Attempting to run the build multiple times shows that the failure occurs at a range of different points during the compile process, but it always occurs at some point before completion. Watching my CPU load, I can see that it usually occurs soon after a large number of threads are started, putting load on multiple CPU cores rather than just one. Can anyone reproduce this or help me get it working? Would a cross compiler be an alternative option to running the compile in the QEMU emulator?

Edit: ping @al45tair I think you might know more about thread handling under Linux?

1 Like

A couple more examples of this occurring at different points during compilation. Another one from my local dev machine running Fedora 39:

249.8 [20/503] Compiling bio.c
250.0 [21/503] Compiling CNIOWindows shim.c
250.1 [22/503] Compiling CNIOWindows WSAStartup.c
250.3 [23/503] Compiling CNIOLinux liburing_shims.c
250.5 [24/506] Compiling CNIODarwin shim.c
250.6 swift-runtime: unable to suspend thread 3768
250.6 swift-runtime: unable to suspend thread 3768
250.9 swift-runtime: unable to suspend thread 3775
250.9 swift-runtime: unable to suspend thread 3775
251.3 swift-runtime: unable to suspend thread 3766
251.3 swift-runtime: unable to suspend thread 3876
251.3 swift-runtime: unable to suspend thread 3878
251.3 swift-runtime: unable to suspend thread 3766
251.3 [25/513] Compiling CNIOLinux shim.c
251.4 swift-runtime: unable to suspend thread 3766
251.4 [26/513] Compiling CNIOLLHTTP c_nio_http.c
253.2 [27/533] Compiling tls_method.cc
253.2 [27/533] Compiling CNIOLLHTTP c_nio_llhttp.c
253.2 [27/533] Compiling CNIOLLHTTP c_nio_api.c
253.2 [27/533] Compiling tls13_enc.cc
253.2 [27/533] Compiling tls13_server.cc
253.2 [27/533] Compiling tls_record.cc
253.2 [27/533] Compiling CNIOBoringSSLShims shims.c
253.2 [27/533] Compiling tls13_both.cc
253.2 [27/533] Compiling tls13_client.cc
256.2 make: *** [Makefile:25: all] Error 1

And a couple from running in GitHub Actions CI:

 > [linux/arm64 grpc_swift  7/11] RUN make:
260.5 Working copy of https://github.com/apple/swift-atomics.git resolved at 1.2.0
261.2 Creating working copy for https://github.com/apple/swift-protobuf.git
264.2 Working copy of https://github.com/apple/swift-protobuf.git resolved at 1.24.0
311.6 error: 'swift-nio-extras': Invalid manifest (compiled with: ["/usr/bin/swiftc", "-vfsoverlay", "/tmp/TemporaryDirectory.0USNp1/vfs.yaml", "-L", "/usr/lib/swift/pm/ManifestAPI", "-lPackageDescription", "-Xlinker", "-rpath", "-Xlinker", "/usr/lib/swift/pm/ManifestAPI", "-swift-version", "5", "-I", "/usr/lib/swift/pm/ManifestAPI", "-package-description-version", "5.6.0", "/grpc-swift/.build/checkouts/swift-nio-extras/Package.swift", "-Xfrontend", "-disable-implicit-concurrency-module-import", "-Xfrontend", "-disable-implicit-string-processing-module-import", "-o", "/tmp/TemporaryDirectory.NdVGP2/swift-nio-extras-manifest"])
311.6 swift-runtime: unable to suspend thread 2939
311.6 swift-runtime: unable to suspend thread 2939
315.6 error: 'swift-docc-plugin': Invalid manifest (compiled with: ["/usr/bin/swiftc", "-vfsoverlay", "/tmp/TemporaryDirectory.A5Cqh0/vfs.yaml", "-L", "/usr/lib/swift/pm/ManifestAPI", "-lPackageDescription", "-Xlinker", "-rpath", "-Xlinker", "/usr/lib/swift/pm/ManifestAPI", "-swift-version", "5", "-I", "/usr/lib/swift/pm/ManifestAPI", "-package-description-version", "5.6.0", "/grpc-swift/.build/checkouts/swift-docc-plugin/Package.swift", "-Xfrontend", "-disable-implicit-concurrency-module-import", "-Xfrontend", "-disable-implicit-string-processing-module-import", "-o", "/tmp/TemporaryDirectory.yHKST1/swift-docc-plugin-manifest"])
315.6 swift-runtime: unable to suspend thread 2956
315.6 swift-runtime: unable to suspend thread 2956
354.4 make: *** [Makefile:25: all] Error 1
 > [linux/arm64 grpc_swift  7/11] RUN make:
610.9 [9/507] Emitting module _NIOBase64
615.6 [10/507] Emitting module _NIODataStructures
638.6 [11/508] Compiling _NIODataStructures PriorityQueue.swift
638.8 [12/508] Compiling _NIOBase64 Base64.swift
646.6 [15/532] Compiling _NIODataStructures Heap.swift
651.0 swift-runtime: unable to suspend thread 3805
651.0 swift-runtime: unable to suspend thread 3808
651.0 swift-runtime: unable to suspend thread 3805
651.0 swift-runtime: unable to suspend thread 3808
652.1 make: *** [Makefile:25: all] Error 1

I have no idea what QEMU issues cause those errors, but I can attest to cross-compilation working well. It would be pretty easy to cross-compile Swift packages to linux AArch64 from one of the other supported platforms, as you only need four things:

  1. One of the prebuilt toolchains from swift.org, which you presumably already have
  2. A glibc sysroot for linux aarch64, which can be downloaded from one of the distros
  3. A Swift resource directory for linux aarch64, which you can get by downloading the closest AArch64 toolchain to your target distro from the swift.org download link in 1. and extracting the usr/lib/swift/ directory from
  4. A cross-compilation destination file, which stitches together these three requirements into a cross-compilation toolchain: take a look at the one I use for Android AArch64 as an example (the sdk is the path to the glibc sysroot from 2. and the -resource-dir is the Swift resource directory from 3., no need for the -L link line I added in your case)

You will also need to modify the symbolic link usr/lib/swift/clang in the Swift resource directory to point at the clang resource directory in the Swift toolchain you are using, as shown in my Android README doc. Use something like the SwiftPM command shown in that doc to cross-compile Swift packages for linux AArch64.

1 Like

When a program crashes, the crash handler (a signal handler) runs, and the first thing it tries to do is to stop any other threads it finds in the process, which it does by installing a signal handler that gathers the thread's register state and then waits, then sending the corresponding signal to each thread it finds by reading from /proc. If somehow that signal's delivery is blocked, or the thread terminates between the crash handler finding it and sending the signal, then it won't be able to capture the register state, which means the thread won't appear in any backtraces that might be generated, and that's when we generate the warning (to tell you that there is missing information).

TL/DR: The program crashed, and whatever thread is being complained about here was either in a state that didn't allow delivery of the signal we use to pause threads, or had terminated already by the time we tried.

Thanks both for your prompt replies! Both were unfortunately pretty far over my head, I am more of a devops specialist working to maintain the CI and Dockerfile for the multi-arch protoc compiler plugins here and have never worked with Swift before. @Finagolfin I could give your approach a try, but it looks tricky and will likely push us beyond the drive space envelope we have for builds in the GitHub Actions CI - the swift:5.9-jammy base image is already the largest image we depend on, and downloading more distros to extract a sysroot and toolchain probably won't work in CI, since we simultaneously build these protobuf plugins for many other languages on both ARM64 and AMD64 arches.

The last time I tried to enable grpc_swift on ARM64 with Swift 5.6 I also had problems with crashing at random points during the build. Are these issues related? Is it something that should be fixed in the Swift compiler, in QEMU or the in grpc_swift source I am trying to build? Is the crash something that can be avoided in the first place by modifying the source maybe? Or is it something that can't be fixed for some Swift architectural/design reasons?

For cases like the above, I wonder if the issue is that something is using fatalError() to report expected error messages to the user. The trouble with doing that is that fatalError() actually crashes the program. That would explain those ones. Quite why you're seeing crashes in other places I'm not so sure.

Just to clarify @al45tair, the “something” in question is not the package being compiled, right? Invalid manifest is a SwiftPM issue, and our package manifests should not be crashing at runtime to report errors.

I'd want one of the SwiftPM folks to weigh in on that I think. @Max_Desiatov @NeoNacho what does Invalid manifest mean here? We aren't using fatalError() to generate that message, are we? If so, is it something we might reasonably expect to happen, as opposed to an "assertion failed, abort the process" situation? If it really is some kind of abort situation, it probably needs looking into — otherwise we probably shouldn't use fatalError() for that purpose, because right now it triggers an abort, which will be caught by the crash handler and produce backtraces.

My theory from the information provided would be the opposite: the compiled manifest (which is a separate process) is crashing at runtime when we're trying to evaluate it (for unknown reasons) and that leads us to emit us an invalidManifestFormat on the SwiftPM client side.

Given that his build sometimes fails at the manifest and other times after that when executing the build, this is likely some general QEMU flakiness or at least some incompatibility with Swift. We can't single out SwiftPM's manifest generation because he reports it sometimes passing that initial step in the logs above.

Can we fix this message to make it clear that it's a secondary failure? It's pretty easy to see how someone reading a log with this message in it could misunderstand the nature of the problem. I would suggest something like "swift: failed to suspend one or more threads while processing a crash, backtraces will be missing information".

4 Likes

Seems like a good idea. I like the suggested wording too. I’ll raise a PR to change it.

1 Like

Thanks for addressing the issue regarding confusing error messages, and thanks @lukasa (maintainer of grpc-swift) for weighing in as well. I'll forward your comment from the GitHub issue I opened here: "We cannot trigger this error message with our own package code, and we do not. The packages compile fine under Rosetta for Linux so it’s highly likely this is specifically an interaction between QEMU and either Swift or SwiftPM."

Given that I previously reported crashes at varying stages of the build under Swift 5.6, I suspect it is the same underlying incompatibility issue - the error message only changed due to the more recent changes to the Linux thread crash handling code. Since this is fairly easy to reproduce, can anyone here suggest a way forward to identify and resolve the underlying issue?

@NeoNacho is there a good way to observe the actual underlying behaviour of the package manifest compilation and execution?

Not really since we don't really anticipate manifests crashing at runtime. Any suggestions on what would be helpful?

If you delete all previous build artifacts with rm -rf .build ~/.cache/org.swift.* then run swift build --vv, you can see exactly what commands are used to compile and run the manifest and then run them manually yourself.

I saw this same problem intermittently when running a CLI in docker written using swift concurrency. Unfortunately I don't have enough info yet to understand what's going on.