Swift 5.7+ Docker Containers hang

I'm trying to run a more recent Swift on my Scaleway VPS, but starting w/ 5.7 calling the driver just seems to hang.

5.5 and 5.6 works:

docker run -it --rm --name ts swift:5.6 bash
root@b48676db35cc:/# swift --version
Swift version 5.6.3 (swift-5.6.3-RELEASE)
Target: x86_64-unknown-linux-gnu

Starting w/ Swift 5.7 (up to 6.0.1 and nightlies) this hangs:

docker run -it --rm --name ts swift:5.7 bash
root@b48676db35cc:/# swift --version
# hangs forever

The VPS is small, only 2GB of physical memory, but 8GB of swap, which doesn't get touched. Is there a minimum amount of memory required by Swift now? I played with -m 1g --memory-swap=8gb etc, that doesn't seem to help.

Starting the driver in GDB got me this suspicious thing (temporary resource shortage that isn't temporary?):

Tried a fresh instance w/ 4 CPU cores and that seems to work, so it looks that Swift 5.7+ doesn't work w/ just 2 cores?

Old instance:

root@scaleway:~# cat /proc/cpuinfo 
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 1
model name	: AMD EPYC 7281 16-Core Processor
stepping	: 2
microcode	: 0x800126e
cpu MHz		: 2096.062
cache size	: 512 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2

New instance:

root@9879d177b8c9:/# cat /proc/cpuinfo 
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 1
model name	: AMD EPYC 7281 16-Core Processor
stepping	: 2
microcode	: 0x800126e
cpu MHz		: 2096.060
cache size	: 512 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4

Note how it is the same exact CPU, just w/ 2 vs 4 cores.

Update: Notably 2 cores do work for @finestructure, on a Xeon VM.

Maybe this one is looping, but that doesn't look like it was touched since 5.6: Blaming swift-corelibs-libdispatch/src/queue.c at e85f6a0d5c9ea1f32f5013c3fa34e4fc146cd0eb · apple/swift-corelibs-libdispatch · GitHub

1 Like

cc @al45tair & @ktoso

I wonder if it thinks there are more than two cores. I've seen problems in the past with things that detect the number of actual cores on the system rather than the number of cores assigned to the container.

1 Like

@Helge_Hess1 could you try the following, on the system where things aren't working:

$ getconf _NPROCESSORS_CONF
<some number>
$ gcc -x c - -o getaffinity <<EOF
#define _GNU_SOURCE 1

#include <errno.h>
#include <stdint.h>
#include <stdio.h>
#include <pthread.h>

int main(void) {
  cpu_set_t cpuset;

  if (pthread_getaffinity_np(pthread_self(), sizeof(cpuset), &cpuset) != 0) {
    fprintf(stderr, "getaffinity: error %d\n", errno);
    return 1;
  }

  printf("%d\n", CPU_COUNT(&cpuset));

  return 0;
}
EOF
$ ./getaffinity
<some number>

and tell me what it says?

Could be the same issue: The first interaction with Swift CLI in ubuntu-latest GitHub workflow takes an extra 10 seconds. · Issue #76993 · swiftlang/swift · GitHub

The CPU itself has 16 cores as shown in the cpuinfo:
model name : AMD EPYC 7281 16-Core Processor

root@scaleway:~# getconf _NPROCESSORS_CONF
2
root@scaleway:~# ./getaffinity
2

As far as I can tell it hangs forever in my VM.

That's not the problem then. As far as I can see, Dispatch uses those numbers and both of them say 2, which is what we expect.

Do you have lldb or perf or something installed in your VM that you could use to get a backtrace? e.g. if you have lldb, you could do

$ lldb -- swiftc --version
(lldb) target create "swiftc"
(lldb) run
# wait a while to make sure it's settled down
<CTRL-C>
(lldb) bt all

and see what that says.

I posted that above already. This is w/ gdb in the swift:5.7 image:

root@6d7be96bf00f:/# gdb swiftc
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04.2) 12.1
(gdb) r --version
Starting program: /usr/bin/swiftc --version
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
process 401 is executing new program: /usr/bin/swift-driver
warning: Section .debug_names in /usr/bin/swift-driver length 5772 does not match section length 1577208, ignoring .debug_names.
warning: Could not find DWO CU /home/build-user/build/buildbot_linux/swiftdriver-linux-x86_64/x86_64-unknown-linux-gnu/release/ModuleCache/28OHVX4XUBPU4/CSwiftScan-P6X4JH2ZLZ8C.pcm(0x9abb811053751965) referenced by CU at offset 0x11f73 [in module /usr/bin/swift-driver]
...
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
^C
Program received signal SIGINT, Interrupt.
0x000076213c0a078a in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7ffd50215db0, rem=rem@entry=0x7ffd50215db0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
78	../sysdeps/unix/sysv/linux/clock_nanosleep.c: No such file or directory.
(gdb) bt
#0  0x000076213c0a078a in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7ffd50215db0, rem=rem@entry=0x7ffd50215db0)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
#1  0x000076213c0a5677 in __GI___nanosleep (req=req@entry=0x7ffd50215db0, rem=rem@entry=0x7ffd50215db0) at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2  0x000076213c0a55ae in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3  0x000076213ccc21eb in _dispatch_temporary_resource_shortage () from /usr/bin/../lib/swift/linux/libdispatch.so
#4  0x000076213ccd1265 in _dispatch_root_queue_poke_slow () from /usr/bin/../lib/swift/linux/libdispatch.so
#5  0x000076213ccdbffe in _dispatch_epoll_init () from /usr/bin/../lib/swift/linux/libdispatch.so
#6  0x000076213ccc8cbf in _dispatch_once_callout () from /usr/bin/../lib/swift/linux/libdispatch.so
#7  0x000076213ccdbf66 in _dispatch_event_loop_poke () from /usr/bin/../lib/swift/linux/libdispatch.so
#8  0x000076213cccc234 in _dispatch_lane_resume_activate () from /usr/bin/../lib/swift/linux/libdispatch.so
#9  0x000076213cd21d0d in $s8Dispatch0A6SourceCAA0aB8ProtocolA2aDP6resumeyyFTW () from /usr/bin/../lib/swift/linux/libswiftDispatch.so
#10 0x0000623db05929bb in swift_driver_main () at /home/build-user/swift-driver/Sources/swift-driver/main.swift:59
(gdb) info threads
  Id   Target Id                                      Frame 
* 1    Thread 0x762139ee27c0 (LWP 401) "swift-driver" 0x000076213c0a078a in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7ffd50215db0, 
    rem=rem@entry=0x7ffd50215db0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78

Sorry, hadn't spotted that. It isn't immediately obvious to me what's going wrong, but I'll ask some of our Dispatch experts to take a look and see what they think.

Was that the entire output from info threads?

If it was, then the issue seems to be to do with the pthreads implementation itself, since we should be able to create more than one thread. That makes me wonder what version of Linux you're using/which version of Glibc is installed on the machine that is going wrong.

Also it occurs to me that with Docker you're sharing the system with other processes. If you had a runaway process outside of the container that had made absolutely boatloads of threads, thread creation within the container might conceivably fail.

Yes, it was only one thread. As far as I can tell it is GCD very early in the bootstrapping phase?
Also remember: This affects 5.7+
5.5 and 5.6 images run just fine! So they seem to be able to create threads.

The system is under very low load, I can't imagine that there is a particular issue w/ creating two threads:

top - 16:15:26 up 2 days, 19:36,  2 users,  load average: 0.00, 0.00, 0.00
Tasks: 156 total,   1 running, 155 sleeping,   0 stopped,   0 zombie

It is the official swift:5.7 image.

root@17d76df72ec9:/lib# ldd --version
ldd (Ubuntu GLIBC 2.35-0ubuntu3.8) 2.35
root@17d76df72ec9:/lib# cat /etc/debian_version 
bookworm/sid

The host is

root@scaleway:~# uname -a
Linux scaleway 6.8.0-47-generic #47-Ubuntu SMP PREEMPT_DYNAMIC Fri Sep 27 21:40:26 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
root@scaleway:~# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 24.04.1 LTS
Release:	24.04
Codename:	noble

What does ulimit -a have to say? How about getconf PTHREAD_THREADS_MAX? Also cat /proc/sys/kernel/threads-max and cat /proc/sys/kernel/pid_max?

Host:

root@scaleway:/var/log# ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 7601
max locked memory           (kbytes, -l) 251168
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1024
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 7601
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited
root@scaleway:/var/log# getconf PTHREAD_THREADS_MAX
undefined
root@scaleway:/var/log# cat /proc/sys/kernel/threads-max
15203
root@scaleway:/var/log# cat /proc/sys/kernel/pid_max
4194304

Container:

root@scaleway:/var/log# docker run -it --rm swift:5.7 bash
root@9ee5b9851daf:/# getconf PTHREAD_THREADS_MAX
undefined
root@9ee5b9851daf:/# ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 7601
max locked memory           (kbytes, -l) 8192
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1048576
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) unlimited
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited
root@9ee5b9851daf:/# cat /proc/sys/kernel/threads-max
15203
root@9ee5b9851daf:/# cat /proc/sys/kernel/pid_max
4194304

The host does have a process limit set, but presumably ps -e | wc -l on the host says something much less than 7,601?

163, also shown in the top output above. I don't think it is a specific constraint, given that Swift 5.6 does run just fine.
I've also tried to find something in a log file, but couldn't so far.

I'll have to leave for today, maybe I could fire up a container somehow and give you SSH access if you want to explore it further.
The other thing I could try is see whether I can fire up a fresh Scaleway 2-CPU and check whether it has the same issue (this one is years old and upgraded forwards, I even did the upgrade to Noble just to see whether that would fix the issue).
Given that 2-CPU works for Sven on Xeon's, maybe it is some weird AMD EPIC vs Xeon difference?