Hello!
I've been using the DistributedCluster package within a service of mine. Everything has seemed to be working great so far, except I've run into an issue when running the service in a release
configuration on Linux via docker.
After some digging, I believe I've isolated the issue to when the cluster is initialized:
let clusterSystem = await ClusterSystem("TestCluster")
I have a reproduction of the issue with a simple main.swift
:
import DistributedCluster
let clusterSystem = await ClusterSystem("TestRunCluster")
try await Task.sleep(for: .seconds(5))
When running with Backtrace installed, I get the following:
Received signal 11. Backtrace:
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] ClusterSystem [TestRunCluster] initialized, listening on: sact://TestRunCluster@127.0.0.1:7337: _ActorRef<ClusterShell.Message>(/system/cluster)
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .autoLeaderElection: LeadershipSelectionSettings(underlying: DistributedCluster.ClusterSystemSettings.LeadershipSelectionSettings.(unknown context at $aaaad5a3b1dc)._LeadershipSelectionSettings.lowestReachable(minNumberOfMembers: 2))
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .downingStrategy: DowningStrategySettings(underlying: DistributedCluster.DowningStrategySettings.(unknown context at $aaaad5a3979c)._DowningStrategySettings.timeout(DistributedCluster.TimeoutBasedDowningStrategySettings(downUnreachableMembersAfter: 1.0 seconds)))
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .onDownAction: OnDownActionStrategySettings(underlying: DistributedCluster.OnDownActionStrategySettings.(unknown context at $aaaad5a3971c)._OnDownActionStrategySettings.gracefulShutdown(delay: 3.0 seconds))
2023-04-23T07:46:26+0000 info TestRunCluster : actor/path=/system/cluster cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Binding to: [sact://TestRunCluster@127.0.0.1:7337]
2023-04-23T07:46:26+0000 info TestRunCluster : actor/path=/system/cluster/leadership cluster/node=sact://TestRunCluster@127.0.0.1:7337 leadership/election=DistributedCluster.Leadership.LowestReachableMember [DistributedCluster] Not enough members [1/2] to run election, members: [Member(sact://TestRunCluster:2481186327279040895@127.0.0.1:7337, status: joining, reachability: reachable)]
2023-04-23T07:46:26+0000 info TestRunCluster : actor/path=/system/cluster cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Bound to [IPv4]127.0.0.1/127.0.0.1:7337
With the backtrace sending a signal 11, I tried using AddressSanitizer to see if I could get more information, which ended up giving me:
AddressSanitizer:DEADLYSIGNAL
=================================================================
==1==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x000000000000 bp 0xffff819de570 sp 0xffff819de560 T3)
==1==Hint: pc points to the zero page.
==1==The signal is caused by a READ memory access.
==1==Hint: address points to the zero page.
2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] ClusterSystem [TestRunCluster] initialized, listening on: sact://TestRunCluster@127.0.0.1:7337: _ActorRef<ClusterShell.Message>(/system/cluster)
#0 0x0 (<unknown module>)
#1 0xaaaac2c62014 (/CrashingCluster+0x1e82014)
#2 0xaaaac2c62754 (/CrashingCluster+0x1e82754)
#3 0xaaaac2c2008c (/CrashingCluster+0x1e4008c)
#4 0xaaaac2c1fdf4 (/CrashingCluster+0x1e3fdf4)
#5 0xaaaac2c2c098 (/CrashingCluster+0x1e4c098)
#6 0xffff85f7d5c4 (/lib/aarch64-linux-gnu/libc.so.6+0x7d5c4) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
#7 0xffff85fe5d18 (/lib/aarch64-linux-gnu/libc.so.6+0xe5d18) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
AddressSanitizer can not provide additional info.
2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .autoLeaderElection: LeadershipSelectionSettings(underlying: DistributedCluster.ClusterSystemSettings.LeadershipSelectionSettings.(unknown context at $aaaac374b1dc)._LeadershipSelectionSettings.lowestReachable(minNumberOfMembers: 2))
SUMMARY: AddressSanitizer: SEGV (<unknown module>)
Thread T3 created by T1 here:
#0 0xaaaac149fb68 (/CrashingCluster+0x6bfb68)
#1 0xaaaac2c28478 (/CrashingCluster+0x1e48478)
#2 0xaaaac2c2b694 (/CrashingCluster+0x1e4b694)
#3 0xaaaac2c24c04 (/CrashingCluster+0x1e44c04)
#4 0xaaaac2c2c098 (/CrashingCluster+0x1e4c098)
#5 0xffff85f7d5c4 (/lib/aarch64-linux-gnu/libc.so.6+0x7d5c4) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
#6 0xffff85fe5d18 (/lib/aarch64-linux-gnu/libc.so.6+0xe5d18) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
Thread T1 created by T0 here:
2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .downingStrategy: DowningStrategySettings(underlying: DistributedCluster.DowningStrategySettings.(unknown context at $aaaac374979c)._DowningStrategySettings.timeout(DistributedCluster.TimeoutBasedDowningStrategySettings(downUnreachableMembersAfter: 1.0 seconds)))
#0 0xaaaac149fb68 (/CrashingCluster+0x6bfb68)
#1 0xaaaac2c28478 (/CrashingCluster+0x1e48478)
#2 0xaaaac2c634cc (/CrashingCluster+0x1e834cc)
#3 0xaaaac2c6293c (/CrashingCluster+0x1e8293c)
#4 0xaaaac2c62014 (/CrashingCluster+0x1e82014)
#5 0xaaaac2c62754 (/CrashingCluster+0x1e82754)
#6 0xaaaac18ce5b4 (/CrashingCluster+0xaee5b4)
#7 0xffff85f273f8 (/lib/aarch64-linux-gnu/libc.so.6+0x273f8) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
#8 0xffff85f274c8 (/lib/aarch64-linux-gnu/libc.so.6+0x274c8) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
#9 0xaaaac143efac (/CrashingCluster+0x65efac)
2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .onDownAction: OnDownActionStrategySettings(underlying: DistributedCluster.OnDownActionStrategySettings.(unknown context at $aaaac374971c)._OnDownActionStrategySettings.gracefulShutdown(delay: 3.0 seconds))
==1==ABORTING
As far as I can tell, the problem only seems to arise when running on Linux with this Dockerfile
:
# ================================
# Build image
# ================================
FROM swift:5.8-jammy as builder
RUN mkdir /workspace
WORKDIR /workspace
COPY . /workspace
RUN swift build --sanitize=address -c release -Xswiftc -g --static-swift-stdlib
# ================================
# Run image
# ================================
FROM ubuntu:jammy
COPY --from=builder /workspace/.build/release/CrashingCluster /
EXPOSE 7337
ENTRYPOINT ["./CrashingCluster"]
This reproduction, along with the Dockerfile, can be found in this repo, if it helps.
Any thoughts as to the cause of this?
I'd appreciate any help!