Distributed Actors Cluster Crashing (5.8+linux+asan+static-stdlib)


I've been using the DistributedCluster package within a service of mine. Everything has seemed to be working great so far, except I've run into an issue when running the service in a release configuration on Linux via docker.

After some digging, I believe I've isolated the issue to when the cluster is initialized:

let clusterSystem = await ClusterSystem("TestCluster")

I have a reproduction of the issue with a simple main.swift:

import DistributedCluster

let clusterSystem = await ClusterSystem("TestRunCluster")
try await Task.sleep(for: .seconds(5))

When running with Backtrace installed, I get the following:

Received signal 11. Backtrace:
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@ [DistributedCluster] ClusterSystem [TestRunCluster] initialized, listening on: sact://TestRunCluster@ _ActorRef<ClusterShell.Message>(/system/cluster)
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@ [DistributedCluster] Setting in effect: .autoLeaderElection: LeadershipSelectionSettings(underlying: DistributedCluster.ClusterSystemSettings.LeadershipSelectionSettings.(unknown context at $aaaad5a3b1dc)._LeadershipSelectionSettings.lowestReachable(minNumberOfMembers: 2))
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@ [DistributedCluster] Setting in effect: .downingStrategy: DowningStrategySettings(underlying: DistributedCluster.DowningStrategySettings.(unknown context at $aaaad5a3979c)._DowningStrategySettings.timeout(DistributedCluster.TimeoutBasedDowningStrategySettings(downUnreachableMembersAfter: 1.0 seconds)))
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@ [DistributedCluster] Setting in effect: .onDownAction: OnDownActionStrategySettings(underlying: DistributedCluster.OnDownActionStrategySettings.(unknown context at $aaaad5a3971c)._OnDownActionStrategySettings.gracefulShutdown(delay: 3.0 seconds))
2023-04-23T07:46:26+0000 info TestRunCluster : actor/path=/system/cluster cluster/node=sact://TestRunCluster@ [DistributedCluster] Binding to: [sact://TestRunCluster@]
2023-04-23T07:46:26+0000 info TestRunCluster : actor/path=/system/cluster/leadership cluster/node=sact://TestRunCluster@ leadership/election=DistributedCluster.Leadership.LowestReachableMember [DistributedCluster] Not enough members [1/2] to run election, members: [Member(sact://TestRunCluster:2481186327279040895@, status: joining, reachability: reachable)]
2023-04-23T07:46:26+0000 info TestRunCluster : actor/path=/system/cluster cluster/node=sact://TestRunCluster@ [DistributedCluster] Bound to [IPv4]

With the backtrace sending a signal 11, I tried using AddressSanitizer to see if I could get more information, which ended up giving me:

==1==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x000000000000 bp 0xffff819de570 sp 0xffff819de560 T3)
==1==Hint: pc points to the zero page.
==1==The signal is caused by a READ memory access.
==1==Hint: address points to the zero page.
2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@ [DistributedCluster] ClusterSystem [TestRunCluster] initialized, listening on: sact://TestRunCluster@ _ActorRef<ClusterShell.Message>(/system/cluster)
    #0 0x0  (<unknown module>)
    #1 0xaaaac2c62014  (/CrashingCluster+0x1e82014)
    #2 0xaaaac2c62754  (/CrashingCluster+0x1e82754)
    #3 0xaaaac2c2008c  (/CrashingCluster+0x1e4008c)
    #4 0xaaaac2c1fdf4  (/CrashingCluster+0x1e3fdf4)
    #5 0xaaaac2c2c098  (/CrashingCluster+0x1e4c098)
    #6 0xffff85f7d5c4  (/lib/aarch64-linux-gnu/libc.so.6+0x7d5c4) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
    #7 0xffff85fe5d18  (/lib/aarch64-linux-gnu/libc.so.6+0xe5d18) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)

AddressSanitizer can not provide additional info.
2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@ [DistributedCluster] Setting in effect: .autoLeaderElection: LeadershipSelectionSettings(underlying: DistributedCluster.ClusterSystemSettings.LeadershipSelectionSettings.(unknown context at $aaaac374b1dc)._LeadershipSelectionSettings.lowestReachable(minNumberOfMembers: 2))
SUMMARY: AddressSanitizer: SEGV (<unknown module>)
Thread T3 created by T1 here:
    #0 0xaaaac149fb68  (/CrashingCluster+0x6bfb68)
    #1 0xaaaac2c28478  (/CrashingCluster+0x1e48478)
    #2 0xaaaac2c2b694  (/CrashingCluster+0x1e4b694)
    #3 0xaaaac2c24c04  (/CrashingCluster+0x1e44c04)
    #4 0xaaaac2c2c098  (/CrashingCluster+0x1e4c098)
    #5 0xffff85f7d5c4  (/lib/aarch64-linux-gnu/libc.so.6+0x7d5c4) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
    #6 0xffff85fe5d18  (/lib/aarch64-linux-gnu/libc.so.6+0xe5d18) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)

Thread T1 created by T0 here:
2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@ [DistributedCluster] Setting in effect: .downingStrategy: DowningStrategySettings(underlying: DistributedCluster.DowningStrategySettings.(unknown context at $aaaac374979c)._DowningStrategySettings.timeout(DistributedCluster.TimeoutBasedDowningStrategySettings(downUnreachableMembersAfter: 1.0 seconds)))
    #0 0xaaaac149fb68  (/CrashingCluster+0x6bfb68)
    #1 0xaaaac2c28478  (/CrashingCluster+0x1e48478)
    #2 0xaaaac2c634cc  (/CrashingCluster+0x1e834cc)
    #3 0xaaaac2c6293c  (/CrashingCluster+0x1e8293c)
    #4 0xaaaac2c62014  (/CrashingCluster+0x1e82014)
    #5 0xaaaac2c62754  (/CrashingCluster+0x1e82754)
    #6 0xaaaac18ce5b4  (/CrashingCluster+0xaee5b4)
    #7 0xffff85f273f8  (/lib/aarch64-linux-gnu/libc.so.6+0x273f8) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
    #8 0xffff85f274c8  (/lib/aarch64-linux-gnu/libc.so.6+0x274c8) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
    #9 0xaaaac143efac  (/CrashingCluster+0x65efac)

2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@ [DistributedCluster] Setting in effect: .onDownAction: OnDownActionStrategySettings(underlying: DistributedCluster.OnDownActionStrategySettings.(unknown context at $aaaac374971c)._OnDownActionStrategySettings.gracefulShutdown(delay: 3.0 seconds))

As far as I can tell, the problem only seems to arise when running on Linux with this Dockerfile:

# ================================
# Build image
# ================================
FROM swift:5.8-jammy as builder

RUN mkdir /workspace
WORKDIR /workspace

COPY . /workspace

RUN swift build --sanitize=address -c release -Xswiftc -g --static-swift-stdlib

# ================================
# Run image
# ================================
FROM ubuntu:jammy

COPY --from=builder /workspace/.build/release/CrashingCluster /


ENTRYPOINT ["./CrashingCluster"]

This reproduction, along with the Dockerfile, can be found in this repo, if it helps.

Any thoughts as to the cause of this?

I'd appreciate any help!

cc @ktoso

Thanks for reporting Iโ€™ll have a look trying to reproduce.

Best place to report crashes is the GitHub repo: GitHub - apple/swift-distributed-actors: Peer-to-peer cluster implementation for Swift Distributed Actors -- would you mind filing an issue there as well so we can track it properly?

I'm going to dive deeper into this (though reporting on github would definitely help so we don't lose this).

I have noticed through a suspicious combination of both asan + static-stdlib + linux here though... We have seen various false positives on linux with such combinations -- and the 5.9 release without the static-stdlib does not reproduce the crash (though does report 2 leaks which I'd like to look into still). Might I ask checking if your system seems to run okey without static-stdlib and the sanitizer for the time being?

Thanks for the reproducer, those help a lot.

Of course! Just filed it :slight_smile:

For sure, I hope the reproducer is useful!

Yeah, it seems like it works without the static-stdlib. For example when I run utilizing bind mounts to my filesystem via

docker run -v "$PWD:/code" -w /code swift:latest swift run -c release

both my original application, as well as the reproducer, appear to work.

I'll make a note of this in the Github issue, as well.

Thanks for the quick reply!

1 Like

Thanks, that's good to hear -- we have some reason to believe this is an effect of static stdlib being broken with swift concurrency...

And the fix probably is [5.8] IRGen: Don't directly call async functions that have weak/linkonce_odr linkage by aschwaighofer ยท Pull Request #65254 ยท apple/swift ยท GitHub which is in 5.9 but wasn't in 5.8 which would explain why it does not happen in the nightly 5.9 container.