Random crashes when trying to look up values

The dictionary is protected with an NIOLock, but we’re still encountering random crashes when trying to look up values. I’m not sure why this is happening. Any advice or suggestions would be greatly appreciated.

0 0x00007ff67e22efa1 NativeDictionary.lookup(:) + 17 in libswiftCore.so
1 [ra] [thunk] 0x000055662a894932 partial apply for closure #1 in SafeDictionary.subscript.getter + 33 in Call at /build/.build/checkouts/nexus/Sources/Nexus/ConcurrencyHelper/SafeDictionary.swift:55:39
2 [ra] 0x000055662a4a3379 NIOLock.withLock(:) + 56 in Call at /build/.build/checkouts/swift-nio/Sources/NIOConcurrencyHelpers/NIOLock.swift:242:20
3 [ra] 0x000055662a893b41 SafeDictionary.subscript.getter + 80 in Call at /build/.build/checkouts/nexus/Sources/Nexus/ConcurrencyHelper/SafeDictionary.swift:54:23
4 [ra] 0x000055662accf80b ClusterStorage.lookup(
:) + 2746 in Call at /build/.build/checkouts/nexus-distribution/Sources/NexusDistribution/ClusterStorage.swift:506:17
5 [async] [system] 0x000055662acbe370 ClusterStorage.lookup(:) in Call at /build/
6 [async] 0x000055662acbb340 ClusterController.lookup
(:type:) in Call at /build/.build/checkouts/nexus-distribution/Sources/NexusDistribution/ClusterController.swift:163
7 [async] 0x000055662ace7160 Nexus.lookup
(_:) in Call at /build/.build/checkouts/nexus-distribution/Sources/NexusDistribution/Extensions/NexusDistributedMethods.swift:17
8 [async] 0x0000556629b14d30 closure #1 in CallActor.notifyDashboard(type:) in Call at /build/Sources/Call/Call/APIs/Internal/Dashboard/NotifyDashboard.swift:21
9 [async] [thunk] 0x000055662996ec20 partial apply for closure #1 in CallActor.notifyDashboard(type:) in Call at /build/
10 [async] [system] 0x00007ff67e043790 completeTaskWithClosure(swift::AsyncContext*, swift::SwiftError*) in libswift_Concurrency.so

Do you have a more complete error message? Either a crash report (on macOS) or a complete stack trace and exit code on Linux?

Here’s some context:

I’m using a distributed actor as a local storage mechanism for actors. The lookup method is supposed to check if the current ID is already present in the storage or in the active tasks. If it’s not found in either, it creates a new task and returns a future result.

Both the storage and activeTasks dictionaries are wrapped in a SafeDictionary, which uses an NIOLock for thread safety. This setup is running in a Linux Docker container with Swift 5.10.1 and the latest version of Swift NIO. I’ve attached the full backtrace file for reference. Any insights would be helpful.

Logs

public distributed actor ClusterStorage {

    public var storage: SafeDictionary<UInt64, any NexusDistributedActor> = .init()
    public var activeTasks: SafeDictionary<AnyHashable, Task<any NexusDistributedActor, Error>> = .init()


    
    distributed func lookup(_ identifier: ActorIdentity) async throws -> ID {
        let hash = MurmurHash3.x64_128.digest(identifier.stringValue).first!
        
           if let instance = self.storage[hash] {
                return instance.id
            } else {
                if let task = self.activeTasks[identifier.id] {
                    return try await task.value.id
                } else {
                    let task = Task<any NexusDistributedActor, Error> {
                        do {
                            let actor = try await self.controller.lookup(identifier, nexus: self.nexus, actorSystem: self.actorSystem)
                            self.storage.updateValue(actor, forKey: hash)
                            self.activeTasks.removeValue(forKey: identifier.id)
                            return actor
                        }
                        catch {
                            self.activeTasks.removeValue(forKey: identifier.id)
                            throw error
                        }
                    }
                
                    self.activeTasks.updateValue(task, forKey: identifier.id)
                    return try await task.value.id
                }
            }
    }
}


public struct SafeDictionary<V: Hashable,T>: Collection  {
    
    private var dictionary: [V: T]
    private var lock: NIOLock = .init()
    
    public init(dict: [V: T] = [V:T]()) {
        self.dictionary = dict
    }
    
    
    @discardableResult
    public mutating func updateValue(_ value: T, forKey key: V) -> T? {
        return self.lock.withLock {
            return self.dictionary.updateValue(value, forKey: key)
        }
    }
    
    
    public var startIndex: Dictionary<V, T>.Index {
        self.lock.withLock {
            return self.dictionary.startIndex
        }
    }
    
    public var endIndex: Dictionary<V, T>.Index {
        self.lock.withLock {
            return self.dictionary.endIndex
        }
    }
    
    public func index(after i: Dictionary<V, T>.Index) -> Dictionary<V, T>.Index {
        self.lock.withLock {
            return self.dictionary.index(after: i)
        }
    }
    
    public subscript(key: V) -> T? {
        set(newValue) {
            self.lock.withLock {
                self.dictionary[key] = newValue
            }
        }
        get {
            self.lock.withLock {
                return self.dictionary[key]
            }
        }
    }
    
    public subscript(index: Dictionary<V, T>.Index) -> Dictionary<V, T>.Element {
        self.lock.withLock {
            return self.dictionary[index]
        }
    }
    @discardableResult
    public mutating func removeValue(forKey key: V) -> T? {
        return self.lock.withLock {
            self.dictionary.removeValue(forKey: key)
        }
    }
    
    public mutating func removeAll() {
        self.lock.withLock {
            self.dictionary.removeAll()
        }
    }
    
    public var keys: Dictionary<V, T>.Keys {
        self.lock.withLock {
            return self.dictionary.keys
        }
    }
    
    public var values: Dictionary<V, T>.Values {
        self.lock.withLock {
            return self.dictionary.values
        }
    }
}

I'm not sure that SafeDictionary does what you want it to do, and as used here it is explicitly unsafe. The fact that you have mutating functions on this dictionary make it fairy clear that the values don't behave sensibly.

In this case, you have two options: either fix SafeDictionary, or avoid needing it. Given that you are using it from inside an actor, you may simply not need it. If you do, consider replacing SafeDictionary with NIOLockedValueBox, which should prevent these issues entirely.

I initially thought I wouldn’t need an additional lock within the actor, but I was experiencing these types of crashes more frequently before switching to SafeDictionary.

Try removing the lock and moving back to the regular dictionary, and then turning on strict concurrency checking.

I enabled strict concurrency checking, but it didn't provide any specific errors or warnings for the lookup logic. So, I wrapped it in NIOLockedValueBox to see if that makes a difference. I typically encounter 2-3 crashes in this particular area within a 24-hour period. I’ll update you tomorrow on whether this approach helped.

I don't think that's safe approach, e.g. consider this usage:

// will cause a race
if dictionary[key] == nil { // individually atomic
    dictionary[key] = value // individually atomic
}

Here the individual get and set are protected but not the whole sequence.

I'd structure it differently so that the whole sequence is atomic, for example:

withLock { // atomic
    if dictionary[key] == nil { // unprotected
        dictionary[key] = value // unprotected
    }
}

i.e. have the lock at a higher level. Beyond that it would be possible to atomically change several data structures (if needed).