Aren't the coroutines disallow yielding inside a closure?
It might be problematic to implement this pattern with Mutex
/OSAllocatedUnfairLock
as they don't provide unsafe API for acquiring a lock with types other than Void. And if we change the type to Void we'll have to store the value separately (which isn't as bad as copying the dictionary of course).
You should also implement the _modify accessor in your MutexCache, so you don't just overwrite all your existing data with a slightly modified one every time.
I've initially thought mutex would be faster, as actors will have some overhead of suspension and context switching. But actually when writing a response already checked your benchmark and seen it's faster.
Yeah, that's unfortunate.
Ah, haven't thought about that!
Here is a bit more optimise version, where instead of whole dict I'm just get/set the values. Also updated the actor solution. Just checked @dabrahams benchmark and it gives some performance boost, indeed.
actor Cache {
var r: [Int: Int] = [:]
func update(key: Int, with value: Int) {
self.r[key] = value
}
func getValue(for key: Int) -> Int? {
self.r[key]
}
}
func compute(_ input: [Int]) async -> [Int: Int] {
let cache = Cache()
@discardableResult
func fib(_ x: Int, cache: Cache) async -> Int {
if let y = await cache.getValue(for: x) { return y }
let y = await x < 2 ? 1 : fib(x - 1, cache: cache) + fib(x - 2, cache: cache)
await cache.update(key: x, with: y)
return y
}
await withTaskGroup(of: Void.self) { group in
for z in input {
group.addTask {
await fib(z, cache: cache)
}
}
await group.waitForAll()
}
return await cache.r
}
import Synchronization
final class MutexCache: Sendable {
let r: Mutex<[Int: Int]> = Mutex([:])
func update(key: Int, with value: Int) {
self.r.withLock { $0[key] = value }
}
func getValue(for key: Int) -> Int? {
self.r.withLock { $0[key] }
}
}
func mutexCompute(_ input: [Int]) async -> [Int: Int] {
let mutexCache = MutexCache()
@discardableResult
func fib(_ x: Int, cache: MutexCache) -> Int {
if let y = cache.getValue(for: x) { return y }
let y = x < 2 ? 1 : fib(x - 1, cache: cache) + fib(x - 2, cache: cache)
cache.update(key: x, with: y)
return y
}
await withTaskGroup(of: Void.self) { group in
for z in input {
group.addTask {
fib(z, cache: mutexCache)
}
}
await group.waitForAll()
}
return mutexCache.r.withLock { $0 }
}
Here some results if someone interested:
Instructions
βββββββββββββββββββββββββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ
β Test β p0 β p25 β p50 β p75 β p90 β p99 β p100 β Samples β
βββββββββββββββββββββββββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββ‘
β ParallelDAG:Jaleel (K) * β 405 β 466 β 470 β 479 β 504 β 1062 β 1444 β 6795 β
βββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β ParallelDAG:Mutex (K) * β 223 β 241 β 263 β 270 β 282 β 383 β 589 β 8152 β
βββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β ParallelDAG:Operations (K) * β 3256 β 4090 β 4399 β 4755 β 5100 β 5788 β 6577 β 1752 β
βββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β ParallelDAG:Tasks (K) * β 1899 β 2587 β 2744 β 2900 β 3043 β 3275 β 3843 β 2482 β
βββββββββββββββββββββββββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ
Malloc (total)
βββββββββββββββββββββββββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ
β Test β p0 β p25 β p50 β p75 β p90 β p99 β p100 β Samples β
βββββββββββββββββββββββββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββ‘
β ParallelDAG:Jaleel * β 20 β 71 β 71 β 80 β 97 β 175 β 370 β 6795 β
βββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β ParallelDAG:Mutex * β 2 β 60 β 60 β 60 β 60 β 71 β 99 β 8152 β
βββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β ParallelDAG:Operations * β 961 β 987 β 1072 β 1093 β 1176 β 1273 β 10219 β 1752 β
βββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β ParallelDAG:Tasks * β 99 β 405 β 518 β 681 β 845 β 1259 β 1692 β 2482 β
βββββββββββββββββββββββββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ
Memory (resident peak)
βββββββββββββββββββββββββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ
β Test β p0 β p25 β p50 β p75 β p90 β p99 β p100 β Samples β
βββββββββββββββββββββββββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββ‘
β ParallelDAG:Jaleel (M) β 10 β 13 β 13 β 13 β 13 β 13 β 13 β 6795 β
βββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β ParallelDAG:Mutex (M) β 9 β 13 β 13 β 13 β 13 β 13 β 13 β 8152 β
βββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β ParallelDAG:Operations (M) β 10 β 19 β 21 β 22 β 23 β 23 β 23 β 1752 β
βββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β ParallelDAG:Tasks (M) β 10 β 14 β 14 β 15 β 15 β 15 β 15 β 2482 β
βββββββββββββββββββββββββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ
Throughput (# / s)
βββββββββββββββββββββββββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ
β Test β p0 β p25 β p50 β p75 β p90 β p99 β p100 β Samples β
βββββββββββββββββββββββββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββ‘
β ParallelDAG:Jaleel (K) β 19 β 18 β 18 β 17 β 15 β 8 β 5 β 6795 β
βββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β ParallelDAG:Mutex (K) β 40 β 37 β 36 β 34 β 31 β 21 β 8 β 8152 β
βββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β ParallelDAG:Operations (#) β 3906 β 2943 β 2631 β 2307 β 2049 β 1608 β 1244 β 1752 β
βββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β ParallelDAG:Tasks (#) β 4783 β 3679 β 3475 β 3249 β 3017 β 2511 β 350 β 2482 β
βββββββββββββββββββββββββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ
Time (total CPU)
βββββββββββββββββββββββββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ
β Test β p0 β p25 β p50 β p75 β p90 β p99 β p100 β Samples β
βββββββββββββββββββββββββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββ‘
β ParallelDAG:Jaleel (ΞΌs) * β 57 β 68 β 70 β 73 β 83 β 177 β 273 β 6795 β
βββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β ParallelDAG:Mutex (ΞΌs) * β 28 β 32 β 35 β 38 β 45 β 77 β 256 β 8152 β
βββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β ParallelDAG:Operations (ΞΌs) * β 349 β 566 β 653 β 766 β 885 β 1172 β 1648 β 1752 β
βββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β ParallelDAG:Tasks (ΞΌs) * β 283 β 394 β 427 β 471 β 517 β 611 β 714 β 2482 β
βββββββββββββββββββββββββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ
Time (wall clock)
βββββββββββββββββββββββββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ
β Test β p0 β p25 β p50 β p75 β p90 β p99 β p100 β Samples β
βββββββββββββββββββββββββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββ‘
β ParallelDAG:Jaleel (ΞΌs) * β 54 β 56 β 57 β 59 β 66 β 129 β 215 β 6795 β
βββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β ParallelDAG:Mutex (ΞΌs) * β 25 β 27 β 28 β 29 β 32 β 48 β 125 β 8152 β
βββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β ParallelDAG:Operations (ΞΌs) * β 256 β 340 β 380 β 434 β 488 β 622 β 804 β 1752 β
βββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β ParallelDAG:Tasks (ΞΌs) * β 209 β 272 β 288 β 308 β 332 β 398 β 2859 β 2482 β
βββββββββββββββββββββββββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ
Think still prefer actors, exactly for scaling, not only due to cache size. Of course if you want to be more performant on a single machine, or need more low level controlβmutex seems like a better solution here.
Yeah, I've just added update
and getValue
functions.
Yes, but fortunately you don't need to use the closure form of locking.
The implementation in my repo is already avoiding those overheads, I'm pretty certain. LMK if I've got that wrong (in fact open a PR).
It's somewhat dangerous not to, though, if you're going to mix mutexes with async
. Leaving a lock held while an async task is suspended risks starving the work pool by synchronously blocking other async tasks that try to hold the lock. New-generation lightweight locks like SRWLOCK
/os_unfair_lock
/futex
can also be corrupted if they are locked on one thread but unlocked when the task resumes on a different kernel thread. This is why the standard library Mutex
type only provides synchronous-closure-taking withLock
APIs to access the locked state.
Sorry, this is a bit off topic, but I was curious about why the identity function -like closure { (x: inout Value)->Value in x }
is used here. I assume this has something to do with the semantics of inout
β at a guess, does avoid a copy of self
?
AFAICT it forces Dictionary
to go through this code path, i.e. the _modify
accessor, managing to both:
- assign the default value to
key
if missing, and - only make one hash lookup (for both the read and write at once).
(I'm slightly aghast that the Dictionary
API doesn't make that cleaner to express.)
Huh, as a relative beginner it's surprising that calling that subscript on an inout
parameter uses the _modify
accessor β I really wouldn't expect dict[key, default: blah]
to sometimes result in an in-place modification of dict
when the key isn't present. Also somewhat surprised that there's no set
accessor for that subscript, just get
or _modify
.
Edit: ah nevermind, looks like providing a get
& _modify
accessor for a property is pretty standard in the stdlib, and I guess I see why there's no set
and _modify
Yeah, I thought of that later⦠It's easy to transform the use of accessors to avoid that problem (done). But how does the closure form solve this problem? It's still blocking, and I thought blocking was in general incompatible with async
(one reason, I assumed, that it's so hard to call async
code from synchronous code). I guess the thread pool will spin up new threads as needed for these cases?
@pyrtsa understood correctly. Yes, we are both aghast that it isn't in the standard library. I'm pretty sure I filed a radar about this years ago.
When the key isn't present, it always results in a modification.
I guess I see why there's no
set
and_modify
Sometimes it's more efficient to have both set
and _modify
; for example set
can handle d[x] = y
more efficiently when d
is a dictionary.
There it is: [SR-9870] Dictionary needs an API suitable for memoization Β· Issue #52276 Β· swiftlang/swift Β· GitHub
It's not that you never want to block in async code so much as that you don't want to block indefinitely. That's a hazy distinction, but to me a good rule of thumb is not to do anything inside of a lock guard that could theoretically also be async and create more work for the system (disk/network IO, waiting for UI events, etc.) Grabbing a lightweight lock to quickly update some state and release the lock is generally fine, and usually more efficient than any alternative synchronization mechanism, and doesn't risk leaving other threads blocked for an unpredictable amount of time waiting for external events to happen. Swift doesn't have a strict type system distinction between sync
and synchrony-agnostic code, so it's true that withLock
APIs still can't strictly enforce you hold them right on their own, though they can at least prevent the memory-safety problem of a lightweight lock being taken on one thread then released on another.
When the key isn't present, it always results in a modification.
Waaait, ok, now I really am confused Looking at the get
accessor in Dictionary, I would assume that this simply returns the default value without modifying _variant
since it just uses the ??
coalescing operator:
get {
return _variant.lookup(key) ?? defaultValue()
}
So I guess I'm confused about when the get
accessor is used, and when _modify
is used. Apparently subscripting an inout Dictionary
is at least one case that uses _modify
even when the value doesn't get assigned to like x[key, defaultValue: 100] += 6
.
Sorry, I hope I'm not derailing this thread too much. I know this is getting pretty heavily into minutiae and _modify
isn't even public API. I appreciated your answer in any case, thank you.
Do they do that any better than careful use of lock()
/unlock()
pairs? I'm just trying to understand whether the changes I made to use the closure form made any real difference.
The closure based form provides two useful invariants currently:
- Since the closure is not
async
, the code inside of the lock guard is prevented from suspending the current task, and the unlock therefore happens on the same OS thread as the lock (which is a hard requirement for many lightweight lock implementations) - For the non-copyable
Mutex
standard library type, the closure runs while holding a borrow of the mutex, which ensures the value cannot be moved while the lock is held. This guarantee combined with noncopyability is what allowsMutex
to use inline storage for the lock.
The latter could be superseded by the use of a nonescaping LockGuard
type, whose lifetime could serve the purpose of keeping the lock borrowed while the lock is held without being strictly scoped. (And if you're already using out-of-line allocation for the lock storage, the borrow is less essential for correctness.)
_modify
is used exactly when the subscript expression x[β¦]
is passed to an inout
parameter, or is the receiver of a mutating
method.
If Swift were more consistent about requiring the use of &
you could say _modify
is used whenever you see &x[β¦]
, but when an operator like +=
or a mutating method is called, we omit the &
.
I just realized that none of the implementations in terms of TaskGroup
are actually creating a new task for each fib
invocation. It doesn't look like there's any way to use TaskGroup
that way, where the dependency computations are discovered.
Maybe when we implement support for lifetime dependencies, we can return the task from addTask
. We could work around it with withUnsafeContinuation
and return a copy of the intermediate result.
But that's not the end of the story. addTask
is a mutating function, and we can't capture inout TaskGroup
to an escaping context. So we would have to take an unsafe mutable pointer to the task group to use from the tasks, and also we'd have to protect it with another mutex so we don't have overlapping mutating access to the group.
Why do you want to use TaskGroup
anyway?
I really hope generalized lifetime dependencies never happen to Swift. And I don't see how it would solve any problems.
AFAICT returning the task would be no problem today since Task
s have reference semantics and are safe to read from any thread. The problem is that there's no way for a task running in a task group to add more tasks to the group.
Why do you want to use
TaskGroup
anyway?
I don't; I was just blindly following the lead of one of the answers posted here. I fixed the examples to not do that anymore and the timing differences are much less stark now, though OSAllocatedUnfairLock
still wins. I also added benchmarks using GCD and pthread R/W locks, FWIW.
I believe when a task is spawned by a group it will be stack allocated. So you can't allow it to outlive the group. And without lifetime dependencies we can't express this constraint.
Yep, it's the second problem I mentioned.
Ah, okay then. I'm pretty sure you can win even more if you use a bare stack allocated os_unfair_lock
(or an alternative) without OSAllocatedUnfairLock
. You can guarantee the lock won't escape your "compute" function, so you don't need all these extra retain/releases.
Something like
var lock = os_unfair_lock_s()
withUnsafeMutablePointer(to: &lock) { lockPtr in
...
do {
os_unfair_lock_lock(lockPtr)
defer { os_unfair_lock_unlock(lockPtr) }
... read/update the cache (probably also has to be passed as a pointer) ...
}
}