Parallel computation DAG / Shared Futures?

dmt · September 7, 2024, 3:58am

Aren't the coroutines disallow yielding inside a closure?
It might be problematic to implement this pattern with Mutex/OSAllocatedUnfairLock as they don't provide unsafe API for acquiring a lock with types other than Void. And if we change the type to Void we'll have to store the value separately (which isn't as bad as copying the dictionary of course).

Cyberbeni · September 7, 2024, 9:24am

You should also implement the _modify accessor in your MutexCache, so you don't just overwrite all your existing data with a slightly modified one every time.

jaleel · September 7, 2024, 11:56am

I've initially thought mutex would be faster, as actors will have some overhead of suspension and context switching. But actually when writing a response already checked your benchmark and seen it's faster.

Yeah, that's unfortunate.

Ah, haven't thought about that!
Here is a bit more optimise version, where instead of whole dict I'm just get/set the values. Also updated the actor solution. Just checked @dabrahams benchmark and it gives some performance boost, indeed.

actor Cache {
    var r: [Int: Int] = [:]

    func update(key: Int, with value: Int) {
        self.r[key] = value
    }
    
    func getValue(for key: Int) -> Int? {
        self.r[key]
    }
}

func compute(_ input: [Int]) async -> [Int: Int] {
    let cache = Cache()

    @discardableResult
    func fib(_ x: Int, cache: Cache) async -> Int {
        if let y = await cache.getValue(for: x) { return y }
        let y = await x < 2 ? 1 : fib(x - 1, cache: cache) + fib(x - 2, cache: cache)
        await cache.update(key: x, with: y)
        return y
    }
    
    await withTaskGroup(of: Void.self) { group in
        for z in input {
            group.addTask {
                await fib(z, cache: cache)
            }
        }
        await group.waitForAll()
    }
    return await cache.r
}

import Synchronization

final class MutexCache: Sendable {
    
    let r: Mutex<[Int: Int]> = Mutex([:])
    
    func update(key: Int, with value: Int) {
        self.r.withLock { $0[key] = value }
    }
    
    func getValue(for key: Int) -> Int? {
        self.r.withLock { $0[key] }
    }
}

func mutexCompute(_ input: [Int]) async -> [Int: Int] {
    let mutexCache = MutexCache()
    
    @discardableResult
    func fib(_ x: Int, cache: MutexCache) -> Int {
        if let y = cache.getValue(for: x) { return y }
        let y = x < 2 ? 1 : fib(x - 1, cache: cache) + fib(x - 2, cache: cache)
        cache.update(key: x, with: y)
        return y
    }
    
    await withTaskGroup(of: Void.self) { group in
        for z in input {
            group.addTask {
                fib(z, cache: mutexCache)
            }
        }
        await group.waitForAll()
    }
    return mutexCache.r.withLock { $0 }
}

Here some results if someone interested:

Instructions
╒═══════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│ Test                          │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞═══════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ ParallelDAG:Jaleel (K) *      │     405 │     466 │     470 │     479 │     504 │    1062 │    1444 │    6795 │
├───────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ ParallelDAG:Mutex (K) *       │     223 │     241 │     263 │     270 │     282 │     383 │     589 │    8152 │
├───────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ ParallelDAG:Operations (K) *  │    3256 │    4090 │    4399 │    4755 │    5100 │    5788 │    6577 │    1752 │
├───────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ ParallelDAG:Tasks (K) *       │    1899 │    2587 │    2744 │    2900 │    3043 │    3275 │    3843 │    2482 │
╘═══════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

Malloc (total)
╒═══════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│ Test                          │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞═══════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ ParallelDAG:Jaleel *          │      20 │      71 │      71 │      80 │      97 │     175 │     370 │    6795 │
├───────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ ParallelDAG:Mutex *           │       2 │      60 │      60 │      60 │      60 │      71 │      99 │    8152 │
├───────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ ParallelDAG:Operations *      │     961 │     987 │    1072 │    1093 │    1176 │    1273 │   10219 │    1752 │
├───────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ ParallelDAG:Tasks *           │      99 │     405 │     518 │     681 │     845 │    1259 │    1692 │    2482 │
╘═══════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

Memory (resident peak)
╒═══════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│ Test                          │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞═══════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ ParallelDAG:Jaleel (M)        │      10 │      13 │      13 │      13 │      13 │      13 │      13 │    6795 │
├───────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ ParallelDAG:Mutex (M)         │       9 │      13 │      13 │      13 │      13 │      13 │      13 │    8152 │
├───────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ ParallelDAG:Operations (M)    │      10 │      19 │      21 │      22 │      23 │      23 │      23 │    1752 │
├───────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ ParallelDAG:Tasks (M)         │      10 │      14 │      14 │      15 │      15 │      15 │      15 │    2482 │
╘═══════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

Throughput (# / s)
╒═══════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│ Test                          │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞═══════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ ParallelDAG:Jaleel (K)        │      19 │      18 │      18 │      17 │      15 │       8 │       5 │    6795 │
├───────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ ParallelDAG:Mutex (K)         │      40 │      37 │      36 │      34 │      31 │      21 │       8 │    8152 │
├───────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ ParallelDAG:Operations (#)    │    3906 │    2943 │    2631 │    2307 │    2049 │    1608 │    1244 │    1752 │
├───────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ ParallelDAG:Tasks (#)         │    4783 │    3679 │    3475 │    3249 │    3017 │    2511 │     350 │    2482 │
╘═══════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

Time (total CPU)
╒═══════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│ Test                          │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞═══════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ ParallelDAG:Jaleel (μs) *     │      57 │      68 │      70 │      73 │      83 │     177 │     273 │    6795 │
├───────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ ParallelDAG:Mutex (μs) *      │      28 │      32 │      35 │      38 │      45 │      77 │     256 │    8152 │
├───────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ ParallelDAG:Operations (μs) * │     349 │     566 │     653 │     766 │     885 │    1172 │    1648 │    1752 │
├───────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ ParallelDAG:Tasks (μs) *      │     283 │     394 │     427 │     471 │     517 │     611 │     714 │    2482 │
╘═══════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

Time (wall clock)
╒═══════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│ Test                          │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞═══════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ ParallelDAG:Jaleel (μs) *     │      54 │      56 │      57 │      59 │      66 │     129 │     215 │    6795 │
├───────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ ParallelDAG:Mutex (μs) *      │      25 │      27 │      28 │      29 │      32 │      48 │     125 │    8152 │
├───────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ ParallelDAG:Operations (μs) * │     256 │     340 │     380 │     434 │     488 │     622 │     804 │    1752 │
├───────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ ParallelDAG:Tasks (μs) *      │     209 │     272 │     288 │     308 │     332 │     398 │    2859 │    2482 │
╘═══════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

Think still prefer actors, exactly for scaling, not only due to cache size. Of course if you want to be more performant on a single machine, or need more low level control—mutex seems like a better solution here.

Yeah, I've just added update and getValue functions.

dabrahams · September 7, 2024, 6:48pm

Yes, but fortunately you don't need to use the closure form of locking.

The implementation in my repo is already avoiding those overheads, I'm pretty certain. LMK if I've got that wrong (in fact open a PR).

Joe_Groff · September 9, 2024, 4:47pm

It's somewhat dangerous not to, though, if you're going to mix mutexes with async. Leaving a lock held while an async task is suspended risks starving the work pool by synchronously blocking other async tasks that try to hold the lock. New-generation lightweight locks like SRWLOCK/os_unfair_lock/futex can also be corrupted if they are locked on one thread but unlocked when the task resumes on a different kernel thread. This is why the standard library Mutex type only provides synchronous-closure-taking withLock APIs to access the locked state.

corbeling · September 9, 2024, 8:16pm

Sorry, this is a bit off topic, but I was curious about why the identity function -like closure { (x: inout Value)->Value in x } is used here. I assume this has something to do with the semantics of inout – at a guess, does avoid a copy of self?

pyrtsa · September 9, 2024, 8:43pm

AFAICT it forces Dictionary to go through this code path, i.e. the _modify accessor, managing to both:

assign the default value to key if missing, and
only make one hash lookup (for both the read and write at once).

(I'm slightly aghast that the Dictionary API doesn't make that cleaner to express.)

corbeling · September 9, 2024, 9:11pm

Huh, as a relative beginner it's surprising that calling that subscript on an inout parameter uses the _modify accessor – I really wouldn't expect dict[key, default: blah] to sometimes result in an in-place modification of dict when the key isn't present. Also somewhat surprised that there's no set accessor for that subscript, just get or _modify.

Edit: ah nevermind, looks like providing a get & _modify accessor for a property is pretty standard in the stdlib, and I guess I see why there's no set and _modify

dabrahams · September 10, 2024, 1:16am

Yeah, I thought of that later… It's easy to transform the use of accessors to avoid that problem (done). But how does the closure form solve this problem? It's still blocking, and I thought blocking was in general incompatible with async (one reason, I assumed, that it's so hard to call async code from synchronous code). I guess the thread pool will spin up new threads as needed for these cases?

dabrahams · September 10, 2024, 1:27am

@pyrtsa understood correctly. Yes, we are both aghast that it isn't in the standard library. I'm pretty sure I filed a radar about this years ago.

When the key isn't present, it always results in a modification.

I guess I see why there's no set and _modify

Sometimes it's more efficient to have both set and _modify; for example set can handle d[x] = y more efficiently when d is a dictionary.

pyrtsa · September 10, 2024, 3:39am

There it is: [SR-9870] Dictionary needs an API suitable for memoization · Issue #52276 · swiftlang/swift · GitHub

Joe_Groff · September 10, 2024, 3:27pm

It's not that you never want to block in async code so much as that you don't want to block indefinitely. That's a hazy distinction, but to me a good rule of thumb is not to do anything inside of a lock guard that could theoretically also be async and create more work for the system (disk/network IO, waiting for UI events, etc.) Grabbing a lightweight lock to quickly update some state and release the lock is generally fine, and usually more efficient than any alternative synchronization mechanism, and doesn't risk leaving other threads blocked for an unpredictable amount of time waiting for external events to happen. Swift doesn't have a strict type system distinction between sync and synchrony-agnostic code, so it's true that withLock APIs still can't strictly enforce you hold them right on their own, though they can at least prevent the memory-safety problem of a lightweight lock being taken on one thread then released on another.

corbeling · September 10, 2024, 5:59pm

When the key isn't present, it always results in a modification.

Waaait, ok, now I really am confused Looking at the get accessor in Dictionary, I would assume that this simply returns the default value without modifying _variant since it just uses the ?? coalescing operator:

get {
  return _variant.lookup(key) ?? defaultValue()
}

So I guess I'm confused about when the get accessor is used, and when _modify is used. Apparently subscripting an inout Dictionary is at least one case that uses _modify even when the value doesn't get assigned to like x[key, defaultValue: 100] += 6.

Sorry, I hope I'm not derailing this thread too much. I know this is getting pretty heavily into minutiae and _modify isn't even public API. I appreciated your answer in any case, thank you.

dabrahams · September 10, 2024, 9:10pm

Do they do that any better than careful use of lock()/unlock() pairs? I'm just trying to understand whether the changes I made to use the closure form made any real difference.

Joe_Groff · September 10, 2024, 9:17pm

The closure based form provides two useful invariants currently:

Since the closure is not async, the code inside of the lock guard is prevented from suspending the current task, and the unlock therefore happens on the same OS thread as the lock (which is a hard requirement for many lightweight lock implementations)
For the non-copyable Mutex standard library type, the closure runs while holding a borrow of the mutex, which ensures the value cannot be moved while the lock is held. This guarantee combined with noncopyability is what allows Mutex to use inline storage for the lock.

The latter could be superseded by the use of a nonescaping LockGuard type, whose lifetime could serve the purpose of keeping the lock borrowed while the lock is held without being strictly scoped. (And if you're already using out-of-line allocation for the lock storage, the borrow is less essential for correctness.)

dabrahams · September 10, 2024, 9:18pm

_modify is used exactly when the subscript expression x[…] is passed to an inout parameter, or is the receiver of a mutating method.

If Swift were more consistent about requiring the use of & you could say _modify is used whenever you see &x[…], but when an operator like += or a mutating method is called, we omit the &.

dabrahams · September 10, 2024, 10:12pm

I just realized that none of the implementations in terms of TaskGroup are actually creating a new task for each fib invocation. It doesn't look like there's any way to use TaskGroup that way, where the dependency computations are discovered.

dmt · September 11, 2024, 12:12am

Maybe when we implement support for lifetime dependencies, we can return the task from addTask. We could work around it with withUnsafeContinuation and return a copy of the intermediate result.
But that's not the end of the story. addTask is a mutating function, and we can't capture inout TaskGroup to an escaping context. So we would have to take an unsafe mutable pointer to the task group to use from the tasks, and also we'd have to protect it with another mutex so we don't have overlapping mutating access to the group.
Why do you want to use TaskGroup anyway?

dabrahams · September 11, 2024, 1:05am

I really hope generalized lifetime dependencies never happen to Swift. And I don't see how it would solve any problems.

AFAICT returning the task would be no problem today since Tasks have reference semantics and are safe to read from any thread. The problem is that there's no way for a task running in a task group to add more tasks to the group.

Why do you want to use TaskGroup anyway?

I don't; I was just blindly following the lead of one of the answers posted here. I fixed the examples to not do that anymore and the timing differences are much less stark now, though OSAllocatedUnfairLock still wins. I also added benchmarks using GCD and pthread R/W locks, FWIW.

dmt · September 11, 2024, 1:57am

I believe when a task is spawned by a group it will be stack allocated. So you can't allow it to outlive the group. And without lifetime dependencies we can't express this constraint.

Yep, it's the second problem I mentioned.

Ah, okay then. I'm pretty sure you can win even more if you use a bare stack allocated os_unfair_lock (or an alternative) without OSAllocatedUnfairLock. You can guarantee the lock won't escape your "compute" function, so you don't need all these extra retain/releases.
Something like

var lock = os_unfair_lock_s()
withUnsafeMutablePointer(to: &lock) { lockPtr in
  ...
  do {
    os_unfair_lock_lock(lockPtr)
    defer { os_unfair_lock_unlock(lockPtr) }
    ... read/update the cache (probably also has to be passed as a pointer) ...
  }
}