Note: for the following I was generally using 10 million elements in the array, but sometimes 100 million or a billion - I found 1 million was far too few to be able to profile the app, as it completes in just a few milliseconds even before optimisation.
On face value…
Essentially all your time is in ARC traffic.
Pre-computing the entire chunks
array is expensive - doing that lazily improves performance 3x. i.e.:
let chunks = (0..<numberOfChunks).lazy.map { …
But inlining that chunks
calculation completely improves it 5x even just for the N=1 case, and 225x for the N=10 case (i.e. 10 million elements now takes ~8ms instead of 1.8s on my M2 MacBook Air). Because it completely eliminates all ARC activity, in the core loops (spoiler: this has broken your benchmark, as I'll show in a minute).
Now all that's left is the actual cost of the array iteration and construction [of the result].
static func workResult(in messages:[Message], numberOfChunks: Int) async -> [Record] {
return await withTaskGroup(of: [Record].self) { group in
for phase in 0..<numberOfChunks {
group.addTask {
var records = [Record]()
let d = Date()
for messageIndex in stride(from: phase, to: messages.count, by: numberOfChunks) {
let result = await Parser.parse(data: messages[messageIndex].data)
records.append(result)
}
print("Processed \(records.count) messages in \(Date().timeIntervalSince(d)) s.")
return records
}
}
var records = [Record]()
for await result in group {
records.append(contentsOf: result)
}
return records
}
}
Even with a billion elements this simple benchmark completes in ~2.5s (N=1) and 0.7s (N=10), with this optimisation. Larger magnitudes hit the memory limits of my M2 MacBook Air (and of course performance craters at that point).
…but actually…
For varying magnitudes (within my RAM capacity) I see the N=10 version being 3x to 4x faster than the N=1 version. So not a great speed-up, for an 8 core M2, but it might merely be because it's over so fast that it doesn't really have time to spin up all the cores: I see CPU utilisation hitting about 3 cores-worth at best and only for a split second, but if I add an actual computation load to parse(data:)
I see all cores fully utilised.
@_optimize(none)
func blackHole<T>(_ x: T) {}
struct Parser {
// Decode the payload of a message into a Record
static func parse(data: Data) async -> Record {
var hash: UInt8 = 0
data.withUnsafeBytes {
for byte in $0 {
hash &+= byte
}
}
blackHole(hash)
return Record()
}
}
Note however that now the N=10 version is about four times slower than N=1, again. This is because the ARC traffic is back (which hurts the parallelised version more than the serialised one, for whatever reason - maybe cache-thrashing centred on those retain counters).
I think previously that parse(data:)
was basically being optimised away to just the actual Record.init()
, and consequently all the code about pulling out the relevant Data
(and retain-releasing it) was optimised away too.
Benchmarking is hard.
This seems proven also by returning to the original no-work version but forcing the optimiser to leave it as-is:
struct Parser {
// Decode the payload of a message into a Record
@_optimize(none)
static func parse(data: Data) async -> Record {
return Record()
}
}
Cache ping-pong (because of ARC)?
What's telling is that the last two tasks run relatively quickly, because they don't launch until basically the first eight are done (8 CPU cores on my machine):
Task 1 started at 2024-05-20 15:59:42 +0000…
Task 0 started at 2024-05-20 15:59:42 +0000…
Task 2 started at 2024-05-20 15:59:42 +0000…
Task 3 started at 2024-05-20 15:59:42 +0000…
Task 4 started at 2024-05-20 15:59:42 +0000…
Task 5 started at 2024-05-20 15:59:42 +0000…
Task 6 started at 2024-05-20 15:59:42 +0000…
Task 7 started at 2024-05-20 15:59:42 +0000…
Processed 1000000 messages in 2.2029730081558228 s.
Task 8 started at 2024-05-20 15:59:44 +0000…
Processed 1000000 messages in 2.202566981315613 s.
Task 9 started at 2024-05-20 15:59:44 +0000…
Processed 1000000 messages in 2.2136240005493164 s.
Processed 1000000 messages in 2.2249550819396973 s.
Processed 1000000 messages in 2.2283610105514526 s.
Processed 1000000 messages in 2.230591058731079 s.
Processed 1000000 messages in 2.234092950820923 s.
Processed 1000000 messages in 2.23491895198822 s.
Processed 1000000 messages in 0.12816309928894043 s.
Processed 1000000 messages in 0.13089799880981445 s.
Now the performance sucks again, and has negative scaling with parallelism, seemingly because of all the ARC traffic (it's essentially the only thing in the time profiles, with swift_retain
and swift_release
accounting for 99% of samples).
I'm not sure how to eliminate that ARC traffic. I'm not even certain what's being retain-released (the Allocations tool in Instruments apparently doesn't work for Swift retains and releases, only Objective-C ones). From stepping through the disassembly in Xcode it appears that at least some of the ARC traffic is around Foundation.__DataStorage
, which is the underlying storage for Data
. Unfortunately the copy-on-write nature of Data
is hurting you here (pity Swift doesn't offer an immutable version, like Objective-C did).
There's also some traffic around managing the Record
s, of course - e.g. retaining them as they're appended to the records
Array
. And some technically redundant traffic when concatenating those arrays together at the end (because Array
doesn't have a consuming version of append(contentsOf:)
that could just 'steal' the existing +1 from the appended collection).
I don't think you can eliminate that in a real-world program if you genuinely need to accumulate the Record
s, but if you don't need to - e.g. if you can pass them off immediately for processing, as they're instantiated - you can avoid at least some of that ARC traffic and general buffering overhead.
That all said, it's still a bit of a mystery to me why it's so much faster to just do it serially, as I haven't found any actual shared references of consequence between your tasks, to support a cache ping-ponging scenario. Yet your benchmark is basically nothing but ARC activity, so I don't know what else it could be.
Sidenote: integrated processing pipelines are way more efficient
Intermediaries (e.g. the result of non-lazy map
and filter
operations) aren't just inefficient because of the temporary memory they allocate, they have truly horrible effects on efficiency of the overall processing. Doing one type of operation at a time, across all data, is fundamentally inefficient. CPUs just don't like that (nor do GPUs or NPUs, although they're more tolerant of it).
Performance is basically dominated by power / heat, which is largely about moving electrons, so you want to move as few electrons and as short a distance as possible. i.e. load each piece of data only once and do the complete algorithm on it, and emit only the final result back to memory (or the network, or wherever). Don't go any further from the processing units than you have to (registers > L0 > L1 > L2 > RAM > etc).
This is not easy to do in languages like Swift because (a) their libraries tend to discourage this, e.g. map
being eager by default and (b) the language relies very heavily on the imperfect optimiser to actually generate good machine code even if you do the right thing (like use lazy
). But you can get a long way just by trying to avoid any explicit collection intermediaries in your code (some bounded cases, like SIMD types, aside).
Sidenote: async
functions are slow[er]
As presented, you're declaring parse(data:)
as async
. Maybe it needs to be in your real program? But if not, making it sync makes this simple benchmark run five times faster (the impact is probably lower in your real-world code, once intermingled with actual work). It appears this is because of overhead of allocating task state in order to call the async function, state which is never really used because your so-called async function never actually hits a suspension point.
Sidenote: Data
is slow
Using [UInt8]
instead of Data
makes the benchmark about twice as fast (when actually accessing the bytes), because Data
is fundamentally slow; you cannot directly access its contents. Either:
- You go through a bunch of function call overhead for every byte, which is really slow, or
- You use
withUnsafeBytes
but that duplicates the underlying malloc allocation and does a memcpy
to populate it, which while surprisingly faster than any other method is of course still really inefficient. And will be horrific if your data is large and you run out of readily free RAM.
See URLSession performance for reading a byte stream as another example.