Note: for the following I was generally using 10 million elements in the array, but sometimes 100 million or a billion - I found 1 million was far too few to be able to profile the app, as it completes in just a few milliseconds even before optimisation.
On face value…
Essentially all your time is in ARC traffic.
Pre-computing the entire chunks array is expensive - doing that lazily improves performance 3x. i.e.:
let chunks = (0..<numberOfChunks).lazy.map { …
But inlining that chunks calculation completely improves it 5x even just for the N=1 case, and 225x for the N=10 case (i.e. 10 million elements now takes ~8ms instead of 1.8s on my M2 MacBook Air). Because it completely eliminates all ARC activity, in the core loops (spoiler: this has broken your benchmark, as I'll show in a minute).
Now all that's left is the actual cost of the array iteration and construction [of the result].
static func workResult(in messages:[Message], numberOfChunks: Int) async -> [Record] {
return await withTaskGroup(of: [Record].self) { group in
for phase in 0..<numberOfChunks {
group.addTask {
var records = [Record]()
let d = Date()
for messageIndex in stride(from: phase, to: messages.count, by: numberOfChunks) {
let result = await Parser.parse(data: messages[messageIndex].data)
records.append(result)
}
print("Processed \(records.count) messages in \(Date().timeIntervalSince(d)) s.")
return records
}
}
var records = [Record]()
for await result in group {
records.append(contentsOf: result)
}
return records
}
}
Even with a billion elements this simple benchmark completes in ~2.5s (N=1) and 0.7s (N=10), with this optimisation. Larger magnitudes hit the memory limits of my M2 MacBook Air (and of course performance craters at that point).
…but actually…
For varying magnitudes (within my RAM capacity) I see the N=10 version being 3x to 4x faster than the N=1 version. So not a great speed-up, for an 8 core M2, but it might merely be because it's over so fast that it doesn't really have time to spin up all the cores: I see CPU utilisation hitting about 3 cores-worth at best and only for a split second, but if I add an actual computation load to parse(data:) I see all cores fully utilised.
@_optimize(none)
func blackHole<T>(_ x: T) {}
struct Parser {
// Decode the payload of a message into a Record
static func parse(data: Data) async -> Record {
var hash: UInt8 = 0
data.withUnsafeBytes {
for byte in $0 {
hash &+= byte
}
}
blackHole(hash)
return Record()
}
}
Note however that now the N=10 version is about four times slower than N=1, again. This is because the ARC traffic is back (which hurts the parallelised version more than the serialised one, for whatever reason - maybe cache-thrashing centred on those retain counters).
I think previously that parse(data:) was basically being optimised away to just the actual Record.init(), and consequently all the code about pulling out the relevant Data (and retain-releasing it) was optimised away too.
Benchmarking is hard. 
This seems proven also by returning to the original no-work version but forcing the optimiser to leave it as-is:
struct Parser {
// Decode the payload of a message into a Record
@_optimize(none)
static func parse(data: Data) async -> Record {
return Record()
}
}
Cache ping-pong (because of ARC)?
What's telling is that the last two tasks run relatively quickly, because they don't launch until basically the first eight are done (8 CPU cores on my machine):
Task 1 started at 2024-05-20 15:59:42 +0000…
Task 0 started at 2024-05-20 15:59:42 +0000…
Task 2 started at 2024-05-20 15:59:42 +0000…
Task 3 started at 2024-05-20 15:59:42 +0000…
Task 4 started at 2024-05-20 15:59:42 +0000…
Task 5 started at 2024-05-20 15:59:42 +0000…
Task 6 started at 2024-05-20 15:59:42 +0000…
Task 7 started at 2024-05-20 15:59:42 +0000…
Processed 1000000 messages in 2.2029730081558228 s.
Task 8 started at 2024-05-20 15:59:44 +0000…
Processed 1000000 messages in 2.202566981315613 s.
Task 9 started at 2024-05-20 15:59:44 +0000…
Processed 1000000 messages in 2.2136240005493164 s.
Processed 1000000 messages in 2.2249550819396973 s.
Processed 1000000 messages in 2.2283610105514526 s.
Processed 1000000 messages in 2.230591058731079 s.
Processed 1000000 messages in 2.234092950820923 s.
Processed 1000000 messages in 2.23491895198822 s.
Processed 1000000 messages in 0.12816309928894043 s.
Processed 1000000 messages in 0.13089799880981445 s.
Now the performance sucks again, and has negative scaling with parallelism, seemingly because of all the ARC traffic (it's essentially the only thing in the time profiles, with swift_retain and swift_release accounting for 99% of samples).
I'm not sure how to eliminate that ARC traffic. I'm not even certain what's being retain-released (the Allocations tool in Instruments apparently doesn't work for Swift retains and releases, only Objective-C ones). From stepping through the disassembly in Xcode it appears that at least some of the ARC traffic is around Foundation.__DataStorage, which is the underlying storage for Data. Unfortunately the copy-on-write nature of Data is hurting you here (pity Swift doesn't offer an immutable version, like Objective-C did).
There's also some traffic around managing the Records, of course - e.g. retaining them as they're appended to the records Array. And some technically redundant traffic when concatenating those arrays together at the end (because Array doesn't have a consuming version of append(contentsOf:) that could just 'steal' the existing +1 from the appended collection).
I don't think you can eliminate that in a real-world program if you genuinely need to accumulate the Records, but if you don't need to - e.g. if you can pass them off immediately for processing, as they're instantiated - you can avoid at least some of that ARC traffic and general buffering overhead.
That all said, it's still a bit of a mystery to me why it's so much faster to just do it serially, as I haven't found any actual shared references of consequence between your tasks, to support a cache ping-ponging scenario. Yet your benchmark is basically nothing but ARC activity, so I don't know what else it could be.
Sidenote: integrated processing pipelines are way more efficient
Intermediaries (e.g. the result of non-lazy map and filter operations) aren't just inefficient because of the temporary memory they allocate, they have truly horrible effects on efficiency of the overall processing. Doing one type of operation at a time, across all data, is fundamentally inefficient. CPUs just don't like that (nor do GPUs or NPUs, although they're more tolerant of it).
Performance is basically dominated by power / heat, which is largely about moving electrons, so you want to move as few electrons and as short a distance as possible. i.e. load each piece of data only once and do the complete algorithm on it, and emit only the final result back to memory (or the network, or wherever). Don't go any further from the processing units than you have to (registers > L0 > L1 > L2 > RAM > etc).
This is not easy to do in languages like Swift because (a) their libraries tend to discourage this, e.g. map being eager by default and (b) the language relies very heavily on the imperfect optimiser to actually generate good machine code even if you do the right thing (like use lazy). But you can get a long way just by trying to avoid any explicit collection intermediaries in your code (some bounded cases, like SIMD types, aside).
Sidenote: async functions are slow[er]
As presented, you're declaring parse(data:) as async. Maybe it needs to be in your real program? But if not, making it sync makes this simple benchmark run five times faster (the impact is probably lower in your real-world code, once intermingled with actual work). It appears this is because of overhead of allocating task state in order to call the async function, state which is never really used because your so-called async function never actually hits a suspension point.
Sidenote: Data is slow
Using [UInt8] instead of Data makes the benchmark about twice as fast (when actually accessing the bytes), because Data is fundamentally slow; you cannot directly access its contents. Either:
- You go through a bunch of function call overhead for every byte, which is really slow, or
- You use
withUnsafeBytes but that duplicates the underlying malloc allocation and does a memcpy to populate it, which while surprisingly faster than any other method is of course still really inefficient. And will be horrific if your data is large and you run out of readily free RAM.
See URLSession performance for reading a byte stream as another example.