Performance issue with parsing large Data in parallel using Swift Concurrency

4xel · May 20, 2024, 9:21am

I'm currently working on parsing a large Data consisting of contiguous messages and I'm trying to leverage Swift Concurrency to handle this task in parallel. My approach involves splitting the messages into chunks and creating a task for each chunk which then calls a parse(data:Data) function.

However, I am encountering a significant performance issue: the more chunks I create, the longer each task takes to complete.

For instance, dividing the work into 4 chunks, each task takes 4 times longer to complete even though it has 4 times fewer messages to process. Dividing the work into 10 chunks, each task takes 10 times longer.

This seems counterintuitive. I expected there would be some overhead with copying the data into child tasks. But I can't understand why, even with fewer messages to process, the more child tasks there is, the longer each child task takes.

Here is a simplified example of my code:

struct ParallelApp {
    static func main() async {
        // Simulate an array of messages from a large binary data, each message contains a Data payload
        let messages = Array(repeating: Message(), count: 1_000_000)
        
        let d1 = Date()
        let _ = await workResult(in: messages, numberOfChunks:1)
        print("> \(messages.count) in \(Date().timeIntervalSince(d1))s.")
        
        let d2 = Date()
        let _ = await workResult(in: messages, numberOfChunks:10)
        print("> \(messages.count) in \(Date().timeIntervalSince(d2))s.")
  }
    
    static func workResult(in messages:[Message], numberOfChunks: Int) async -> [Record] {
        // Chunk messages
        let chunks = (0..<numberOfChunks).map { stride(from: $0, to: messages.count, by: numberOfChunks).map { messages[$0] } }
        
        return await withTaskGroup(of: [Record].self) { group in
            for chunk in chunks {
                group.addTask {
                    var records = [Record]()
                    
                    let d = Date()
                    for message in chunk {
                        let result = await Parser.parse(data: message.data)
                        records.append(result)
                    }
                    print("Processed \(chunk.count) messages in \(Date().timeIntervalSince(d)) s.")
                    
                    return records
                }
            }
            
            var records = [Record]()
            for await result in group {
                records.append(contentsOf: result)
            }
            return records
        }
    }
}

struct Message {
    let data = Data(repeating: 42, count: 100)
}

struct Record {
}

struct Parser {
    // Decode the payload of a message into a Record
    static func parse(data: Data) async -> Record {
        return Record()
    }
}

Why does the parse function take longer with an increasing number of chunks?
Is there a better way to manage the concurrent parsing of large data?

AronL · May 20, 2024, 10:46am

Have you measured what is taking up most of the time using Instruments with the "real" code?

Looking at your sample code, all the real work (data decoding) have been replaced with a no-op so what you are actually measuring in the test code is:

The cost of doing a deep copy of the messages array and merging the result of each chunk back into one large array.
The cost of doing a context switch (switching from one thread to another).

Both of those costs will only go up when you run your code in parallel, so I am not surprised that it would be slower the more chunks you create.

vns · May 20, 2024, 10:51am

Your concurrency approach seems to be fine, the issue lies in other things here. I've made it run faster with more chunks (see code below). What turned out to be the most costly is accessing message (there is a comment in code), I suppose it might be due to the delayed evaluation of data property in the Message struct, but so far not sure why this has affected it that much.

Note also, that reserving capacity also has impact (not large, but it is) on the total running time - system spends less time on allocations.

import Foundation

struct Message {
    let data = Data(repeating: 42, count: 100)
}

struct Record {
}

struct Parser {
    static func parse(data: Data) async -> Record {
        return Record()
    }
}

@main
struct Main {
    static func main() async {
        let messages = Array(repeating: Message(), count: 1_000_000)

        var start = CFAbsoluteTimeGetCurrent()
        let _ = await workResult(in: messages, numberOfChunks: 1)
        let d1 = CFAbsoluteTimeGetCurrent() - start
        print("Processed \(messages.count.formatted()) messages in 1 chunk in \(d1) s.")

        start = CFAbsoluteTimeGetCurrent()
        let _ = await workResult(in: messages, numberOfChunks: 10)
        let d2 = CFAbsoluteTimeGetCurrent() - start
        print("Processed \(messages.count.formatted()) messages in 10 chunks in \(d2) s.")

        print("Speedup: \(d1 / d2)")
    }

    static func workResult(in messages: [Message], numberOfChunks: Int) async -> [Record] {
        // that one might be incorrect if the number of messages
        // is not a multiple of the number of chunks
        // so this calculation requires a bit of adjustment
        let chunkSize = messages.count / numberOfChunks
        return await withTaskGroup(of: [Record].self) { group in
            for i in 0..<numberOfChunks {
                group.addTask {
                    var records: [Record] = []
                    records.reserveCapacity(chunkSize)
                    for k in 0..<chunkSize {
                        // getting message out of array turned out to be the bottleneck in overall
                        // so if you write
                        // let message = messages[i * chunkSize + k]
                        // and then pass it to the parse function, it slows down significantly
                        // but I'm not sure why
                        let result = await Parser.parse(data: messages[i * chunkSize + k].data)
                        records.append(result)
                    }
                    return records
                }
            }
            var records = [Record]()
            records.reserveCapacity(messages.count)
            for await result in group {
                records.append(contentsOf: result)
            }
            return records
        }
    }
}

vns · May 20, 2024, 10:54am

There is no necessary switch of the thread, or even suspension point at all. In fact, if you try removing async from parse method it should not affect the result at all (or impact would be extremely minor compared to the overall work).

AronL · May 20, 2024, 11:13am

Sure; but I was not talking about the func parse(data: Data) code, I was taking about the "overhead" of having multiple chunks.

You are right "context switch" is not the correct term, but there is a tiny overhead of creating a Task for each chunk and a tiny overhead each time a thread picks up a new task. That overhead should be insignificant (and less that the gain of performing work in parallel). In this example code if the number of chunks are 10 it should not be the issue, but it might be if the real code uses a large number of chunks; say 10_000.

vns · May 20, 2024, 11:32am

The great thing about Swift Concurrency is controlled environment with pool of threads, so you can left aside issues from GCD with too many chunks at the baseline. But that's also not the case - there is a controlled environment where you define how many chunks you are going to process, and the goal to find optimum. It would be unreasonable to rely on some random number of chunks, rather than setting it explicitly.

wadetregaskis · May 20, 2024, 5:08pm

Note: for the following I was generally using 10 million elements in the array, but sometimes 100 million or a billion - I found 1 million was far too few to be able to profile the app, as it completes in just a few milliseconds even before optimisation.

On face value…

Essentially all your time is in ARC traffic.

Pre-computing the entire chunks array is expensive - doing that lazily improves performance 3x. i.e.:

let chunks = (0..<numberOfChunks).lazy.map { …

But inlining that chunks calculation completely improves it 5x even just for the N=1 case, and 225x for the N=10 case (i.e. 10 million elements now takes ~8ms instead of 1.8s on my M2 MacBook Air). Because it completely eliminates all ARC activity, in the core loops (spoiler: this has broken your benchmark, as I'll show in a minute).

Now all that's left is the actual cost of the array iteration and construction [of the result].

static func workResult(in messages:[Message], numberOfChunks: Int) async -> [Record] {
    return await withTaskGroup(of: [Record].self) { group in
        for phase in 0..<numberOfChunks {
            group.addTask {
                var records = [Record]()

                let d = Date()
                for messageIndex in stride(from: phase, to: messages.count, by: numberOfChunks) {
                    let result = await Parser.parse(data: messages[messageIndex].data)
                    records.append(result)
                }
                print("Processed \(records.count) messages in \(Date().timeIntervalSince(d)) s.")

                return records
            }
        }

        var records = [Record]()
        for await result in group {
            records.append(contentsOf: result)
        }
        return records
    }
}

Even with a billion elements this simple benchmark completes in ~2.5s (N=1) and 0.7s (N=10), with this optimisation. Larger magnitudes hit the memory limits of my M2 MacBook Air (and of course performance craters at that point).

…but actually…

For varying magnitudes (within my RAM capacity) I see the N=10 version being 3x to 4x faster than the N=1 version. So not a great speed-up, for an 8 core M2, but it might merely be because it's over so fast that it doesn't really have time to spin up all the cores: I see CPU utilisation hitting about 3 cores-worth at best and only for a split second, but if I add an actual computation load to parse(data:) I see all cores fully utilised.

@_optimize(none)
func blackHole<T>(_ x: T) {}

struct Parser {
    // Decode the payload of a message into a Record
    static func parse(data: Data) async -> Record {
        var hash: UInt8 = 0

        data.withUnsafeBytes {
            for byte in $0 {
                hash &+= byte
            }
        }

        blackHole(hash)

        return Record()
    }
}

Note however that now the N=10 version is about four times slower than N=1, again. This is because the ARC traffic is back (which hurts the parallelised version more than the serialised one, for whatever reason - maybe cache-thrashing centred on those retain counters).

I think previously that parse(data:) was basically being optimised away to just the actual Record.init(), and consequently all the code about pulling out the relevant Data (and retain-releasing it) was optimised away too.

Benchmarking is hard.

This seems proven also by returning to the original no-work version but forcing the optimiser to leave it as-is:

struct Parser {
    // Decode the payload of a message into a Record
    @_optimize(none)
    static func parse(data: Data) async -> Record {
        return Record()
    }
}

Cache ping-pong (because of ARC)?

What's telling is that the last two tasks run relatively quickly, because they don't launch until basically the first eight are done (8 CPU cores on my machine):

Task 1 started at 2024-05-20 15:59:42 +0000…
Task 0 started at 2024-05-20 15:59:42 +0000…
Task 2 started at 2024-05-20 15:59:42 +0000…
Task 3 started at 2024-05-20 15:59:42 +0000…
Task 4 started at 2024-05-20 15:59:42 +0000…
Task 5 started at 2024-05-20 15:59:42 +0000…
Task 6 started at 2024-05-20 15:59:42 +0000…
Task 7 started at 2024-05-20 15:59:42 +0000…
Processed 1000000 messages in 2.2029730081558228 s.
Task 8 started at 2024-05-20 15:59:44 +0000…
Processed 1000000 messages in 2.202566981315613 s.
Task 9 started at 2024-05-20 15:59:44 +0000…
Processed 1000000 messages in 2.2136240005493164 s.
Processed 1000000 messages in 2.2249550819396973 s.
Processed 1000000 messages in 2.2283610105514526 s.
Processed 1000000 messages in 2.230591058731079 s.
Processed 1000000 messages in 2.234092950820923 s.
Processed 1000000 messages in 2.23491895198822 s.
Processed 1000000 messages in 0.12816309928894043 s.
Processed 1000000 messages in 0.13089799880981445 s.

Now the performance sucks again, and has negative scaling with parallelism, seemingly because of all the ARC traffic (it's essentially the only thing in the time profiles, with swift_retain and swift_release accounting for 99% of samples).

I'm not sure how to eliminate that ARC traffic. I'm not even certain what's being retain-released (the Allocations tool in Instruments apparently doesn't work for Swift retains and releases, only Objective-C ones). From stepping through the disassembly in Xcode it appears that at least some of the ARC traffic is around Foundation.__DataStorage, which is the underlying storage for Data. Unfortunately the copy-on-write nature of Data is hurting you here (pity Swift doesn't offer an immutable version, like Objective-C did).

There's also some traffic around managing the Records, of course - e.g. retaining them as they're appended to the records Array. And some technically redundant traffic when concatenating those arrays together at the end (because Array doesn't have a consuming version of append(contentsOf:) that could just 'steal' the existing +1 from the appended collection).

I don't think you can eliminate that in a real-world program if you genuinely need to accumulate the Records, but if you don't need to - e.g. if you can pass them off immediately for processing, as they're instantiated - you can avoid at least some of that ARC traffic and general buffering overhead.

That all said, it's still a bit of a mystery to me why it's so much faster to just do it serially, as I haven't found any actual shared references of consequence between your tasks, to support a cache ping-ponging scenario. Yet your benchmark is basically nothing but ARC activity, so I don't know what else it could be.

Sidenote: integrated processing pipelines are way more efficient

Intermediaries (e.g. the result of non-lazy map and filter operations) aren't just inefficient because of the temporary memory they allocate, they have truly horrible effects on efficiency of the overall processing. Doing one type of operation at a time, across all data, is fundamentally inefficient. CPUs just don't like that (nor do GPUs or NPUs, although they're more tolerant of it).

Performance is basically dominated by power / heat, which is largely about moving electrons, so you want to move as few electrons and as short a distance as possible. i.e. load each piece of data only once and do the complete algorithm on it, and emit only the final result back to memory (or the network, or wherever). Don't go any further from the processing units than you have to (registers > L0 > L1 > L2 > RAM > etc).

This is not easy to do in languages like Swift because (a) their libraries tend to discourage this, e.g. map being eager by default and (b) the language relies very heavily on the imperfect optimiser to actually generate good machine code even if you do the right thing (like use lazy). But you can get a long way just by trying to avoid any explicit collection intermediaries in your code (some bounded cases, like SIMD types, aside).

Sidenote: `async` functions are slow[er]

As presented, you're declaring parse(data:) as async. Maybe it needs to be in your real program? But if not, making it sync makes this simple benchmark run five times faster (the impact is probably lower in your real-world code, once intermingled with actual work). It appears this is because of overhead of allocating task state in order to call the async function, state which is never really used because your so-called async function never actually hits a suspension point.

Sidenote: `Data` is slow

Using [UInt8] instead of Data makes the benchmark about twice as fast (when actually accessing the bytes), because Data is fundamentally slow; you cannot directly access its contents. Either:

You go through a bunch of function call overhead for every byte, which is really slow, or
You use withUnsafeBytes but that duplicates the underlying malloc allocation and does a memcpy to populate it, which while surprisingly faster than any other method is of course still really inefficient. And will be horrific if your data is large and you run out of readily free RAM.

See URLSession performance for reading a byte stream as another example.

vns · May 20, 2024, 5:26pm

Interesting insights on ARC being the main consumer of the time… Anyway I suppose the running time is quite good for 1 billion, not the best possible to squeeze, but that would require much more work at this point.

That’s odd, I didn’t see any difference with and without async for parsing. It is literally the same. I’m confused it makes no difference while should’ve as well…

Yet to use or not Data highly depends on details, with 1b being processed fast enough and assuming parsing is simpler to implement using Data, it is better to stick with data. Because using [UInt8] might bring additional complications on processing it right and efficient on its own.

wadetregaskis · May 20, 2024, 5:32pm

That was after some of the other optimisations. It might not surface when there's bigger slow-downs (or maybe it doesn't matter at all, if the optimiser simply behaves differently depending on other factors).

I don't usually see a significant difference in basic call cost between sync and async functions, in my more real-world code, but I have seen it occasionally. It's never been apparent to me what rhyme or reason there is.

tera · May 20, 2024, 6:17pm

Great analysis as usual!

On this one, are you sure? I allocated 1GB Data, used withUnsafeBytes on it but I do not see 2GB being used when I'm inside withUnsafeBytes callback, or do you mean something else?

wadetregaskis · May 20, 2024, 6:23pm

It doesn't happen all the time. I originally thought it was a question of actual Data vs NSData, since they're silently bridged, but I've seen it happen even with literal Data's allocated & used purely in Swift 5.10 code.

But why it doesn't always happen, I have no idea. Some kind of optimisation sensitivity, I assume.

In this benchmark specifically, I mostly don't seem to see this problem occurring, but I did see it at least once while trying out various mutations. I don't recall now what specific version of the code caused it.

wadetregaskis · May 20, 2024, 6:28pm

So, apropos of this other thread, marking parse(data:) as @_effects(readonly) fixes the performance behaviour. i.e. now the parallelised version runs basically N times faster (where N is the CPU core count, ignoring long tails for chunk count not being an even multiple of core count).

It looks like it's doing the right thing, in time profiles. i.e. it's still actually calling the method for every datum. It's just not wasting any time with unnecessary ARC traffic.

That underscored function modifier wasn't my first stop at trying to convince the compiler to stop temporarily retaining things, but it was the only thing I could find that worked (even unpacking the data first, using withUnsafeBytes, and passing only the UnsafeMutableRawPointer to parse(data:) didn't help).

Of course, the question is both (a) why is this seemingly necessary and (b) is this actually safe, even in this trivial benchmark?

I'm going to assume it's not safe for the real-world version of this code, so it's not really a solution.

Perhaps the specific question is: in the absence of that annotation, why does the compiler apparently think that parse(from:) can write to something which somehow means the caller has to retain the data? Is this some weird aliasing problem; does it think somehow parse(from:) is able to mutate the input array and potentially remove the current element out from under itself?

Addendum: also, with @_effects(readonly) in place, @_optimize(none) is no longer necessary for correctness as long as Record contains something derived from its Data argument (e.g. a joking "hash" as in the example below). Removing the optimisation inhibitor allows inlining again, which improves performance by more than an order of magnitude. Which is expected, but noteworthy, and it's good to confirm that @_effects(readonly) still works even with the optimiser enabled.

struct Record {
    let hash: UInt8
}

struct Parser {
    // Decode the payload of a message into a Record
    @_effects(readonly)
    static func parse(data: Data) -> Record {
        var hash: UInt8 = 0

        data.withUnsafeBytes {
            for byte in $0 {
                hash &+= byte
            }
        }

        return Record(hash: hash)
    }
}

tera · May 20, 2024, 6:41pm

I would highly recommend doing something more realistic here for meaningful benchmark results, e.g.

actually read the data bytes and
stuff those bytes into a non-empty Record struct.

Along with answering the question: does parse() actually need to be async? In other words, what would it be using in a real-world scenario to be async? If it doesn't have to be async in real world - make it sync, otherwise make it really async in the benchmark (e.g. return the result after a 0.2 delay that you'd await).

wadetregaskis · May 20, 2024, 6:44pm

As a sidenote, note also that using an empty struct as the Record isn't ideal because it devolves into an unusually-behaved immaterial entity; its size is zero bytes, so most code that works with it just optimises away to nothing.

It does have a non-zero stride (of 1 byte) so Arrays of Records still take up space, but I think get automatically and unavoidably initialised to zeroes (which is arbitrary - they could even be left as uninitialised memory, since nothing can ever read them, but Swift seems to choose to zero the memory anyway, for Array at least). Most code which nominally interacts with the array's contents tends to devolve into no-ops (i.e. gets optimised away).

Giving them at least one real field doesn't materially change what's been discussed above, as it happens, but it's generally safer for benchmarking / performance analysis purposes. And obviously closer to the real-world code.

wadetregaskis · May 20, 2024, 9:00pm

I'm thinking the performance problems, as a result of unnecessary ARC traffic, are memory latency issues of some kind.

The ARC traffic seems to be there irrespective of how many concurrent tasks are used. It seems it just varies hugely in cost.

	Single-threaded	8 concurrent tasks
Instructions retired	1.2B	3B
↳ Loads	262M	311M
↳ Stores	246k	416k
L1D misses (loads)	1.6M	805M
↳ Non-speculative	0.4M (17%)	41M (5%)
L1D misses (stores)	0.3M	224M
↳ Non-speculative	0.3M (83%)	34M (15%)
L1D demand misses	664	11k

The rate of instructions retired is 10x lower with all 8 cores in use than when using just one. But only ~35% lower with 2 cores in use, suggesting increasing penalties as concurrency goes up.

It's curious that the overall number of instructions retired more than doubles, to accomplish the same amount of work. That's a lot more than I anticipated for just the overhead of a few extra tasks. It suggests something materially different is happening w.r.t. to the code path followed as concurrency increases.

Even odder that apparently most of those extra instructions are not loads or stores.

L1 misses rocket up as concurrency increases, yet most of them are speculative (i.e. not necessarily used). It's not clear to me if the PMU events in question, on Apple's M2, can separate out unused speculative loads and stores - it appears not.

Unfortunately that's about the limits of the PMCs on Apple's M2, apparently - there's really very few events available, at least in Instruments. Notably absent is anything regarding traffic to RAM, any kind of stall counters of consequence (there's only one, for stalls on L1I), etc. It might require running this on an Intel Mac to get more information.

What makes no sense to me, as I noted previously, is why there'd be cache ping-ponging to begin with. It makes no difference how you chunk up the input array - as originally presented it was doing interleaved reads, which had obvious potential for nasty cache behaviour, but changing it to big chunks of contiguous entries makes absolutely no difference. Which I guess makes sense - the programmer-visible arrays shouldn't contain any reference counters (they're all value types), it's the buried reference types inside Data that seem to be the problem. And I guess those are sharing cache lines? Even so, I'd expect basically linear address assignment, so contiguous chunking should avoid any false sharing from those too.

jrose · May 20, 2024, 9:39pm

Copy-on-write doesn't have anything to do with this, just normal ARC behavior: every task has its own reference to the Data. (No mutating methods are called, so copy-on-write can't matter.) There are places where ARC is more conservative than it needs to be; if it can be proven so, then it's worth filing an optimization issue on the compiler. When it can't be, sometimes there are ways to restructure the code, or dipping into unsafe code with Unmanaged, to sidestep the cost, but mostly that's still pretty limited. (At one point I remember someone talking about adding Unmanaged.withUncheckedGuaranteedReference(_:), which would help.)

That said, I don't know why specifically there's so much ARC traffic during the main loop without digging into things further.

(For the chunk computation I'd definitely recommend slicing the original array rather than collecting into new arrays, but if OP really wants this striping pattern that's trickier, hence your inlining.)

David_Smith · May 20, 2024, 9:50pm

If you are in fact retaining and releasing the same underlying object from multiple threads concurrently, that's known to degrade extremely rapidly as the number of threads increases (due to repeatedly evicting the cache line containing the refcount from peer CPU caches).

One pattern I've used in the past to avoid that is to do a single access, get a buffer pointer, and then partition that between my concurrent workers, e.g.

data.withUnsafeBytes { rawBuffer in
  DispatchQueue.concurrentPerform(sliceCount) { sliceNumber in 
    doWork( /* slice up rawBuffer here as desired */)
  }
}

Doing this with Swift Concurrency may be tricky until we get more of the infrastructure in place around borrowing slices of things, since you can't await anything inside withUnsafeBytes. Nothing wrong with concurrentPerform for data-parallel stuff like this though.

Karl · May 21, 2024, 12:50am

On that note, I have found that passing -Xfrontend -enable-ossa-modules can lead to much better ARC optimisations.

I don't believe that's on by default yet, so maybe give it a try?

wadetregaskis · May 21, 2024, 3:55am

What I was obtusely alluding to was that although Data tries to pretend it's a value type, it's a leaky abstraction - witness the case in question, where the fact that it's actually a reference type underneath is showing up as really horrible performance.

If it were actually a reference type on its face, then ironically it'd seemingly be easier to optimise in situations like this because I could use Unmanaged. But Unmanaged doesn't work with types that hide their reference nature, like Data. Or at least, I can't figure out how to use it here.

In that sense, ye ol' NSData was better. Especially before ARC. But of course, there were a couple of downsides to that world…

(totally tangential, but one of my favourites was that NSMutableArray is a subclass of NSArray so you can pass it freely to anything which takes an NSArray, and lots of code doesn't remember to copy the input array in order to snapshot its contents, so unexpected mutations happened all the time and caused all sorts of undefined behaviour - worse, if you did "correctly" copy the input array, it could lead to you effectively DoSing your own app, because passing the same root NSMutableArray to multiple such APIs would cause duplicate copies, which could easily blow out available RAM in many cases)

David_Smith:

One pattern I've used in the past to avoid that is to do a single access, get a buffer pointer, and then partition that between my concurrent workers, e.g.
data.withUnsafeBytes { rawBuffer in
  DispatchQueue.concurrentPerform(sliceCount) { sliceNumber in 
    doWork( /* slice up rawBuffer here as desired */)
  }
}

I tried that on the individual datums themselves, and it had no effect (either the retain of __DataStorage isn't actually the problematic one, or somehow it's still getting needlessly retained & released despite this reshuffling…?).

And it can't practically be done with the overall messages array because withUnsafeBufferPointer won't accept an async closure.

I switched the code to using DispatchQueue.concurrentPerform(iterations:execute:) and the performance was maybe a bit better, but only by about 10% to 20%. Still an order of magnitude off. Plus I got multiple compiler warnings, because UnsafeBufferPointer isn't Sendable, ~~and because of some compiler bugs (e.g. it falsely claimed that code inside DispatchQueue.main.async { … } isn't running on the main thread)~~.

Doing both levels together (withUnsafeBufferPointer for the array and withUnsafeBytes for the datums) had no additional effect.

Maybe I'm doing it wrong, somehow?

Code

static func workResult(in messages: [Message], numberOfChunks: Int) async -> [Record] {
	messages.withUnsafeBufferPointer { rawMessages in
		var allRecords = [Record]()

		DispatchQueue.concurrentPerform(iterations: numberOfChunks) { phase in
			precondition(0 == rawMessages.count % numberOfChunks)
			let chunkSize = rawMessages.count / numberOfChunks

			var records = [Record]()
			records.reserveCapacity(chunkSize)

			let d = Date.now

			for messageIndex in stride(from: phase, to: rawMessages.count, by: numberOfChunks) {
				rawMessages[messageIndex].data.withUnsafeBytes {
					records.append(Parser.parse(data: $0))
				}
			}

			print("Processed \(records.count.formatted()) messages in \(-d.timeIntervalSinceNow) s.")

			DispatchQueue.main.async {
				allRecords.append(contentsOf: records)
			}
		}

		return allRecords
	}
}

It increases the release-mode compile time from 0.3s to 18s (!!!), but has no effect on the runtime performance, in this case.

What's it supposed to do? I see it described as:

Always serialize SIL in ossa form. If this flag is not passed in, when optimizing ownership will be lowered before serializing SIL

…but that's just the mechanics, and while I vaguely understand what that means, it's not apparent what the purpose is.

Karl · May 21, 2024, 10:30am

As I understand it, OSSA stands for Ownership SSA - it is a form of SIL which includes ownership information and allows for better ARC optimisations.

@Michael_Gottesman gave a talk about it: https://www.youtube.com/watch?v=qy3iZPHZ88o

In a thread from about 1.5 years ago (ARC overhead when wrapping a class in a struct which is only borrowed) I had some retain/release pairs that weren't being optimised, and passing -enable-ossa-modules removed them. Since then, whenever I see ARC issues, I pass that flag and very often it will improve things (not always, but it's definitely much more capable at optimising ARC operations).

Unfortunate that it wasn't able to help in this case.

I don't ship anything built with the flag enabled, but it's useful when I'm asking myself "is this something I can realistically expect the compiler to optimise? Or should I try to restructure my code somehow?"