__CFSafelyReallocate does not work well with concurrency? mach_msg2_trap

hibernat · March 12, 2024, 5:05pm

When I run structured concurrency code that is extensively using NSData, system gets stuck - most of threads comes to mach_msg2_trap() soon, and system performance is very, very low. I have built a demo for it: https://github.com/hibernat/concurrency-with-data-bug

This is related to: https://developer.apple.com/forums/thread/748253?page=1#782485022

Originally, I have been using Gzip package, not Foundation's .lzfse, and the behavior was the same, so the issue is NOT the decompression, the trouble seems to be the concurrent memory reallocations.

Tested, and reproducible, on macOS 14.4, Xcode 15.3 (15E204a)

dnadoba · March 12, 2024, 6:23pm

The example loads files from the file system which is a blocking operation and will block the thread. Swifts Concurrency pool is fixed to the number of logical cores the system has and never spawns more. It's best to move this kind of blocking work off the concurrency pool. I don't really have a good answer for that today other than falling back to DispatchQueues.

hibernat · March 12, 2024, 6:43pm

OK, so, if I understand it right:

new task is created to load the file, however this runs on some thread which becomes blocked by the filesystem operation
this happens multiple times, so all threads become blocked by filesystem operations, and when I pause it, threads are in mach_msg2_trap(), because they wait for filesystem
concurrency is not creating more threads than physical CPU cores

Makes sense. Thanks a lot!

dnadoba · March 12, 2024, 7:45pm

Yes, thats correct.

Alternatively you can use NIOFileSystem which offloads file I/O to a separate thread pool that you can size yourself. However, it doesn't vend Data but ByteBuffers.

This will still block a thread under the hood. No Apple OS does provide any truly non-blocking file I/O API today. So you will always block threads. It is therefore a good idea to still limit the number of concurrent File I/O operations in flight.

hibernat · March 12, 2024, 11:42pm

Hmm, it seems to me that things are seriously more complex. I put all filesystem operations into an actor, so there is truly just one filesystem operation at a time.

Unfortunately, the major performance degradation is still there. After a while, most of threads waits for something in mach_msg2_trap(), but it cannot be blocked by filesystem, because no other filesystem operation starts sooner than the previous finishes.

If you could check the latest commit in the repo https://github.com/hibernat/concurrency-with-data-bug, things become clearer.

Hot topics:

once you change .decompressed(using: .lzfse) to .compressed(using: .lzfse), performance degradation does not appear. It just works!
when there is no compression/decompression at all, the performance is heavily limited by the actor, and when I put some dummy computation there (instead of compression/decompression) , there is no performance degradation at all!
when I add let _ = try (data as NSData).compressed(using: .lzfse) just under let _ = try (data as NSData).decompressed(using: .lzfse), then there is no performance degradation.

It truly seems to me that the issue is NOT any filesystem, but memory management under Swift concurrency

jrose · March 12, 2024, 11:52pm

DispatchIO isn’t truly non-blocking? Elsewhere I’ve seen it described as being implemented on kqueue, at least once upon a time.

dnadoba · March 13, 2024, 12:55am

Data can map files into memory and therefore delay the actually file I/O until access. I'm not sure if this is actually happening. IIRC Data uses some heuristics and doesn't always memory map files so it may do or may not happen serially.

How long are you running the test?
I can't really reproduce the issue locally. The CPU is busy doing decompression. There are a lot of system calls but that's expected as we are doing blocking IO and potentially allocating memory during decompression.

(The trace is too big to be attached)

I have rewritten the example using Dispatch with a very similar trace:

DispatchQueue.concurrentPerform(iterations: 1_000_000) { _ in
    let url = URL(filePath: "/Users/davidnadoba/Downloads/concurrency-with-data-bug-master/file.lzfse")
    let data = try Data(contentsOf: url)
    let _ = try (data as NSData).decompressed(using: .lzfse)
}

(again, too big to attach)

dnadoba · March 13, 2024, 12:57am

AFAIK all file I/O is blocking. Only linux can do truly non-blocking file IO with io_uring. @georgebarnett might be able to elaborate more on this as I can't recall the details.

eskimo · March 13, 2024, 9:39am

AFAIK all file I/O is blocking.

Correct. On Apple platforms the only way to transfer bytes between a file and memory is via the synchronous read and write system calls [1]. Anything that claims to be async (including aio_* and Dispatch I/O) is just a wrapper with careful (or not :-) thread management.

Having said that, file system I/O [2] is quite fast and the underlying hardware has limited parallelism, so you can make a lot of headway with careful thread management (-:

If you’re waiting in mach_msg2_trap you’re not waiting for the file system, because the file system is part of BSD which has its own system calls that don’t go through Mach messaging. The backtrace in hibernat’s DevForums thread looks like this:

#0	… mach_msg2_trap ()
#1	… mach_msg2_internal ()
#2	… vm_copy ()
#3	… szone_realloc ()
#4	… _malloc_zone_realloc ()
#5	… _realloc ()
#6	… __CFSafelyReallocate ()
#7	… _NSMutableDataGrowBytes ()
#8	… -[NSConcreteMutableData appendBytes:length:] ()
#9	… -[_NSDataCompressor processBytes:size:flags:] ()
#10	… -[NSData(NSDataCompression) _produceDataWithCompressionOperation:algorithm:handler:] ()
#11	… -[NSData(NSDataCompression) _decompressedDataUsingCompressionAlgorithm:error:] ()

and that’s all about NSData moving memory about. Notably, vm_copy is a Mach routine, and hence the mach_msg2_trap.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

[1] And their friends, most notably pread and pwrite.

[2] I’m talking about transferring bytes to and from memory. The file system also supports a bunch of metadata operations, like traversing directory hierarchies, and that’s a different story.

hibernat · March 13, 2024, 10:07am

The latest commit in https://github.com/hibernat/concurrency-with-data-bug includes the DispatchQueue.concurrentPerform alternative code.

I did some rough time measurements, how many compressions or decompressions can my M3 Max (14 cores) do in 1 minute:

DispatchQueue compressions: 750
DispatchQueue decompressions: 1500
Swift concurrency compressions: 800
Swift concurrency decompressions: 270

All Xcode debug.

And here is visible the major performance issue: Swift concurrency decompressions are heavily affected by mach_msg2_trap, and not all CPU cores are performing. The Instruments screenshots above miss the important CPU chart, ideally with per core chart.

Until you hear fans, it is not performing well...

The demo code in the repo gets performance degraded quickly, in a minute is the system so slow, that you literally wait for any new finished task in Swift concurrency.

I discovered the issue in my other code, where it runs 3-10 minutes till is becomes sluggish.

I run again the test with Swift concurrency decompressions, and it was able to finish 800 tasks in the first minute (seems OK), but then only 200 in the second minute.

Yesterday, when I experimented with the demo code, I put a long text in the Swift code, and converted it to Data. Then, I first compressed the data, and then decompressed again. This works fine.

Also, with the code in the repository, when I compress the file first (yes, compressed data compressing again), and then decompress, then it works fine. And, the performance is almost identical as just compressions: approx. 800 tasks finished in a minute.