Huge performance hit from exclusive memory access checks

Hi there,

tl;dr Writing to an array in a tight loop gets really slow because of memory exclusivity checks. What's the recommended way to avoid that?

I implemented a decoder/encoder for an image format called QOI.
My naive implementation just used an array of UInt8 to store information about each pixel colour (one array slot = one pixel channel value):

public class Image {
    ...
    public var pixels : [UInt8]
    ...

Decoder performance was around one order of magnitude slower (10x) than the reference C implementation, running on the same machine (a MacBook Pro M1 with 8 CPU cores).

Since I didn't expect such a huge slowdown, I ran Instruments on the benchmark tool to see where the problem was.

In the profiler, I noticed most of the decoder time was dominated by calls to swift_beginAccess and swift_endAccess:

I found a post in the Swift blog explaining those runtime calls were added by default to release builds on Swift 5.

After I disabled the exclusivity checks (--enforce-exclusivity=none), decoder performance was significantly better and in the realms of what I expected from Swift (2x slower than the reference C implementation).


What is the recommended way of dealing with these situations?

I'm leaning towards allocating a Data object with enough space to store the pixel information and do the writes directly via an UnsafeMutableRawPointer. It's not as convenient as directly writing to the array of pixels, but I'm guessing it will be faster.


How to replicate the measurements taken

All the code can be found at GitHub - track-5/SwiftQOI: QOI image decoder/encoder written in Swift

Clone the repo and checkout the branch benchmark-repro.

Create a release build with the command below:

swift build --product SwiftQOIBenchmark -c release

You can turn off memory safety by editing the Package.swift file and including adding the flags commented there.

3 Likes

Do you find any difference when Image is a struct?

I can only add to the chorus; Swift has a performance problem. It's often unpredictable and unexpected and I deal with it, as you did, through profiling. I also disable exclusivity checks for release (though I test with them on from time to time).

I'm optimistic about Swift performance. The recent ARC roadmap will go a long way, as will things like performance annotations in hot paths [1, 2]. It's unclear to me whether the ARC roadmap will improve exclusive memory access checks like these.

1 Like

Putting an array inside a class is indeed going to pessimize your performance. You may find the following thread and the links there to be helpful:

4 Likes

@Yup_Lucas I just profiled your project and I came across something a little strange. You were writing to the image's pixels via image.pixels[index] = ... from your static decode method. The exclusivity checks are necessary because you were accessing the pixels belonging to the image from outside the image. I modified your code to write to its own pixels array inside your static decode method, then that gets passed into Image's init.

Here are my changes.

    public static func decode(_ data: Data) throws -> Image {
        return try data.withUnsafeBytes { rawBufferPointer in
            ...
            
            var pixels = [UInt8](repeating: 255, count: channels * width * height)

            while pixelPos < pixelsLen {
                ...
                pixels[pixelPos] = ...
            }

            ...
            return Image(
                width: width,
                height: height,
                channels: channels,
                colorspace: colorspace,
                pixels: pixels
            )
        }
    } 
2 Likes

Good catch! I didn't realize I was getting hit by accessing the array from the static method.
I've applied the changes you suggested (avoid accessing class members from outside) and re-ran the benchmarks.

Quickly skimming through the results I'm getting something much closer to the reference implementation. Will analyze it further and update this thread, but seems like the problem was as @timdecode pointed out.

2 Likes

I did some more investigation and followed @timdecode's suggestion to switch to a struct instead of class.

Here you can see a small repro using a class: Compiler Explorer
And here the same repro but using a struct: Compiler Explorer

The point is to compare the assembly code generated for each approach.
The code generated for the struct sample is much simpler than the one generated for the class. Regardless of accessing member variables inside or outside the data object.

Using a struct generates code pretty close to what you'd get from C.

Did some more experimentation and came to the conclusion there's no big difference between using a struct vs constructing the pixels array locally and passing it to the class constructor.

Here's some data (lower numbers = less time in milliseconds taken):

I'd say the takeaway here is: Whenever constructing a value used by a class, ensure it is either constructed locally and provided once or ensure the construction happens inside the class.

That said, using struct by default instead of class would've performed as expected out of the box.

2 Likes

Could you add another column for C? That might be useful to have.

2 Likes