Huge performance hit from exclusive memory access checks

Hi there,

tl;dr Writing to an array in a tight loop gets really slow because of memory exclusivity checks. What's the recommended way to avoid that?

I implemented a decoder/encoder for an image format called QOI.
My naive implementation just used an array of UInt8 to store information about each pixel colour (one array slot = one pixel channel value):

public class Image {
    ...
    public var pixels : [UInt8]
    ...

Decoder performance was around one order of magnitude slower (10x) than the reference C implementation, running on the same machine (a MacBook Pro M1 with 8 CPU cores).

Since I didn't expect such a huge slowdown, I ran Instruments on the benchmark tool to see where the problem was.

In the profiler, I noticed most of the decoder time was dominated by calls to swift_beginAccess and swift_endAccess:

I found a post in the Swift blog explaining those runtime calls were added by default to release builds on Swift 5.

After I disabled the exclusivity checks (--enforce-exclusivity=none), decoder performance was significantly better and in the realms of what I expected from Swift (2x slower than the reference C implementation).


What is the recommended way of dealing with these situations?

I'm leaning towards allocating a Data object with enough space to store the pixel information and do the writes directly via an UnsafeMutableRawPointer. It's not as convenient as directly writing to the array of pixels, but I'm guessing it will be faster.


How to replicate the measurements taken

All the code can be found at GitHub - track-5/SwiftQOI: QOI image decoder/encoder written in Swift

Clone the repo and checkout the branch benchmark-repro.

Create a release build with the command below:

swift build --product SwiftQOIBenchmark -c release

You can turn off memory safety by editing the Package.swift file and including adding the flags commented there.

3 Likes

Do you find any difference when Image is a struct?

I can only add to the chorus; Swift has a performance problem. It's often unpredictable and unexpected and I deal with it, as you did, through profiling. I also disable exclusivity checks for release (though I test with them on from time to time).

I'm optimistic about Swift performance. The recent ARC roadmap will go a long way, as will things like performance annotations in hot paths [1, 2]. It's unclear to me whether the ARC roadmap will improve exclusive memory access checks like these.

2 Likes

Putting an array inside a class is indeed going to pessimize your performance. You may find the following thread and the links there to be helpful:

4 Likes

@Yup_Lucas I just profiled your project and I came across something a little strange. You were writing to the image's pixels via image.pixels[index] = ... from your static decode method. The exclusivity checks are necessary because you were accessing the pixels belonging to the image from outside the image. I modified your code to write to its own pixels array inside your static decode method, then that gets passed into Image's init.

Here are my changes.

    public static func decode(_ data: Data) throws -> Image {
        return try data.withUnsafeBytes { rawBufferPointer in
            ...
            
            var pixels = [UInt8](repeating: 255, count: channels * width * height)

            while pixelPos < pixelsLen {
                ...
                pixels[pixelPos] = ...
            }

            ...
            return Image(
                width: width,
                height: height,
                channels: channels,
                colorspace: colorspace,
                pixels: pixels
            )
        }
    } 
2 Likes

Good catch! I didn't realize I was getting hit by accessing the array from the static method.
I've applied the changes you suggested (avoid accessing class members from outside) and re-ran the benchmarks.

Quickly skimming through the results I'm getting something much closer to the reference implementation. Will analyze it further and update this thread, but seems like the problem was as @timdecode pointed out.

2 Likes

I did some more investigation and followed @timdecode's suggestion to switch to a struct instead of class.

Here you can see a small repro using a class: Compiler Explorer
And here the same repro but using a struct: Compiler Explorer

The point is to compare the assembly code generated for each approach.
The code generated for the struct sample is much simpler than the one generated for the class. Regardless of accessing member variables inside or outside the data object.

Using a struct generates code pretty close to what you'd get from C.

Did some more experimentation and came to the conclusion there's no big difference between using a struct vs constructing the pixels array locally and passing it to the class constructor.

Here's some data (lower numbers = less time in milliseconds taken):

I'd say the takeaway here is: Whenever constructing a value used by a class, ensure it is either constructed locally and provided once or ensure the construction happens inside the class.

That said, using struct by default instead of class would've performed as expected out of the box.

2 Likes

Could you add another column for C? That might be useful to have.

3 Likes

I actually just found this thread. The Improve memory usage and performance with Swift talk from @nnnnnnnn also references time spent in swift_beginAccess from a QOI parser.

It looks like migrating from struct to class is also the approach from the talk. There seems to be at least one more clue in the exclusivity enforcement announcement to improve perf:[1]

As a general guideline, avoid performing class property access within the most performance critical loops, particularly on different objects in each loop iteration. If that isn’t possible, making the class properties private or internal can help the compiler prove that no other code accesses the same property inside the loop.

My understanding is that it sounds like migrating from struct to class will improve performance… but migrating from public to internal might improve performance.

In the example from WWDC… the class properties already were internal when the performance of swift_beginAccess was measured.

Does anyone have more details how to learn when access control can and cannot be used to improve perf in place of migrating from class to struct?


  1. Swift.org - Swift 5 Exclusivity Enforcement ↩︎

Isn’t it the opposite? The code migrates from state held in an internal class to direct properties of the struct so it no longer has to pay those costs.

2 Likes

Access control can have the effect of eliminating getter/modify/setter overhead by making a field effectively final, which allows the compiler to see accesses as being direct to storage. If appropriate, making the field explicitly final can get you this benefit more consistently. Access control might also help with effect analysis, which is where the compiler analyzes and propagates what mutable state is accessed by each method, but that's a more nonlocal benefit and harder to predict.

Generally, I would expect code using mutable classes to always be subject to more dynamic exclusivity checks, since struct field accesses can always be statically derived from the access to the containing struct.

This guidance here:

As a general guideline, avoid performing class property access within the most performance critical loops, particularly on different objects in each loop iteration.

is more about reducing the number of accesses by hoisting them out of loops. If you have:

for ... {
  mutate(&object.field)
  doOtherStuff()
}

then object.field will undergo a separate exclusive access on every call to mutate inside of the loop, whereas something like:

func mutateLoop(field: inout Field) {
  for ... {
    mutate(&field)
    doOtherStuff()
  }
}

mutateLoop(field: &object.field)

will do only one exclusivity check on object.field, and maintain that exclusivity assertion for the duration of mutateLoop. (Whether that transformation is valid or not hinges on whether doOtherStuff can legitimately access object.field via another reference to object in between calls to mutate, which is where effect analysis may or may not help the optimizer.)

6 Likes

Hmm… the WWDC talk starts with something like this:

struct S1 {
  class C {
    var x = 0
    var y = 0
  }
  
  var c = C()
  
  func increment() {
    c.x += 1
    c.y += 1
  }
}

And migrates to something like this:

struct S2 {
  var x = 0
  var y = 0
  
  mutating func increment() {
    x += 1
    y += 1
  }
}

To reduce the calls to swift_beginAccess. My question was more along the lines of if we wanted to preserve the nested class as a reference type do we have any more options to help? Something like:

struct S3 {
  private final class C {
    var x = 0
    var y = 0
  }
  
  private let c = C()
  
  func increment() {
    c.x += 1
    c.y += 1
  }
}

Do we know to what extent could private and final help the compiler optimize out those swift_beginAccess calls?

If x and y do not need to be independently mutable, one thing you could do is put them together in a single struct, and use the inout parameter technique again to combine both writes in increment into a single exclusive operation:

struct S3 {
  private final class C {
    private struct CS {
      var x = 0
      var y = 0

      mutating func increment() {
        x += 1
        y += 1
      }
    }
    var value: CS
  }
  
  private let c = C()
  
  func increment() {
    // Will do one exclusivity check instead of two
    c.value.increment()
  }
}
2 Likes

That's interesting you bring up modify… that was going to be my next question. From reading through SE-0474 it does sound like we can expect the coroutine accessor to mutate storage exclusively:

This means that the yielding mutate accessor has exclusive access to self , and that exclusive access extends for the duration of the accessor's execution, including its suspension after yield -ing.

If we had something like:

struct S4 {
  private final class C {
    var x = 0
    var y = 0
  }
  
  private let c = C()
  
  var x: Int {
    // TODO: coroutine accessors on c.x
  }
  var y: Int {
    // TODO: coroutine accessors on c.y
  }
}

Do we know what the effects on swift_beginAccess would be from that?

Non-final stored properties will have a modify accessor generated for them implicitly, which will generally look something like this:

class C {
  private var _x: Int // the storage
  var x: Int {
    yielding mutate {
      // implicit swift_beginAccess(&x)
      yield &x
      // implicit swift_endAccess(&x)
    }
  }
}

so you generally wouldn't get much benefit from writing one yourself. The issue for exclusivity check optimization is that those access checks are compiled into the coroutine itself, so they cannot be optimized away unless the coroutine can be inlined first, which provides an additional barrier for optimizing accesses.

3 Likes

TBH… I do not have much experience with final stored properties. Here is all I found from TSPL:[1]

Apply this modifier to a class or to a property, method, or subscript member of a class. It’s applied to a class to indicate that the class can’t be subclassed. It’s applied to a property, method, or subscript of a class to indicate that a class member can’t be overridden in any subclass.

And:[2]

Any attempt to override a final method, property, or subscript in a subclass is reported as a compile-time error. […] You can mark an entire class as final by writing the final modifier before the class keyword in its class definition (final class). Any attempt to subclass a final class is reported as a compile-time error.

I think this implies that a final class must imply final stored properties. Does that sound correct?

In this example:

struct S5 {
  private final class C {
    final var x = 0
    final var y = 0
  }
  
  private let c = C()
  
  func increment() {
    c.x += 1
    c.y += 1
  }
}

I believe the final var declaration is equivalent to just var because the class itself is already final. Does that sound correct?

In this example S3.C is final… but we expect that the mutation in increment still needs a call to swift_beginAccess? Is that correct? Do we know if there could be any more easy declaration modifiers here that could help the compiler optimize out that remaining call to swift_beginAccess while still preserving a S3.C class reference?


  1. Documentation ↩︎

  2. Documentation ↩︎

Yeah, a final class should naturally have all final members.

Yeah, even if it's final, I would still generally expect there to be exclusivity checks around the call to increment. The only way we could really avoid that is if we know beforehand that c.value is already undergoing an exclusive access, which given the nature of objects having shared ownership is not generally possible to know. We don't really have any memory-safe production-ready annotations you can use to avoid this fundamental exclusivity check yet.

3 Likes