Low-level swift optimization tips

taylorswift · June 25, 2020, 10:50pm

The past few days i’ve been working on a pure-Swift implementation of lz77 inflate and getting its performance on-par with that of the system-installed libz, so that i can remove the system dependency from Swift PNG.

i learned a lot about what works and what doesn’t when it comes to optimizing algorithms implemented in swift, that is specific to the language and that i figured could help other people working on similar frameworks so i wrote up some of my findings for reference. i’m not really a ~blog person~, so for now the writeup lives in a README in the Swift PNG repository.

Hacksaw · June 26, 2020, 12:47am

Goes along nicely with this: swift/OptimizationTips.rst at main · apple/swift · GitHub

Thanks!

Karl · June 26, 2020, 12:52am

This one is quite interesting:

The reason for this is that the overflow check that comes with + always takes the same branch (the non-trapping path), so this branch is effectively free. At the same time, unwise usage of &+ can actually inhibit other compiler optimizations. For example, adding two positive Int s with + can never produce a negative result, but adding two positive Int s with &+ can. This means that, for example, a subsequent Array bounds check could have skipped its routine “is this index negative” sanity check if you had stuck with + instead of using &+ .

The overflow check is basically free because it is easily predictable, but causes another check in Array - but wouldn’t this new check be just as predictable as the one you removed?

Do you have benchmarks or some other evidence like smaller codegen backing this up?

Also - stuff like this is why I want better control over where -Ounchecked optimisations can be enabled. Some of my functions are in a tight-knit group and validate things in advance that the compiler seems unable to fully prove and remember. I’d like to mark those individual functions as being suitable for safety-check removal even if the enclosing package is only compiled with -O

taylorswift · June 26, 2020, 1:55am

you’re right, but as i said in one of the other paragraphs, the bounds check overhead doesn’t really come from the bounds check itself, it comes from querying Array.count

Karl · June 26, 2020, 2:03am

The check that index < count will happen either way; it’s only the check that index >= 0 that might be avoided by using non-wrapping arithmetic on two known-positive integers.

taylorswift · June 26, 2020, 2:16am

okay, bad example then. but i have noticed from examining the stack traces in instruments that the compiler seems to emit more checks when i use &+ and &- and that things tend to run a little slower

i don’t know what would constitute “backing this up” since anyone can just make up numbers and discuss them but my completely unverifiable evidence is that before i made these changes, the benchmark target that decodes this test image was running in well over 1s whereas afterwards it was running in about ~550 ms, when the same version of the library built with zlib takes about ~475 ms. (the lz77 implementation used is determined by a build flag SWIFT_INFLATE which you can add to the Package.swift to get it to build in xcode.) In both builds, the lz77 inflate accounts for almost all of the execution time, with the png defiltering stage coming in a distant second.

Jens · June 26, 2020, 7:36am

A commonly-perscribed remedy is to wrap the buffer pointer in a class , which lets you do cleanup using the class’s deinit method. But this has the performance drawback of placing the buffer’s storage behind two layers of heap indirection (one for the class , and one for the unsafe buffer.)

A better solution is to use UnmanagedBuffer<Header, Element> , which allows you to store buffer elements inline in the allocation of the class .

I assume you mean ManagedBuffer (unmanaged vs managed is a big difference!).

For larger storage, won't using ManagedBuffer imply two layers of indirection too? (One for the pointer to the ManagedBuffer class instance, and one for the pointer to its heap allocated storage for the elements.)

EDIT: To be clear, I ask because I have no idea how ManagedBuffer is implemented. So please explain why the answer is (I assume) "no".

lukasa · June 26, 2020, 10:01am

Nope. ManagedBuffer looks like this:

┌─────────────────┬──────────────────┬──────────────────┐
│Swift Class Stuff│      Header      │      Values      │
│  (Do not open)  │                  │                  │
└─────────────────┴──────────────────┴──────────────────┘

The offsets of Header and the start of Values are known at compile time because the "Swift Class Stuff" is of known-size, as is the Header. Thus, to get the first element of a ManagedBuffer Values section, you just do *(ptr + offsetof(Values)). This only dereferences one pointer.

A double-indirection occurs when you store a pointer in the class itself. Then you need one indirection to load the pointer itself (*(classPtr + offsetof(StoredPointer))) and then a subsequent dereference of that loaded pointer to get its first element. In C, that'd be: **(classPtr + offsetof(StoredPointer)).

This is much worse for performance and very hard for the branch predictor.

Jens · June 26, 2020, 10:13am

Thanks! I had some vague idea about it using its own storage up to some fixed size, and if necessary (if the elements required more storage than the fixed size of its own storage) allocated a separate heap storage.

Is ManagedBuffer implemented via some special compiler support or would it be possible for a (Swift) user to write a class like ManagedBuffer (ie, one that has control over the byte size of its own allocation)?

lukasa · June 26, 2020, 10:18am

It's magic compiler support. The intention is that if this is something you want, you'll compose with ManagedBuffer or ManagedBufferPointer to achieve your goal. As a worked example, swift-crypto provides a type called SecureBytes that is a self-erasing CoW struct. The backing storage uses ManagedBuffer. This is the currently-expected way to handle this particular use-case.

Jens · June 26, 2020, 11:15am

At the risk of getting slightly off-topic, I happen to have a (very rough, non-production ready) low level type called Table that I first thought might benefit from using ManagedBuffer or ManagedBufferPointer instead of its current ManagedTableStorage.

But now when I looked at it, it seems like it is an example of a type that doesn't suffer from any double indirection, despite using a class for its (optionally) managed storage, because the Table struct has its own data pointer property (which might or might not point into the data allocated by its optional ManagedTableStorage class instance).

Here's the code, should anyone be interested.

/// A low level two dimensional table. A table's storage can be managed or
/// unmanaged, and is *not* copied on write (since that is usually not
/// what we want when using this type in practice). Tables can be used for
/// eg representing pixel images, see Table+CoreGraphics.swift.

struct Table<Value> {

  let data: UnsafeMutableRawPointer
  let width: Int
  let height: Int
  let bytesPerRow: Int
  let managedStorage: ManagedTableStorage?

  static var bytesPerValue: Int { MemoryLayout<Value>.stride }
  var size: SIMD2<Int> { return [width, height] }


  /// If `managedStorage == nil` then you are responsible for deallocating
  /// the memory of `data`.
  /// If `managedStorage != nil` then it must contain the memory block
  /// described by `data`, `width`, `height` and `bytesPerRow`.
  init(
    data: UnsafeMutableRawPointer,
    width: Int,
    height: Int,
    bytesPerRow: Int,
    managedStorage: ManagedTableStorage?
  )
  {
    precondition(width > 0 && height > 0)
    precondition(bytesPerRow >= Self.bytesPerValue * width)
    if let managedStorage = managedStorage {
      let diff = data - managedStorage.storage.baseAddress!
      precondition(diff >= 0)
      precondition(diff + height * bytesPerRow <= managedStorage.storage.count)
    }
    self.data = data
    self.width = width
    self.height = height
    self.bytesPerRow = bytesPerRow
    self.managedStorage = managedStorage
  }


  /// Creates a table instance with a newly allocated storage.
  /// If `managedStorage` is `false`, you are responsible for deallocating the
  /// memory of `data`, for example by `defer { table.data.deallocate  }`.
  /// If `repeatedValue` is nil then the data is left uninitialized.
  init(
    width: Int,
    height: Int,
    repeating repeatedValue: Value?,
    rowByteAlignment: Int = Self.bytesPerValue,
    managedStorage: Bool = true)
  {
    precondition(width > 0 && height > 0)
    precondition(rowByteAlignment > 0)
    precondition(rowByteAlignment.isMultiple(of: Self.bytesPerValue))
    let bytesPerRow = (width * Self.bytesPerValue)
      .roundedUp(toMultipleOf: rowByteAlignment)
    let byteCount = bytesPerRow * height
    self.width = width
    self.height = height
    self.bytesPerRow = bytesPerRow
    let alignment = max(16, MemoryLayout<Value>.alignment)
    if managedStorage {
      self.managedStorage = ManagedTableStorage(byteCount: byteCount,
                                                alignment: alignment)
      self.data = self.managedStorage!.storage.baseAddress!
    } else {
      self.managedStorage = nil
      self.data = UnsafeMutableRawPointer.allocate(byteCount: byteCount,
                                                   alignment: alignment)
    }
    if let repeatedValue = repeatedValue {
      // NOTE: We could use self.data.initializeMemory(as: repeating: count:)
      // on each row, or for the whole thing (but that would mean computing
      // the count taking rowByteAlignment into accout), so for simplicity, we
      // do it like this instead:
      withPositions { self[unchecked: $0] = repeatedValue }
    }
  }

}



// MARK: - Table subscripts and positions


extension Table {

  subscript(unchecked position: SIMD2<Int>) -> Value {
    get {
      let o = MemoryLayout<Value>.stride &* position.x
        &+ bytesPerRow &* position.y
      return data.load(fromByteOffset: o, as: Value.self)
    }
    nonmutating set {
      let o = MemoryLayout<Value>.stride &* position.x
        &+ bytesPerRow &* position.y
      data.storeBytes(of: newValue, toByteOffset: o, as: Value.self)
    }
  }

  subscript(_ position: SIMD2<Int>) -> Value {
    get {
      precondition(position.x >= 0 && position.x < width
        && position.y >= 0 && position.y < height)
      return self[unchecked: position]
    }
    nonmutating set {
      precondition(position.x >= 0 && position.x < width
        && position.y >= 0 && position.y < height)
      self[unchecked: position] = newValue
    }
  }

  func withPositions(_ block: (SIMD2<Int>) -> Void) {
    for y in 0 ..< height {
      for x in 0 ..< width {
        block([x, y])
      }
    }
  }

}



// MARK: - Table Subtables etc.

// ... lots of stuff not relevant to this discussion ...


// MARK: - ManagedTableStorage


// NOTE: I guess one could perhaps use ManagedBuffer or ManagedBufferPointer
// instead of this, see:
// https://forums.swift.org/t/low-level-swift-optimization-tips/37917/8
// But in this scenario, there will not be any double indirection, since the
// Table struct has its own data-pointer (which might point into the data
// allocated to the storage-pointer of a `ManagedTableStorage`).

final class ManagedTableStorage {
  static var debug: Bool { true }

  let storage: UnsafeMutableRawBufferPointer
  init(byteCount: Int, alignment: Int) {
    self.storage = UnsafeMutableRawBufferPointer.allocate(byteCount: byteCount,
                                                          alignment: alignment)
  }
  deinit {
    self.storage.deallocate()
  }
}


// MARK: - An Irrelevant Dependency

extension FixedWidthInteger {

  @inline(__always)
  func roundedUpToPowerOfTwo() -> Self {
    precondition(self > 0)
    let shifts = bitWidth &- leadingZeroBitCount
    return nonzeroBitCount == 1 ? self : 1 &<< shifts
  }

  @inline(__always)
  func roundedTowardZero(toMultipleOf m: Self) -> Self {
    return self - (self % m)
  }

  @inline(__always)
  func roundedAwayFromZero(toMultipleOf m: Self) -> Self {
    let x = self.roundedTowardZero(toMultipleOf: m)
    if x == self { return x }
    return (m.signum() == self.signum()) ? (x + m) : (x - m)
  }

  @inline(__always)
  func roundedDown(toMultipleOf m: Self) -> Self {
    return self < 0
      ? self.roundedAwayFromZero(toMultipleOf: m)
      : self.roundedTowardZero(toMultipleOf: m)
  }

  @inline(__always)
  func roundedUp(toMultipleOf m: Self) -> Self {
    return self >= 0
      ? self.roundedAwayFromZero(toMultipleOf: m)
      : self.roundedTowardZero(toMultipleOf: m)
  }

}

Happy to receive constructive criticism.

EDIT: Added a dependency to make it compile as is.

lukasa · June 26, 2020, 11:38am

If accesses go via the extracted pointer, you're correct, there will be no double-indirection on data access.

taylorswift · June 26, 2020, 5:51pm

your code looks good, but it seems a little over-complicated. Is the reason you’re trying to support the externally-allocated buffer case because you’re getting the table data from a C API?

Jens:

  subscript(unchecked position: SIMD2<Int>) -> Value {
    get {
      let o = MemoryLayout<Value>.stride &* position.x
        &+ bytesPerRow &* position.y
      return data.load(fromByteOffset: o, as: Value.self)
    }
    nonmutating set {
      let o = MemoryLayout<Value>.stride &* position.x
        &+ bytesPerRow &* position.y
      data.storeBytes(of: newValue, toByteOffset: o, as: Value.self)
    }
  }

i would not model an index type using SIMD2<Int>, since you are not doing any SIMD operations with them. you’re using them as (x:Int, y:Int) tuples, which is what you should model them as.

scanon · June 26, 2020, 6:08pm

Agreed. Semantically you should use SIMD types when you expect the normal usage pattern to be performing homogeneous operations on the lanes of the vector, not just because you have two or more of a thing.

Jens · June 26, 2020, 7:14pm

The table data can come from anywhere really, it can be some existing, eg yuv videostream or shared Metal resource, but it can also create its own managed data, loaded from an image file, or be part of another table, etc. It's just a simple but useful data type that I have added functionality to as needed, wrapping functionality from vImage, drawing on it via a CGContext etc.

Yes, but it is very rare that I have the need for a separate x and y, and I don't like to always break them up into that or convert them into a tuple, just to access a pixel.

I'm kind of aware of this and hence I've asked questions like this here before.

The reason for why I use them here is just because:

There are so many places where we need some kind of vector like data types, eg a Point, a Vector, the origin and size of a Rect, etc, etc. And it's crazy that I shouldn't just be able to eg iterate through all the Vector2<Int> positions of a Rect2D<Int> bounds of a pixel image and access a pixel via subscripting it with that Vector2<Int> point directly. (The Table code i posted here is simplified/stripped, it has dependency of my custom Vector types, Rect<Int>, etc). So I've experimented a lot with various generic Vector types over the years but always ended up in the time consuming rabbit hole of exploring (the optimizability, limits and rough areas of) Swift's generics and protocols.
So, after having spent way to much time during these years trying to build a reasonably nice generic Vector (static array) type with type level Count/Index etc, ever since Swift came out actually, never being entirely satisfied, always thinking that it takes too much time, always frustrated by the awkwardness of using CGPoint, CGRect, et al, I sometimes kind of just give up and use the SIMDN types for everything, just to get something done. Then I see what the project really needs and replace the SIMDN types with the simplest possible custom generic Vector variation I can come up with for that project (trying not to begin exploring again).
In the Table code here, I have replaced my own Vector2<Int> types with SIMD2<Int> to easily strip it from its most complicated dependency.

So, my current approach is to use some such custom generic Vector types, instead of the SIMDN types (except where using SIMDN types are motivated), with convenient conversions between these, the SIMDN types and CGPoint, CGRect or whatever I have to interface with at some places.