What are the benefits of minimizing the MemoryLayout size/stride of a struct?

jamieQ · August 28, 2024, 2:56am

as i understand things, Swift's structs are laid out in memory in the order that their stored properties are declared. thus, you can play a fun little mini-game trying to minimize MemoryLayout.size and MemoryLayout.stride by reordering these declarations. i'm somewhat intrinsically drawn to doing this, but i wonder – what are the concrete benefits of it? other things being equal, it seems like having a minimally-sized in-memory representation is desirable... but are there particular size/stride thresholds that are more beneficial than others?

Alejandro · August 28, 2024, 3:13am

Swift does have some thresholds where it treats your type differently:

Swift has an optimization for existentials (Any or any P) that it can put values inline in the existential if it's 3 machine words or less (24 bytes on 64 bit machines). If you're putting this type(s) in existentials often, then maybe it's beneficial for you to optimize its layout to this limit or less so that you can reap the benefit of not having to force an allocation for those things everytime.

There's also another limit Swift has where if a type is larger than 5 machine words large (iirc, so 40 bytes on 64 bit machines) it will treat that type as "large"/"big" and instead of passing it preferably through registers, it will stack allocate it and pass a pointer to callees. This itself isn't a bad thing, but constantly copying this value means constantly loading from memory and moving things around to be usable from callees, etc. This can add up and can hurt performance.

Other folks may be more interested at the assembly level where representing a type in 2 registers vs. 4 registers is a goal in things like code size (very very miniscule), performance (very very miniscule), etc. These are real concerns to some folks, but for the general developer one shouldn't worry about these problems at all.

Karl · August 28, 2024, 9:59am

There was a presentation by Andrew Kelley, creator and lead developer of the Zig programming language, about size optimisations he's applying within the Zig compiler. He goes further than just reordering fields in order to achieve even greater gains, but it's all about shrinking the size of values and reducing padding.

At 38:15 he gives some numbers after applying these techniques to the reduce the size of tokens and AST nodes:

Spoilers

15% fewer cache-misses
28% fewer total instructions executed
26% fewer total CPU cycles
22% faster wall clock time

And then at 42:30 he gives some numbers after reducing the size of Zig IR instruction values:

Spoilers

39% reduction in Wall Clock Time
40% reduction in Peak Memory Usage
53% reduction in Cache Misses
23% reduction in Instructions
41% reduction in CPU Cycles

(these are in addition to the previous gains)

Your results will of course depend on the specifics of your application.

jrose · August 28, 2024, 4:40pm

I still want to see Swift auto-reorder fields like Rust does, except for fragile structs in libraries with binary evolution turned on, but it’s de facto used the stable ordering long enough that at this point we’d probably need an opt-out. :-(

ksluder · August 28, 2024, 4:45pm

Actually, I think I’d prefer an opt-in that controlled not only reordering but also bit-packing. I’d be willing to explicitly grant the compiler permission to squeeze a Bool into the unused top bit of another property, knowing that the tradeoff is the inability to call MemoryLayout<T>.offset(of:) for that property.

Slava_Pestov · August 28, 2024, 4:48pm

I'm not sure this is a sensible default in all situations .For large structs, you sometimes want the exact opposite -- for the order of fields to match the order written in source, so that you can organize the fields in such a way that those fields that are frequently accessed together are more likely to be in the same cache line.

ksluder · August 28, 2024, 4:53pm

You could get that effect by tagging a sub-struct as non-reorderable:

struct MyBigOlStruct {
  @DontReorder
  struct CombinedState {
    // these properties are accessed together frequently
    var foo: Int
    var bar: Int
  }
  // ...
}

John_McCall · August 28, 2024, 6:35pm

The exact rule here is more complex than just a size threshold — it's based on how many "scalars" it would take to break the type up. Optimizing the layout of the type can still help it fit under the limit, though, because Swift will pass multiple small-integer fields that fall within the same word-aligned span in the layout as a single word.

David_Smith · August 28, 2024, 11:16pm

My favorite unintuitive-at-a-glance example of this was how, some years ago, a type in Darwin that was laid out essentially like this

struct Thingy {
  var headMutex: Mutex
  var tailMutex: Mutex
  var name: FixedSizeInlineArray
}

was changed to be like this

struct Thingy {
  var headMutex: Mutex
  var name: FixedSizeInlineArray
  var tailMutex: Mutex
}

plus increasing the minimum size of the name, to make sure the two locks ended up on different cache lines. Quite a significant speedup!

Another classic is getting a 4x speedup by reordering the elements in a binary search so that fewer cache lines are touched: Eytzinger Binary Search - Algorithmica

grynspan · August 29, 2024, 12:05am

Some colleagues and I discussed this maybe a year ago, and the point was made that it would effectively mean that @frozen can become a performance pessimization because it means fields get reordered only for non-fragile types.

My horse for Int63 (and the compiler understanding that means there's an uninhabited bit.)

jrose · August 29, 2024, 12:54am

“Non-fragile” isn’t quite right because that excludes structs without ABI concerns, but for types in existing libraries that’s true. Perhaps this change would come with a way to opt into reordering even with @frozen, and then we could make it opt out instead in a future Swift version (with the ABI compatibility checker aware of the difference, of course). @frozen(fieldOrderIsSignificant: true), perhaps, as a long-winded starting point.

(I see the sub-field bitpacking / padding-packing as a complementary change, but a tricky one. It’s a more explicit trade of less memory for more code size, and it needs to be implemented carefully, since it means assigning into any of the fields packed together has to preserve the rest of the fields. We had a bug around this for enum payloads way back: you’d have something like an Optional<Optional<Foo>>, where both layers of optional were using spare bits, and set a value of type Optional<Foo> without remembering to zero out the other bits. Oops, the outer Optional is now nil because one of those inner spare bits was a 1! I’m confident we wouldn’t make that mistake again, but it does illustrate the unexpected complexity of this kind of transformation.)

tbkka · August 30, 2024, 6:40pm

As with all performance tuning, sometimes it matters a lot and sometimes not at all.

This thread has outlined a few cases where it can matter a lot:

Fitting objects into fixed-size buffer (for example, with existentials) that would otherwise require heap allocations
Fitting data into cache lines (or preventing same)

Another general observation is that it can often depend on how many of that object reside in memory and/or how often those values are accessed. If you have an array with a million elements, then reducing the size of the elements reduces the total amount of memory that has to be scanned and can have a huge impact.

Another case to be wary of occurs with Swift enums. Remember that an enum is at least as large as the largest payload, so making the first case indirect here can save a lot of memory:

enum Foo {
  case a(ReallyBig)
  case b(Small)
  case c(Smaller)
  case d(Smallest)
  case e(AlsoTiny)
}

tera · August 31, 2024, 8:52am

I'd like seeing further exploration in this field. Or Int64X approach (bikeshed name) that uses "64 bits minus one value": Int64.min+1...Int64.max. This would allow Optional<Int64X> size and stride being unchanged compared to Int64 that is 8 by using Int64.min bit pattern to represent .none. To extent we could simulate this with optional doubles today by abusing a single nan bit pattern (out of about 10¹⁶ of those) to represent .none.