Parameter pack loop performance worsens the more "nested" it is

Here's the setup: I have a basic Pack type that holds a parameter pack of arbitrary values. I also have a Packs type that holds a parameter pack of other Packs. Everything is static:

protocol ValueHolding {
    
    static func forEachValueTypeSize(body: (Int) -> Void)
    
    static func forEachOptionalValueTypeSize(body: (Int) -> Void)
    
}

struct Pack<each Value>: ValueHolding {
    
    static func size<T>(of type: T.Type) -> Int {
        MemoryLayout<T>.size
    }
    
    static func forEachValueTypeSize(body: (Int) -> Void) {
        for valueType in repeat (each Value).self {
            body(size(of: valueType))
        }
    }
    
    static func forEachOptionalValueTypeSize(body: (Int) -> Void) {
        for optionalValueType in repeat Optional<each Value>.self {
            body(size(of: optionalValueType))
        }
    }
    
}

struct Packs<each Value: ValueHolding>: ValueHolding {
    
    static func forEachValueTypeSize(body: (Int) -> Void) {
        for valueType in repeat (each Value).self {
            valueType.forEachValueTypeSize(body: body)
        }
    }
    
    static func forEachOptionalValueTypeSize(body: (Int) -> Void) {
        for valueType in repeat (each Value).self {
            valueType.forEachOptionalValueTypeSize(body: body)
        }
    }
    
}

Generally speaking, there seems to be two factors that influence the performance of these loops:

  1. The bodies of those forEach functions: some of them loop over the pack types as-is, and others wrap them in an Optional.
  2. The more nested a Pack is in generics, the worse the loop's performance is.

Some test code and benchmarks from my M3 Macbook Pro:

import os

let log = OSLog(subsystem: "com.example.packs", category: .pointsOfInterest)
let signposter = OSSignposter(logHandle: log)

func profile(_ name: StaticString, body: () -> Void) {
    let staticSignpostID = signposter.makeSignpostID()
    let state = signposter.beginInterval(name, id: staticSignpostID)
    
    for _ in 0..<500_000 {
        body()
    }
    
    signposter.endInterval(name, state)
}

var totalSize = 0

// Fast, but slower than it should be: 324.33 µs

profile("Pack.forEachOptionalValueTypeSize x 2") {
    totalSize = 0
    
    let pack = Pack<Int, String, Int, String>.self
    
    pack.forEachOptionalValueTypeSize {
        totalSize += $0
    }
    
    pack.forEachOptionalValueTypeSize {
        totalSize += $0
    }
}

// Extremely slow, relatively: 128.44 ms

profile("Packs.forEachOptionalValueTypeSize") {
    totalSize = 0
    
    let packs = Packs<
        Pack<Int, String, Int, String>,
        Pack<Int, String, Int, String>
    >.self
    
    packs.forEachOptionalValueTypeSize {
        totalSize += $0
    }
}

// Fast! 11.29 µs

profile("Pack.forEachValueTypeSize x 2") {
    totalSize = 0
    
    let pack = Pack<Int, String, Int, String>.self
    
    pack.forEachValueTypeSize {
        totalSize += $0
    }
    
    pack.forEachValueTypeSize {
        totalSize += $0
    }
}

// Pretty fast, but still slower than it should be: 8.52 ms

profile("Packs.forEachValueTypeSize") {
    totalSize = 0
    
    let packs = Packs<
        Pack<Int, String, Int, String>,
        Pack<Int, String, Int, String>
    >.self
    
    packs.forEachValueTypeSize {
        totalSize += $0
    }
}

print(totalSize)

Ranked, fastest to slowest:

Test Duration
Pack.forEachValueTypeSize x 2 11.29 µs
Pack.forEachOptionalValueTypeSize x 2 324.33 µs
Packs.forEachValueTypeSize 8.52 ms
Packs.forEachOptionalValueTypeSize 128.44 ms

Looping over the unmodified types is fastest; but generally the "further" we get from the pack the slower the loop gets, with the nested, wrapped packs performing the worst.

Looking at the profile for these seems to suggest to me that at a certain point Swift stops being able to fully statically generate and specialize the types and calls involved, and starts leaning on runtime lookups. This gets pretty terrible if you combine both nested types that have parameter packs and wrapping of whatever the original types were.

Anyway, I consider this a bug (or at least an opportunity for improvement) so I plan to file one, but I'm curious in a more general sense so as to refine my mental model: is there any principle that should cause this code to progressive slow down the way it does?

1 Like

Are you able to call the body closures without for-in?

struct Pack<each Value>: ValueHolding {
  
  static func size<T>(of type: T.Type) -> Int {
    MemoryLayout<T>.size
  }
  
  static func forEachValueTypeSize(body: (Int) -> Void) {
    repeat body(size(of: ((each Value).self)))
  }
  
  static func forEachOptionalValueTypeSize(body: (Int) -> Void) {
    repeat body(size(of: (Optional<(each Value)>.self)))
  }
  
}

struct Packs<each Value: ValueHolding>: ValueHolding {
  
  static func forEachValueTypeSize(body: (Int) -> Void) {
    repeat ((each Value).self).forEachValueTypeSize(body: body)
  }
  
  static func forEachOptionalValueTypeSize(body: (Int) -> Void) {
    repeat ((each Value).self).forEachOptionalValueTypeSize(body: body)
  }
  
}

I would also recommend Ordo One benchmarks if you are looking for an alternative option to run performance benchmarks from command-line. I don't have too much experience with OSSignposter outside of testing UI flows in Instruments.

I actually tried your exact code just after I posted this! It works, but it still has the same performance. Also, as far as I know, for _ repeat ... is mostly just sugar for plain repeat. They generate the same code.

TIL. Looks pretty neat, thanks for sharing.

1 Like

is there any principle that should cause this code to progressive slow down the way it does?

Abstraction!

godbolt link

This is really interesting, but unsurprising when you read the generated code. With a flat pack the codegen is essentially what you'd expect, if a bit hilarious*:

  • Load the type metadata for Int and String
  • GEP into them and load the slot for their runtime size
  • Add it to the global variable for totalSize

Now for the nested case

godbolt link

The codegen here is demonstrably worse. You are eating a full pack metadata instantiation at the top of main and four calls to Pack.forEachOptionalValueTypeSize and Pack.forEachValueTypeSize each of which is doing a load then a jump all to just add a number it has to pull out of yet more type metadata and stuff in a global variable. All of this said, you can really get a feel for how you jump from microseconds (four adds) to milliseconds (full metadata instantiation, loads, stores, calls to closures) even just eyeballing the two side by side.

We can try to bring codegen of the second case closer to the first case by doing the compiler's job for it

godbolt link

By noticing that we can specialize Pack<Int, String, Int, String> since it appears multiple times, aggressively turning up inlining, and by hand-writing the formerly variadic codegen in our specialized functions, we can demolish the abstraction barriers imposed here albeit at the cost of also demolishing the reason you expressed this with variadic generics in the first place. In that light, I definitely agree an optimizer bug is warranted.

*A bit disappointed that there's loads and repeated overflow checks here at all. We should be able to inline away these metadata lookups and constant prop and folding should take care of this annoying repeated "add, overflow check" gadget LLVM cooked up

        adds    x9, x19, x8
        b.vs    .LBB0_10
        adds    x9, x9, x20
        b.vs    .LBB0_11
        adds    x9, x9, x8
        b.vs    .LBB0_12
        adds    x9, x9, x20
6 Likes

The SIL optimizer’s support for optimizing iteration over packs is limited (read: essentially nonexistent). It should be possible to always iterate over a concrete pack without instantiating the pack, but that’s not the case today.

4 Likes