Understanding @_transparent: Why is my @_transparent Swift operator faster than with simd?

So I'm doing some performance optimization on my GameMath package and I'm having trouble understanding compiler hints.

I'm not sure what @_transparent is doing, because adding it to this operator makes it way faster; so much so that it's significantly faster then using simd on macOS x86_64 when using the operator in a performance test.

This is the test
func testMultiplicationPerformance() {
    let m1 = Transform3<Float>(position: Position3(128, 128, 128),
                               rotation: Quaternion(Degrees(128), axis: .up),
                               scale: .one).createMatrix()
    let m2 = Transform3<Float>(position: Position3(0, 1, 2),
                               rotation: Quaternion(Degrees(90), axis: .up),
                               scale: Size3(-100, -15, -1)).createMatrix()
    let m3 = Transform3<Float>(position: Position3(-128, -128, -128),
                               rotation: Quaternion(Degrees(-65), axis: .up),
                               scale: Size3(100, 15, 1)).createMatrix()

    func doMath() {
        var mtx: Matrix4x4<Float> = .identity
        mtx *= m1 * m2 * m3
        mtx *= m1 * m2 * m3
        mtx *= m1 * m2 * m3
        mtx *= m1 * m2 * m3
        func more() -> Matrix4x4<Float> {
            return m1 * m2 * m3
        for _ in 1 ..< 5000 {
            mtx *= more()
            mtx *= more()
        for _ in 1 ..< 5000 {
            mtx *= more()
    measure {

I thought @_transparent was a heavy-handed @inlineable, but it makes my raw Swift operator nearly 10x faster than using the same code with @inlinable / @inline(__always). It's faster in unoptimized and optimized builds.

Is @_transparent somehow making the compiler do better vectorization? Or is the speed coming from something else?

Also worth noting, using @_transparent on the simd implementation of the operator had no significant performance change. Using raw Swift with @_transparent was the fastest in all of my tests.

This is the SIMD operator Implementation
static func *=(lhs: inout Self, rhs: Self) {
    let r = simd_mul(simd_float4x4(lhs.storage[0], lhs.storage[1], lhs.storage[2], lhs.storage[3]),
                     simd_float4x4(rhs.storage[0], rhs.storage[1], rhs.storage[2], rhs.storage[3]))
    lhs.storage[0] = r[0]
    lhs.storage[1] = r[1]
    lhs.storage[2] = r[2]
    lhs.storage[3] = r[3]

How did you measure the performance between the two?

I kept both implementations and switch between them with a swiftSetting when running the XCTest using Xcode, and swift test -Xswitc -O, both were similar.

this is the actual XCTest implementation

I'd recommend you use XCode instruments for this one, or if you have any other CPU profiling tools laying around. I've very little experience with SIMD instructions, but if Apple implemented this correctly (assuming you're using the Accelerate framework), it should definitely be faster than a compiled version.

What I'm worried is happening is that the instruction pipeline is getting stalled due to wrong implementation in Accelerate, but if that's truly the case then a bunch of other apps would also noticeably suffer.

Which CPU and platform are you running this on?

— edit:

Never mind, just re-read your first post and saw you mentioned MacOS on x86_64. It's been a while since I had to profile anything that low-level, but it'd be interesting to see how many cycles CPU spends on each of those two routines, so maybe try running this through Instruments -> Counters, once in Instruments go to Recording Options and sample the events (Cycles and also add CYCLES_ACTIVITY.STALLS_TOTAL) and hit record. You can run the two functions sequentially and you'll see the cycles spent on each in the call tree.

Thanks for the suggestions! Those would all help a great deal doing performance stuff.

My question was actually specifically about @_transparent though, not just performance.
For anyone that comes across this thread later: @_transparent is a compiler hint that isn't meant for use by non-core packages. There's some documentation here but it's rather abstract. Because it's not meant meant for common use as part of the language its effects could change from Swift release to Swift release.

I'm just exploring using it to see if the benefits are worth the restrictions to the package in the future. I ran into the example in the question and it was an unreasonable result that doesn't seem possible.

I'll have to learn some assembly to figure it out. I was hoping someone might just have a reasonable explanation but I'll figure it out. It's too radical a difference to just blindly use in case it regresses in a future Swift release.