Generic SIMD function produces non-SIMD code, unless you explicitly constraint it

I needed to "zip"/"interleave" two SIMD8<UInt8> vectors into a single SIMD16<UInt16>. The Standard Library's SIMD types don't have API for this, so I made my own:

InterleaveSIMD16.swift
extension SIMD16
// Uncomment to specialize this to UInt8, and get a `punpcklbw` instruction
// where Scalar == UInt8
{
	/// Pairs up each byte of `highHalves` with a byte from `lowHalves`.
	/// E.g.
	/// ```
	/// let result = SIMD16.interleavedFrom(
	///     highHalves:  SIMD8<UInt8>(0, 2, 4, 6, 8, 0xB, 0xD, 0xF),
	///     lowHalves:   SIMD8<UInt8>(1, 3, 5, 7, 9, 0xA, 0xC, 0xE)
	/// )
	/// // produces a result like:
	/// result == SIMD16<UInt8>(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0xA, 0xB, 0xC, 0xD, 0xE, 0xF)
	/// ```
	@inlinable
	static func interleavedFrom(highHalves: SIMD8<Scalar>, lowHalves: SIMD8<Scalar>) -> SIMD16<Scalar> {
		// When used for UInt8, lowers to `punpcklbw` on x86_64
		return SIMD16<Scalar>(
			highHalves[0], lowHalves[0],
			highHalves[1], lowHalves[1],
			highHalves[2], lowHalves[2],
			highHalves[3], lowHalves[3],
			highHalves[4], lowHalves[4],
			highHalves[5], lowHalves[5],
			highHalves[6], lowHalves[6],
			highHalves[7], lowHalves[7]
		)
	}
}

// Use randoms to trick the optimizer and prevent constant folding
let lowHalves = unsafeBitCast(UInt64.random(in: 0...UInt64.max), to: SIMD8<UInt8>.self)
let highHalves = unsafeBitCast(UInt64.random(in: 0...UInt64.max), to: SIMD8<UInt8>.self)
let interleaved = SIMD16.interleavedFrom(highHalves: highHalves, lowHalves: lowHalves)

print(lowHalves)
print(highHalves)
print(interleaved)

If I constrain my extension to where Scalar == UInt8, then it'll produce the correct punpcklbw instruction on x86_64. But if I don't, it produces 16 separate calls to Swift.SIMDStorage.subscript.

Since the caller is known to use only a single type with this generic function, shouldn't the optimizer be able to specialize it for SIMD16<UInt16>?

On a potentially related note, I couldn't get this to produce SIMD instructions on ARM at all (I think VZIP would be the correct instruction, but I'm not sure), regardless of constraints. I would just get 16 immediate loads.

It is specializing in the example you gave (you can search for punpcklbw in the output). It's just also exporting the generic entry point, because you made the function @inlinable internal. If you change it to fileprivate, or internal plus -whole-module-optimization, it gets dead-stripped, but @inlinable internal means it might be called from an @inlinable public function that doesn't get fully inlined and so it's public at the IR level.

(Actually, an @inlinable internal function that isn't referenced by a public entry point theoretically ought to still be dead-strippable with -whole-module-optimization. But that's an edge case separate from what you're testing here.)

6 Likes

Hey Jordan, thanks for explaining this!

but @inlinable internal means it might be called from an @inlinable public function that doesn't get fully inlined and so it's public at the IR level.

Ahhhh that makes perfect sense!

Got any advice on how to investigate why this doesn't produce a zip instruction on ARM?

xcrun swiftc --target=arm64-apple-macos12 -emit-assembly -O - includes zip1.8b v1, v0, v3 for me. So maybe it's that your particular target doesn't have SVE enabled by default?

1 Like

Ooo that's a really handy command, thanks for sharing it.

It's good to know that this optimization is possible in principle. I must have some issue with my SPM or Xcode config. I'll investigate that, thanks!

1 Like

The standard library has evenHalf and oddHalf properties, but they're implemented with for loops.

There isn't an init(evenHalf:oddHalf:) initializer, which for concrete types might be implemented with a Builtin.shufflevector instruction?

1 Like

The standard library has evenHalf and oddHalf properties, but they're implemented with for loops.

Oh neat. Yeah what I'm doing is basically the inverse of this, to stitch those two back together.

Ah I found my issue: there wasn't really one :smiley:

The compiler was inlining my function, and using load instructions to move the values directly into their eventual destination, skipping the intermediate zip. If I made my funciton @inline(never), I can confirm that it compiles with a zip1.8b. Though it also does some moves after that, whose purpose I haven't figured out yet:

	.p2align	2
_$ss6SIMD16V12minimal_demos5UInt8VRszrlE15interleavedFrom33_4C7D1AC566DCCCCC5E73C08167DEB363LL10highHalves03lowO0AByAEGs5SIMD8VyAEG_AMtFZTf4nnd_n:
	zip1.8b	v2, v0, v1
	dup.8b	v3, v0[4]
	mov.d	v2[1], v3[0]
	mov.b	v2[9], v1[4]
	mov.b	v2[10], v0[5]
	mov.b	v2[11], v1[5]
	mov.b	v2[12], v0[6]
	mov.b	v2[13], v1[6]
	mov.b	v2[14], v0[7]
	mov.b	v2[15], v1[7]
	mov.16b	v0, v2
	ret

It looks like zip1.8b only fills in the lower half of v2, and that these moves fill in the rest. I guess there’s no vectorized instruction that handles a 128 bit output? Fair enough

What I don’t understand is why there wasn’t a second zip instruction for the upper half, instead of the separate moves.

The dup.8b v3, v0[4] also looks odd of place :thinking:

A while back I asked how to do something similar, and @scanon recommended writing a widen function like this:

private func widen(_ x: SIMD8<UInt8>) -> SIMD8<UInt16> {
    SIMD8<UInt16>(truncatingIfNeeded: x)
}

Then you can interleave vectors like this:

(widen(a) &<< 8) | widen(b)

I believe this also produces efficient machine instructions.

2 Likes