I needed to "zip"/"interleave" two SIMD8<UInt8>
vectors into a single SIMD16<UInt16>
. The Standard Library's SIMD types don't have API for this, so I made my own:
InterleaveSIMD16.swift
extension SIMD16
// Uncomment to specialize this to UInt8, and get a `punpcklbw` instruction
// where Scalar == UInt8
{
/// Pairs up each byte of `highHalves` with a byte from `lowHalves`.
/// E.g.
/// ```
/// let result = SIMD16.interleavedFrom(
/// highHalves: SIMD8<UInt8>(0, 2, 4, 6, 8, 0xB, 0xD, 0xF),
/// lowHalves: SIMD8<UInt8>(1, 3, 5, 7, 9, 0xA, 0xC, 0xE)
/// )
/// // produces a result like:
/// result == SIMD16<UInt8>(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0xA, 0xB, 0xC, 0xD, 0xE, 0xF)
/// ```
@inlinable
static func interleavedFrom(highHalves: SIMD8<Scalar>, lowHalves: SIMD8<Scalar>) -> SIMD16<Scalar> {
// When used for UInt8, lowers to `punpcklbw` on x86_64
return SIMD16<Scalar>(
highHalves[0], lowHalves[0],
highHalves[1], lowHalves[1],
highHalves[2], lowHalves[2],
highHalves[3], lowHalves[3],
highHalves[4], lowHalves[4],
highHalves[5], lowHalves[5],
highHalves[6], lowHalves[6],
highHalves[7], lowHalves[7]
)
}
}
// Use randoms to trick the optimizer and prevent constant folding
let lowHalves = unsafeBitCast(UInt64.random(in: 0...UInt64.max), to: SIMD8<UInt8>.self)
let highHalves = unsafeBitCast(UInt64.random(in: 0...UInt64.max), to: SIMD8<UInt8>.self)
let interleaved = SIMD16.interleavedFrom(highHalves: highHalves, lowHalves: lowHalves)
print(lowHalves)
print(highHalves)
print(interleaved)
If I constrain my extension to where Scalar == UInt8
, then it'll produce the correct punpcklbw
instruction on x86_64. But if I don't, it produces 16 separate calls to Swift.SIMDStorage.subscript
.
Since the caller is known to use only a single type with this generic function, shouldn't the optimizer be able to specialize it for SIMD16<UInt16>
?
On a potentially related note, I couldn't get this to produce SIMD instructions on ARM at all (I think VZIP
would be the correct instruction, but I'm not sure), regardless of constraints. I would just get 16 immediate loads.