Recommendations for SIMD intrinsic fallbacks?

As was suggested in this thread, i performed a move-mask operation by importing the x86 _mm_cmpeq_epi8 intrinsic.

#if arch(x86_64)

import _Builtin_intrinsics.intel

extension SIMD16 where Scalar == UInt8 
{
    func find(_ scalar:UInt8) -> UInt16 
    {
        let repeated:Self       = .init(repeating: scalar)
        let mask:SIMD2<Int64>   = _mm_cmpeq_epi8(
            unsafeBitCast(self,     to: SIMD2<Int64>.self), 
            unsafeBitCast(repeated, to: SIMD2<Int64>.self))
        return .init(truncatingIfNeeded: _mm_movemask_epi8(mask))
    }
}

#endif

now i’m looking to make this work on all Swift platforms, not just #if arch(x86_64) evaluates true. does anyone have any recommendations for implementing efficient fallbacks for this operation?

i have generic Swift SIMD{N} code here:

func find(_ key:UInt8, in vector:SIMD16<UInt8>, _ body:(Int) -> ())
{
    // (key: 5, vector: (1, 5, 1, 1, 5, 5, 1, 1, 1, 1, 1, 1, 5, 1, 1, 5))
    let places:SIMD16<UInt8>    = 
        .init(128, 64, 32, 16, 8, 4, 2, 1, 128, 64, 32, 16, 8, 4, 2, 1),
        match:SIMD16<UInt8>     = places.replacing(with: 0, where: vector .!= key)
    // match: ( 0, 64,  0,  0,  8,  4,  0,  0,  0,  0,  0,  0,  8,  0,  0,  1)
    let r8:SIMD8<UInt8> =    match.evenHalf |    match.oddHalf, 
        r4:SIMD4<UInt8> =       r8.evenHalf |       r8.oddHalf,
        r2:SIMD2<UInt8> =       r4.evenHalf |       r4.oddHalf
    let r:UInt16        = .init(r2.x) << 8  | .init(r2.y)
    // r: 0b0100_1100_0000_1001
    var i:Int = r.leadingZeroBitCount
    while i < 16
    {
        body(i)
        i += 1 + ((r << i) << 1).leadingZeroBitCount
    }
}

but it generates very inefficient assembly.

Do you want to find every match, or does finding the first match actually suffice for your intended use case?

If the latter, on arm64 the usual idiom is to or-not the compare mask into 0,1,2,3,...n and then use the MINV instruction to get the index of first match (or 0b111...111 if no match).

If the former, on arm64, and-not mask into 1,2,4,...128,1,2,4,...128, break into two SIMD8 halves, ADDV across each half, then assemble as a | b << 8. On arm32, it's the same, but you don't have ADDV and have to use a sequence of ADDP instead (this is what your open-coded version is doing, but you may get better codegen via the intrinsic).

Note that on some arm implementations, moving a result from NEON back to GPR can be somewhat slow (5-10 cycles), and NEON is under-provisioned relative to GPR instructions on most non-Apple designs (it is a pretty common design to have only two NEON pipes, but 4-6 GPR pipes); if you're using this in a loop to scan for matches, it's generally better to use SWAR on UInt64 fields instead of NEON on an unspecified arm core.

2 Likes

thank you!!

in theory i need to find every match, but in the vast majority of cases, the first match succeeds (when doing the full-length key comparison), so subsequent matches can be ignored

is this the same as doing places.replacing(with: 0, where: vector .!= key) except with two SIMD8’s instead of one SIMD16?

how do i use SWAR on UInt64 fields? i’m not very familiar with ARM tbh

also, is there a list of all the #if architecture values that Swift supports? or should I keep the standard library version of the find function around in an #else block?

finally, this is only semi-related but do you have tips on how to build and test the architecture-specific implementations? since I only have access to an x86_64 platform, both on my local machine, and the Github Actions runner