How to efficiently set one element of an SIMD value in memory?

i have code that clears a byte in an SIMD16<UInt8> value given an element index 0 ...< 14. However, no matter how I spell the assignment (do the assignment in a local var and then write the whole SIMD16<UInt8> value back into the array, or do the assignment through a pointer lvalue), the generated assembly seems to do silly things like copy the entire vector onto the stack, clear the byte in memory, then reload the entire vector back into a register, then write the entire vector back into memory again. For example, this pointer lvalue-based spelling generates this:

let destination:UnsafeMutablePointer<SIMD16<UInt8>> = 
    (buffer + base).assumingMemoryBound(to: SIMD16<UInt8>.self)
destination.pointee[i - 1] = 0
// index calculation
        addq    $-1, %rbx
        cmpq    $15, %rbx
        ja      .LBB4_23
// write vector to stack (vector was previously loaded into xmm1)
        movdqa  %xmm1, -32(%rbp)
// completely redundant index clamp
        andl    $15, %ebx
// clear the relevant byte in the vector in the stack
        movb    $0, -32(%rbp,%rbx)
// load the vector back into a register
        movdqa  -32(%rbp), %xmm0
// write the vector to its proper memory location
        movdqa  %xmm0, (%rdx,%r14)

the version written with read-mutate-writeback semantics generates the exact same assembly.

var tags:SIMD16<UInt8> = 
    buffer.load(fromByteOffset: base, as: SIMD16<UInt8>.self)
tags[i - 1] = 0
buffer.storeBytes(of: tags, toByteOffset: base, as: SIMD16<UInt8>.self)

how can i get Swift to do the simple thing and just assign the relevant byte in memory where it already lives?

note this is not about trying to assign an element of an SIMD value in a register, i already know that’s not efficient.

From what I've experimented with in the past, the SIMD types in the stdlib can give some unexpected assembly, and it can be a bit troublesome to get it to do exactly what you think it would do (at least, as of Swift 5.2.4).

But to address what you're asking for... try this:

extension SIMD16 {
    subscript(index: Int) -> Scalar {
        mutating get {
            withUnsafeBytes(of: &self) {
                $0.baseAddress!.assumingMemoryBound(to: Scalar.self)[index]
            }
        }
        set {
            withUnsafeMutableBytes(of: &self) {
                $0.baseAddress!.assumingMemoryBound(to: Scalar.self)[index] = newValue
            }
        }
    }
}

The mutating get doesn't actually mutate the vector, but it forces the compiler to reference self when accessing the scalar element.

Also, it looks like you may be compiling using -O. Try using -Ounchecked instead. It can help with better vectorization and assembly generation.

2 Likes

I'm slightly confused; if you want to operate on bytes in memory, operate on bytes in memory. Obfuscating that operation by dressing the access up in operations on some other type (whether that type is SIMD vectors or integers or something else) can only possibly hurt performance. Ideally the optimizer would eliminate the extra traffic, but why even introduce the possibility of something else happening?

1 Like

I agree. Sums up my thoughts exactly.

It is much more efficient to write simple, highly vectorizable code than to attempt to micro-optimize for milliseconds of gain.

Take this snippet, for instance:

func sum(buffer: UnsafeBufferPointer<Int>) -> Int {
    buffer.reduce(0, &+)
}

The compiler already generates efficient, vectorized code with minimal effort. No worrying about alignment, bounds checking, architecture, or hardware support. It's fully portable, and works on x86-64 and ARM.

Oftentimes, just writing the simplest version of your code is the best way to go.

Edit: changed example.

okay i omitted crucial context in my example, the reason i need to set the element of the vector in memory is because the vector gets interpreted as an SIMD16<UInt8> value when it gets read, so that the application can do 16 byte comparisons at once. but only one of the bytes ever gets written to at a time. i have no idea what the internal memory layout of the SIMD16<UInt8> value is, so i don’t know if i can just operate on the byte in memory by rebinding the value to UnsafePointer<UInt8>. Is element 0 always at the lowest memory address?

for context, the use case is implementing a specialized version of F14 hash maps.

1 Like

My usecase is doing parallel comparisons against a vector of byte-tags. unfortunately, this “simple” searching code does not vectorize on its own.

    func find(_ tag:UInt8, in buffer:UnsafePointer<UInt8>) -> UInt16 
    {
        (0 ..< 16).reduce(0){ $0 | (buffer[$1] == tag ? 1 << $1 : 0) }
    }
output.find(_: Swift.UInt8, in: Swift.UnsafePointer<Swift.UInt8>) 
-> Swift.UInt16:
        pushq   %rbp
        movq    %rsp, %rbp
        xorl    %edx, %edx
        movl    $1, %r8d
        xorl    %r9d, %r9d
        jmp     .LBB1_1
.LBB1_9:
        movl    %ecx, %eax
        andb    $15, %al
        shlxl   %eax, %r8d, %eax
.LBB1_10:
        orl     %r9d, %eax
        leaq    1(%rcx), %rdx
        movl    %eax, %r9d
        cmpq    $15, %rcx
        je      .LBB1_7
.LBB1_1:
        movq    %rdx, %rcx
.LBB1_2:
        leaq    1(%rcx), %rax
        cmpb    %dil, (%rsi,%rcx)
        jne     .LBB1_11
        testq   %rcx, %rcx
        js      .LBB1_4
        cmpq    $15, %rcx
        jg      .LBB1_12
        jmp     .LBB1_9
.LBB1_4:
        cmpq    $-15, %rcx
        jge     .LBB1_5
.LBB1_11:
        cmpq    $15, %rcx
.LBB1_12:
        movq    %rax, %rcx
        jne     .LBB1_2
        jmp     .LBB1_6
.LBB1_5:
        movl    %ecx, %eax
        negb    %al
        andb    $15, %al
        shrxl   %eax, %r8d, %eax
        jmp     .LBB1_10
.LBB1_6:
        movl    %r9d, %eax
.LBB1_7:
        popq    %rbp
        retq

so far, the only way i’ve been able to get this to vectorize is to use the builtin intrinsics as @scanon suggested in a different thread:

#if arch(x86_64)

import _Builtin_intrinsics.intel

extension SIMD16 where Scalar == UInt8 
{
    func find(_ scalar:UInt8) -> UInt16 
    {
        let repeated:Self       = .init(repeating: scalar)
        let mask:SIMD2<Int64>   = _mm_cmpeq_epi8(
            unsafeBitCast(self,     to: SIMD2<Int64>.self), 
            unsafeBitCast(repeated, to: SIMD2<Int64>.self))
        return .init(truncatingIfNeeded: _mm_movemask_epi8(mask))
    }
}

#endif
(extension in output):Swift.SIMD16< where A == Swift.UInt8>.find(Swift.UInt8) 
-> Swift.UInt16:
        pushq   %rbp
        movq    %rsp, %rbp

        vpbroadcastb    %edi, %xmm1
        vpcmpeqb        %xmm0, %xmm1, %k0

        kmovd   %k0, %eax
        popq    %rbp
        retq

that constitutes one half of my use case. now i’m trying to figure out how to set the byte values in the vector to be compared against.

this works! again though, can we assume the layout of the SIMD elements in memory?

Ah, in that case, going with @scanon's suggestion of builtin intrinsics is the way to go.

There are limitations to what LLVM's autovectorizer can do, and equality comparisons can be a bit tricky; you'd need to use some vectorizable SWAR logic. But at that point, might as well use intrinsics if you want to micro-optimize.

If memory layout guarantees are your concern, just forgo the stdlib SIMD types and write your own types that encapsulate the builtins. There technically isn't any official documentation on the memory layout of the stdlib SIMD vectors afaik.

1 Like

Yes.

2 Likes

perfect!