Any good documentation/examples on SIMD?

I have no experience with SIMD types, and have changed a mandelbrot swift benchmark to use SIMD which is in fact faster now.
Now I'm seeing that the highly optimized c++ as well as the rust version of the benchmark are using a special SIMD operation ( _mm512_cmp_pd_mask ) to fold a SIMD8 Double Mask to a byte.

I'm currently using a or operation comparing each part of the SIMD like this:

var cmpresult:SIMDMask<SIMD8<Double.SIMDMaskScalar>>

cmpresult = Tr + Ti .< thresholds

let byte:UInt8 =    ( cmpresult[0] != false ? 128 : 0 ) |
                    ( cmpresult[1] != false ? 64 : 0 ) |
                    ( cmpresult[2] != false ? 32 : 0 ) |
                    ( cmpresult[3] != false ? 16 : 0 ) |
                    ( cmpresult[4] != false ? 8 : 0 ) |
                    ( cmpresult[5] != false ? 4 : 0 ) |
                    ( cmpresult[6] != false ? 2 : 0 ) |
                    ( cmpresult[7] != false ? 1 : 0 )

Has Swift a SIMD operation to get it in one step like _mm512_cmp_pd_mask ? Or can it be achieved faster ?

Another annoying thing I found out was that the following code does not compile, the compiler takes too long to find out the type of 1<<7 :

let byte:UInt8 =    ( cmpresult[0] != false ? 1<<7 : 0 ) |
                    ( cmpresult[1] != false ? 1<<6 : 0 ) |
                    ( cmpresult[2] != false ? 1<<5 : 0 ) |
                    ( cmpresult[3] != false ? 1<<4 : 0 ) |
                    ( cmpresult[4] != false ? 1<<3 : 0 ) |
                    ( cmpresult[5] != false ? 1<<2 : 0 ) |
                    ( cmpresult[6] != false ? 1<<1 : 0 ) |
                    ( cmpresult[7] != false ? 1<<0 : 0 )

In case somebody likes to help to get swift faster benchmark results you can optimize the Which programming language is fastest? | Computer Language Benchmarks Game

Patrick @jollyjinx

One thing you could do to eliminate the need of all these bit shifts is to convert your mask of doubles to a mask of UInt8's:

let reducedMask: SIMDMask<SIMD8<UInt8>> = [
  cmpresult[0],
  cmpresult[1],
  cmpresult[2],
  cmpresult[3],
  cmpresult[4],
  cmpresult[5],
  cmpresult[6],
  cmpresult[7]
]

This reduces the size of your mask from 64bit to 8bit. Then rebind the memory of your mask to a UInt8. and read out that value.

let byte = withUnsafeBytes(of: reducedMask) { $0.bindMemory(to: UInt8.self).baseAddress!.pointee }

I'm not sure that this will result in the performance gains you are looking for, but I hope it's already better than the bitshifting method.

Side note
My way of converting the 64bit mask to 8bit using the array literal still seems a bit like a hack. Maybe there is a better way to do this. Recently, @taylorswift also wrote question about this in: How to convert between SIMD mask types?

Maybe I misunderstand what you wrote, but the bitshifting is actually not done - the resulting code I had was just or'ing the cmpresult values in one byte.
I can't seem to get your code idea to work (even if I flip the order of bits). But I had another idea - maybe if I can't get the mask get converted to a byte with one simd operation , maybe getting the mask into a simd8 vector and multiply each with a fixed vector ( 128 , 64, 8 ) and sum up the result I would get the byte value as well ( no idea if that's faster then ).

something like

cmpresult (true,false,true,false,false,false,false,true)
->
result8( 1,0,1,0,0,0,0,0,1) * fixed( 128, 64, 32, 16, 8, 4, 2, 1 )
->
result8fixed ( 128,0,32,0,0,0,0,0,1)
result8fixed.sum() == 161

but I have no idea how to get from a simd mask to a simd vector again.

I'm now using the following:

cmpresult = Tr + Ti .< thresholds

let reduced: SIMD8<Int64> = unsafeBitCast(cmpresult,to: SIMD8<Int64>.self)
let summask: SIMD8<Int64> = ramp & reduced
let byte  =   summask[0] +
              summask[1] +
              summask[2] +
              summask[3] +
              summask[4] +
              summask[5] +
              summask[6] +
              summask[7]

return UInt8(byte)

which is marginally faster. It seems that the addition of all numbers is replaced by the according SIMD instruction.

1 Like

Well by bit shifting I meant all your 1<<7 , 1<<6 , 1<<5 , ... statements.

Hmmm indeed I was mistaken: My reduced mask was 8 bytes (instead of 8 bits) long. So it reduced the 512bit mask to a 64bit mask. And by unsafe rebinding, I was only reading the first 8 bits of the 64bit value, so that didn't work.

Your unsafeBitCast is also cleaner and safer than my withUnsafeBytes method.

Ah, I see. Bit shifting has been hard resolved by compilers to values since the 90's so I never bothered writing them out before, except that I can't do that in swift as the compiler can't compile it then.

I forgot to attach the ramp SIMD definition which is:

let ramp:SIMD8<Int64> = [128,64,32,16,8,4,2,1]

The whole program is now about 1.8 times slower than the c++ version, but the c++ version only works fast on intel processors with SIMD weras the swift version is fast on m1 as well ;-). Even though a mac mini m1 is not quite as fast as mbp i9 ( 4.9 seconds m1 swift vs. 4.5 seconds i9 swift ) but the mbp is WAY louder ;-D

Terms of Service

Privacy Policy

Cookie Policy