I have no experience with SIMD types, and have changed a mandelbrot swift benchmark to use SIMD which is in fact faster now.
Now I'm seeing that the highly optimized c++ as well as the rust version of the benchmark are using a special SIMD operation ( _mm512_cmp_pd_mask ) to fold a SIMD8 Double Mask to a byte.
I'm currently using a or operation comparing each part of the SIMD like this:
This reduces the size of your mask from 64bit to 8bit. Then rebind the memory of your mask to a UInt8. and read out that value.
let byte = withUnsafeBytes(of: reducedMask) { $0.bindMemory(to: UInt8.self).baseAddress!.pointee }
I'm not sure that this will result in the performance gains you are looking for, but I hope it's already better than the bitshifting method.
Side note
My way of converting the 64bit mask to 8bit using the array literal still seems a bit like a hack. Maybe there is a better way to do this. Recently, @taylorswift also wrote question about this in: How to convert between SIMD mask types?
Maybe I misunderstand what you wrote, but the bitshifting is actually not done - the resulting code I had was just or'ing the cmpresult values in one byte.
I can't seem to get your code idea to work (even if I flip the order of bits). But I had another idea - maybe if I can't get the mask get converted to a byte with one simd operation , maybe getting the mask into a simd8 vector and multiply each with a fixed vector ( 128 , 64, 8 ) and sum up the result I would get the byte value as well ( no idea if that's faster then ).
Well by bit shifting I meant all your 1<<7 , 1<<6 , 1<<5 , ... statements.
Hmmm indeed I was mistaken: My reduced mask was 8 bytes (instead of 8 bits) long. So it reduced the 512bit mask to a 64bit mask. And by unsafe rebinding, I was only reading the first 8 bits of the 64bit value, so that didn't work.
Your unsafeBitCast is also cleaner and safer than my withUnsafeBytes method.
Ah, I see. Bit shifting has been hard resolved by compilers to values since the 90's so I never bothered writing them out before, except that I can't do that in swift as the compiler can't compile it then.
I forgot to attach the ramp SIMD definition which is:
let ramp:SIMD8<Int64> = [128,64,32,16,8,4,2,1]
The whole program is now about 1.8 times slower than the c++ version, but the c++ version only works fast on intel processors with SIMD weras the swift version is fast on m1 as well ;-). Even though a mac mini m1 is not quite as fast as mbp i9 ( 4.9 seconds m1 swift vs. 4.5 seconds i9 swift ) but the mbp is WAY louder ;-D