I've been playing with SIMD decoding of base32 for the past month. I initially implemented base32 decoding using SIMD up to x64 in pure swift. As I realized that my CPU doesn't support x64 (512 bits) ..I downgraded it to x32.
This lead me to discover that using anything but SIMD8 is drastically reducing performance. Even x16 is slightly slower than x8.
While doing this I found out that swift's performance isn't at all comparable to performance of a pure C implementation. I implemented both x16 and x32 in pure C. With the x32 version I saw up to 25% increase in performance (swift performance decreased in x32 significantly).
Pure C in my testing decoded block of around 200k over 1024 iterations in 80ms in x16 mode.
Pure swift does the same work in 1.6s in x16 mode and in ~1.1s in x8 mode. (around 20x slower)
Here are some sources that reflect the issue described:
The sources are not of the best quality (since this is a prototype) and the pure swift implementation is quite unsafe (oh well) but this is the fastest I cloud get it to run.
Is this a bug? or is my code somehow flawed?
Is this expected or am I doing something wrong here?
I built this using current Xcode 11.4 beta 1, 2 and 3