Swift's SIMD from std appears to broken (?)

kozlowsqi · March 12, 2020, 5:33pm

I've been playing with SIMD decoding of base32 for the past month. I initially implemented base32 decoding using SIMD up to x64 in pure swift. As I realized that my CPU doesn't support x64 (512 bits) ..I downgraded it to x32.

This lead me to discover that using anything but SIMD8 is drastically reducing performance. Even x16 is slightly slower than x8.

While doing this I found out that swift's performance isn't at all comparable to performance of a pure C implementation. I implemented both x16 and x32 in pure C. With the x32 version I saw up to 25% increase in performance (swift performance decreased in x32 significantly).

Pure C in my testing decoded block of around 200k over 1024 iterations in 80ms in x16 mode.
Pure swift does the same work in 1.6s in x16 mode and in ~1.1s in x8 mode. (around 20x slower)

Here are some sources that reflect the issue described:
Sources

The sources are not of the best quality (since this is a prototype) and the pure swift implementation is quite unsafe (oh well) but this is the fastest I cloud get it to run.

Is this a bug? or is my code somehow flawed?
Is this expected or am I doing something wrong here?

I built this using current Xcode 11.4 beta 1, 2 and 3

scanon · March 12, 2020, 6:05pm

Not broken, but:

Swift doesn't yet have a good model for enabling CPU features
There's a bunch of work to be done to lower the SIMD API directly to llvm SIMD builtins, rather than depending on the optimizer to do the work.

I.e. this is basically expected, and will get some fixes when we have time for it. In the short term, note that you can use the Swift types with the normal C intrinsics (i.e. basically use C code as-is) and match C codegen. This isn't as nice as it will eventually be, but it works.

Also, the whole reason we put the SIMD types into the standard library before the optimization work was finished is to enable interop with C functions that traffic in SIMD types. So if you have a working C implementation of something, you can simply use it directly from Swift.

kozlowsqi · March 12, 2020, 6:11pm

What about the slowdown between x8 to x16 to x32 to x64?

I would expect that in any case swift should maintain the performance. But going from x8 to x64 is a significant slowdown (1.1s to ~15 seconds)..? Is this also expected?

scanon · March 12, 2020, 6:19pm

That's an expected consequence of the two issues I called out; because there's no convenient way to target specific CPU features, the baseline instruction set is used, which means even if optimization produces SIMD codegen, wider-than-16B vectors will frequently not produce faster code, and that it's less likely for the optimizer to produce good vector code at all.

There's an umbrella bug for a CPU features model here: [SR-11660] Umbrella: function multiversioning and dispatch on CPU features · Issue #54069 · apple/swift · GitHub. I haven't had a chance to do any real work on it, but it has some thoughts about what sorts of things we want to support in a design.

I'm hoping have some time to work on SIMD codegen issues (bypassing the optimizer and lowering directly to LLVM vectors) in the near future.

porterchild · March 14, 2020, 12:03am

Not super confident I'm following everything here, but it looks like this is a prototype of what you're talking about:
https://github.com/markuswntr/simdx
In a quick test of basic math operations, I'm seeing 6x speedup compared to regular Swift SIMD.
Maybe some ideas for implementation there.

I'm excited about SIMD's being so usable now, thanks for the great work!

Jens · March 14, 2020, 9:22am

I haven't looked closely at your code, but I'm just wondering about the purpose of the Box class?

Also, if you drill into the time profiling results, what (exactly) is most time spent on?

kozlowsqi · March 14, 2020, 10:42am

On the decode method itself. In optimized and unchecked build most time is spent (assuming the time profiler isn't laying here) on the witness table jmp for Equatable.== (that is in assembly). This when build with symbols translates to .< or .>

Box is a really really bad hack to convert memory at UnsafeMemoryPointer<UInt8> to SIMD8<UInt8> and SIMD16<UInt8> respectively. This is to make sure initializer isn't getting called. That is the fastest I cloud get it. The same could probably be accomplished by casting the pointer and accessing the first member (without actually doing unsafeBitCast)..

Note here that initializers are a huge cost. I mean especially the SIMD16<UInt8>(repeating:) needs to be initialized outside of the function because it will get called every time the function gets called even though it is constant. When I put the tables inside of the decode(...) function biggest factor in profiling was the initializer. I suspect that most of the runtime cost comes from actually using non SIMD instructions to do SIMD things. I know that .replacing(..) on SIMD types is implemented using a simple loop and probably compiler doesn't optimize this to pblendw instruction (from what I can tell by looking at disassembly..?).

This is quite slow still. My current implementation of base32 decoder in C can do 25GB/s (on i9 mbp from 2018) (actually 20x faster I think..?) and from swift I would expect at least 20GB/s assuming the added overhead from safety checks. My friend implemented this in rust and he got similar numbers so swift should have no problems once this is properly implemented.

I also think that swift compiler should warn you when using things unsupported by your architecture. By this I mean fairly large part of operations exposed by the SIMD protocol and the SIMD... structs will have to be done serially or with unexpected performance ..like there is no << nor >> (on x86_64 with default CPU features) for SIMD8<UInt8> and also <. isn't available for some types on x86_64 with default CPU features. I would expect a warning at compile-time when implementation for my architecture isn't available or is slower than expected or requires extra features that I haven't enabled, so that I can explicitly implement a more efficient version and/or enable CPU features.

scanon · March 14, 2020, 7:21pm

This is not what the SIMD API is for. It exists to provide a uniform programming model (even supporting HW with no SIMD capabilities at all), so that higher-level functionality can be built on top of it without worrying about these issues. It sounds like you really want to be writing intrinsics, which is fine--you can do that in Swift today.

kozlowsqi · March 14, 2020, 8:35pm

Well yes and no. I want to be able to write generic code that supports any architecture and supports any size ...but at the same time some parts might be incompatible and this should be known at compile-time.

Like AVX is 128 bits and I would expect compiler to emit warning when I attempt to compile a SIMD32<UInt8> without enabling AVX2 for that function. Same goes for .< and >>.. It's not something that would make the build fail but it would enable more correct code.

I don't think this would interfere with the uniformity of the model since enabling/adding that feature should silence the warning and the swift compiler could provide implementation for incompatible platforms then. I think that you also should be able to provide your own implementation for incompatible platforms. That would ensure that swift never breaks the promise of vectorizing SIMD types.