Using SIMD types for vector addition

wigging · October 3, 2024, 2:18pm

In the example below I use SIMD and Accelerate to add two large vectors. The elapsed time for the SIMD4 and SIMD64 examples is much slower than the Accelerate example. Maybe Accelerate is doing something under-the-hood to optimize this problem. Anyway, I have never used SIMD types in Swift, so can someone comment on my use of SIMD here and if I used it correctly?

import Accelerate

func runSimd4(_ a: [Double], _ b: [Double]) {
    var c: [Double] = Array(repeating: 0.0, count: a.count)
    let tic = Date.now

    // Define the chunk size based on the SIMD vector type (e.g., SIMD4<Float>)
    let chunkSize = 4

    // Process the arrays in chunks
    let remainderStartIndex = a.count - (a.count % chunkSize)

    for i in stride(from: 0, to: remainderStartIndex, by: chunkSize) {
        // Create SIMD vectors for the current chunk
        let simdA = SIMD4<Double>(a[i], a[i+1], a[i+2], a[i+3])
        let simdB = SIMD4<Double>(b[i], b[i+1], b[i+2], b[i+3])

        // Perform element-wise addition using SIMD
        let simdResult = simdA + simdB

        // Store the result back into the result array
        c[i] = simdResult[0]
        c[i+1] = simdResult[1]
        c[i+2] = simdResult[2]
        c[i+3] = simdResult[3]
    }

    // Handle the remainder elements that don't fit into SIMD chunks
    for i in remainderStartIndex..<a.count {
        c[i] = a[i] + b[i]
    }

    let toc = tic.timeIntervalSinceNow.magnitude

    print("SIMD4 elapsed time ........ \(toc) s")
    //print("SIMD4 result ........", c)
}

func runSimd64(_ a: [Double], _ b: [Double]) {
    var c: [Double] = Array(repeating: 0.0, count: a.count)
    let tic = Date.now

    // Define the chunk size based on the SIMD vector type (e.g., SIMD64<Float>)
    let chunkSize = 64

    // Process the arrays in chunks
    let remainderStartIndex = a.count - (a.count % chunkSize)

    for i in stride(from: 0, to: remainderStartIndex, by: chunkSize) {
        // Create SIMD vectors for the current chunk
        let simdA = SIMD64<Double>(a[i..<i+chunkSize])
        let simdB = SIMD64<Double>(b[i..<i+chunkSize])

        // Perform element-wise addition using SIMD
        let simdResult = simdA + simdB

        // Store the result back into the result array
        for j in 0..<chunkSize {
            c[i + j] = simdResult[j]
        }
    }

    // Handle the remainder elements that don't fit into SIMD chunks
    for i in remainderStartIndex..<a.count {
        c[i] = a[i] + b[i]
    }

    let toc = tic.timeIntervalSinceNow.magnitude

    print("SIMD64 elapsed time ....... \(toc) s")
    //print("SIMD64 result ........", c)
}

func runAccelerate(_ a: [Double], _ b: [Double]) {
    let tic = Date.now
    let c = vDSP.add(a, b)
    let toc = tic.timeIntervalSinceNow.magnitude

    print("Accelerate elapsed time ... \(toc) s")
    //print("Accelerate result ...", c)
}

// --- Run examples ---

//let a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10.0]
//let b = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11.0]

let n = 100_000
let a = (0..<n).map { _ in Double.random(in: 0...1) }
let b = (0..<n).map { _ in Double.random(in: 0...1) }

runAccelerate(a, b)
runSimd4(a, b)
runSimd64(a, b)

The elapsed times from an Intel Mac are shown below. I did not run the code with any optimizations enabled.

Accelerate elapsed time ... 0.0004929 s
SIMD4 elapsed time ........ 0.01182 s
SIMD64 elapsed time ....... 0.09705 s

scanon · October 3, 2024, 2:45pm

I did not run the code with any optimizations enabled.

Well, there it is.

ksluder · October 3, 2024, 2:55pm

It seems unfortunate that unoptimized code that calls Accelerate would be so much faster than unoptimized code that performs equivalent SIMD operations. Are SIMD operations reliant on optimizer heroics?

scanon · October 3, 2024, 2:58pm

Unoptimized code will not be competitive with an optimized library, no matter how it is written; this has nothing to do with optimizer heroics or lack thereof.

The benchmark in question is also critically broken in a few ways; I'll expand on that later today when I have a few minutes.

ksluder · October 3, 2024, 3:06pm

I think this assertion relies on the assumption that the unoptimized code and the optimized library are implementing the same technique. But even then, it doesn’t mean that “optimizer heroics” aren’t involved: it could be that the compiler generates a whole bunch of crappy unoptimized memory operations around the SIMD operations that are cleaned up by a SIMD-aware optimizer pass. We have seen this exact situation with excessive refcounting; that discussion is where I cribbed “optimizer heroics” from.

scanon · October 3, 2024, 3:09pm

Let me be more precise then: unoptimized code will not be competitive with a library whose entire reason for existence is to provide the exact same operations in hand-optimized as-fast-as-possible form.

wigging · October 3, 2024, 3:11pm

I didn't think it would matter if optimizations were turned on because I thought SIMD types would already be optimized. Anyway, I compiled and ran the examples with swiftc main.swift -Ounchecked and it does make a difference. Below are the elapsed time results. The SIMD4 example is faster than the Accelerate example.

Accelerate elapsed time ... 0.00036001 s
SIMD4 elapsed time ........ 0.00006700 s
SIMD64 elapsed time ....... 0.00674701 s

ksluder · October 3, 2024, 3:15pm

This still assumes that the library implements a technique at least roughly similar to the unoptimized code. I have seen cases where unoptimized CPU code that dispatches to unoptimized GPU shaders beats hand-optimized, battle-tested CPU-only code.

You obviously know that this is a valid comparison between vDSP and Gavin’s code, but Gavin didn’t know that, which is why he asked whether vDSP was employing some alternative approach that is inherently faster than his naïve SIMD implementation.

wigging · October 3, 2024, 3:34pm

There seems to be two documentation pages for SIMD. One from the Swift docs and one from the Accelerate docs. Are these different implementations of SIMD or are they the same thing?

scanon · October 3, 2024, 3:36pm

SIMD is a protocol abstracting over types defined by the Swift standard library.

simd is a module defining a set of operations extending those types provided by the /usr/include/simd/ headers on Apple platforms.

scanon · October 3, 2024, 4:40pm

Ok, so what do I not like about the benchmark?

It was built without optimizations; that's essentially never what you want to do. You already fixed that, and now SIMD4 is faster than Accelerate! So we're done, right? Well...
I don't like that it reports the time of a single run. For long-running tasks that get to something like steady-state, this can make sense, but for small primitive operations you generally want to report the minimum time over many runs.¹ After making this change, I see:
```
Accelerate elapsed time ... 3.993511199951172e-05 s
SIMD4 elapsed time ........ 8.690357208251953e-05 s
```
(I'm going to leave SIMD64 out from here on, but I'll come back to it later.)
Now Accelerate is more than 2x faster! What's going on?!
I also don't love the vector length of 100000; this means that your working set is 100000 elements/vector * 3 vectors * 8 bytes / element = 2.4MB, too large to reap the full benefit of caching. So a fully-optimized implementation is bottlenecked on the memory hierarchy, rather than compute. Let's trim the vector length down to 1024 and do 100 iterations for each measurement to compensate:
```
Accelerate elapsed time ... 2.9921531677246094e-05 s
SIMD4 elapsed time ........ 7.18832015991211e-05 s
```
That didn't speed things up as much as I thought it would; why not?
Ah, we're timing vDSP.add(a, b). That has to allocate c on every call, but we're interested in making the arithmetic fast. We can switch to using vDSP.add(a, b, result: &c) instead to re-use the result storage. Likewise, we can hoist the allocation of c out of the timing loop for the Swift implementation:
```
Accelerate elapsed time ... 1.1920928955078125e-05 s
SIMD4 elapsed time ........ 5.4955482482910156e-05 s
```
Ok, now these are timings I can believe. 1.19e-5 seconds to do 100k additions is 0.12ns / addition, or around 3 adds per cycle, which is at least in the right ballpark. So why is SIMD4 slow?
First, a diversion: always try to write scalar code first and see what the optimizer can make of it. If I do a scalar implementation:
```
 for i in a.indices { c[i] = a[i] + b[i] }
```
and just let the optimizer do its work, I'm already there:
```
Scalar elapsed time ........ 1.2993812561035156e-05 s
```
so there's really not much reason to go to explicit SIMD for this problem (or any problem where the loop is non-conditional and everything stays nicely in-lane).
Setting that aside, the SIMD arithmetic itself would be fine, but how you construct the SIMD vectors (and get scalars out of the result) is making it slow (more precisely, it causes the optimizer to not generate SIMD instructions at all because of knock-on effects). Initializing them with SIMD4<Double>(a[i ..< i+4]) gets you halfway there:
```
SIMD4 elapsed time ........ 3.993511199951172e-05 s
```
but unpacking the vectors to the array of scalars is a little bit fussier to do right. I've been meaning to make some utilities public that make this sort of "take a SIMD building-block and apply it to arrays" operation easier to write, but, as noted above, that should often be written in scalar code anyway, so it has less value than it seems like at first. Anyway, I'll take this as a hint it might be useful to some folks.

¹ Why the minimum? Because, roughly speaking, nothing can happen on a computer to make it go faster, but lots of things are happening that cause little transient slowdowns. So timing measurements in the small scale look like "real time + poisson noise", and taking the minimum recovers the best estimate of the "real time".

wigging · October 3, 2024, 6:10pm

The plain Swift code for i in a.indices { c[i] = a[i] + b[i] } is basically the same speed as vDSP.add in Accelerate. So what's the point of using Accelerate for vector addition if plain Swift is just as fast?

jim · October 3, 2024, 6:18pm

It's something like ~8% faster, which I suppose might matter in some cases. But yeah, I was wondering the same thing. I was also wondering whether any Swift implementations matrix operations might benefit enough from compiler optimizations to be within the same ballpark.

I know that a lot of this is designated as future work for Swift numerics, but this has me wanting it even more than before. Accelerate is great for Apple's platforms, but things get messy as soon as you want to also work on Linux/Windows ... if we're getting this close with plain Swift, I wonder how close we are to cross-platform linear algebra just being written in Swift?

scanon · October 3, 2024, 6:27pm

The vDSP library is about 30 years old (and its direct predecessors were a decade or two older than that); optimizing compilers were not always as good as they are today. If someone were writing a vDSP-like library from scratch today, they might not include some of these operations (or they might--a function call is still a little bit tidier than a for loop, depending on your taste).

There's also an aspect of future-proofing. If Apple launches the M3000 processor tomorrow, with a special "add 1000 doubles at once" instruction, Accelerate will be updated to take advantage of that and your app will benefit automatically.¹ Autovectorizer work in the compiler sometimes lags somewhat behind what hand-tuned libraries can provide, and you would at least have to recompile your app to benefit, plus handle detecting the instruction and falling back to a different implementation.

¹ This example is contrived, but only a little. If, when Apple shipped the first 64b Intel MacBook Pro in 2007, you compiled an app that used vDSP_vaddD (the C equivalent of your vDSP.add() call), and then never updated it, it would have gone twice as fast (cycle for cycle) when Haswell shipped in 2013, and then twice as fast again on an AVX-512 machine without you ever needing to recompile the binary, without you needing to select between different implementations for different micro-architectures, or even be aware that it was possible, and without users needing to do anything. That has real value.

wigging · October 3, 2024, 8:24pm

I rewrote my examples based on your comments (see revised code below). The benchmark function performs 10 iterations to calculate total elapsed time and average elapsed time.

/*
Compare Swift, Accelerate, and SIMD4 vector addition.

Compile with `swiftc main.swift -Ounchecked`
Run with `./main`
*/

import Accelerate

func benchmark(operation: () -> ()) -> (Double, Double) {
    var totalTime = 0.0

    for _ in 1...10 {
        let tic = Date.now
        operation()
        let toc = tic.timeIntervalSinceNow.magnitude
        totalTime += toc
    }

    let averageTime = totalTime / 10.0
    return (totalTime, averageTime)
}

// --- Functions to benchmark ---

func add(_ a: [Double], _ b: [Double], result: inout [Double]) {
    for i in a.indices {
        result[i] = a[i] + b[i]
    }
}

func addSimd4(_ a: [Double], _ b: [Double], result: inout [Double]) {

    // Define the chunk size based on the SIMD vector type (e.g. SIMD4)
    let chunkSize = 4

    // Process the arrays in chunks
    let remainderStartIndex = a.count - (a.count % chunkSize)

    for i in stride(from: 0, to: remainderStartIndex, by: chunkSize) {
        // Create SIMD vectors for the current chunk
        let simdA = SIMD4<Double>(a[i ..< i+chunkSize])
        let simdB = SIMD4<Double>(b[i ..< i+chunkSize])

        // Perform element-wise addition using SIMD
        let simdResult = simdA + simdB

        // Store the result back into the result array
        result[i] = simdResult[0]
        result[i+1] = simdResult[1]
        result[i+2] = simdResult[2]
        result[i+3] = simdResult[3]
    }

    // Handle the remainder elements that don't fit into SIMD chunks
    for i in remainderStartIndex..<a.count {
        result[i] = a[i] + b[i]
    }
}

// --- Run checks ---

let a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10.0]
let b = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11.0]

var checkSwift = [Double](repeating: 0.0, count: a.count)
add(a, b, result: &checkSwift)

var checkAccel = [Double](repeating: 0.0, count: a.count)
vDSP.add(a, b, result: &checkAccel)

var checkSimd4 = [Double](repeating: 0.0, count: a.count)
addSimd4(a, b, result: &checkSimd4)

print("Check results")
print("Swift ........ \(checkSwift)")
print("Accelerate ... \(checkAccel)")
print("SIMD4 ........ \(checkSimd4)")

// --- Run benchmarks ---

let n = 100_000
let c = (0..<n).map { _ in Double.random(in: 0...1) }
let d = (0..<n).map { _ in Double.random(in: 0...1) }

var benchSwift = [Double](repeating: 0.0, count: c.count)
let (totalSwift, avgSwift) = benchmark { add(c, d, result: &benchSwift) }

var benchAccel = [Double](repeating: 0.0, count: c.count)
let (totalAccel, avgAccel) = benchmark { vDSP.add(c, d, result: &benchAccel) }

var benchSimd4 = [Double](repeating: 0.0, count: c.count)
let (totalSimd4, avgSimd4) = benchmark { addSimd4(c, d, result: &benchSimd4) }

print("\nTotal elapsed time in seconds for 10 iterations, vector size \(n)")
print("Swift ........ \(String(format: "%.8f", totalSwift))")
print("Accelerate ... \(String(format: "%.8f", totalAccel))")
print("SIMD4 ........ \(String(format: "%.8f", totalSimd4))")

print("\nAverage elapsed time in seconds for 10 iterations, vector size \(n)")
print("Swift ........ \(String(format: "%.8f", avgSwift))")
print("Accelerate ... \(String(format: "%.8f", avgAccel))")
print("SIMD4 ........ \(String(format: "%.8f", avgSimd4))")

Running this gives me the results shown below. Performing the addition with plain Swift is definitely slower than Accelerate and SIMD4.

Check results
Swift ........ [3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0, 19.0, 21.0]
Accelerate ... [3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0, 19.0, 21.0]
SIMD4 ........ [3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0, 19.0, 21.0]

Total elapsed time in seconds for 10 iterations, vector size 100000
Swift ........ 0.00250185
Accelerate ... 0.00040710
SIMD4 ........ 0.00042605

Average elapsed time in seconds for 10 iterations, vector size 100000
Swift ........ 0.00025018
Accelerate ... 0.00004071
SIMD4 ........ 0.00004261

ksluder · October 3, 2024, 8:48pm

The Swift version is doing overflow checking. What happens if you change a[i] + b[i] to a[i] &+ b[i]?

scanon · October 3, 2024, 11:00pm

Not for Double. Overflow produces infinity per IEEE-754 rules, no checks required.

(However, for integer arithmetic, you'd absolutely want to do that to unlock vectorization)

wigging · October 3, 2024, 11:12pm

If I use &+ like this

func add(_ a: [Double], _ b: [Double], result: inout [Double]) {
    for i in a.indices {
        result[i] = a[i] &+ b[i]
    }
}

then I get a fatal error

main.swift:28:26: error: binary operator '&+' cannot be applied to two 'Double' operands
        result[i] = a[i] &+ b[i]
                    ~~~~ ^  ~~~~
error: fatalError

wigging · October 5, 2024, 3:17am

When I compile and run without using -Ounchecked I get the following:

Average elapsed time in seconds for 10 iterations, vector size 100000
Swift ........ 0.04840972 -
Accelerate ... 0.00003947 1226.49x faster
SIMD4 ........ 0.09849452 0.49x faster

And with -Ounchecked I get:

Average elapsed time in seconds for 10 iterations, vector size 100000
Swift ........ 0.00024961 -
Accelerate ... 0.00003650 6.84x faster
SIMD4 ........ 0.00004189 5.96x faster

The Accelerate code has similar performance whether I use unchecked optimizations or not. How does Accelerate accomplish this without using optimizations?

Jon_Shier · October 5, 2024, 3:22am

Accelerate is a precompiled library included with the OS, so your optimization settings don't matter.

Honestly, I expected the SIMD types to be similarly pre optimized since the operators are defined in the standard library (well, the SIMD module), so I'm a bit confused why those types need optimization. Other Swift overhead I guess?