Convert arrays of Int32 to arrays of Int for integer arithmetic

The function shown below performs element-wise addition of two vectors. The vectors are arrays containing integer values of type Int. The printed output of the result is [5, 7, 9, 11, 13]. I am only showing addition here but I also have similar functions for subtraction, multiplication, and division of integer arrays.

func add(_ a: [Int], _ b: [Int]) -> [Int] {
    var vec = [Int](repeating: 0, count: a.count)
    for i in 0..<a.count {
        vec[i] = a[i] + b[i]
    }
    return vec
}

let a = [1, 2, 3, 4, 5]
let b = [4, 5, 6, 7, 8]
let c = add(a, b)
print(c)

I noticed that Accelerate has some functions for performing integer arithmetic (see here). However, these functions operate on Int32 values instead of Int. So I have to convert the integer arrays to Int32 before I can use the Accelerate functions. See below for an example.

import Accelerate

func add(_ a: [Int], _ b: [Int]) -> [Int] {
    let x = a.map { Int32($0) }
    let y = b.map { Int32($0) }
    var z = [Int32](repeating: 0, count: x.count)
    vDSP_vaddi(x, 1, y, 1, &z, 1, vDSP_Length(x.count))

    let vec = z.map { Int($0) }
    return vec
}

let a = [1, 2, 3, 4, 5]
let b = [4, 5, 6, 7, 8]
let c = add(a, b)
print(c)

I have several questions related to this:

  1. Is there a better way to convert the Int arrays to be compatible with Accelerate functions? Or should I just stick with my original function and do the integer arithmetic without Accelerate?
  2. Why doesn't Accelerate have functions that support other integer types like Int, Int16, and Int128?
  3. Instead of doing integer arithmetic, should I just do these operations with floating point values such as Float or Double then convert the final result to integer values?
1 Like

Quoting @scanon from your previous thread:

If you’re just doing 4 native-integer additions, you probably don’t need to use vDSP. CPUs have already figured out how to issue multiple integer operations without requiring code to be rewritten.

But you will want to use overflow-disabled operators to ensure optimal codegen:

3 Likes

Int is really big, so there wouldn't be much to optimize here, on Arm64 at least. The vector registers are 128 bits, so you could only process 2 Ints at a time. Intel's AVX2 instructions have 256 bit vectors which can do 4 Ints at a time, but it's still not that many. (Poor one out for AVX512.) I suppose the GPU could do a better job, if you had enough integers to justify the communication overhead.

Btw you can use zip(_:_:) and map to simplify your for loop:

func add(_ a: [Int], _ b: [Int]) -> [Int] {
    zip(a, b).map(+)
}

Not only is it shorter/simpler, but it has the added benefit of removing the need to fill in all the 0s to initialize the resulting array.

2 Likes

Absolutely do not convert to Int32 and then call vDSP_vaddi.

This sort of operation is performance-limited by load-store operations, not the arithmetic itself. So if you want to go fast, the #1 consideration is that you should only load or store each piece of data one time. Making multiple passes over the data is always slower than making a single pass, when there is so little "real work" to be done.

You really want to write a for loop and the the autovectorizer handle this (similarly any straightforward vector operation that only touches each element once and has no conditional operations). This can be pretty simple:

func add(_ a: [Int], _ b: [Int]) -> [Int] {
    precondition(a.count == b.count)
    return [Int](unsafeUninitializedCapacity: a.count) { buffer, count in
        for i in a.indices { buffer[i] = a[i] &+ b[i] }
        count = a.count
    }
}

this produces the following core loop for generic x86:

0:  movdqu  xmm0, xmmword ptr [rbp + 8*rdx + 32]
    movdqu  xmm1, xmmword ptr [rbp + 8*rdx + 48]
    movdqu  xmm2, xmmword ptr [r13 + 8*rdx + 32]
    paddq   xmm2, xmm0
    movdqu  xmm0, xmmword ptr [r13 + 8*rdx + 48]
    paddq   xmm0, xmm1
    movdqu  xmmword ptr [r14 + 8*rdx + 32], xmm2
    movdqu  xmmword ptr [r14 + 8*rdx + 48], xmm0
    add     rdx, 4
    cmp     rcx, rdx
    jne     0b

and similar code for ARM64. If we turn on AVX512, we'll get AVX512 instructions:

0:  vmovdqu64       zmm0, zmmword ptr [r13 + 8*rdx + 32]
    vmovdqu64       zmm1, zmmword ptr [r13 + 8*rdx + 96]
    vmovdqu64       zmm2, zmmword ptr [r13 + 8*rdx + 160]
    vmovdqu64       zmm3, zmmword ptr [r13 + 8*rdx + 224]
    vpaddq  zmm0, zmm0, zmmword ptr [rbp + 8*rdx + 32]
    vpaddq  zmm1, zmm1, zmmword ptr [rbp + 8*rdx + 96]
    vpaddq  zmm2, zmm2, zmmword ptr [rbp + 8*rdx + 160]
    vpaddq  zmm3, zmm3, zmmword ptr [rbp + 8*rdx + 224]
    vmovdqu64       zmmword ptr [r14 + 8*rdx + 32], zmm0
    vmovdqu64       zmmword ptr [r14 + 8*rdx + 96], zmm1
    vmovdqu64       zmmword ptr [r14 + 8*rdx + 160], zmm2
    vmovdqu64       zmmword ptr [r14 + 8*rdx + 224], zmm3
    add     rdx, 32
    cmp     rcx, rdx
    jne     0

These run exactly as fast as the loads and stores do¹, and therefore are much, much preferred to anything that involves conversions or multiple passes over the data. Those can only slow you down.


¹ It would be nice if the LLVM didn't unroll them quite so much; it's unnecessary and makes them slightly less efficient than they could be in both code size and performance for ragged lengths, but it's still "fine," and certainly much better than doing extra work.

11 Likes

Not sure about performance, but I just wanted to comment on using existing functions as building blocks for idiomatic algorithms.

Swift comes with zip which turns a tuple of sequences, into a sequence of tuples. It also comes with map which (given a binary operation), turns a sequence of tuples into a sequence of operation results.

Combining the two you get very succinct and readable code, using simple primitives:

let a = [1, 2, 3, 4, 5]
let b = [4, 5, 6, 7, 8]

let sums = zip(a,b).map(+)
/// [5, 7, 9, 11, 13]
let producs = zip(a,b).map(*)
// [4, 10, 18, 28, 40]

Not sure how effective the compiler is at optimizing these into performant code, but once you learn the primitives, the code is glanceable.

1 Like

Sadly, even if you pass in &+, zip(a, b).map doesn't seem to vectorize.

public func add(_ a: [Int], _ b: [Int]) -> [Int] {
    [Int](count: min(a.count, b.count)) {
        a[$0] &+ b[$0]
    }
}

extension Array {
    @inlinable
    public init(count: Int, generating element: (Int) throws -> Element) rethrows {
        try self.init(unsafeUninitializedCapacity: count) { buffer, initializedCount in
            for i in 0..<count {
                buffer.initializeElement(at: i, to: try element(i))
                initializedCount &+= 1
            }
        }
    }
}

Really seems to be the best you can do.

1 Like

First question is how are you getting this LLVM code (I think that is what it's called)? Is it from Xcode? Are you generating it from the swift compiler in the terminal? Or from somewhere else? And how do you know by looking at this output that the code is performant?

0:  movdqu  xmm0, xmmword ptr [rbp + 8*rdx + 32]
    movdqu  xmm1, xmmword ptr [rbp + 8*rdx + 48]
    movdqu  xmm2, xmmword ptr [r13 + 8*rdx + 32]
    paddq   xmm2, xmm0
    movdqu  xmm0, xmmword ptr [r13 + 8*rdx + 48]
    paddq   xmm0, xmm1
    movdqu  xmmword ptr [r14 + 8*rdx + 32], xmm2
    movdqu  xmmword ptr [r14 + 8*rdx + 48], xmm0
    add     rdx, 4
    cmp     rcx, rdx
    jne     0b
0:  vmovdqu64       zmm0, zmmword ptr [r13 + 8*rdx + 32]
    vmovdqu64       zmm1, zmmword ptr [r13 + 8*rdx + 96]
    vmovdqu64       zmm2, zmmword ptr [r13 + 8*rdx + 160]
    vmovdqu64       zmm3, zmmword ptr [r13 + 8*rdx + 224]
    vpaddq  zmm0, zmm0, zmmword ptr [rbp + 8*rdx + 32]
    vpaddq  zmm1, zmm1, zmmword ptr [rbp + 8*rdx + 96]
    vpaddq  zmm2, zmm2, zmmword ptr [rbp + 8*rdx + 160]
    vpaddq  zmm3, zmm3, zmmword ptr [rbp + 8*rdx + 224]
    vmovdqu64       zmmword ptr [r14 + 8*rdx + 32], zmm0
    vmovdqu64       zmmword ptr [r14 + 8*rdx + 96], zmm1
    vmovdqu64       zmmword ptr [r14 + 8*rdx + 160], zmm2
    vmovdqu64       zmmword ptr [r14 + 8*rdx + 224], zmm3
    add     rdx, 32
    cmp     rcx, rdx
    jne     0
1 Like

How can you tell that the zip(a, b).map code isn't getting vectorized?

1 Like

I ran it through godbolt, so I could look at the assembly. I'm not at my computer right now, but I will provide a link as soon as I'm able.

1 Like

Ah, yes, Compiler Explorer (a.k.a. godbolt). https://godbolt.org

1 Like

(x86) assembly, not LLVM. The LLVM IR is also accessible: (Compiler Explorer, see vector.body for the IR that eventually becomes the x86 loop I posted).

You can get the generated assembly from godbolt, or tell swift to generate it from the command line (-emit-assembly) or by disassembling the resulting binary (on Mac, I use otool -tvV [binary] to disassemble the text section).

Knowing that it's performant is a bit of an art and a science. I wrote vector code by hand for over a decade, so for simple loops like this a quick glance is enough to know. If you haven't done that, a few obvious things to check:

  1. There are no unnecessary loop-carried dependencies (i.e. registers whose value on this loop iteration was computed in the previous loop iteration). These introduce latency chains that prevent HW from extracting all the available ILP.

  2. There are no scalar instructions other than incrementing the induction variable and maybe bounds checks. Vector code should be all vector code, as much as possible. Here having some familiarity with the ISA helps. movdqu is a SIMD load or store. paddq is a SIMD add of 2xInt64.

  3. No unnecessary loads or stores.

  4. Enough unrolling to mostly amortize the loop overhead accounting. The first loop is unrolled 2x (i.e. two SIMD vectors of results are generated on each iteration). In each iteration, the CPU needs to do 4 x load, 2 x SIMD ALU operation, 2 x store, plus three GPR operations (add, compare, branch). This makes 11 operations in total. Exact micro-architectural width of CPUs varies, but 2 loads per cycle and 1 store per cycle is fairly typical. That leaves 4 arithmetic operations and a branch every two cycles, which is achievable on anything recent. So this loop runs approximately at the speed of just the necessary loads and stores; you can't go faster no matter what you do (other than allowing newer SIMD ISA features to get wider loads and stores).

    On older x86, this could also be limited by the retirement cap (the maximum number of raw instructions or [fused] macro-ops that can be retired per cycle); we need to retire 10 or 11 things per two cycles, which wouldn't necessarily be attainable, but even on an older 4-per-cycle machine this loop will run at 8/10 = 80% of peak in steady-state, which is totally acceptable for most uses. Getting to 90% for large vectors by unrolling another 2x generally isn't worth the codesize increase and additional cleanup code, unless you know that all your vectors are multiple-of-some-power-of-two length.

Obviously, there's a bunch of other stuff to learn to do this sort of analysis, but those are the high level points that allow me to glance at the loop and say "yup, looks good, there's no benefit to further optimization work here."

15 Likes

Both of these functions appear to use SIMD movdqu and paddq according to Compiler Explorer. So why use the [Int](unsafeUninitializedCapacity:) initializer instead of the [Int](repeating:) initializer?

func add(_ a: [Int], _ b: [Int]) -> [Int] {
    precondition(a.count == b.count)
    var vec = [Int](repeating: 0, count: a.count)
    for i in 0..<a.count {
        vec[i] = a[i] &+ b[i]
    }
    return vec
}
func add(_ a: [Int], _ b: [Int]) -> [Int] {
    precondition(a.count == b.count)
    let vec = [Int](unsafeUninitializedCapacity: a.count) { buffer, initializedCount in
        for i in a.indices {
            buffer[i] = a[i] &+ b[i]
        }
        initializedCount = a.count
    }
    return vec
}

Both use SIMD operations to do the addition, but the (repeating: 0, count: a.count) version also calls memset to set the result vector to zero first, resulting in two passes over the destination buffer.

Memset is very fast, but this will cut your speed roughly in half some of the time. Still better would be to change the API (or add a parallel API) to take an existing buffer to write the results into instead of creating a new array.

3 Likes

I couldn't resist doing a benchmark of the different methods discussed in this topic. Below is the code I used to run the benchmark along with the results. The benchmark does 5 iterations where each iteration adds two integer arrays and each array contains 40,000,000 elements. I had to crank up the array size to see any noticeable difference between the different methods. As expected, the [Int](unsafeUninitializedCapacity:) approach was faster than the other methods.

import Accelerate

// --- Benchmark function ---

func benchmark(operation: () -> ()) {
    let n = 5
    var totalTime = 0.0

    for _ in 1...n {

        let tic = Date.now
        operation()
        let toc = tic.timeIntervalSinceNow.magnitude

        totalTime += toc
    }

    let averageTime = totalTime / Double(n)

    print("Number of iterations     \(n)")
    print("Total time (seconds)     \(String(format: "%.8f", totalTime))")
    print("Average time (seconds)   \(String(format: "%.8f", averageTime))")
}

// --- Integer addition functions ---

func add(_ a: [Int], _ b: [Int]) -> [Int] {
    var vec = [Int](repeating: 0, count: a.count)
    for i in 0..<a.count {
        vec[i] = a[i] + b[i]
    }
    return vec
}

func addOverflow(_ a: [Int], _ b: [Int]) -> [Int] {
    precondition(a.count == b.count)
    var vec = [Int](repeating: 0, count: a.count)
    for i in 0..<a.count {
        vec[i] = a[i] &+ b[i]
    }
    return vec
}

func addDspVaddi(_ a: [Int], _ b: [Int]) -> [Int] {
    let x = a.map { Int32($0) }
    let y = b.map { Int32($0) }
    var z = [Int32](repeating: 0, count: x.count)
    vDSP_vaddi(x, 1, y, 1, &z, 1, vDSP_Length(x.count))

    let vec = z.map { Int($0) }
    return vec
}

func addUnsafe(_ a: [Int], _ b: [Int]) -> [Int] {
    precondition(a.count == b.count)
    let vec = [Int](unsafeUninitializedCapacity: a.count) { buffer, initializedCount in
        for i in a.indices {
            buffer[i] = a[i] &+ b[i]
        }
        initializedCount = a.count
    }
    return vec
}

// --- Run checks ---

let x = [1, 2, 3, 4, 5]
let y = [4, 5, 6, 7, 8]

let checkOutput = """
Check function output
add()           \(add(x, y))
addOverflow()   \(addOverflow(x, y))
addDspVaddi()   \(addDspVaddi(x, y))
addUnsafe()     \(addUnsafe(x, y))
"""

print(checkOutput)

// --- Run benchmark ---

let n = 40_000_000
let a = (0..<n).map { _ in Int.random(in: 0...20) }
let b = (0..<n).map { _ in Int.random(in: 0...20) }

print("\nBenchmark add()")
benchmark { let _ = add(a, b) }

print("\nBenchmark addOverflow()")
benchmark { let _ = addOverflow(a, b) }

print("\nBenchmark addDspVaddi()")
benchmark { let _ = addDspVaddi(a, b) }

print("\nBenchmark addUnsafe()")
benchmark { let _ = addUnsafe(a, b) }

And here are the results from my ancient Intel MacBook Pro using swiftc main.swift -Ounchecked to compile the code.

Check function output
add()           [5, 7, 9, 11, 13]
addOverflow()   [5, 7, 9, 11, 13]
addDspVaddi()   [5, 7, 9, 11, 13]
addUnsafe()     [5, 7, 9, 11, 13]

Benchmark add()
Number of iterations     5
Total time (seconds)     0.55623209
Average time (seconds)   0.11124642

Benchmark addOverflow()
Number of iterations     5
Total time (seconds)     0.45895708
Average time (seconds)   0.09179142

Benchmark addDspVaddi()
Number of iterations     5
Total time (seconds)     1.41264689
Average time (seconds)   0.28252938

Benchmark addUnsafe()
Number of iterations     5
Total time (seconds)     0.38857019
Average time (seconds)   0.07771404
2 Likes

Is there a reason add and addDspVaddi don't have precondition(a.count == b.count)?

Oops, I forgot to put the precondition statement in those functions. I added the statements and ran the benchmarks again and got similar results as before.