Unexpected performance using SIMD + OperationQueues

I've been playing around with SIMD in Swift. I wrote a very simple program that takes in an array of UInt32s and iterates over it, doubling the values.

Here is the first version of that code:

func grow(array: inout ContiguousArray<UInt32>) {
    for i in 0..<array.count {
        array[i] *= 2
    }
}

func simd_grow(array: inout ContiguousArray<UInt32>) {
    let ops = array.count/16
    array.withUnsafeMutableBufferPointer {
        $0.baseAddress!.withMemoryRebound(to: simd_uint16.self, capacity: ops) { simd_array in
            for t in 0..<ops {
                simd_array[t] &*= 2
            }
        }
    }
}

Here's some sample output that reflects rough performance numbers:

Buffer len 16000000000
SIMD start
SIMD time elapsed: 9.82564401626587
LOOP start
LOOP time elapsed: 20.030798077583313

This all seemed reasonable to me. The SIMD version of the function is definitely faster than the simple loop. Perhaps not quite as fast as I would have expected, but I figured the compiler was optimizing the loop implementation (this was a release build).

I then wanted to try distributing this work across multiple threads, which lead me to write this function:

func simd_queue_grow(array: ContiguousArray<UInt32>) -> ContiguousArray<UInt32> {
    var array = array
    let queue = OperationQueue()
    let thread_count = 10
    let simd_width = 16
    queue.qualityOfService = .userInitiated
    queue.maxConcurrentOperationCount = thread_count
    let simd_capacity = array.count / simd_width
    let chunk_size = simd_capacity / thread_count

    array.withUnsafeMutableBufferPointer { array_mem in
        array_mem.baseAddress!.withMemoryRebound(to: simd_uint16.self, capacity: simd_capacity) { simd_array in
            for thread_num in 0..<thread_count {
                queue.addOperation {
                    let chunk_base = thread_num * chunk_size
                    for t in chunk_base..<(chunk_base + chunk_size) {
                        simd_array[t] &*= 2
                    }
                }
            }
        }
    }
    queue.waitUntilAllOperationsAreFinished()
    return array
}

With 10 "threads", I imagined this would be roughly an order of magnitude faster, but the results were actually far worse:

SIMDTHREAD start
SIMDTHREAD time elapsed: 30.01075792312622

I figure there is either something about OperationQueues or the semantics of the code I have written that I don't quite understand. Due to the escaping Operations, I am unable to use an inout ContiguousArray, but I didn't think that would account for such a significant degradation of performance. Was hoping for a little help understanding what I might be missing. Thanks!

I think I'm just stupid. Here's a version that is quite a bit faster:

func simd_queue_grow(array: inout ContiguousArray<UInt32>) {
    let queue = OperationQueue()
    let thread_count = 10
    let simd_width = 16
    queue.qualityOfService = .userInitiated
    queue.maxConcurrentOperationCount = thread_count
    let simd_capacity = array.count / simd_width
    let chunk_size = simd_capacity / thread_count

    array.withUnsafeMutableBufferPointer { array_mem in
        array_mem.baseAddress!.withMemoryRebound(to: simd_uint16.self, capacity: simd_capacity) { simd_array in
            for thread_num in 0..<thread_count {
                queue.addOperation {
                    let chunk_base = thread_num * chunk_size
                    for t in chunk_base..<(chunk_base + chunk_size) {
                        simd_array[t] &*= 2
                    }
                }
            }
        }
    }
    queue.waitUntilAllOperationsAreFinished()
}

This clocks in at ~2x the performance of the SIMD loop (5 seconds for the buffer size and machine I am using).

I would appreciate any other recommendations that could improve performance!

Could you please mention the test array generation and how you are measuring the execution time?

Here's my test code. This is just part of a Swift package compiled via swift run -c release

print("Create buffer")
let populations = ContiguousArray<UInt32>.init(repeating: 1, count: 1000000000 * 16)
print("Buffer len \(populations.count)")


print("SIMD start")
let simd_start = CFAbsoluteTimeGetCurrent()
simd_grow(array: &populations)
let simd_elapsed = CFAbsoluteTimeGetCurrent() - simd_start
print("SIMD time elapsed: \(simd_elapsed)")

print("LOOP start")
let loop_start = CFAbsoluteTimeGetCurrent()
grow(array: &populations)
let loop_elapsed = CFAbsoluteTimeGetCurrent() - loop_start
print("LOOP time elapsed: \(loop_elapsed)")


print("SIMDTHREAD start")
let sq_start = CFAbsoluteTimeGetCurrent()
simd_queue_grow(array: &populations)
let sq_elapsed = CFAbsoluteTimeGetCurrent() - sq_start
print("SIMDTHREAD time elapsed: \(sq_elapsed)")

Ah ok. So one other interesting thing to note. I was unfamiliar with the semantics of &*=. Replacing the *= operation in the simple loop produces performance comparable with the simd implementation. So is the compiler automatically SIMDing this for me? I'm unable to import simd in compiler explorer to find out.

Those are called overflow operators, and they are known to speed up calculations at the risk of giving unexpected results if you're not careful enough.

Have a look here: Documentation

FWIW for “fanning out” an operation like this I would reach for DispatchQueue.concurrentPerform. Operation queues are more useful for situations where you have a dependency graph of fairly large high level operations.

2 Likes

You are not doing several operations at once, are you?

I don't know if simd allows several integer multiplications at once or not. Assuming you need to multiply all 32-bit integer numbers by two (which is equivalent to shifting left by 1) you can shift two back to back values at once treating it as a single 64-bit value.

Edit. That's what simd based implementation is doing inside:

@inline(never)
func foobarbaz(_ v: simd_uint16) -> simd_uint16 {
    v &* 2
}
0x1000034b0 <+0>:  shl.4s v0, v0, #0x1
0x1000034b4 <+4>:  shl.4s v1, v1, #0x1
0x1000034b8 <+8>:  shl.4s v2, v2, #0x1
0x1000034bc <+12>: shl.4s v3, v3, #0x1

I played around a bit more and this was the fastest implementation I could put together:

Thanks @David_Smith for the convenient DispatchQueue.concurrentPerform.

func other_simd_concurrent(array: inout ContiguousArray<UInt32>) {
    let threads = 32
    let simd_count = array.count / 8
    let chunk_size = simd_count / threads
    array.withUnsafeMutableBufferPointer { array_mem in
        array_mem.baseAddress!.withMemoryRebound(to: SIMD8<UInt32>.self, capacity: simd_count) { simd_array in
            DispatchQueue.concurrentPerform(iterations: threads) { thread in
                for simdi in thread*chunk_size..<(thread+1)*chunk_size {
                    simd_array[simdi] &*= 2
                }
            }
        }
    }
}

On my machine, ~32 threads proved to be close to optimal, with some wiggle room in either direction.

Happy to be shown a faster version though (not counting something written in Metal :slight_smile:).

1 Like

Your functions simd_grow and other_simd_concurrent don't actually double the values in the array. The input array stays the same.