Multithreading concurrent SIMD tasks

Dev1an · March 16, 2019, 10:35am

I am stumbling upon an issue with concurrency and SIMD. To reproduce the issue I have simplified my code to the following fragment:

import Dispatch

let queue = DispatchQueue(label: "Concurrent threads", qos: .userInitiated, attributes: .concurrent)
let group = DispatchGroup()

let threadCount = 4
let size = 1_000
var pixels = [SIMD3<Float>](repeating: .init(repeating: 0), count: threadCount*size)

for thread in 0..<threadCount {
  queue.async(group: group) {
    for number in thread*size ..< (thread+1)*size {
      let floating = Float(number)
      pixels[number] = SIMD3<Float>(floating, floating, floating)
    }
  }
}

print("waiting")
group.wait()
print("Finished")

When I execute this in debug mode using Xcode Version 10.2 beta 4 (10P107d) it always crashes with an error like:

Multithread(15095,0x700008d63000) malloc: *** error for object 0x104812200: pointer being freed was not allocated
Multithread(15095,0x700008d63000) malloc: *** set a breakpoint in malloc_error_break to debug

I have the feeling that it is some bug in the compiler because when I run the code in release mode it runs just fine. Or am I just doing something wrong?

Torust · March 16, 2019, 11:14am

This to me looks like the Law of Exclusivity in effect, albeit with bad diagnostics. You're concurrently modifying pixels from multiple threads, which is prohibited.

To work around this, use pixels.withUnsafeMutableBufferPointer outside of the async, and make sure the wait() is within the withUnsafeMutableBufferPointer scope:

pixels.withUnsafeMutableBufferPointer { pixels in
  for thread in 0..<threadCount {
    queue.async(group: group) { ... }
  }
  group.wait()
}

Dev1an · March 16, 2019, 12:06pm

Hey @Torust, thanks for your quick response!
I tried your suggestion like this:

pixels.withUnsafeMutableBufferPointer { unsafePixels in
  for thread in 0..<threadCount {
    queue.async(group: group) {
      for number in thread*size ..< (thread+1)*size {
        let floating = Float(number)
        unsafePixels[number] = SIMD<Float>(floating, floating, floating)
      }
    }
  }
  group.wait()
}

But it has one problem: accessing the unsafePixels inside the async block does not work and produces the following error during compilation Escaping closures can only capture inout parameters explicitly by value.

I don't think it is related to the runtime exclusive access checks (that are mentioned in the blog post you added) because I get the same errors when I disable them in the Swift Compiler - Code generation settings. I would also like to note that I am not concurrently modifying the same variables from different threads. The different threads operate on different non overlapping parts of the array.

Dev1an · March 16, 2019, 1:47pm

Thanks to Rob Napier who has solved the problem, I found out that I was missing the capture list [unsafePixels] in the queue.async block. Everything works fine now!

scanon · March 16, 2019, 2:54pm

Separate from the actual crash, I get that this is a simplification, but I want to point out that this example does not do nearly enough work per thread to benefit from using dispatch--the main thread will finish filling the entire buffer long before any threading abstraction would manage to create a second work item. Even if you have billions of elements to write, this sort of store-dominate workflow is almost always a poor candidate for multithreading, unless your entire application is bottlenecked on it for some reason, because just a few cores can saturate the bandwidth to memory on most systems.

My conservative rough rule of thumb is:

Only consider threading if you have more computational work than memory traffic.
Choose the number of work items so you spend at least a few hundred thousand cycles in each.

There are cases where you still benefit from threading that fall outside of these, but you start to get "this specific task got faster, but everything else got slower" pretty quickly as you move outside of these constraints.

Dev1an · March 16, 2019, 3:06pm

@scanon thanks for pointing this out. I am writing a ray tracer where, for each pixel, I need to calculate intersections with millions of objects (instead of just initialising the pixel with a dummy value as in the simplified code I added above). And I am seeing quite some performance gain (more than 3x faster) when I am using multithreaded code. So IMHO I think that this is actually a good case for multithreading.