If I clean your code up slightly to remove obstacles to vectorization, by factoring the conditionals enforcing periodic boundary conditions out of the main loop, and simply remove usage of concurrency entirely:
let r = k * dt / (dx * dx)
for t in 0 ..< nt {
let src = space[t % 2]
let dst = space[(t+1) % 2]
dst[0] = src[0] + r*(src[nx-1] - 2*src[0] + src[1])
dst[nx-1] = src[nx-1] + r*(src[nx-2] - 2*src[nx-1] + src[0])
for i in 0 ..< nx {
dst[i] = src[i] + r*(src[i-1] - 2*src[i] + src[i+1])
}
}
then the performance is only a little bit slower on my system. I would be curious to see your Python and Rust code, since the assembly generated by this isn't quite optimal, but it's within spitting distance:
+0x408 ldur q2, [x2, #-0x8]
+0x40c ldur q3, [x2, #0x8]
+0x410 ldp q4, q5, [x2, #-0x10]
+0x414 fadd.2d v6, v2, v2
+0x418 fadd.2d v7, v3, v3
+0x41c fsub.2d v4, v4, v6
+0x420 fsub.2d v6, v5, v7
+0x424 ldr q7, [x2, #0x10]
+0x428 fadd.2d v4, v4, v5
+0x42c fadd.2d v5, v6, v7
+0x430 fmul.2d v4, v4, v1
+0x434 fmul.2d v5, v5, v1
+0x438 fadd.2d v2, v2, v4
+0x43c fadd.2d v3, v3, v5
+0x440 stp q2, q3, [x1, #-0x10]
+0x444 add x1, x1, #0x20
+0x448 add x2, x2, #0x20
+0x44c subs x3, x3, #0x4
+0x450 b.ne "main+0x408"
This could be improved somewhat by re-using loaded data across iterations (and using ext
instead of loading from half-vector offsets), but those transforms are hard to get compilers to do in any language. It could also be improved somewhat on newer x86 and Apple Silicon by explicitly using FMA via src[I].addingProduct(r, ...)
instead of separate multiply and add. However, neither of these would make much real difference as used because of point (0) below.
This sort of problem is generally a very poor case for naive concurrency or multithreading, for three reasons:
- Your working set is too large. At 2 * 10000000 * MemoryLayout(Double).size, you have a working set of 160MB, which doesn't fit into the caches, and so you're actually limited by the speed of memory, rather than the speed of compute, which does not scale nearly as well with additional resources.
- If you solved that with smaller working sets, you'd get bitten by data locality. You really want to pin each worker to a fixed portion of the data, so it will be in the caches associated with the worker at the next time step.
- If you solved that, you'd get bitten by fine-grained concurrency with synchronization boundaries: at each time step, every worker has to wait for all the other workers to finish. Because your system is doing other stuff, one or two of the work items will be delayed, and all your resources have to sit idle until the straggler is done and you can kick off the next group.
Now, since you aren't solving the physical heat equation, but rather a finite-difference approximation, the propagation speed is finite, and so the contributions of neighboring work items matter almost not at all (with only 10 time steps, only the 10 elements on the boundary between blocks can possibly effect the neighbor). So if you just wanted to make this as fast as possible you would use a wave-front approach that does all the time steps in one pass in each block, and then you fill in the ten elements at the boundaries afterwards. This allows you to make it so you only miss your caches twice for each element of the solution in total, instead of once for each time-space step.
And, of course, you're solving the heat equation with periodic boundary, so you would actually use a Fourier transform to solve the physical equation directly if going as fast as possible across longer times was the goal.