If I clean your code up slightly to remove obstacles to vectorization, by factoring the conditionals enforcing periodic boundary conditions out of the main loop, and simply remove usage of concurrency entirely:
let r = k * dt / (dx * dx)
for t in 0 ..< nt {
  let src = space[t % 2]
  let dst = space[(t+1) % 2]
  
  dst[0] = src[0] + r*(src[nx-1] - 2*src[0] + src[1])
  dst[nx-1] = src[nx-1] + r*(src[nx-2] - 2*src[nx-1] + src[0])
  
  for i in 0 ..< nx {
    dst[i] = src[i] + r*(src[i-1] - 2*src[i] + src[i+1])
  }
}
then the performance is only a little bit slower on my system. I would be curious to see your Python and Rust code, since the assembly generated by this isn't quite optimal, but it's within spitting distance:
+0x408	ldur                q2, [x2, #-0x8]
+0x40c	ldur                q3, [x2, #0x8]
+0x410	ldp                 q4, q5, [x2, #-0x10]
+0x414	fadd.2d             v6, v2, v2
+0x418	fadd.2d             v7, v3, v3
+0x41c	fsub.2d             v4, v4, v6
+0x420	fsub.2d             v6, v5, v7
+0x424	ldr                 q7, [x2, #0x10]
+0x428	fadd.2d             v4, v4, v5
+0x42c	fadd.2d             v5, v6, v7
+0x430	fmul.2d             v4, v4, v1
+0x434	fmul.2d             v5, v5, v1
+0x438	fadd.2d             v2, v2, v4
+0x43c	fadd.2d             v3, v3, v5
+0x440	stp                 q2, q3, [x1, #-0x10]
+0x444	add                 x1, x1, #0x20
+0x448	add                 x2, x2, #0x20
+0x44c	subs                x3, x3, #0x4
+0x450	b.ne                "main+0x408"
This could be improved somewhat by re-using loaded data across iterations (and using ext instead of loading from half-vector offsets), but those transforms are hard to get compilers to do in any language. It could also be improved somewhat on newer x86 and Apple Silicon by explicitly using FMA via src[I].addingProduct(r, ...) instead of separate multiply and add. However, neither of these would make much real difference as used because of point (0) below.
This sort of problem is generally a very poor case for naive concurrency or multithreading, for three reasons:
- Your working set is too large. At 2 * 10000000 * MemoryLayout(Double).size, you have a working set of 160MB, which doesn't fit into the caches, and so you're actually limited by the speed of memory, rather than the speed of compute, which does not scale nearly as well with additional resources.
- If you solved that with smaller working sets, you'd get bitten by data locality. You really want to pin each worker to a fixed portion of the data, so it will be in the caches associated with the worker at the next time step.
- If you solved that, you'd get bitten by fine-grained concurrency with synchronization boundaries: at each time step, every worker has to wait for all the other workers to finish. Because your system is doing other stuff, one or two of the work items will be delayed, and all your resources have to sit idle until the straggler is done and you can kick off the next group.
Now, since you aren't solving the physical heat equation, but rather a finite-difference approximation, the propagation speed is finite, and so the contributions of neighboring work items matter almost not at all (with only 10 time steps, only the 10 elements on the boundary between blocks can possibly effect the neighbor). So if you just wanted to make this as fast as possible you would use a wave-front approach that does all the time steps in one pass in each block, and then you fill in the ten elements at the boundaries afterwards. This allows you to make it so you only miss your caches twice for each element of the solution in total, instead of once for each time-space step.
And, of course, you're solving the heat equation with periodic boundary, so you would actually use a Fourier transform to solve the physical equation directly if going as fast as possible across longer times was the goal.