Really bad performance with concurrecny and how to optimize the code

Hi,

we are working one a comparison of concurrency in various programming languages and implemented a 1d heat equation solver using Swift. Compared to Python, Rust, and C++ the performance of swift looks not very good.

First, the code only scales up to three cores, see below

Cores,Total time
10,2111.423936009407
8,2189.256893992424
5,1967.6182420253754
4,1929.6173659563065
2,3097.796007990837
1,4388.57520699501

However, the floating points operation per second are rather low, for example on three cores we get 500 Floating point operations per second. Compared to the other languages that is not much.

So, since we want to publish the results, I would like to ask for some help, since I assume our implementation in Swift is just not good.

I would appreciate to get any feedback on the code to get better flops. You can find the code here

Thanks,

Patrick

1 Like

At a glance, the use of an actor to collect the results seems unnecessary and inefficient. This bit here:

      for await result in group {

        var i = 0
        for e in result {
          await future.set_value(i, e)
          i = i + 1
        }
      }

    })

  await current = future.get_values()

is going to switch context into and out of the actor on every call to set_value in order to copy a single element into its array, once for each element, and I can imagine that likely overwhelms the cost of the computation itself. And since this is at the top level of the task group, there is no concurrency to be concerned about at all, and the use of an actor should not be necessary. You could append the arrays together directly to a local variable in the top level of the withTaskGroup block.

Ideally the code would be structured in such a way that the result array from each subtask doesn't need to be copied at all. Maybe you could allocate the full array at the top level, use withUnsafeMutableBufferPointer to get a pointer to its contents, and slice the buffer pointer so each task in the group can directly fill out its part of the array.

6 Likes

Something that is often overlooked is that Swift by default compiles with debug configuration, which can have a large performance impact. In case you didn’t, make sure to compile for release:

swift build -c release
1 Like

I was running

swiftc -O heat.swift

Is that correct?

Yes, that works as well.

Hi @diehlpk, I was looking at the heat_2 code in your repo that implements Joe's suggestion.

It may be unnecessary to yield the thread with await at each iteration of the i loop inside the task. The buffer pointers themselves are not mutated so you can declare the space array as a let constant, which entirely avoids the need for await. Otherwise you are being killed by task suspension.

1 Like

The linked files are giving a 404 error in GitHub for me.

Thanks, I fixed that and the performance got much better.

swift,10000000,10,10,1.0,1.0,0.15755105018615723
swift,10000000,10,9,1.0,1.0,0.16402196884155273
swift,10000000,10,8,1.0,1.0,0.17061495780944824
swift,10000000,10,7,1.0,1.0,0.18057799339294434
swift,10000000,10,6,1.0,1.0,0.18970704078674316
swift,10000000,10,5,1.0,1.0,0.20411300659179688
swift,10000000,10,4,1.0,1.0,0.2284250259399414
swift,10000000,10,3,1.0,1.0,0.2689969539642334
swift,10000000,10,2,1.0,1.0,0.3408910036087036
swift,10000000,10,1,1.0,1.0,0.5628360509872437

However, still far way from python or rust.

The link in the header post 404's for me. Please correct it.
In my experience Python is like 100 times slower than Swift :slight_smile:

With the current code, I am 50% faster as in Python.

If I clean your code up slightly to remove obstacles to vectorization, by factoring the conditionals enforcing periodic boundary conditions out of the main loop, and simply remove usage of concurrency entirely:

let r = k * dt / (dx * dx)
for t in 0 ..< nt {
  let src = space[t % 2]
  let dst = space[(t+1) % 2]
  
  dst[0] = src[0] + r*(src[nx-1] - 2*src[0] + src[1])
  dst[nx-1] = src[nx-1] + r*(src[nx-2] - 2*src[nx-1] + src[0])
  
  for i in 0 ..< nx {
    dst[i] = src[i] + r*(src[i-1] - 2*src[i] + src[i+1])
  }
}

then the performance is only a little bit slower on my system. I would be curious to see your Python and Rust code, since the assembly generated by this isn't quite optimal, but it's within spitting distance:

+0x408	ldur                q2, [x2, #-0x8]
+0x40c	ldur                q3, [x2, #0x8]
+0x410	ldp                 q4, q5, [x2, #-0x10]
+0x414	fadd.2d             v6, v2, v2
+0x418	fadd.2d             v7, v3, v3
+0x41c	fsub.2d             v4, v4, v6
+0x420	fsub.2d             v6, v5, v7
+0x424	ldr                 q7, [x2, #0x10]
+0x428	fadd.2d             v4, v4, v5
+0x42c	fadd.2d             v5, v6, v7
+0x430	fmul.2d             v4, v4, v1
+0x434	fmul.2d             v5, v5, v1
+0x438	fadd.2d             v2, v2, v4
+0x43c	fadd.2d             v3, v3, v5
+0x440	stp                 q2, q3, [x1, #-0x10]
+0x444	add                 x1, x1, #0x20
+0x448	add                 x2, x2, #0x20
+0x44c	subs                x3, x3, #0x4
+0x450	b.ne                "main+0x408"

This could be improved somewhat by re-using loaded data across iterations (and using ext instead of loading from half-vector offsets), but those transforms are hard to get compilers to do in any language. It could also be improved somewhat on newer x86 and Apple Silicon by explicitly using FMA via src[I].addingProduct(r, ...) instead of separate multiply and add. However, neither of these would make much real difference as used because of point (0) below.

This sort of problem is generally a very poor case for naive concurrency or multithreading, for three reasons:

  1. Your working set is too large. At 2 * 10000000 * MemoryLayout(Double).size, you have a working set of 160MB, which doesn't fit into the caches, and so you're actually limited by the speed of memory, rather than the speed of compute, which does not scale nearly as well with additional resources.
  2. If you solved that with smaller working sets, you'd get bitten by data locality. You really want to pin each worker to a fixed portion of the data, so it will be in the caches associated with the worker at the next time step.
  3. If you solved that, you'd get bitten by fine-grained concurrency with synchronization boundaries: at each time step, every worker has to wait for all the other workers to finish. Because your system is doing other stuff, one or two of the work items will be delayed, and all your resources have to sit idle until the straggler is done and you can kick off the next group.

Now, since you aren't solving the physical heat equation, but rather a finite-difference approximation, the propagation speed is finite, and so the contributions of neighboring work items matter almost not at all (with only 10 time steps, only the 10 elements on the boundary between blocks can possibly effect the neighbor). So if you just wanted to make this as fast as possible you would use a wave-front approach that does all the time steps in one pass in each block, and then you fill in the ten elements at the boundaries afterwards. This allows you to make it so you only miss your caches twice for each element of the solution in total, instead of once for each time-space step.

And, of course, you're solving the heat equation with periodic boundary, so you would actually use a Fourier transform to solve the physical equation directly if going as fast as possible across longer times was the goal.

9 Likes

Thanks for the respond. Our idea was to compare concurrency in a various languages. We want to investigate lines of codes, performance, and programming model for concurrency. So we decided on the stencil approach since this code is easy to understand and implement. So we do not want to get the fastest implementation and rather compare this specific implementation.

Python code: async_heat_equation/heat2.py at main · diehlpk/async_heat_equation · GitHub

Rust code: async_heat_equation/main.rs at main · diehlpk/async_heat_equation · GitHub

My main language is C++ and I wanted to learn Swift and was just thinking that I am doing something wrong with my first performance.

1 Like

Ah, so just looking at the Python code, it operates completely differently from the Swift code--you are pinning a thread to each segment of the work buffers, and all the arithmetic is explicitly vectorized via numpy. It's not terribly surprising that this works pretty well (it would work really well if the working set was smaller!), but that's hardly an apples-to-apples comparison.

3 Likes

Yes, I agree, but can I do the same in Swift?

Alternatively just remove numpy from Python to make it more apples-to-apples comparison.

2 Likes

I illustrated how to get vectorization to happen automatically in my post above. I suspect that someone with some free time can illustrate how to achieve better data locality.

I've been looking at the Python code.

  1. The ghosts in the send_ghosts method are taken from data[0] and data[-1]. Actually I think they should come from data[1] and data[-2] for reasons that become obvious when you stare at it.

  2. In Python, the leftmost worker and the rightmost worker are not connected. In the Swift version, the space is modelled as a circular array. I think this means that Python is modelling a rod, while Swift is modelling a torus. This has implications for approaches to data localisation along the lines of rod-easy torus-hard.

  3. The timer in Python is not started until the workers and their data have been initialised. In Swift the timer is started immediately so it includes the initialisation of data structures.

  4. The for await loop at line 79 of the Swift code is not necessary and it is just destroying performance.

I re-tested after moving the Python timing code to measure the whole runtime -- given that the objective is to compare language concurrency features then it seems reasonable to include this as overhead. I also removed all of the threading code in a version of the Python so it's just vectorising through NumPy.

For a symmetrical workload at 10 10000 10000:

  • Swift Task-based = 0.38 secs.
  • Swift Vectorised = 0.04 secs.
  • Python Threading = 4.6 secs.
  • Python NumPy = 0.17 secs.
5 Likes

Thanks. Meanwhile, we changed the python code to use the same structure and initial conditions.

Would you mind sharing your code?

I am working on new version and would appreciate to look at the code.

Sure, I've added the other versions to the Gist: Async Heat Equation · GitHub

The change in the Python code is mostly adding and then commenting-out print statements! I enjoyed looking at how it works.