How to parallelize across CPU cores?

hassila · January 30, 2025, 7:36pm

Instead of a busy waiting loop, there are pause instructions typically for this purposes that are suitable for the purpose, e.g for x86:

github.com/dotnet/runtime

Introduce pause intrinsics in order to support spin wait loop indication

opened 03:24PM - 01 Jun 21 UTC

closed 10:15PM - 15 Nov 21 UTC

zpodlovics

api-approved area-System.Runtime.Intrinsics

## Background and Motivation Some hardware platforms may be greatly benefit f…rom software indication that a spin wait loop is in progress. Some common execution benefits may be observed: 1. The reaction time of a spin wait loop construct may be improved when a spin wait indicating is used due to various factors, reducing thread-to-thread latencies in spinning wait situations. 2. The power consumed by the core or hardware thread involved in the spin wait loop construct may be reduced, benefitting overall power consumption of a program, and possibly allowing other cores or hardware threads to execute at faster speeds within the same power consumption envelope. As a practical example and use case, current x86 processors support a PAUSE instruction that can be used to indicate spinning behavior. Using a PAUSE instruction demonstrably reduces thread-to-thread round trips. Due to it's benefits and commonly recommended use, the x86 PAUSE instruction is commonly used in kernel spin locks, in POSIX libraries that perform heuristic spins prior to blocking, and even by the .NET itself (mm_pause). However, due to the inability to indicate that a .NET loop is spinning, it's benefits are not available to regular .NET code. In the prototype the round-trip latencies were demonstrably reduced by ~29-69 nsec across a wide percentile spectrum (from the 10%'ile to the 99.9%'ile). This reduction can represent an improvement as high as ~30%-50% in best-case thread-to-thread communication latency. Please note just like any other instruction latency the PAUSE instruction may vary depending on processor architectures [8][9]: - Intel® (Core i5-5200U) processor on Broadwell architecture: 9 cycles; - Intel® (Core i7-6500U) processor on Skylake architecture: 140 cycles; - Intel® processor on Cascade Lake architecture: 40 cycles. - AMD® (Ryzen 7 3700X) processor on Zen2 architecture: 65 cycles. Thanks to @jkotas suggestion Yield instrinsics will be also provided on ARM architecture at the same time to be in parity. ## Proposed API changes (using the alphanumerical order of file names): ```diff public abstract class ArmBase public static uint ReverseElementBits(uint value); + public static void Yield(); } } ``` ```diff public abstract partial class X86Base public static unsafe (int Eax, int Ebx, int Ecx, int Edx) CpuId(int functionId, int subFunctionId); + public static void Pause(); } } ``` ## Usage Examples Efficient thread-to-thread communication in order to implement highly performant (and often latency sensitive) concurrent data structures and communication patterns. A simple thread-to-thread communication latency and throughput tests that measures and reports on the behavior of thread-to-thread ping-pong latencies when spinning using a shared volatile field, align with the impact of using the Stopwatch/BusySpin/SpinWait/Pause call on that latency behavior. The test can be used to measure and document the impact of Sse2.Pause() behavior on thread-to-thread communication latencies. E.g. when the two threads are pinned to the two hardware threads of a shared x86 core (with a shared L1), this test will demonstrate an estimate the best case thread-to-thread latencies possible on the platform, if the latency of measuring time with Stopwatch.GetTimestamp() is discounted (GetTimestamp latency can be separately estimated across the percentile spectrum using the PauseIntrinsics.GetTimestamp.Benchmark.Cli test in the PauseIntrinsics project). The thread-to-thread communication benchmarks (in order to measure the latency of timestamp, busyspin, spinwait, pause wait methods) project are available: https://github.com/zpodlovics/PauseIntrinsics ### A non-official, non-validated, non-compatible proof of concept .NET SDK benchmarking result (using modified CentOS8 dotnet5 package). WARNING: due the fixed public api surface an existing api call (Sse2.MemoryFence()) will emit PAUSE instruction instead of the MFENCE instruction. Example .NET results plot (two threads on a shared core on a Xeon E5-2660v1 with SMT disabled and using all spectre / meltdown / related mitigations enabled by default): ![68747470733a2f2f7261772e6769746875622e636f6d2f7a706f646c6f766963732f7061757365696e7472696e736963732f6d61696e2f6d6561737572656d656e74732f4b61766572695f4c6174656e63792e706e67](https://user-images.githubusercontent.com/8523206/120336945-26e02f80-c2f3-11eb-8164-1e4e91d2d6f5.png) Example .NET results plot (two threads on a shared core on a Kaveri 7850K using all spectre / meltdown / related mitigations enabled by default): ![68747470733a2f2f7261772e6769746875622e636f6d2f7a706f646c6f766963732f7061757365696e7472696e736963732f6d61696e2f6d6561737572656d656e74732f53616e64794272696467655f4c6174656e63792e706e67](https://user-images.githubusercontent.com/8523206/120336953-28a9f300-c2f3-11eb-9040-73756e9db55a.png) ``` BenchmarkDotNet=v0.12.1, OS=centos 8 AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G, 1 CPU, 4 logical and 2 physical cores .NET Core SDK=5.0.102 [Host] : .NET Core 5.0 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT DefaultJob : .NET Core 5.0.2 (CoreCLR 5.0.220.62901, CoreFX 5.0.220.62901), X64 RyuJIT | Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated | Code Size | |------------- |----------:|----------:|----------:|------:|------:|------:|----------:|----------:| | BusySpin | 1.311 ns | 0.0542 ns | 0.0507 ns | - | - | - | - | 6 B | | GetTimestamp | 50.397 ns | 1.0498 ns | 1.2893 ns | - | - | - | - | 33 B | | SpinWait1 | 42.325 ns | 0.7664 ns | 0.7169 ns | - | - | - | - | 10 B | | MemoryFence | 1.177 ns | 0.0518 ns | 0.0484 ns | - | - | - | - | 3 B | ``` ## Alternative Designs DllImport or DllImport alternatives (e.g.: function pointers) can be used to spin loop with a spin-loop-indicating CPU instruction, but the DllImport / DllImport alternative boundary crossing overhead tends to be larger than the benefit provided by the instruction. .NET pattern maching could attempt to have the JIT compilers deduce spin-wait-loop situations and code and choose to automatically include a spin-loop-indicating CPU instructions with no .NET code indication required. I would expect that the complexity of automatically and reliably detecting spinning situations, coupled with questions about potential tradeoffs in using the indication on some platform to delay the availability of viable implementations significantly. ## Risks An intrinsic x86 implementation will involve modifications to multiple .NET components and exposing a new Sse2.Pause Intrinsics API and as such they carry some risks, but no more than other simple intrinsics added to the .NET. Some processor architecture may have significantly different latency profile for PAUSE intrinsics (e.g.: · Intel® Xeon® Scalable processor on Skylake architecture: 140 cycles). However this is also true for every other intrinsics that is available and should not prevent the intrinsics usage and it seems that the latency improved greatly since than (e.g.: 2nd generation Intel® Xeon® Scalable processor based on Cascade Lake architecture: 40 cycles.). ## References [1] [LMAX Disruptor .NET implementation](https://github.com/disruptor-net/Disruptor-net) [2] [Pause intrinsics latency and throughput benchmarks (C#)](https://github.com/zpodlovics/pauseintrinsics) [3] [Pause intrinsics latency and throughput benchmarks (Java)] (https://github.com/giltene/GilExamples/tree/master/SpinWaitTest) [4] [Chart depicting Java onSpinWait() intrinsification impact](https://github.com/giltene/GilExamples/blob/master/SpinWaitTest/SpinLoopLatency_E5-2697v2_sharedCore.png) [5] [.NET prototype Sse2.Pause intrinsics implementation branch] (https://github.com/zpodlovics/runtime/tree/sse2pause) [6] Implementations on other platforms (other than x86) may choose to use the same instructions as [linux cpu_relax](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/x86/um/asm/processor.h?h=v5.10.23#n30) and/or [plasma_spin](https://github.com/gstrauss/plasma/blob/master/plasma_spin.h) [7] https://software.intel.com/content/www/us/en/develop/articles/benefitting-power-and-performance-sleep-loops.html [8] [Andreas Abel: Automatic Generation of Models of Microarchitectures](https://uops.info/paper.html) [9] https://uops.info/table.html?search=PAUSE&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SNB=on&cb_IVB=on&cb_HSW=on&cb_BDW=on&cb_SKL=on&cb_SKX=on&cb_KBL=on&cb_CFL=on&cb_CNL=on&cb_CLX=on&cb_ICL=on&cb_ZENp=on&cb_ZEN2=on&cb_measurements=on&cb_iaca30=on&cb_doc=on&cb_base=on&cb_sse=on&cb_others=on

(random quick googled reference, assuming similar exists for ARM)

AlexanderM · January 30, 2025, 7:41pm

Not worth raising the version just for Mutex, IMO.

I was able to make this work with an ManagedAtomic based on @tera's suggestion, which drops the requirement down to macOS 10.15 (for withTaskGroup(...).

Demo using ManagedAtomic

import Foundation // For Thread, Dispatch Time, etc.
import Atomics

/// A dummy input.
struct Item {
	/// An ID for human readability.
	/// For demo purposes, higher IDs are more likely to take longer to process than lower IDs.
	let id: Int
}

/// A dummy result
final class ItemGroup: Sendable, AtomicReference {
	let cost: Double

	init(cost: Double) { self.cost = cost }
}

@available(macOS 10.15, *)
final class GroupFinder: Sendable {
	private let bestGroupYetAtomic: ManagedAtomic<ItemGroup?> = .init(nil)

	public func findBestGroup(items: [Item]) async -> ItemGroup? {
		if items.isEmpty { return nil }

		return await withTaskGroup(of: ItemGroup.self) { taskGroup in
			for item in items { // Queue up all the work
				taskGroup.addTask { @Sendable in
					await self.findBestGroup(startingWith: item)
				}
			}

			// Once everything is queued up, start waiting for the results.
			// The cooperative thread pool will run the tasks in parallel, with 1 thread per core.
			await taskGroup.waitForAll()

			return bestGroupYetAtomic.load(ordering: .sequentiallyConsistent)
		}
	}

	private func findBestGroup(startingWith item: Item) async -> ItemGroup {
		// The `lowestCostYet` is just an optimization for culling computations that can't possibly
		// be better that what's already been seen so far. If we load a value and it ends up being stale,
		// we might end culling less and doing more redundant computation. So it's a can trade-off between:
		// 1. `.sequentiallyConsistent` (more culling, but higher contention) and
		// 2. `.relaxed` (less contention, but less culling and potentially more redundant computation).
		let lowestCostYet = bestGroupYetAtomic.load(ordering: .relaxed)?.cost ?? Double.infinity

		let candidate = await findBestGroup(startingWith: item, lowestCostYet: lowestCostYet)

		// Update our `bestGroupYet` ASAP, so its value can be used by other starting tasks.
		updateIfLowerCost(candidate)

		return candidate
	}

	private func updateIfLowerCost(_ candidate: ItemGroup) {
		var exchanged = false

		while !exchanged {
			let current = bestGroupYetAtomic.load(ordering: .sequentiallyConsistent)

			if candidate.cost < current?.cost ?? Double.infinity {
				exchanged = bestGroupYetAtomic.compareExchange(
					expected: current,
					desired: candidate,
					ordering: .sequentiallyConsistent
				).exchanged
			} else {
				return // The existing cost was lower. Don't change it.
			}
		}

		print("New best item: \(candidate.cost)")
	}

	private func findBestGroup(startingWith item: Item, lowestCostYet: Double) async -> ItemGroup {
		// Replace this with the real algorithm.
		let resultCost = Double.random(in: 0...Double(item.id * item.id)) / 1000
		print("Item \(item.id): Started... will take \(resultCost)s.")
		hogCPU(forSeconds: resultCost)
		print("Item \(item.id): done.")
		return ItemGroup(cost: resultCost)
	}

	private func hogCPU(forSeconds seconds: TimeInterval) {
		// Intentionally block the thread using `Thread.sleep` (e.g. instead of `Task.sleep`), to simulate
		// a CPU-intensive computation (as opposed to I/O like a network request that can yield the thread).
		Thread.sleep(forTimeInterval: seconds)

		// Alternatively, we can use a busy-wait and really use the CPU.
//		let deadline = DispatchTime.now() + DispatchTimeInterval.seconds(1)
//		while DispatchTime.now() < deadline {}
	}
}

if #available(macOS 10.15, *) {
	// We base the computation time on the item IDs. For demonstration, we start large and get smaller,
	// so we're more likely to find new lowest costs as computation progresses.
	let dummyItems: Array<Item> = (1..<100).map(Item.init).reversed()
	let result = await GroupFinder().findBestGroup(items: dummyItems)
	print("Best result overall: \(result!.cost)")
}

Good catch.

David_Smith · January 30, 2025, 7:48pm

True. rwlocks also have other issues (they degrade to mutexes under surprisingly light write loads due to reader-reader contention while writers are pending, and they break priority donation). Those may or may not be important for your usage though.

Nevin · January 30, 2025, 8:05pm

Oh that’s great, thanks again!

Slava_Pestov · January 30, 2025, 9:19pm

How about something like this? No mutex needed:

struct Solution {...}
struct Input {...}

func doSomething(_: Input, _ Solution?) -> Solution { ... }

...

await withTaskGroup(of: Solution.self) { group in
  var numTasks = 0
  var bestSolution: Solution? = nil
  func startTask() {
    if let nextInput = ... {
      group.addTask { () -> Solution in
        return doSomething(nextInput, bestSolution)
      }
      numTasks += 1
    }
  }

  func completeTask(_ solution: Solution) {
    if bestSolution == nil || solution < bestSolution! {
      bestSolution = solution
    }
    numTasks -= 1
  }

  for _ in 0 ..< 32 {
    startTask()
  }

  for await solution in group {
    startTask()
    completeTask(solution)
  }
}

Nevin · January 30, 2025, 9:29pm

If I understand correctly, this would mean that the first 32 tasks all start from nil, and no more than 32 tasks ever run at the same time.

On a device with only 4 cores, 28 tasks would unnecessarily start from nil when they “could have” gotten a head start.

And on a device with 96 cores, 64 of them would remain idle the whole time.

Slava_Pestov · January 30, 2025, 9:36pm

That's right, but of course you can change the parameter. However, yeah, the other problem you pointed out is inherent with my approach and requires shared state to solve:

Nevin · January 31, 2025, 5:56pm

AlexanderM:

	private func updateIfLowerCost(_ candidate: ItemGroup) {
		var exchanged = false

		while !exchanged {
			let current = bestGroupYetAtomic.load(ordering: .sequentiallyConsistent)

			if candidate.cost < current?.cost ?? Double.infinity {
				exchanged = bestGroupYetAtomic.compareExchange(
					expected: current,
					desired: candidate,
					ordering: .sequentiallyConsistent
				).exchanged
			} else {
				return // The existing cost was lower. Don't change it.
			}
		}

		print("New best item: \(candidate.cost)")
	}

After doing a bit of reading, I think this should use weakCompareExchange since it’s in a loop, because that might be more efficient on some systems. Also, it can use the return value of the compare-exchange to update current, so there’s only one atomic access within the loop:

private func updateIfLowerCost(_ candidate: ItemGroup) {
  var exchanged = false
  var current = bestGroupYetAtomic.load(ordering: .sequentiallyConsistent)
  
  while !exchanged {
    if candidate.cost < current?.cost ?? Double.infinity {
      (exchanged, current) = bestGroupYetAtomic.weakCompareExchange(
        expected: current,
        desired: candidate,
        ordering: .sequentiallyConsistent
      )
    } else {
      return // The existing cost was lower. Don't change it.
    }
  }

  print("New best item: \(candidate.cost)")
}

Does that sound right?

And… I don’t know enough about atomic memory orderings to judge for myself, but would this still work correctly if the initial load (before the loop) used .relaxed ordering? Would there be any benefit?

I assume the actual exchange must be sequentially consistent, but I’m not entirely sure what could go wrong if it weren’t.

AlexanderM · January 31, 2025, 6:42pm

Nice findings. It all seems plausibly correct, but I don't know

Can someone please explain the reasoning behind the weak vs. strong compareAndExchange functions? The docs mention:

This compare-exchange variant is allowed to spuriously fail ... In this weak form, transient conditions may cause the original == expected check to sometimes return false when the two values are in fact the same.

I'm sure there's some benefit to this odd behaviour, perhaps it lowers to some different instruction that affords the hardware some clever trick to do something better. What is it, exactly?

Re. memory orderings: The Swift docs on memory ordering refer to the C++ std::memory_order counterparts. But the docs for those were clearly designed for lawyers, and I'm a mere software engineer, so I'm a bit out of my league lol

Nevin · February 1, 2025, 2:33pm

Actually, now that I think about it, this could work really well just by letting the number of tasks increase:

  startTask()
  
  for await solution in group {
    completeTask(solution)
    for _ in 0..<3 {
      startTask()
    }
  }

Now only one task starts from nil, and each time one finishes, several more begin. So this should quickly saturate the available processors.

It doesn’t need locks or atomics, though it does mean the first task runs alone until it completes. Obviously I could launch more than one task at the outset, but starting from nil is kind of a big slowdown so I’d rather only do it once.

An improved design might launch one task, wait until it finds the first valid group, and then start launching more tasks. That level of optimization may be more than I need to worry about though.

Slava_Pestov · February 1, 2025, 2:40pm

Another possibility is to keep a count of the number of pending tasks together with a count of started tasks. This requires a single atomic since the started task counter needs to be incremented in the Task closure itself. Then you can continue starting new tasks until the number of pending tasks exceeds the number of started tasks, then you wait for one to complete, etc.

Nevin · February 1, 2025, 7:05pm

…is there a good way to do this?

That is, inside the withTaskGroup closure I want to start one task. At some point in that task, it decides “Okay, now it’s time to add more tasks to the group.”

Can I call @Slava_Pestov’s startTask function from within the already-running task, in order to add more tasks to the same task group?