I have kept attacking this challenge from multiple angles, until I remembered and relocated a blog post about a scalability issue, whose message is: regardless of how tight its implementation is, an atomic counter semaphore can always become a scalability issue by mere virtue of requiring the cache line to bounce between the L1D attached to the respective cores if more than one of them manipulates it concurrently; which can easily and sometimes implicitly occur with reference counting.
So I tried to look for such a situation in my scenario, but even if I have some suspects, I first need data to know how much really of such potential savings can be had. And that is where I came up short on this Mac mini M2, as I failed to find a hardware counter that would be of help: none of them measure the case where a core requests a cache line but it can't be loaded from anywhere in its cache hierarchy, including the common RAM, at least not until another core has written it back. Instead, I had to get back to my meager Intel-based MacBook with only two physical cores, the bare minimum where this can happen, as this one does have counters to that effect. Do you hear me, Apple silicon guys? I had to get back to an Intel processor (which are now gone from the Mac hardware lineup) from a 7-year-old Mac in order to try and improve software scalability on your brand spanking new hardware since the latter is utterly unhelpful.
Using MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM in the CPU counters Instruments template in event mode (once every 1000 events), which allows samples to be taken at or close to where such events happen, I do notice swift_release and swift_retain at the top of the inverted call tree profile, in particular in comparison to the same profile for the sequential run, and unfolding their callers reveals that most of them are directly my code.
Unfortunately, Instruments was not able to point out which lines (even in the assembly) corresponded to such calls (it either complained of being unable to locate the file in question, or pointed me to the first line in the file, etc.), but there was only really one possibility for data that I share between concurrent tasks: the closures that correspond to the chain of operands for a given operation. Indeed, I had that data expressed as closures, being the most straightforward and economical way to make that data available to all the computation trees that may need access to it, as this data is immutable once captured. Everything else is mutable and copied (if not exactly at the point of the explicit copy, at worst immediately afterwards when it is first modified due to COW, which occurs at most once per micro-batch so not worth worrying about).
Unfortunately, Swift won't allow me to "fully copy" a closure, even in a shallow way: such an operation would probably be incompatible with its semantics anyway. So when spawning a new task, instead of such a copy I replace each reference to such a closure by a new closure whose only purpose is to store that reference and forward control to it when invoked:
// First, wrap the closures into a no-op closure that will serve as a cap
let childWalkOperands = { @Sendable action in
anchor_func();
return walkOperands(action);
};
Here this is the code for the closure for the operation that is currently being built, similar code is used to replace the similar reference found in the non-primitive ReadyValues in the available list (the array known as l which is copied when a new task is spawned). void anchor_func(void) is a no-op assembly function, which Swift imports as a foreign C declaration, whose only purpose is to ensure the optimizer won't deem that new closure to be equal to the one it wraps and merge the two as a result (which would defeat my work): its source code is
.globl _anchor_func
.p2align 2
_anchor_func:
ret
After a bit of additional code to support this switcheroo, I profile my code again.
Swift with closure cap
N| T | factor
1|100 | ref
2| 69.3|x1.44
3| 62.0|x1.61
4| 59.3|x1.69
5| 48.7|x2.05
6| 42.1|x2.37
7| 38.6|x2.59
8| 36.1|x2.77
9| 36.2|x2.76
10| 35.9|x2.78
11| 36.0|x2.77
12| 35.9|x2.79
(methodology is identical to Swift without dictionaries, plus an attached patch)
discussion
…well, the absolute performance did improve, as well as the consistency of the results (neither of which are shown here), which is why I will keep this change anyway; running that with the CPU counters instruments does show the event in question as being triggered by swift_release and swift_retain less often, and the remaining cases appear to be only indirectly related to my code.
But this did not move the needle on scalability one bit. Back to the drawing board…
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/AsyncAPI.swift
--- a/AsyncCountdown/AsyncAPI.swift Mon Jul 10 22:51:12 2023 +0200
+++ b/AsyncCountdown/AsyncAPI.swift Sun Sep 17 12:12:20 2023 +0200
@@ -20,13 +20,13 @@
_ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
_ kind: Op,
_ startingFrom: Int,
- _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async throws -> Void;
+ _ walkOperands: @escaping WalkerFuncNonGCD) async throws -> Void;
typealias DispatcherPseudoReasync = (_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
_ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
_ kind: Op,
_ startingFrom: Int,
- _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) throws -> Void;
+ _ walkOperands: @escaping WalkerFuncNonGCD) throws -> Void;
// Despite https://github.com/apple/swift-evolution/blob/main/proposals/0309-unlock-existential-types-for-all-protocols.md
// I am unable to do the following, even with Xcode 13.2.1: I get the error that:
@@ -235,9 +235,29 @@
_ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
_ kind: Op,
_ startingFrom: Int,
- _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async throws {
+ _ walkOperands: @escaping WalkerFuncNonGCD) async throws {
+ // First, wrap the closures into a no-op closure that will serve as a cap
+ let childWalkOperands = { @Sendable action in
+ anchor_func();
+ return walkOperands(action);
+ };
+ var interResolveCoreCtx = resolveCoreCtx;
+ interResolveCoreCtx.l = resolveCoreCtx.l.map({e in
+ return ReadyValueNonGCD(template: e,
+ overridingExpr: e.expr.map({ innerE in
+ var innerR = innerE;
+ let innerOperandsBrowser = innerE.operatorsBrowser;
+ innerR.operatorsBrowser = { action in
+ anchor_func();
+ return innerOperandsBrowser(action);
+ };
+ return innerR;
+ })
+ );
+ });
+
// reseat this divergence over a copy of the whole state
- let paramResolveCoreCtx = resolveCoreCtx;
+ let paramResolveCoreCtx = interResolveCoreCtx;
let paramExploreAdditionOfRightNodeCtx = exploreAdditionOfRightNodeCtx;
let task = Task {
@@ -246,7 +266,7 @@
_ innerExploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
_ innerKind: Op,
_ innerStartingFrom: Int,
- _ innerWalkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) {
+ _ innerWalkOperands: @escaping WalkerFuncNonGCD) {
iteratePossibleLeftNodesPseudoReasync(&innerResolveCoreCtx,
&innerExploreAdditionOfRightNodeCtx,
{ result in innerResultsScopingGroup.addTask { await resultMediator.yield(result);}},
@@ -271,7 +291,7 @@
&childExploreAdditionOfRightNodeCtx,
kind,
startingFrom,
- walkOperands);
+ childWalkOperands);
os_signpost(.end, log: LogWrapper.gObservationHandle, name: "micro-batch", signpostID: OSSignpostID(log: LogWrapper.gObservationHandle, object: Thread.current));
}
@@ -291,7 +311,7 @@
_ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
_ kind: Op,
_ startingFrom: Int,
- _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async throws {
+ _ walkOperands: @escaping WalkerFuncNonGCD) async throws {
/* 1: Peek a bit: if we find out from startingFrom that no further left node is
going to be able to be added anyway (and remember: closing the composition as
it stands is performed in our caller), don't bother spawning a micro-batch.
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/AsyncCountdown-Bridging-Header.h
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/AsyncCountdown/AsyncCountdown-Bridging-Header.h Sun Sep 17 12:12:20 2023 +0200
@@ -0,0 +1,13 @@
+//
+// AsyncCountdown-Bridging-Header.h
+// AsyncCountdown
+//
+// Created by Pierre Lebeaupin on 15/09/2023.
+//
+
+#ifndef AsyncCountdown_Bridging_Header_h
+#define AsyncCountdown_Bridging_Header_h
+
+#include "anchor_func.h"
+
+#endif /* AsyncCountdown_Bridging_Header_h */
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/AsyncCountdown.swift
--- a/AsyncCountdown/AsyncCountdown.swift Mon Jul 10 22:51:12 2023 +0200
+++ b/AsyncCountdown/AsyncCountdown.swift Sun Sep 17 12:12:20 2023 +0200
@@ -15,14 +15,15 @@
import Foundation;
+typealias WalkerFuncNonGCD = @Sendable (_ action: (_ value: ReadyValueNonGCD,
+ _ reverse: Bool) -> Void?
+ ) -> Void?;
+
@available(macOS 12.0.0, iOS 15.0, *)
struct ReadyValueNonGCD:Sendable {
struct Expr {
var op: Op;
- var operatorsBrowser: @Sendable (_ action: (_ value: ReadyValueNonGCD,
- _ reverse: Bool)
- -> Void?)
- -> Void?;
+ var operatorsBrowser: WalkerFuncNonGCD;
}
let value: UInt32;
@@ -38,6 +39,11 @@
expr = inExpr;
}
+ init(template intemplate: Self, overridingExpr inexpr: Expr?) {
+ value = intemplate.value;
+ expr = inexpr;
+ }
+
func description() -> String {
guard let localExpr = expr else {
return value.description;
@@ -47,7 +53,7 @@
var isFirst = true;
var numRev = 0;
- localExpr.operatorsBrowser({(_ subReadyValue: ReadyValueNonGCD,
+ localExpr.operatorsBrowser({(_ subReadyValue: ReadyValueNonGCD,
_ freverse: Bool) -> Void? in
guard !freverse else {
numRev += 1;
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/AsyncExplorers.swift
--- a/AsyncCountdown/AsyncExplorers.swift Mon Jul 10 22:51:12 2023 +0200
+++ b/AsyncCountdown/AsyncExplorers.swift Sun Sep 17 12:12:20 2023 +0200
@@ -46,10 +46,10 @@
// Downstream ones are those created while this one is active
try await dispatcher(&resolveCoreCtx,
- &reference,
- kind,
- /* startingFrom: */ skip,
- { _ in return nil;});
+ &reference,
+ kind,
+ /* startingFrom: */ skip,
+ { _ in return nil;});
}
/*
@@ -95,7 +95,7 @@
_ startGoal: UInt32,
_ kind: Op,
_ startingFrom: Int,
- _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?,
+ _ walkOperands: @escaping WalkerFuncNonGCD,
_ dispatcher: DispatcherAsync) async rethrows {
let emptyComposition = (walkOperands({_,_ in return ();}) == nil)
@@ -137,10 +137,10 @@
};
try await innerDispatcher(&innerResolveCoreCtx,
- &innerExploreAdditionOfRightNodeCtx,
- kind,
- /* startingFrom: */ candidateReverseOffset,
- selfNode);
+ &innerExploreAdditionOfRightNodeCtx,
+ kind,
+ /* startingFrom: */ candidateReverseOffset,
+ selfNode);
// close current composition
var num = 0;
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/DispatchAPI.swift
--- a/AsyncCountdown/DispatchAPI.swift Mon Jul 10 22:51:12 2023 +0200
+++ b/AsyncCountdown/DispatchAPI.swift Sun Sep 17 12:12:20 2023 +0200
@@ -20,7 +20,7 @@
_ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtx,
_ kind: Op,
_ startingFrom: Int,
- _ walkOperands: @escaping (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void?) -> Void?) throws -> Void;
+ _ walkOperands: @escaping WalkerFunc) throws -> Void;
@preconcurrency class ResolveObject {
@@ -40,7 +40,7 @@
_ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtx,
_ kind: Op,
_ startingFrom: Int,
- _ walkOperands: @escaping (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void?) -> Void?) {
+ _ walkOperands: @escaping WalkerFunc) {
iteratePossibleLeftNodes(&resolveCoreCtx,
&exploreAdditionOfRightNodeCtx,
resultReceiver,
@@ -68,7 +68,7 @@
_ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtx,
_ kind: Op,
_ startingFrom: Int,
- _ walkOperands: @escaping (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void?) -> Void?) {
+ _ walkOperands: @escaping WalkerFunc) {
/* 1: Peek a bit: if we find out from startingFrom that no further left node is
going to be able to be added anyway (and remember: closing the composition as
it stands is performed in our caller), don't bother spawning a micro-batch.
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/anchor_func.h
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/AsyncCountdown/anchor_func.h Sun Sep 17 12:12:20 2023 +0200
@@ -0,0 +1,4 @@
+#ifndef ANCHOR_FUNC
+#define ANCHOR_FUNC
+void anchor_func(void);
+#endif // ANCHOR_FUNC
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/anchor_func.s
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/AsyncCountdown/anchor_func.s Sun Sep 17 12:12:20 2023 +0200
@@ -0,0 +1,9 @@
+ .section __TEXT,__text,regular,pure_instructions
+ .build_version macos, 13, 0 sdk_version 13, 3
+ .globl _anchor_func
+ .p2align 2
+_anchor_func:
+ .cfi_startproc
+ ret
+ .cfi_endproc
+.subsections_via_symbols