Improving Swift code for scalability

I recently acquired a Mac mini M2 (4 performance cores, 4 efficiency cores); even if I did not get the M2 Pro (much less the SMP monsters featured in the Mac Studio), I am very interested in seeing how well I can exploit these CPUs using thread-level parallelism. Particularly so with Swift, which was released at a time where all Macs and iOS devices were already multi-core and seemed meant to be tailored to such a multi-core reality, potential which was unveiled with the release of concurrency features starting with Swift 5.5.

So imagine my disappointment when I ran my experimentations project in it, and I obtained the following run times:

Naive Swift

 N|  T  | factor
 1|100  | ref
 2| 83.5|x1.20
 3| 69.5|x1.44
 4| 79.5|x1.26
 5| 60.8|x1.64
 6| 50.0|x2
 7| 47.2|x2.12
 8| 39.8|x2.51
 9| 45.2|x2.21
10| 45.1|x2.22
11| 45.3|x2.21
12| 39.2|x2.56

The first column is the number of tasks that my code spawns before requiring that a task completes as a condition for an additional task to new spawned; the code is run as micro-batches whose size is independent from the number of processors or from this parameter. I went up to 12 as a sanity check

The second column is the time taken in each scenario, normalized against a duration 100 for the sequential case (i.e. where the above parameter is 1; that code uses the async APIs too anyway so the comparison is fair): I am not interested in the absolute performance here, only in whether it scales as expected. Smaller amounts are better.

The third column is how better that scenario performs compared to the first, expressed as a performance multiplier. It is the inverse of the second column, presented for convenience. Higher numbers are better.

Methodology: I opened my project, which you can find as the attachment to the May 17th comment at It's time for `reasync` · Issue #65475 · apple/swift · GitHub with Xcode 15.0 beta 8 and profiled it with the time profiler template. For each run, I took the difference between the end time and the initialization time both reported by Instruments. After an initial run whose results were not kept so as to warm the caches, each scenario was run 3 times, and the median time was kept as the scenario outcome. Control of the parallelism was obtained through an environment variable, which requires the following patch to be applied over the project version from May 17th¹; obtention of the required parallelism was confirmed from inspection of the time profiler instrument (except for parallelism values above 8). The complexity of the computation was chosen so the scenario completes in a handful of seconds, even in the case of the reference (longest duration) scenario, so as to avoid thermal throttling effects.

discussion

Right off the bat, there are obviously some issues with these results: you can see performance regress when going from 3-way to 4-way parallelism, and again when going from 8-way to 9-way (before recovering at 12); there appears to be bimodal behaviors at work. But the general trend is clear: scalability is awful. I doubled-checked the time profiler instrument for obvious misbehavior, such as a single task completing much later than the others and making the processing complete much later than expected, but no such luck. For sanity, I also checked on the machine I was relying on before (2016 13' MacBook Pro, no touch bar), but I saw similar results when going from 1 to 2 parallel tasks.

When I tried the dedicated Swift concurrency template, it failed to provide much insight: to begin with, it appears to be overloaded by the amount of tasks that are spawned (~5000), even if there are never more than N at any given time, resulting in my processing slowing to a crawl.

Looking at the profiling results, I started to have ideas for improvements, but first I wanted to see what kind of scalability I could legitimately expect, to see what I could demand of my Swift code. So I reimplemented my processing in C, with concurrency obtained through libdispatch.

Fast-forward a few weeks…

C with libdispatch

 N|  T  | factor
 1|100  | ref
 2| 55.9|x1.79
 3| 36.5|x2.74
 4| 28.8|x3.47
 5| 28.3|x3.53
 6| 27.5|x3.64
 7| 26.4|x3.79
 8| 24.6|x4.07
 9| 24.0|x4.16
10| 23.7|x4.23
11| 22.8|x4.38
12| 22.8|x4.38

Methodology is similar; the only difference from the the Swift version is the computation load, which was adjusted so total duration in the reference scenario is in the same ballpark so as to obtain comparable thermal throttling effects, if any.

discussion

These results are much better behaved, for a start (this is confirmed by the individual runs, whose groupings are satisfying), and while scaling is not linear from 1 to 4, it is close enough to it that I don't feel computational resources are wasted. As expected, going from 4 to 8 improves the performance much less as we get into the efficiency cores, but still improves the situation. Performance does still improve beyond 8, possibly because we afford the system more opportunities for scheduling useful work, but we clearly hit diminishing returns by that point.

Coming back to the Swift version, one peculiarity I noticed in the profile is some symbols related to dictionaries (hashing-related infrastructure, in particular) jumping from the nowhere near the top in the case of the reference runs to the top of the symbols (sorted by descending weight in the inverted call tree) for all other runs, even when parallelism is barely 2. So I worked on eliminating dictionaries from my naive Swift code, and repeated the experiment.

Swift without dictionaries

 N|  T  | factor
 1|100  | ref
 2| 69.1|x1.45
 3| 62.0|x1.61
 4| 58.6|x1.71
 5| 48.3|x2.07
 6| 41.9|x2.39
 7| 37.9|x2.64
 8| 36.1|x2.77
 9| 36.1|x2.77
10| 36.1|x2.77
11| 35.9|x2.78
12| 36.0|x2.77

Methodology is the same as the naive Swift table; the only difference is the improvement patch, attached here² (I am not happy with some of the compromises that I had to make in order to remove usage of dictionaries; I welcome suggestions for improvements to better handle the situation).

discussion

This code is already much better behaved than the naive Swift one (this is all the more obvious in the individual runs). Scalability is much better at the earlier levels, though behavior is somewhat erratic (much better improvement going from 4 to 5 than going from 3 to 4 parallel tasks), and the factor caps at a level not much better than it did earlier.

At any rate, we are still far from scalability that could legitimately be expected of this algorithm.

Further improvements

The improvements I obtained by removing dictionaries suggest the issues I am having with scalability here may not be related solely to the way I use the Swift concurrency APIs, but also in how the Swift code proper behaves, so I think this should be a significant avenue for investigation.

Unfortunately, I have failed to notice any further lead from what Instruments or Xcode could tell me, so I am looking for means to investigate such remaining scalability impediments: tooling, experiments to perform, known issues, best practices, etc.

2 Likes

1:

diff -r 330582a717f6 -r d2ef6283be0b AsyncCountdown/AsyncAPI.swift
--- a/AsyncCountdown/AsyncAPI.swift     Wed May 17 01:41:53 2023 +0200
+++ b/AsyncCountdown/AsyncAPI.swift     Fri Jul 14 22:58:55 2023 +0200
@@ -223,7 +223,11 @@
                                  source: DispatchSource.makeSignalSource(signal: SIGINT,
                                              queue: DispatchQueue.global(qos:.userInitiated)),
                                  cancelHandler: { Task { await cancellator.raise();}});
-        let _ = Computer(exec: { () -> Channel in
+        var param: UInt?;
+        if let config = getenv("net_wanderingcoder_cpucount"), config[0] != 0 {
+            param = UInt(strlen(config));
+        }
+        let _ = Computer(cpuCount: param, exec: { () -> Channel in
             return await Channel(exec: { (sink) in
                 /* While the subtasks added to this group never throw themselves,
                  this is a throwing group so that it can be traversed by a thrown
diff -r 330582a717f6 -r d2ef6283be0b AsyncCountdown/PullAPI.swift
--- a/AsyncCountdown/PullAPI.swift      Wed May 17 01:41:53 2023 +0200
+++ b/AsyncCountdown/PullAPI.swift      Fri Jul 14 22:58:55 2023 +0200
@@ -21,14 +21,18 @@
 }
 
 @available(macOS 12.0, iOS 15.0, *) struct Computer<BatchIterator: BatchIteratorProtocol> {
-    init(exec: @Sendable @escaping () async -> BatchIterator) {
+    init(cpuCount: UInt? = nil, exec: @Sendable @escaping () async -> BatchIterator) {
+        guard cpuCount != 0 else
+        {
+            fatalError("Computer instantiated with 0 CPU!");
+        }
         Task {
             await withTaskGroup(of: Void.self) { group in
                 var outstandingTasks = UInt(0);
                 var iterator = await exec();
 
                 while (true) {
-                    guard outstandingTasks < ProcessInfo.processInfo.activeProcessorCount else {
+                    guard outstandingTasks < cpuCount ?? UInt(ProcessInfo.processInfo.activeProcessorCount) else {
                         _ = await group.next();
                         outstandingTasks -= 1;
                         continue;

2:

diff -r d2ef6283be0b -r cdcaaddad141 AsyncCountdown/AsyncAPI.swift
--- a/AsyncCountdown/AsyncAPI.swift     Fri Jul 14 22:58:55 2023 +0200
+++ b/AsyncCountdown/AsyncAPI.swift     Sun Jul 02 19:01:50 2023 +0200
@@ -16,20 +16,14 @@
 import Foundation;
 import os;
 
-typealias DispatcherAsync = (_ l: inout [ReadyValueNonGCD],
-                        _ composedCount: inout Int,
-                        _ otherComposedCount: inout Int,
-                        _ decomposedValue: inout [Bool:UInt32],
-                        _ downstreamSiblingsRecruitmentSkip: inout Int,
+typealias DispatcherAsync = (_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
+                        _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                         _ kind: Op,
                         _ startingFrom: Int,
                         _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async throws -> Void;
 
-typealias DispatcherPseudoReasync = (_ l: inout [ReadyValueNonGCD],
-                        _ composedCount: inout Int,
-                        _ otherComposedCount: inout Int,
-                        _ decomposedValue: inout [Bool:UInt32],
-                        _ downstreamSiblingsRecruitmentSkip: inout Int,
+typealias DispatcherPseudoReasync = (_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
+                                     _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                         _ kind: Op,
                         _ startingFrom: Int,
                         _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) throws -> Void;
@@ -237,62 +231,29 @@
                  */
                 do {
                     try await withThrowingTaskGroup(of: Void.self) { resultsScopingGroup in
-                        func iteratePossibleLeftNodesDispatch(_ l: inout [ReadyValueNonGCD],
-                                                              _ composedCount: inout Int,
-                                                              _ otherComposedCount: inout Int,
-                                                              _ decomposedValue: inout [Bool:UInt32],
-                                                              _ downstreamSiblingsRecruitmentSkip: inout Int,
+                        func iteratePossibleLeftNodesDispatch(_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
+                                                              _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                                                               _ kind: Op,
                                                               _ startingFrom: Int,
                                                               _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async throws {
                             // reseat this divergence over a copy of the whole state
-                            let paramL = l;
-                            let paramComposedCount = composedCount;
-                            let paramOtherComposedCount = otherComposedCount;
-                            let paramDecomposedValue = decomposedValue;
-                            let paramDownstreamSiblingsRecruitmentSkip = downstreamSiblingsRecruitmentSkip;
+                            let paramResolveCoreCtx = resolveCoreCtx;
+                            let paramExploreAdditionOfRightNodeCtx = exploreAdditionOfRightNodeCtx;
                             
                             let task = Task {
                                 await withTaskGroup(of: Void.self) { innerResultsScopingGroup in
-                                    /*func iteratePossibleLeftNodesFakeDispatchAsync(_ l: inout [ReadyValueNonGCD],
-                                     _ composedCount: inout Int,
-                                     _ otherComposedCount: inout Int,
-                                     _ decomposedValue: inout [Bool:UInt32],
-                                     _ downstreamSiblingsRecruitmentSkip: inout Int,
-                                     _ kind: Op,
-                                     _ startingFrom: Int,
-                                     _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async {
-                                     await iteratePossibleLeftNodesAsync(&l,
-                                     &composedCount,
-                                     &otherComposedCount,
-                                     &decomposedValue,
-                                     &downstreamSiblingsRecruitmentSkip,
-                                     resultReceiver,
-                                     startGoal,
-                                     kind,
-                                     startingFrom,
-                                     walkOperands,
-                                     iteratePossibleLeftNodesFakeDispatchAsync);
-                                     }*/
-                                    
-                                    func iteratePossibleLeftNodesFakeDispatchPseudoReasync(_ l: inout [ReadyValueNonGCD],
-                                                                                           _ composedCount: inout Int,
-                                                                                           _ otherComposedCount: inout Int,
-                                                                                           _ decomposedValue: inout [Bool:UInt32],
-                                                                                           _ downstreamSiblingsRecruitmentSkip: inout Int,
-                                                                                           _ kind: Op,
-                                                                                           _ startingFrom: Int,
-                                                                                           _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) {
-                                        iteratePossibleLeftNodesPseudoReasync(&l,
-                                                                              &composedCount,
-                                                                              &otherComposedCount,
-                                                                              &decomposedValue,
-                                                                              &downstreamSiblingsRecruitmentSkip,
+                                    func iteratePossibleLeftNodesFakeDispatchPseudoReasync(_ innerResolveCoreCtx: inout ResolveCoreCtxNonGCD,
+                                                                                           _ innerExploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
+                                                                                           _ innerKind: Op,
+                                                                                           _ innerStartingFrom: Int,
+                                                                                           _ innerWalkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) {
+                                        iteratePossibleLeftNodesPseudoReasync(&innerResolveCoreCtx,
+                                                                              &innerExploreAdditionOfRightNodeCtx,
                                                                               { result in innerResultsScopingGroup.addTask { await resultMediator.yield(result);}},
                                                                               startGoal,
-                                                                              kind,
-                                                                              startingFrom,
-                                                                              walkOperands,
+                                                                              innerKind,
+                                                                              innerStartingFrom,
+                                                                              innerWalkOperands,
                                                                               iteratePossibleLeftNodesFakeDispatchPseudoReasync);
                                     }
                                     
@@ -303,17 +264,11 @@
                                      partial functions on the same thread either. */
                                     os_signpost(.begin, log: LogWrapper.gObservationHandle, name: "micro-batch", signpostID: OSSignpostID(log: LogWrapper.gObservationHandle, object: Thread.current));
                                     
-                                    var childL = paramL;
-                                    var childComposedCount = paramComposedCount;
-                                    var childOtherComposedCount = paramOtherComposedCount;
-                                    var childDecomposedValue = paramDecomposedValue;
-                                    var childDownstreamSiblingsRecruitmentSkip = paramDownstreamSiblingsRecruitmentSkip;
+                                    var childResolveCoreCtx = paramResolveCoreCtx;
+                                    var childExploreAdditionOfRightNodeCtx = paramExploreAdditionOfRightNodeCtx;
                                     
-                                    iteratePossibleLeftNodesFakeDispatchPseudoReasync(&childL,
-                                                                                      &childComposedCount,
-                                                                                      &childOtherComposedCount,
-                                                                                      &childDecomposedValue,
-                                                                                      &childDownstreamSiblingsRecruitmentSkip,
+                                    iteratePossibleLeftNodesFakeDispatchPseudoReasync(&childResolveCoreCtx,
+                                                                                      &childExploreAdditionOfRightNodeCtx,
                                                                                       kind,
                                                                                       startingFrom,
                                                                                       walkOperands);
@@ -332,11 +287,8 @@
                             try await cancellator.checkRaised();
                         }
                         
-                        func iteratePossibleLeftNodesDispatchLoop(_ l: inout [ReadyValueNonGCD],
-                                                                  _ composedCount: inout Int,
-                                                                  _ otherComposedCount: inout Int,
-                                                                  _ decomposedValue: inout [Bool:UInt32],
-                                                                  _ downstreamSiblingsRecruitmentSkip: inout Int,
+                        func iteratePossibleLeftNodesDispatchLoop(_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
+                                                                  _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                                                                   _ kind: Op,
                                                                   _ startingFrom: Int,
                                                                   _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async throws {
@@ -349,13 +301,10 @@
                              - 4 already means too few possibilities to explore
                              - 3 is right out
                              Note that among other properties, this estimator value is monotonic. */
-                            guard startingFrom < (l.count - composedCount),
-                                  l.count + (walkOperands({_,_ in return ();}) != nil ? 1 : 0) == 5 else {
-                                try await iteratePossibleLeftNodesAsync(&l,
-                                                                        &composedCount,
-                                                                        &otherComposedCount,
-                                                                        &decomposedValue,
-                                                                        &downstreamSiblingsRecruitmentSkip,
+                            guard startingFrom < (resolveCoreCtx.l.count - resolveCoreCtx.composedCount),
+                                  resolveCoreCtx.l.count + (walkOperands({_,_ in return ();}) != nil ? 1 : 0) == 5 else {
+                                try await iteratePossibleLeftNodesAsync(&resolveCoreCtx,
+                                                                        &exploreAdditionOfRightNodeCtx,
                                                                         { result in resultsScopingGroup.addTask { await resultMediator.yield(result);}},
                                                                         startGoal,
                                                                         kind,
@@ -365,11 +314,8 @@
                                 return;
                             }
                             
-                            try await iteratePossibleLeftNodesDispatch(&l,
-                                                                       &composedCount,
-                                                                       &otherComposedCount,
-                                                                       &decomposedValue,
-                                                                       &downstreamSiblingsRecruitmentSkip,
+                            try await iteratePossibleLeftNodesDispatch(&resolveCoreCtx,
+                                                                       &exploreAdditionOfRightNodeCtx,
                                                                        kind,
                                                                        startingFrom,
                                                                        walkOperands);
diff -r d2ef6283be0b -r cdcaaddad141 AsyncCountdown/AsyncCountdown.swift
--- a/AsyncCountdown/AsyncCountdown.swift       Fri Jul 14 22:58:55 2023 +0200
+++ b/AsyncCountdown/AsyncCountdown.swift       Sun Jul 02 19:01:50 2023 +0200
@@ -101,24 +101,32 @@
 }
 
 @available(macOS 12.0.0, iOS 15.0, *)
+struct ResolveCoreCtxNonGCD:Sendable {
+    var l = [ReadyValueNonGCD]();
+    var composedCount = 0;
+    var otherComposedCount = 0;
+};
+
+@available(macOS 12.0.0, iOS 15.0, *)
+struct ExploreAdditionOfRightNodeCtxNonGCD:Sendable {
+    var forwardValue: UInt32;
+    var reverseValue: UInt32;
+    var downstreamSiblingsRecruitmentSkip: Int;
+}
+
+
+@available(macOS 12.0.0, iOS 15.0, *)
 func resolveCoreNonGCD(_ injectedDispatcher: DispatcherAsync, _ primitives: [UInt32]) async throws {
-    var referenceL = [ReadyValueNonGCD]();
-    var referenceComposedCount = 0;
-    var referenceOtherComposedCount = 0;
-            
+    var reference = ResolveCoreCtxNonGCD();
             
     for element in primitives {
-        referenceL.append(.init(element));
+        reference.l.append(.init(element));
     }
         
-    try await exploreAdditionOfFloorAsync(&referenceL,
-                                     &referenceComposedCount,
-                                     &referenceOtherComposedCount,
+    try await exploreAdditionOfFloorAsync(&reference,
                                      injectedDispatcher,
                                      kind: .additive);
-    try await exploreAdditionOfFloorAsync(&referenceL,
-                                     &referenceComposedCount,
-                                     &referenceOtherComposedCount,
+    try await exploreAdditionOfFloorAsync(&reference,
                                      injectedDispatcher,
                                      kind: .multiplicative);
 }
diff -r d2ef6283be0b -r cdcaaddad141 AsyncCountdown/AsyncExplorers.swift
--- a/AsyncCountdown/AsyncExplorers.swift       Fri Jul 14 22:58:55 2023 +0200
+++ b/AsyncCountdown/AsyncExplorers.swift       Sun Jul 02 19:01:50 2023 +0200
@@ -16,45 +16,37 @@
 import Foundation;
             
             
-func exploreAdditionOfFloorAsync(_ l: inout [ReadyValueNonGCD],
-                                               _ composedCount: inout Int,
-                                               _ otherComposedCount: inout Int,
+func exploreAdditionOfFloorAsync(_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
                                                _ dispatcher: DispatcherAsync,
                                            kind: Op) async rethrows {
-    guard otherComposedCount == 0 else {
+    guard resolveCoreCtx.otherComposedCount == 0 else {
         return;
     }
     
-    otherComposedCount = composedCount;
-    composedCount = 0;
+    resolveCoreCtx.otherComposedCount = resolveCoreCtx.composedCount;
+    resolveCoreCtx.composedCount = 0;
     
-    try await exploreAdditionOfRightNodeAsync(&l,
-                                          &composedCount,
-                                          &otherComposedCount,
+    try await exploreAdditionOfRightNodeAsync(&resolveCoreCtx,
                                           dispatcher,
                                           kind: kind,
                                           skip: 0);
 
-    composedCount = otherComposedCount;
-    otherComposedCount = 0;
+    resolveCoreCtx.composedCount = resolveCoreCtx.otherComposedCount;
+    resolveCoreCtx.otherComposedCount = 0;
 }
 
-func exploreAdditionOfRightNodeAsync(_ l: inout [ReadyValueNonGCD],
-                                               _ composedCount: inout Int,
-                                               _ otherComposedCount: inout Int,
+func exploreAdditionOfRightNodeAsync(_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
                                                _ dispatcher: DispatcherAsync,
                                                kind: Op,
                                                skip: Int) async rethrows {
-    var referenceDecomposedValue = [false: kind.neutralValue(), true: kind.neutralValue()];
+    var reference = ExploreAdditionOfRightNodeCtxNonGCD(forwardValue: kind.neutralValue(),
+                                                        reverseValue: kind.neutralValue(),
+                                                        downstreamSiblingsRecruitmentSkip: Int(0)) // placeholder
     // Siblings are the other right nodes on the same floor
     // Downstream ones are those created while this one is active
-    var downstreamSiblingsRecruitmentSkip = Int(0); // placeholder
     
-    try await dispatcher(&l,
-                     &composedCount,
-                     &otherComposedCount,
-                     &referenceDecomposedValue,
-                     &downstreamSiblingsRecruitmentSkip,
+    try await dispatcher(&resolveCoreCtx,
+                     &reference,
                      kind,
                      /* startingFrom: */ skip,
                      { _ in return nil;});
@@ -97,11 +89,8 @@
 the same floor) recovering candidates in a way that results in a composition
 that we ourselves explored earlier, but that is the basic principle.
 */
-func iteratePossibleLeftNodesAsync(_ l: inout [ReadyValueNonGCD],
-                                             _ composedCount: inout Int,
-                                             _ otherComposedCount: inout Int,
-                                             _ decomposedValue: inout [Bool:UInt32],
-                                             _ downstreamSiblingsRecruitmentSkip: inout Int,
+func iteratePossibleLeftNodesAsync(_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
+                                             _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                                              _ resultReceiver: (_ result: String) -> Void,
                                              _ startGoal: UInt32,
                                              _ kind: Op,
@@ -117,35 +106,38 @@
     would never reach 0 no matter how many siblings we'd add, and so we'd never be
     able to close the floor anyway. So it'd be pointless to keep going. */
     for candidateReverseOffset in
-            startingFrom ..< ((emptyComposition && (otherComposedCount > 0)) ? startingFrom+1 : (l.count - composedCount) ) {
+            startingFrom ..< ((emptyComposition && (resolveCoreCtx.otherComposedCount > 0)) ? startingFrom+1 : (resolveCoreCtx.l.count - resolveCoreCtx.composedCount) ) {
         if emptyComposition {
-            downstreamSiblingsRecruitmentSkip = candidateReverseOffset;
+            exploreAdditionOfRightNodeCtx.downstreamSiblingsRecruitmentSkip = candidateReverseOffset;
         }
 
         /* Note: new compositions are added at the end of the array, i.e. at the high offsets,
          while primitives (if any left) will be at the low offsets. As a result, the candidates
          to be recruited as a first priority are found at the high offsets first, though not so
          high that we would end up recruiting from the same floor (forbidden). */
-        let rightChild = l.remove(at: (l.count - composedCount - 1) - candidateReverseOffset);
+        let rightChild = resolveCoreCtx.l.remove(at: (resolveCoreCtx.l.count - resolveCoreCtx.composedCount - 1) - candidateReverseOffset);
         if let _ = rightChild.expr {
-            otherComposedCount -= 1;
+            resolveCoreCtx.otherComposedCount -= 1;
         }
         
-        for phase in 0...1 {
-            let reverse = (phase == 1);
-            { (_ valueComponent: inout UInt32) in
-                valueComponent = kind.combine(valueComponent, rightChild.value);
-            }(&decomposedValue[reverse]!);
-            
+        func combine(_ total: inout UInt32) {
+            total = kind.combine(total, rightChild.value);
+        }
+
+        func uncombine(_ total: inout UInt32) {
+            total = kind.uncombine(total, rightChild.value);
+        }
+        
+        func commonAsync(_ innerResolveCoreCtx: inout ResolveCoreCtxNonGCD,
+                         _ innerExploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
+                         isReverse: Bool,
+                         _ innerDispatcher: DispatcherAsync) async rethrows {
             let selfNode = {@Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void? in
-                return action(rightChild, reverse) ?? walkOperands(action);
+                return action(rightChild, isReverse) ?? walkOperands(action);
             };
             
-            try await dispatcher(&l,
-                             &composedCount,
-                             &otherComposedCount,
-                             &decomposedValue,
-                             &downstreamSiblingsRecruitmentSkip,
+            try await innerDispatcher(&innerResolveCoreCtx,
+                             &innerExploreAdditionOfRightNodeCtx,
                              kind,
                              /* startingFrom: */ candidateReverseOffset,
                              selfNode);
@@ -153,49 +145,49 @@
             // close current composition
             var num = 0;
             if (selfNode({_,_ in guard num == 0 else {return ();}; num += 1; return nil;}) != nil)
-                    && ( (kind == .additive) ? decomposedValue[false]! > decomposedValue[true]! :
-                            ((decomposedValue[false]! % decomposedValue[true]!) == 0) ) {
+                && ( (kind == .additive) ? innerExploreAdditionOfRightNodeCtx.forwardValue > innerExploreAdditionOfRightNodeCtx.reverseValue :
+                            ((innerExploreAdditionOfRightNodeCtx.forwardValue % innerExploreAdditionOfRightNodeCtx.reverseValue) == 0) ) {
                 
-                let realizedValue = kind.uncombine(decomposedValue[false]!, decomposedValue[true]!);
+                let realizedValue = kind.uncombine(innerExploreAdditionOfRightNodeCtx.forwardValue, innerExploreAdditionOfRightNodeCtx.reverseValue);
                 
                 let readyValue = ReadyValueNonGCD(value: realizedValue,
                                                   expr: .init(op: kind,
                                                        operatorsBrowser: selfNode))
                 
-                if l.count == 0 {
+                if innerResolveCoreCtx.l.count == 0 {
                     if realizedValue == startGoal {
                         resultReceiver(readyValue.description());
                     }
                 } else {
-                    composedCount += 1;
-                    l.append(readyValue);
+                    innerResolveCoreCtx.composedCount += 1;
+                    innerResolveCoreCtx.l.append(readyValue);
                     
-                    try await exploreAdditionOfRightNodeAsync(&l,
-                                                          &composedCount,
-                                                          &otherComposedCount,
-                                                          dispatcher,
+                    try await exploreAdditionOfRightNodeAsync(&innerResolveCoreCtx,
+                                                              innerDispatcher,
                                                           kind: kind,
-                                                          skip: downstreamSiblingsRecruitmentSkip);
-                    try await exploreAdditionOfFloorAsync(&l,
-                                                      &composedCount,
-                                                      &otherComposedCount,
-                                                      dispatcher,
+                                                              skip: innerExploreAdditionOfRightNodeCtx.downstreamSiblingsRecruitmentSkip);
+                    try await exploreAdditionOfFloorAsync(&innerResolveCoreCtx,
+                                                          innerDispatcher,
                                                       kind: ((kind == .additive) ? .multiplicative : .additive));
                     
-                    l.remove(at: l.count - 1);
-                    composedCount -= 1;
+                    innerResolveCoreCtx.l.remove(at: innerResolveCoreCtx.l.count - 1);
+                    innerResolveCoreCtx.composedCount -= 1;
                 }
             } // end of operation validity test
+        } // end of commonAsync
+        
+        combine(&exploreAdditionOfRightNodeCtx.forwardValue);
+        try await commonAsync(&resolveCoreCtx, &exploreAdditionOfRightNodeCtx, isReverse: false, dispatcher);
+        uncombine(&exploreAdditionOfRightNodeCtx.forwardValue);
 
-            { (_ valueComponent: inout UInt32) in
-                valueComponent = kind.uncombine(valueComponent, rightChild.value);
-            }(&decomposedValue[reverse]!);
-        } // end of phase loop
-        
+        combine(&exploreAdditionOfRightNodeCtx.reverseValue);
+        try await commonAsync(&resolveCoreCtx, &exploreAdditionOfRightNodeCtx, isReverse: true, dispatcher);
+        uncombine(&exploreAdditionOfRightNodeCtx.reverseValue);
+
         if let _ = rightChild.expr {
-            otherComposedCount += 1;
+            resolveCoreCtx.otherComposedCount += 1;
         }
-        l.insert(rightChild, at: ((l.count+1) - composedCount - 1) - candidateReverseOffset);
+        resolveCoreCtx.l.insert(rightChild, at: ((resolveCoreCtx.l.count+1) - resolveCoreCtx.composedCount - 1) - candidateReverseOffset);
     } // end of candidate iteration loop
 }
 
diff -r d2ef6283be0b -r cdcaaddad141 AsyncCountdown/DispatchAPI.swift
--- a/AsyncCountdown/DispatchAPI.swift  Fri Jul 14 22:58:55 2023 +0200
+++ b/AsyncCountdown/DispatchAPI.swift  Sun Jul 02 19:01:50 2023 +0200
@@ -16,11 +16,8 @@
 @preconcurrency import Foundation
 import os;
 
-typealias Dispatcher = (_ l: inout [ReadyValue],
-                        _ composedCount: inout Int,
-                        _ otherComposedCount: inout Int,
-                        _ decomposedValue: inout [Bool:UInt32],
-                        _ downstreamSiblingsRecruitmentSkip: inout Int,
+typealias Dispatcher = (_ resolveCoreCtx: inout ResolveCoreCtx,
+                        _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtx,
                         _ kind: Op,
                         _ startingFrom: Int,
                         _ walkOperands: @escaping (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void?) -> Void?) throws -> Void;
@@ -39,19 +36,13 @@
          _ primitives: UInt32...) {
         let resultReceiver = { result in inqueue.async { incb(result); };};
         
-        func iteratePossibleLeftNodesFakeDispatch(_ l: inout [ReadyValue],
-                                                                _ composedCount: inout Int,
-                                                                _ otherComposedCount: inout Int,
-                                                                _ decomposedValue: inout [Bool:UInt32],
-                                                                _ downstreamSiblingsRecruitmentSkip: inout Int,
+        func iteratePossibleLeftNodesFakeDispatch(_ resolveCoreCtx: inout ResolveCoreCtx,
+                                                  _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtx,
                                                                 _ kind: Op,
                                                                 _ startingFrom: Int,
                                                                 _ walkOperands: @escaping  (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void?) -> Void?)  {
-            iteratePossibleLeftNodes(&l,
-                                  &composedCount,
-                                  &otherComposedCount,
-                                  &decomposedValue,
-                                  &downstreamSiblingsRecruitmentSkip,
+            iteratePossibleLeftNodes(&resolveCoreCtx,
+                                  &exploreAdditionOfRightNodeCtx,
                                   resultReceiver,
                                   startGoal,
                                   kind,
@@ -73,11 +64,8 @@
             completion.notify(queue: inqueue,
                               execute: completionCb);
 
-            func iteratePossibleLeftNodesDispatch(_ l: inout [ReadyValue],
-                                                  _ composedCount: inout Int,
-                                                  _ otherComposedCount: inout Int,
-                                                  _ decomposedValue: inout [Bool:UInt32],
-                                                  _ downstreamSiblingsRecruitmentSkip: inout Int,
+            func iteratePossibleLeftNodesDispatch(_ resolveCoreCtx: inout ResolveCoreCtx,
+                                                  _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtx,
                                                   _ kind: Op,
                                                   _ startingFrom: Int,
                                                   _ walkOperands: @escaping (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void?) -> Void?)  {
@@ -90,13 +78,10 @@
                  - 4 already means too few possibilities to explore
                  - 3 is right out
                  Note that among other properties, this estimator value is monotonic. */
-                guard startingFrom < (l.count - composedCount),
-                      l.count + (walkOperands({_,_ in return ();}) != nil ? 1 : 0) == 5 else {
-                    iteratePossibleLeftNodes(&l,
-                                             &composedCount,
-                                             &otherComposedCount,
-                                             &decomposedValue,
-                                             &downstreamSiblingsRecruitmentSkip,
+                guard startingFrom < (resolveCoreCtx.l.count - resolveCoreCtx.composedCount),
+                      resolveCoreCtx.l.count + (walkOperands({_,_ in return ();}) != nil ? 1 : 0) == 5 else {
+                    iteratePossibleLeftNodes(&resolveCoreCtx,
+                                             &exploreAdditionOfRightNodeCtx,
                                              resultReceiver,
                                              startGoal,
                                              kind,
@@ -107,26 +92,17 @@
                 }
                 
                 // reseat this divergence over a copy of the whole state
-                let childL = l;
-                let childComposedCount = composedCount;
-                let childOtherComposedCount = otherComposedCount;
-                let childDecomposedValue = decomposedValue;
-                let childDownstreamSiblingsRecruitmentSkip = downstreamSiblingsRecruitmentSkip;
-                
+                let paramResolveCoreCtx = resolveCoreCtx;
+                let paramExploreAdditionOfRightNodeCtx = exploreAdditionOfRightNodeCtx;
+
                 DispatchQueue.global(qos:.userInitiated).async(group: completion) {
                     os_signpost(.begin, log: LogWrapper.gObservationHandle, name: "micro-batch", signpostID: OSSignpostID(log: LogWrapper.gObservationHandle, object: Thread.current));
 
-                    var mutableL = childL;
-                    var mutableComposedCount = childComposedCount;
-                    var mutableOtherComposedCount = childOtherComposedCount;
-                    var mutableDecomposedValue = childDecomposedValue;
-                    var mutableDownstreamSiblingsRecruitmentSkip = childDownstreamSiblingsRecruitmentSkip;
+                    var childResolveCoreCtx = paramResolveCoreCtx;
+                    var childExploreAdditionOfRightNodeCtx = paramExploreAdditionOfRightNodeCtx;
 
-                    iteratePossibleLeftNodes(&mutableL,
-                                             &mutableComposedCount,
-                                             &mutableOtherComposedCount,
-                                             &mutableDecomposedValue,
-                                             &mutableDownstreamSiblingsRecruitmentSkip,
+                    iteratePossibleLeftNodes(&childResolveCoreCtx,
+                                             &childExploreAdditionOfRightNodeCtx,
                                              resultReceiver,
                                              startGoal,
                                              kind,

I have kept attacking this challenge from multiple angles, until I remembered and relocated a blog post about a scalability issue, whose message is: regardless of how tight its implementation is, an atomic counter semaphore can always become a scalability issue by mere virtue of requiring the cache line to bounce between the L1D attached to the respective cores if more than one of them manipulates it concurrently; which can easily and sometimes implicitly occur with reference counting.

So I tried to look for such a situation in my scenario, but even if I have some suspects, I first need data to know how much really of such potential savings can be had. And that is where I came up short on this Mac mini M2, as I failed to find a hardware counter that would be of help: none of them measure the case where a core requests a cache line but it can't be loaded from anywhere in its cache hierarchy, including the common RAM, at least not until another core has written it back. Instead, I had to get back to my meager Intel-based MacBook with only two physical cores, the bare minimum where this can happen, as this one does have counters to that effect. Do you hear me, Apple silicon guys? I had to get back to an Intel processor (which are now gone from the Mac hardware lineup) from a 7-year-old Mac in order to try and improve software scalability on your brand spanking new hardware since the latter is utterly unhelpful.

Using MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM in the CPU counters Instruments template in event mode (once every 1000 events), which allows samples to be taken at or close to where such events happen, I do notice swift_release and swift_retain at the top of the inverted call tree profile, in particular in comparison to the same profile for the sequential run, and unfolding their callers reveals that most of them are directly my code.

Unfortunately, Instruments was not able to point out which lines (even in the assembly) corresponded to such calls (it either complained of being unable to locate the file in question, or pointed me to the first line in the file, etc.), but there was only really one possibility for data that I share between concurrent tasks: the closures that correspond to the chain of operands for a given operation. Indeed, I had that data expressed as closures, being the most straightforward and economical way to make that data available to all the computation trees that may need access to it, as this data is immutable once captured. Everything else is mutable and copied (if not exactly at the point of the explicit copy, at worst immediately afterwards when it is first modified due to COW, which occurs at most once per micro-batch so not worth worrying about).

Unfortunately, Swift won't allow me to "fully copy" a closure, even in a shallow way: such an operation would probably be incompatible with its semantics anyway. So when spawning a new task, instead of such a copy I replace each reference to such a closure by a new closure whose only purpose is to store that reference and forward control to it when invoked:

// First, wrap the closures into a no-op closure that will serve as a cap
let childWalkOperands = { @Sendable action in
    anchor_func();
    return walkOperands(action);
};

Here this is the code for the closure for the operation that is currently being built, similar code is used to replace the similar reference found in the non-primitive ReadyValues in the available list (the array known as l which is copied when a new task is spawned). void anchor_func(void) is a no-op assembly function, which Swift imports as a foreign C declaration, whose only purpose is to ensure the optimizer won't deem that new closure to be equal to the one it wraps and merge the two as a result (which would defeat my work): its source code is

	.globl	_anchor_func
	.p2align	2
_anchor_func:
	ret

After a bit of additional code to support this switcheroo, I profile my code again.

Swift with closure cap

 N|  T  | factor
 1|100  | ref
 2| 69.3|x1.44
 3| 62.0|x1.61
 4| 59.3|x1.69
 5| 48.7|x2.05
 6| 42.1|x2.37
 7| 38.6|x2.59
 8| 36.1|x2.77
 9| 36.2|x2.76
10| 35.9|x2.78
11| 36.0|x2.77
12| 35.9|x2.79

(methodology is identical to Swift without dictionaries, plus an attached patch)

discussion

…well, the absolute performance did improve, as well as the consistency of the results (neither of which are shown here), which is why I will keep this change anyway; running that with the CPU counters instruments does show the event in question as being triggered by swift_release and swift_retain less often, and the remaining cases appear to be only indirectly related to my code.

But this did not move the needle on scalability one bit. Back to the drawing board…

diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/AsyncAPI.swift
--- a/AsyncCountdown/AsyncAPI.swift     Mon Jul 10 22:51:12 2023 +0200
+++ b/AsyncCountdown/AsyncAPI.swift     Sun Sep 17 12:12:20 2023 +0200
@@ -20,13 +20,13 @@
                         _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                         _ kind: Op,
                         _ startingFrom: Int,
-                        _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async throws -> Void;
+                        _ walkOperands: @escaping WalkerFuncNonGCD) async throws -> Void;
 
 typealias DispatcherPseudoReasync = (_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
                                      _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                         _ kind: Op,
                         _ startingFrom: Int,
-                        _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) throws -> Void;
+                        _ walkOperands: @escaping WalkerFuncNonGCD) throws -> Void;
 
 // Despite https://github.com/apple/swift-evolution/blob/main/proposals/0309-unlock-existential-types-for-all-protocols.md
 // I am unable to do the following, even with Xcode 13.2.1: I get the error that:
@@ -235,9 +235,29 @@
                                                               _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                                                               _ kind: Op,
                                                               _ startingFrom: Int,
-                                                              _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async throws {
+                                                              _ walkOperands: @escaping WalkerFuncNonGCD) async throws {
+                            // First, wrap the closures into a no-op closure that will serve as a cap
+                            let childWalkOperands = { @Sendable action in
+                                anchor_func();
+                                return walkOperands(action);
+                            };
+                            var interResolveCoreCtx = resolveCoreCtx;
+                            interResolveCoreCtx.l = resolveCoreCtx.l.map({e in
+                                return ReadyValueNonGCD(template: e,
+                                                        overridingExpr: e.expr.map({ innerE in
+                                        var innerR = innerE;
+                                        let innerOperandsBrowser = innerE.operatorsBrowser;
+                                        innerR.operatorsBrowser = { action in
+                                            anchor_func();
+                                            return innerOperandsBrowser(action);
+                                        };
+                                        return innerR;
+                                    })
+                                );
+                            });
+                            
                             // reseat this divergence over a copy of the whole state
-                            let paramResolveCoreCtx = resolveCoreCtx;
+                            let paramResolveCoreCtx = interResolveCoreCtx;
                             let paramExploreAdditionOfRightNodeCtx = exploreAdditionOfRightNodeCtx;
                             
                             let task = Task {
@@ -246,7 +266,7 @@
                                                                                            _ innerExploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                                                                                            _ innerKind: Op,
                                                                                            _ innerStartingFrom: Int,
-                                                                                           _ innerWalkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) {
+                                                                                           _ innerWalkOperands: @escaping WalkerFuncNonGCD) {
                                         iteratePossibleLeftNodesPseudoReasync(&innerResolveCoreCtx,
                                                                               &innerExploreAdditionOfRightNodeCtx,
                                                                               { result in innerResultsScopingGroup.addTask { await resultMediator.yield(result);}},
@@ -271,7 +291,7 @@
                                                                                       &childExploreAdditionOfRightNodeCtx,
                                                                                       kind,
                                                                                       startingFrom,
-                                                                                      walkOperands);
+                                                                                      childWalkOperands);
                                     
                                     os_signpost(.end, log: LogWrapper.gObservationHandle, name: "micro-batch", signpostID: OSSignpostID(log: LogWrapper.gObservationHandle, object: Thread.current));
                                 }
@@ -291,7 +311,7 @@
                                                                   _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                                                                   _ kind: Op,
                                                                   _ startingFrom: Int,
-                                                                  _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async throws {
+                                                                  _ walkOperands: @escaping WalkerFuncNonGCD) async throws {
                             /* 1: Peek a bit: if we find out from startingFrom that no further left node is
                              going to be able to be added anyway (and remember: closing the composition as
                              it stands is performed in our caller), don't bother spawning a micro-batch.
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/AsyncCountdown-Bridging-Header.h
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/AsyncCountdown/AsyncCountdown-Bridging-Header.h   Sun Sep 17 12:12:20 2023 +0200
@@ -0,0 +1,13 @@
+//
+//  AsyncCountdown-Bridging-Header.h
+//  AsyncCountdown
+//
+//  Created by Pierre Lebeaupin on 15/09/2023.
+//
+
+#ifndef AsyncCountdown_Bridging_Header_h
+#define AsyncCountdown_Bridging_Header_h
+
+#include "anchor_func.h"
+
+#endif /* AsyncCountdown_Bridging_Header_h */
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/AsyncCountdown.swift
--- a/AsyncCountdown/AsyncCountdown.swift       Mon Jul 10 22:51:12 2023 +0200
+++ b/AsyncCountdown/AsyncCountdown.swift       Sun Sep 17 12:12:20 2023 +0200
@@ -15,14 +15,15 @@
 
 import Foundation;
 
+typealias WalkerFuncNonGCD = @Sendable (_ action: (_ value: ReadyValueNonGCD,
+                                             _ reverse: Bool) -> Void?
+                                  ) -> Void?;
+   
 @available(macOS 12.0.0, iOS 15.0, *)
 struct ReadyValueNonGCD:Sendable {
     struct Expr {
         var op: Op;
-        var operatorsBrowser: @Sendable (_ action: (_ value: ReadyValueNonGCD,
-                                                    _ reverse: Bool)
-                                         -> Void?)
-        -> Void?;
+        var operatorsBrowser: WalkerFuncNonGCD;
     }
 
     let value: UInt32;
@@ -38,6 +39,11 @@
         expr = inExpr;
     }
     
+    init(template intemplate: Self, overridingExpr inexpr: Expr?) {
+        value = intemplate.value;
+        expr = inexpr;
+    }
+    
     func description() -> String {
          guard let localExpr = expr else {
              return value.description;
@@ -47,7 +53,7 @@
          var isFirst = true;
          var numRev = 0;
          
-         localExpr.operatorsBrowser({(_ subReadyValue: ReadyValueNonGCD,
+        localExpr.operatorsBrowser({(_ subReadyValue: ReadyValueNonGCD,
                     _ freverse: Bool) -> Void? in
              guard !freverse else {
                  numRev += 1;
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/AsyncExplorers.swift
--- a/AsyncCountdown/AsyncExplorers.swift       Mon Jul 10 22:51:12 2023 +0200
+++ b/AsyncCountdown/AsyncExplorers.swift       Sun Sep 17 12:12:20 2023 +0200
@@ -46,10 +46,10 @@
     // Downstream ones are those created while this one is active
     
     try await dispatcher(&resolveCoreCtx,
-                     &reference,
-                     kind,
-                     /* startingFrom: */ skip,
-                     { _ in return nil;});
+                         &reference,
+                         kind,
+                         /* startingFrom: */ skip,
+                         { _ in return nil;});
 }
 
 /*
@@ -95,7 +95,7 @@
                                              _ startGoal: UInt32,
                                              _ kind: Op,
                                              _ startingFrom: Int,
-                                             _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?,
+                                             _ walkOperands: @escaping WalkerFuncNonGCD,
                                              _ dispatcher: DispatcherAsync) async rethrows {
     let emptyComposition = (walkOperands({_,_ in return ();}) == nil)
 
@@ -137,10 +137,10 @@
             };
             
             try await innerDispatcher(&innerResolveCoreCtx,
-                             &innerExploreAdditionOfRightNodeCtx,
-                             kind,
-                             /* startingFrom: */ candidateReverseOffset,
-                             selfNode);
+                                      &innerExploreAdditionOfRightNodeCtx,
+                                      kind,
+                                      /* startingFrom: */ candidateReverseOffset,
+                                      selfNode);
             
             // close current composition
             var num = 0;
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/DispatchAPI.swift
--- a/AsyncCountdown/DispatchAPI.swift  Mon Jul 10 22:51:12 2023 +0200
+++ b/AsyncCountdown/DispatchAPI.swift  Sun Sep 17 12:12:20 2023 +0200
@@ -20,7 +20,7 @@
                         _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtx,
                         _ kind: Op,
                         _ startingFrom: Int,
-                        _ walkOperands: @escaping (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void?) -> Void?) throws -> Void;
+                        _ walkOperands: @escaping WalkerFunc) throws -> Void;
 
 
 @preconcurrency class ResolveObject {
@@ -40,7 +40,7 @@
                                                   _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtx,
                                                                 _ kind: Op,
                                                                 _ startingFrom: Int,
-                                                                _ walkOperands: @escaping  (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void?) -> Void?)  {
+                                                                _ walkOperands: @escaping WalkerFunc)  {
             iteratePossibleLeftNodes(&resolveCoreCtx,
                                   &exploreAdditionOfRightNodeCtx,
                                   resultReceiver,
@@ -68,7 +68,7 @@
                                                   _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtx,
                                                   _ kind: Op,
                                                   _ startingFrom: Int,
-                                                  _ walkOperands: @escaping (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void?) -> Void?)  {
+                                                  _ walkOperands: @escaping WalkerFunc)  {
                 /* 1: Peek a bit: if we find out from startingFrom that no further left node is
                  going to be able to be added anyway (and remember: closing the composition as
                  it stands is performed in our caller), don't bother spawning a micro-batch.
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/anchor_func.h
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/AsyncCountdown/anchor_func.h      Sun Sep 17 12:12:20 2023 +0200
@@ -0,0 +1,4 @@
+#ifndef ANCHOR_FUNC
+#define ANCHOR_FUNC
+void anchor_func(void);
+#endif // ANCHOR_FUNC
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/anchor_func.s
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/AsyncCountdown/anchor_func.s      Sun Sep 17 12:12:20 2023 +0200
@@ -0,0 +1,9 @@
+       .section        __TEXT,__text,regular,pure_instructions
+       .build_version macos, 13, 0     sdk_version 13, 3
+       .globl  _anchor_func
+       .p2align        2
+_anchor_func:
+       .cfi_startproc
+       ret
+       .cfi_endproc
+.subsections_via_symbols
1 Like

Small update: involving a (potentially target-dependent) assembly file and a C import because I don't know how else to make code off-limits to the Swift optimizer didn't sit well with me, so I made another effort at a swiftier solution. I have observed that the compiler, both in its internal operation and presumably in how it hands off its representation to LLVM (since the latter is going to also remove abstractions if they are presented to it the right way), is particularly respectful of class instances and does not appear to simplify code across class boundaries (e.g. propagate constraints, inline code, etc.). So I need to involve a class somehow, but still end up with a closure, because I need to transparently insert that precisely when needed. This is how I accomplished it:

final class ThreadLocalCapGutsNonGCD:Sendable {
    let closure: WalkerFuncNonGCD;
    init(closure inclosure: @escaping WalkerFuncNonGCD) {
        closure = inclosure;
    }
    
    class func cap(_ inclosure: @escaping WalkerFuncNonGCD) -> WalkerFuncNonGCD
    {
        // Put the closure in a wrapper…
        let wrapper = Self(closure: inclosure);
        // …so a new closure can capture the wrapper…
        return { @Sendable action in
            // …and recover the original closure from that wrapper
            return wrapper.closure(action);
            // This ought to ensure the wrapping isn't optimized away.
        };
    }
}

After profiling with Instruments and the MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM counter in event mode, I confirm that this is sufficient to obtain the same effect of reduction of cache line bouncing as I previously did with the anchor_func()-based solution, and that there is no performance regression, so now I can get rid of the bridging header and the assembly source file.

The patch:

diff -r 7d85ed77952d -r b0dd9436c367 AsyncCountdown/AsyncAPI.swift
--- a/AsyncCountdown/AsyncAPI.swift     Sun Sep 17 12:12:20 2023 +0200
+++ b/AsyncCountdown/AsyncAPI.swift     Tue Sep 19 22:41:45 2023 +0200
@@ -236,21 +236,13 @@
                                                               _ kind: Op,
                                                               _ startingFrom: Int,
                                                               _ walkOperands: @escaping WalkerFuncNonGCD) async throws {
-                            // First, wrap the closures into a no-op closure that will serve as a cap
-                            let childWalkOperands = { @Sendable action in
-                                anchor_func();
-                                return walkOperands(action);
-                            };
+                            let childWalkOperands = ThreadLocalCapGutsNonGCD.cap(walkOperands);
                             var interResolveCoreCtx = resolveCoreCtx;
                             interResolveCoreCtx.l = resolveCoreCtx.l.map({e in
                                 return ReadyValueNonGCD(template: e,
                                                         overridingExpr: e.expr.map({ innerE in
                                         var innerR = innerE;
-                                        let innerOperandsBrowser = innerE.operatorsBrowser;
-                                        innerR.operatorsBrowser = { action in
-                                            anchor_func();
-                                            return innerOperandsBrowser(action);
-                                        };
+                                        innerR.operatorsBrowser = ThreadLocalCapGutsNonGCD.cap(innerE.operatorsBrowser);
                                         return innerR;
                                     })
                                 );
diff -r 7d85ed77952d -r b0dd9436c367 AsyncCountdown/AsyncCountdown-Bridging-Header.h
--- a/AsyncCountdown/AsyncCountdown-Bridging-Header.h   Sun Sep 17 12:12:20 2023 +0200
+++ /dev/null   Thu Jan 01 00:00:00 1970 +0000
@@ -1,13 +0,0 @@
-//
-//  AsyncCountdown-Bridging-Header.h
-//  AsyncCountdown
-//
-//  Created by Pierre Lebeaupin on 15/09/2023.
-//
-
-#ifndef AsyncCountdown_Bridging_Header_h
-#define AsyncCountdown_Bridging_Header_h
-
-#include "anchor_func.h"
-
-#endif /* AsyncCountdown_Bridging_Header_h */
diff -r 7d85ed77952d -r b0dd9436c367 AsyncCountdown/AsyncCountdown.swift
--- a/AsyncCountdown/AsyncCountdown.swift       Sun Sep 17 12:12:20 2023 +0200
+++ b/AsyncCountdown/AsyncCountdown.swift       Tue Sep 19 22:41:45 2023 +0200
@@ -18,6 +18,33 @@
 typealias WalkerFuncNonGCD = @Sendable (_ action: (_ value: ReadyValueNonGCD,
                                              _ reverse: Bool) -> Void?
                                   ) -> Void?;
+
+/* The aim of this class is to provide each thread (or task, async block, etc.)
+ with its own reference that will get retain/release traffic, rather than all
+ threads sharing a single reference directly and causing the cache line
+ containing the reference count to bounce between cores.
+ 
+ In order to fit in structure fields and parameters that actually expect a
+ WalkerFuncNonGCD, this class needs to be wrapped by a closure of that type. */
+
+final class ThreadLocalCapGutsNonGCD:Sendable {
+    let closure: WalkerFuncNonGCD;
+    init(closure inclosure: @escaping WalkerFuncNonGCD) {
+        closure = inclosure;
+    }
+    
+    class func cap(_ inclosure: @escaping WalkerFuncNonGCD) -> WalkerFuncNonGCD
+    {
+        // Put the closure in a wrapper…
+        let wrapper = Self(closure: inclosure);
+        // …so a new closure can capture the wrapper…
+        return { @Sendable action in
+            // …and recover the original closure from that wrapper
+            return wrapper.closure(action);
+            // This ought to ensure the wrapping isn't optimized away.
+        };
+    }
+}
    
 @available(macOS 12.0.0, iOS 15.0, *)
 struct ReadyValueNonGCD:Sendable {
diff -r 7d85ed77952d -r b0dd9436c367 AsyncCountdown/anchor_func.h
--- a/AsyncCountdown/anchor_func.h      Sun Sep 17 12:12:20 2023 +0200
+++ /dev/null   Thu Jan 01 00:00:00 1970 +0000
@@ -1,4 +0,0 @@
-#ifndef ANCHOR_FUNC
-#define ANCHOR_FUNC
-void anchor_func(void);
-#endif // ANCHOR_FUNC
diff -r 7d85ed77952d -r b0dd9436c367 AsyncCountdown/anchor_func.s
--- a/AsyncCountdown/anchor_func.s      Sun Sep 17 12:12:20 2023 +0200
+++ /dev/null   Thu Jan 01 00:00:00 1970 +0000
@@ -1,9 +0,0 @@
-       .section        __TEXT,__text,regular,pure_instructions
-       .build_version macos, 13, 0     sdk_version 13, 3
-       .globl  _anchor_func
-       .p2align        2
-_anchor_func:
-       .cfi_startproc
-       ret
-       .cfi_endproc
-.subsections_via_symbols

OK, how much impact do you think this patch could possibly have on our matter?

diff -r 631593d9f4fa -r 07055be695f9 AsyncCountdown/AsyncExplorers.swift
--- a/AsyncCountdown/AsyncExplorers.swift       Mon Jul 10 22:51:12 2023 +0200
+++ b/AsyncCountdown/AsyncExplorers.swift       Sat Sep 23 14:19:11 2023 +0200
@@ -143,8 +143,7 @@
                              selfNode);
             
             // close current composition
-            var num = 0;
-            if (selfNode({_,_ in guard num == 0 else {return ();}; num += 1; return nil;}) != nil)
+            if /* parent is */ (!emptyComposition)
                 && ( (kind == .additive) ? innerExploreAdditionOfRightNodeCtx.forwardValue > innerExploreAdditionOfRightNodeCtx.reverseValue :
                             ((innerExploreAdditionOfRightNodeCtx.forwardValue % innerExploreAdditionOfRightNodeCtx.reverseValue) == 0) ) {
                 

I did not think it could have a lot, myself, as I performed this change while in the middle of more significant rearrangements: the latter involved some changes to the signature of the selfNode parameter, and I considered it would be just be as well to reuse a cached result here rather than have to keep tinkering with the signature of an anonymous closure that does not use its own parameters anyway just to keep it in sync (if you don't feel up to reading how it works, just know this code makes sure an operation has at least two operands before we try to obtain its result). For reasons that will eventually become clear, I tested this patch in isolation, as directly applied to the without dictionaries case. What's going to be the outcome?

Swift without dictionaries and with reduced walker calls

 N|  T  | factor
 1|100  | ref
 2| 62.8|x1.59
 3| 52.1|x1.92
 4| 48.1|x2.07
 5| 41.6|x2.40
 6| 37.2|x2.69
 7| 34.3|x2.91
 8| 32.3|x3.10
 9| 32.2|x3.11
10| 32.1|x3.12
11| 32.1|x3.12
12| 31.7|x3.15

(No change in methodology)

discussion

For the first time I break x2 with just the performance cores, and x3 when adding the efficiency cores: now that is a significant improvement (absolute performance of the reference did also improve, in case you're curious). Now I have a pretty good idea on why that is, but I would like to hear your theories before I expose it. Why did this small change have such an effect?

Bonus: in a further patch I also removed the call to walkOperands that sets emptyComposition (using instead specially added state to track the number of operands so as to obtain the same result), but that change did not have a significant impact on scalability. Why is that, in your opinion?

1 Like

So, about that performance and scalability change…

Remember earlier on that I made changes to avoid different CPU simultaneously manipulating the same reference counts, by having different threads manipulate different references. These changes were ultimately unsuccessful at improving scalability, but I kept them nevertheless, and a common theme of these changes were to make manipulation of the references in question more performant at the expense of actually visiting these data structures. So when I noticed the scalability improvement of the tiny patch above, I first wondered if that wasn't in relation to avoiding the increased expense of performing the extra calls. This is why I tested this patch when applied on the "without dictionaries" code: so as to test that hypothesis.

Given that the scalability improvements from the tiny patch were conserved even in that case, I deduced that these improvements had nothing to do with the cost of recovering the wrapped closure from it wrapper or whatever. So when I made the post above, I though that may have to do with accessing the data guarded by walkOperands: even if we discard the copy provided when we just use the fact we're called in order to increment num, that could still result in scalability troubles, such as when we generate accesses to a ReadyValue that (as rightChild) may be simultaneously accessed by other tasks in a similar manner.Âą

So I set out to determine what was going on here. But my hopes to find that out with a well-chosen performance monitoring counter were soon dashed: here, none of them (MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM, L2_RQSTS.RFO_MISS, nor a few others I tried on a lark) told me anything I did not already know from the time profiler, even with the benefit of being able to compare between with and without the improvement (as opposed to just being able to profile the version without improvement in order to look for such an improvement). In fact, in some cases I saw the profile of the unimproved version have lower prevalence of potential suspect functions (including in absolute terms), while it was a longer tail of smaller functions that were taking more CPU time – a telltale sign of victim code, without me being able to identify the perpetrating phenomenon.

Out of ideas, I tried to look for weaker signals, notably how the time in certain functions, possibly called in a certain way, would vary between the two scenarios, while ignoring or at least defocusing some other information.

To do that, seeing the limits of Instruments, I turned to flame graphs.

After some tinkering, I finally came up with these:

(before the improvement, then after. Fun fact: in order to come up with these I had to hack up support for generating flame graphs from inverted calls trees. Isn't Perl fun?)

While there are some shifts here and there, the major change is the amount of swift_release calls by common(PseudoReasync). After the improvement, they're gone! Now, I did notice on occasion the compiler shifting responsibilities between some of my own functions without necessarily inlining them, including sometimes assigning these responsibilities to thunks², so I looked if these calls had instead moved to other function of mine, but no (with the possible exception of destructors, but that cannot be the effect of shifting of responsibility coming from the compiler, can it?)³.

So, while I previously did not look too closely where in my code the calls to the Swift runtime occur, here I ended up looking at the disassembly of my code&sup4; and noticing that before the improvement, there is a call to swift_release in the main epilogue of common (along with an "outlined consume of ReadyValue.Expr?"), and there isn't any equivalent call in common after the improvement, including elsewhere in the same function; and if there was such an equivalent call elsewhere in my code (e.g. responsibility for calling it shifting to the called of common somehow), it would show up in the flame graph, but it doesn't.

At this point, I still don't know for sure what reference it is that gets an extra swift_release (and swift_retain, presumably, though in a way that is much less performance-impacting), but I know enough.


The whole point of this exploration was to see how well Swift can scale, by itself: even with assembly language and lock-free primitives, you're always bound to discover surprising scaling behaviour on your first try, so the important part is the process by which you discover and resolve those; here I wanted to see how far I could take Swift on the same road, by itself: when discovering issues, I wanted to see how far I could get resolving them by using more Swift, not less (and preferably, keeping the code as idiomatic as possible: it is said you can write FORTRAN in any language…)

Concurrent Swift has a lot going for it, notably a Task API that gets more right than any I've previously seen. For instance, when I wrote the concurrent C version of this code, I implemented recycling of allocations so a task could reuse allocations made by the one that just completed; but the libdispatch API does not support the notion of recovering a "result" from an async block, so I had to create an ad-hoc system with a pair of semaphores that surely introduces bottlenecks of its own.

The Swift Task API supports that out of the box, and this is the reason why I will keep tinkering with async Swift. But to get such variation in behaviour, variation comparable to that obtained by a non-trivial patch (the one which removed the dictionaries), as a side-effect (not even the main effect! Since the extra swift_release calls were not from the invocations removed proper) of a one-line change means I am always at the mercy of seemingly random performance variations that would drown out any intended improvement, without much help from the diagnostics or profiling tools available to me. In particular, the more elaborate the technique to help scalability (such as a system to recycle Swift allocations somehow), the more likely it would be for the compiler to just fail at doing the reference counting and other optimisations that impact requests to the runtime that it is succeeding at right now, such that there would be little way for me to reason about what I am doing. Right now, to me trying to make pure Swift code scale well up to 100% on all CPUs is a fool's errand.


1: Even if, as you've probably noticed, that enumeration supports early termination and so will not stray too far away in that particular invocation.

2: Ah, our friend partial apply for iteratePossibleLeftNodesFakeDispatchPseudoReasync #1 (_:_:_:_:_:) in closure #1 in closure #2 in iteratePossibleLeftNodesDispatch #1 (_:_:_:_:_:) in closure #1 in closure #1 in closure #2 in closure #1 in resolveAsync(_:_:)…

3: and I took into account the fact _swift_release_dealloc in in fact tailed-called by swift_release when looking for other culprits. Speaking of which, there is an attribution error in the flame graph: due to the use of "Flatten AsyncCountdown to boundary frames", those frames that include a destructor assign responsibility of the caller of _swift_release_dealloc directly to the bottom of the stack since the destructor is considered the boundary frame; this does not happen in frames when _swift_release_dealloc is not calling any of my destructors. I can't remove this data mining setting, or I end with illegible flame graphs that weigh multiple megabytes due to the use of indirect recursion in my code.

4: With the help of Instruments, which this time was somewhat cooperative for reasons I have not entirely explained: I have worked differently in some aspects, notably working off executables saved instead of them directly residing in the Xcode build directory, so as to be able to repeat experiments without recompiling back and forth the version without improvements and the one with)

3 Likes

I keep worrying that nobody replying will send the wrong signal on this. I for one have found this interesting and useful to follow along with, even though I haven't contributed. So you have at least one interested reader (and I suspect more)! :slightly_smiling_face:

One thought I've had a few times is that I might be inclined to take a look at your example myself, when I get some free time, but that's hindered by not having a simple way to obtain your full code. Perhaps you could put it on GitHub or similar?

7 Likes

Agreed. I have been following this thread with great interest. I find it worrisome that no Apple engineer replied.

1 Like

That is a fair remark; I haven't had a public code portfolio since BitBucket sunsetted their Mercurial hostingÂą, so for this need I finally bit the bullet and evaluated my options. I ended up settling on SourceHut: ~zpedro/AsyncCountdown - Countdown numbers round solver written in async Swift - sourcehut hg

I particular, I added tags corresponding to the various profiled versions:

Don't forget to run RegenDerived.sh from the AsyncCountdown subfolder once downloaded/cloned (and redo so after each update, etc.): this is needed until reasync is available.


1: I prefer Mercurial to git so far in large part because of the Evolve extension, which is invaluable to refine a work in progress that needs to bounce between different machines

2 Likes