Improving Swift code for scalability

I recently acquired a Mac mini M2 (4 performance cores, 4 efficiency cores); even if I did not get the M2 Pro (much less the SMP monsters featured in the Mac Studio), I am very interested in seeing how well I can exploit these CPUs using thread-level parallelism. Particularly so with Swift, which was released at a time where all Macs and iOS devices were already multi-core and seemed meant to be tailored to such a multi-core reality, potential which was unveiled with the release of concurrency features starting with Swift 5.5.

So imagine my disappointment when I ran my experimentations project in it, and I obtained the following run times:

Naive Swift

 N|  T  | factor
 1|100  | ref
 2| 83.5|x1.20
 3| 69.5|x1.44
 4| 79.5|x1.26
 5| 60.8|x1.64
 6| 50.0|x2
 7| 47.2|x2.12
 8| 39.8|x2.51
 9| 45.2|x2.21
10| 45.1|x2.22
11| 45.3|x2.21
12| 39.2|x2.56

The first column is the number of tasks that my code spawns before requiring that a task completes as a condition for an additional task to new spawned; the code is run as micro-batches whose size is independent from the number of processors or from this parameter. I went up to 12 as a sanity check

The second column is the time taken in each scenario, normalized against a duration 100 for the sequential case (i.e. where the above parameter is 1; that code uses the async APIs too anyway so the comparison is fair): I am not interested in the absolute performance here, only in whether it scales as expected. Smaller amounts are better.

The third column is how better that scenario performs compared to the first, expressed as a performance multiplier. It is the inverse of the second column, presented for convenience. Higher numbers are better.

Methodology: I opened my project, which you can find as the attachment to the May 17th comment at It's time for `reasync` · Issue #65475 · apple/swift · GitHub with Xcode 15.0 beta 8 and profiled it with the time profiler template. For each run, I took the difference between the end time and the initialization time both reported by Instruments. After an initial run whose results were not kept so as to warm the caches, each scenario was run 3 times, and the median time was kept as the scenario outcome. Control of the parallelism was obtained through an environment variable, which requires the following patch to be applied over the project version from May 17th¹; obtention of the required parallelism was confirmed from inspection of the time profiler instrument (except for parallelism values above 8). The complexity of the computation was chosen so the scenario completes in a handful of seconds, even in the case of the reference (longest duration) scenario, so as to avoid thermal throttling effects.

discussion

Right off the bat, there are obviously some issues with these results: you can see performance regress when going from 3-way to 4-way parallelism, and again when going from 8-way to 9-way (before recovering at 12); there appears to be bimodal behaviors at work. But the general trend is clear: scalability is awful. I doubled-checked the time profiler instrument for obvious misbehavior, such as a single task completing much later than the others and making the processing complete much later than expected, but no such luck. For sanity, I also checked on the machine I was relying on before (2016 13' MacBook Pro, no touch bar), but I saw similar results when going from 1 to 2 parallel tasks.

When I tried the dedicated Swift concurrency template, it failed to provide much insight: to begin with, it appears to be overloaded by the amount of tasks that are spawned (~5000), even if there are never more than N at any given time, resulting in my processing slowing to a crawl.

Looking at the profiling results, I started to have ideas for improvements, but first I wanted to see what kind of scalability I could legitimately expect, to see what I could demand of my Swift code. So I reimplemented my processing in C, with concurrency obtained through libdispatch.

Fast-forward a few weeks…

C with libdispatch

 N|  T  | factor
 1|100  | ref
 2| 55.9|x1.79
 3| 36.5|x2.74
 4| 28.8|x3.47
 5| 28.3|x3.53
 6| 27.5|x3.64
 7| 26.4|x3.79
 8| 24.6|x4.07
 9| 24.0|x4.16
10| 23.7|x4.23
11| 22.8|x4.38
12| 22.8|x4.38

Methodology is similar; the only difference from the the Swift version is the computation load, which was adjusted so total duration in the reference scenario is in the same ballpark so as to obtain comparable thermal throttling effects, if any.

discussion

These results are much better behaved, for a start (this is confirmed by the individual runs, whose groupings are satisfying), and while scaling is not linear from 1 to 4, it is close enough to it that I don't feel computational resources are wasted. As expected, going from 4 to 8 improves the performance much less as we get into the efficiency cores, but still improves the situation. Performance does still improve beyond 8, possibly because we afford the system more opportunities for scheduling useful work, but we clearly hit diminishing returns by that point.

Coming back to the Swift version, one peculiarity I noticed in the profile is some symbols related to dictionaries (hashing-related infrastructure, in particular) jumping from the nowhere near the top in the case of the reference runs to the top of the symbols (sorted by descending weight in the inverted call tree) for all other runs, even when parallelism is barely 2. So I worked on eliminating dictionaries from my naive Swift code, and repeated the experiment.

Swift without dictionaries

 N|  T  | factor
 1|100  | ref
 2| 69.1|x1.45
 3| 62.0|x1.61
 4| 58.6|x1.71
 5| 48.3|x2.07
 6| 41.9|x2.39
 7| 37.9|x2.64
 8| 36.1|x2.77
 9| 36.1|x2.77
10| 36.1|x2.77
11| 35.9|x2.78
12| 36.0|x2.77

Methodology is the same as the naive Swift table; the only difference is the improvement patch, attached here² (I am not happy with some of the compromises that I had to make in order to remove usage of dictionaries; I welcome suggestions for improvements to better handle the situation).

discussion

This code is already much better behaved than the naive Swift one (this is all the more obvious in the individual runs). Scalability is much better at the earlier levels, though behavior is somewhat erratic (much better improvement going from 4 to 5 than going from 3 to 4 parallel tasks), and the factor caps at a level not much better than it did earlier.

At any rate, we are still far from scalability that could legitimately be expected of this algorithm.

Further improvements

The improvements I obtained by removing dictionaries suggest the issues I am having with scalability here may not be related solely to the way I use the Swift concurrency APIs, but also in how the Swift code proper behaves, so I think this should be a significant avenue for investigation.

Unfortunately, I have failed to notice any further lead from what Instruments or Xcode could tell me, so I am looking for means to investigate such remaining scalability impediments: tooling, experiments to perform, known issues, best practices, etc.

2 Likes

1:

diff -r 330582a717f6 -r d2ef6283be0b AsyncCountdown/AsyncAPI.swift
--- a/AsyncCountdown/AsyncAPI.swift     Wed May 17 01:41:53 2023 +0200
+++ b/AsyncCountdown/AsyncAPI.swift     Fri Jul 14 22:58:55 2023 +0200
@@ -223,7 +223,11 @@
                                  source: DispatchSource.makeSignalSource(signal: SIGINT,
                                              queue: DispatchQueue.global(qos:.userInitiated)),
                                  cancelHandler: { Task { await cancellator.raise();}});
-        let _ = Computer(exec: { () -> Channel in
+        var param: UInt?;
+        if let config = getenv("net_wanderingcoder_cpucount"), config[0] != 0 {
+            param = UInt(strlen(config));
+        }
+        let _ = Computer(cpuCount: param, exec: { () -> Channel in
             return await Channel(exec: { (sink) in
                 /* While the subtasks added to this group never throw themselves,
                  this is a throwing group so that it can be traversed by a thrown
diff -r 330582a717f6 -r d2ef6283be0b AsyncCountdown/PullAPI.swift
--- a/AsyncCountdown/PullAPI.swift      Wed May 17 01:41:53 2023 +0200
+++ b/AsyncCountdown/PullAPI.swift      Fri Jul 14 22:58:55 2023 +0200
@@ -21,14 +21,18 @@
 }
 
 @available(macOS 12.0, iOS 15.0, *) struct Computer<BatchIterator: BatchIteratorProtocol> {
-    init(exec: @Sendable @escaping () async -> BatchIterator) {
+    init(cpuCount: UInt? = nil, exec: @Sendable @escaping () async -> BatchIterator) {
+        guard cpuCount != 0 else
+        {
+            fatalError("Computer instantiated with 0 CPU!");
+        }
         Task {
             await withTaskGroup(of: Void.self) { group in
                 var outstandingTasks = UInt(0);
                 var iterator = await exec();
 
                 while (true) {
-                    guard outstandingTasks < ProcessInfo.processInfo.activeProcessorCount else {
+                    guard outstandingTasks < cpuCount ?? UInt(ProcessInfo.processInfo.activeProcessorCount) else {
                         _ = await group.next();
                         outstandingTasks -= 1;
                         continue;

2:

diff -r d2ef6283be0b -r cdcaaddad141 AsyncCountdown/AsyncAPI.swift
--- a/AsyncCountdown/AsyncAPI.swift     Fri Jul 14 22:58:55 2023 +0200
+++ b/AsyncCountdown/AsyncAPI.swift     Sun Jul 02 19:01:50 2023 +0200
@@ -16,20 +16,14 @@
 import Foundation;
 import os;
 
-typealias DispatcherAsync = (_ l: inout [ReadyValueNonGCD],
-                        _ composedCount: inout Int,
-                        _ otherComposedCount: inout Int,
-                        _ decomposedValue: inout [Bool:UInt32],
-                        _ downstreamSiblingsRecruitmentSkip: inout Int,
+typealias DispatcherAsync = (_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
+                        _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                         _ kind: Op,
                         _ startingFrom: Int,
                         _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async throws -> Void;
 
-typealias DispatcherPseudoReasync = (_ l: inout [ReadyValueNonGCD],
-                        _ composedCount: inout Int,
-                        _ otherComposedCount: inout Int,
-                        _ decomposedValue: inout [Bool:UInt32],
-                        _ downstreamSiblingsRecruitmentSkip: inout Int,
+typealias DispatcherPseudoReasync = (_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
+                                     _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                         _ kind: Op,
                         _ startingFrom: Int,
                         _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) throws -> Void;
@@ -237,62 +231,29 @@
                  */
                 do {
                     try await withThrowingTaskGroup(of: Void.self) { resultsScopingGroup in
-                        func iteratePossibleLeftNodesDispatch(_ l: inout [ReadyValueNonGCD],
-                                                              _ composedCount: inout Int,
-                                                              _ otherComposedCount: inout Int,
-                                                              _ decomposedValue: inout [Bool:UInt32],
-                                                              _ downstreamSiblingsRecruitmentSkip: inout Int,
+                        func iteratePossibleLeftNodesDispatch(_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
+                                                              _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                                                               _ kind: Op,
                                                               _ startingFrom: Int,
                                                               _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async throws {
                             // reseat this divergence over a copy of the whole state
-                            let paramL = l;
-                            let paramComposedCount = composedCount;
-                            let paramOtherComposedCount = otherComposedCount;
-                            let paramDecomposedValue = decomposedValue;
-                            let paramDownstreamSiblingsRecruitmentSkip = downstreamSiblingsRecruitmentSkip;
+                            let paramResolveCoreCtx = resolveCoreCtx;
+                            let paramExploreAdditionOfRightNodeCtx = exploreAdditionOfRightNodeCtx;
                             
                             let task = Task {
                                 await withTaskGroup(of: Void.self) { innerResultsScopingGroup in
-                                    /*func iteratePossibleLeftNodesFakeDispatchAsync(_ l: inout [ReadyValueNonGCD],
-                                     _ composedCount: inout Int,
-                                     _ otherComposedCount: inout Int,
-                                     _ decomposedValue: inout [Bool:UInt32],
-                                     _ downstreamSiblingsRecruitmentSkip: inout Int,
-                                     _ kind: Op,
-                                     _ startingFrom: Int,
-                                     _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async {
-                                     await iteratePossibleLeftNodesAsync(&l,
-                                     &composedCount,
-                                     &otherComposedCount,
-                                     &decomposedValue,
-                                     &downstreamSiblingsRecruitmentSkip,
-                                     resultReceiver,
-                                     startGoal,
-                                     kind,
-                                     startingFrom,
-                                     walkOperands,
-                                     iteratePossibleLeftNodesFakeDispatchAsync);
-                                     }*/
-                                    
-                                    func iteratePossibleLeftNodesFakeDispatchPseudoReasync(_ l: inout [ReadyValueNonGCD],
-                                                                                           _ composedCount: inout Int,
-                                                                                           _ otherComposedCount: inout Int,
-                                                                                           _ decomposedValue: inout [Bool:UInt32],
-                                                                                           _ downstreamSiblingsRecruitmentSkip: inout Int,
-                                                                                           _ kind: Op,
-                                                                                           _ startingFrom: Int,
-                                                                                           _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) {
-                                        iteratePossibleLeftNodesPseudoReasync(&l,
-                                                                              &composedCount,
-                                                                              &otherComposedCount,
-                                                                              &decomposedValue,
-                                                                              &downstreamSiblingsRecruitmentSkip,
+                                    func iteratePossibleLeftNodesFakeDispatchPseudoReasync(_ innerResolveCoreCtx: inout ResolveCoreCtxNonGCD,
+                                                                                           _ innerExploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
+                                                                                           _ innerKind: Op,
+                                                                                           _ innerStartingFrom: Int,
+                                                                                           _ innerWalkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) {
+                                        iteratePossibleLeftNodesPseudoReasync(&innerResolveCoreCtx,
+                                                                              &innerExploreAdditionOfRightNodeCtx,
                                                                               { result in innerResultsScopingGroup.addTask { await resultMediator.yield(result);}},
                                                                               startGoal,
-                                                                              kind,
-                                                                              startingFrom,
-                                                                              walkOperands,
+                                                                              innerKind,
+                                                                              innerStartingFrom,
+                                                                              innerWalkOperands,
                                                                               iteratePossibleLeftNodesFakeDispatchPseudoReasync);
                                     }
                                     
@@ -303,17 +264,11 @@
                                      partial functions on the same thread either. */
                                     os_signpost(.begin, log: LogWrapper.gObservationHandle, name: "micro-batch", signpostID: OSSignpostID(log: LogWrapper.gObservationHandle, object: Thread.current));
                                     
-                                    var childL = paramL;
-                                    var childComposedCount = paramComposedCount;
-                                    var childOtherComposedCount = paramOtherComposedCount;
-                                    var childDecomposedValue = paramDecomposedValue;
-                                    var childDownstreamSiblingsRecruitmentSkip = paramDownstreamSiblingsRecruitmentSkip;
+                                    var childResolveCoreCtx = paramResolveCoreCtx;
+                                    var childExploreAdditionOfRightNodeCtx = paramExploreAdditionOfRightNodeCtx;
                                     
-                                    iteratePossibleLeftNodesFakeDispatchPseudoReasync(&childL,
-                                                                                      &childComposedCount,
-                                                                                      &childOtherComposedCount,
-                                                                                      &childDecomposedValue,
-                                                                                      &childDownstreamSiblingsRecruitmentSkip,
+                                    iteratePossibleLeftNodesFakeDispatchPseudoReasync(&childResolveCoreCtx,
+                                                                                      &childExploreAdditionOfRightNodeCtx,
                                                                                       kind,
                                                                                       startingFrom,
                                                                                       walkOperands);
@@ -332,11 +287,8 @@
                             try await cancellator.checkRaised();
                         }
                         
-                        func iteratePossibleLeftNodesDispatchLoop(_ l: inout [ReadyValueNonGCD],
-                                                                  _ composedCount: inout Int,
-                                                                  _ otherComposedCount: inout Int,
-                                                                  _ decomposedValue: inout [Bool:UInt32],
-                                                                  _ downstreamSiblingsRecruitmentSkip: inout Int,
+                        func iteratePossibleLeftNodesDispatchLoop(_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
+                                                                  _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                                                                   _ kind: Op,
                                                                   _ startingFrom: Int,
                                                                   _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async throws {
@@ -349,13 +301,10 @@
                              - 4 already means too few possibilities to explore
                              - 3 is right out
                              Note that among other properties, this estimator value is monotonic. */
-                            guard startingFrom < (l.count - composedCount),
-                                  l.count + (walkOperands({_,_ in return ();}) != nil ? 1 : 0) == 5 else {
-                                try await iteratePossibleLeftNodesAsync(&l,
-                                                                        &composedCount,
-                                                                        &otherComposedCount,
-                                                                        &decomposedValue,
-                                                                        &downstreamSiblingsRecruitmentSkip,
+                            guard startingFrom < (resolveCoreCtx.l.count - resolveCoreCtx.composedCount),
+                                  resolveCoreCtx.l.count + (walkOperands({_,_ in return ();}) != nil ? 1 : 0) == 5 else {
+                                try await iteratePossibleLeftNodesAsync(&resolveCoreCtx,
+                                                                        &exploreAdditionOfRightNodeCtx,
                                                                         { result in resultsScopingGroup.addTask { await resultMediator.yield(result);}},
                                                                         startGoal,
                                                                         kind,
@@ -365,11 +314,8 @@
                                 return;
                             }
                             
-                            try await iteratePossibleLeftNodesDispatch(&l,
-                                                                       &composedCount,
-                                                                       &otherComposedCount,
-                                                                       &decomposedValue,
-                                                                       &downstreamSiblingsRecruitmentSkip,
+                            try await iteratePossibleLeftNodesDispatch(&resolveCoreCtx,
+                                                                       &exploreAdditionOfRightNodeCtx,
                                                                        kind,
                                                                        startingFrom,
                                                                        walkOperands);
diff -r d2ef6283be0b -r cdcaaddad141 AsyncCountdown/AsyncCountdown.swift
--- a/AsyncCountdown/AsyncCountdown.swift       Fri Jul 14 22:58:55 2023 +0200
+++ b/AsyncCountdown/AsyncCountdown.swift       Sun Jul 02 19:01:50 2023 +0200
@@ -101,24 +101,32 @@
 }
 
 @available(macOS 12.0.0, iOS 15.0, *)
+struct ResolveCoreCtxNonGCD:Sendable {
+    var l = [ReadyValueNonGCD]();
+    var composedCount = 0;
+    var otherComposedCount = 0;
+};
+
+@available(macOS 12.0.0, iOS 15.0, *)
+struct ExploreAdditionOfRightNodeCtxNonGCD:Sendable {
+    var forwardValue: UInt32;
+    var reverseValue: UInt32;
+    var downstreamSiblingsRecruitmentSkip: Int;
+}
+
+
+@available(macOS 12.0.0, iOS 15.0, *)
 func resolveCoreNonGCD(_ injectedDispatcher: DispatcherAsync, _ primitives: [UInt32]) async throws {
-    var referenceL = [ReadyValueNonGCD]();
-    var referenceComposedCount = 0;
-    var referenceOtherComposedCount = 0;
-            
+    var reference = ResolveCoreCtxNonGCD();
             
     for element in primitives {
-        referenceL.append(.init(element));
+        reference.l.append(.init(element));
     }
         
-    try await exploreAdditionOfFloorAsync(&referenceL,
-                                     &referenceComposedCount,
-                                     &referenceOtherComposedCount,
+    try await exploreAdditionOfFloorAsync(&reference,
                                      injectedDispatcher,
                                      kind: .additive);
-    try await exploreAdditionOfFloorAsync(&referenceL,
-                                     &referenceComposedCount,
-                                     &referenceOtherComposedCount,
+    try await exploreAdditionOfFloorAsync(&reference,
                                      injectedDispatcher,
                                      kind: .multiplicative);
 }
diff -r d2ef6283be0b -r cdcaaddad141 AsyncCountdown/AsyncExplorers.swift
--- a/AsyncCountdown/AsyncExplorers.swift       Fri Jul 14 22:58:55 2023 +0200
+++ b/AsyncCountdown/AsyncExplorers.swift       Sun Jul 02 19:01:50 2023 +0200
@@ -16,45 +16,37 @@
 import Foundation;
             
             
-func exploreAdditionOfFloorAsync(_ l: inout [ReadyValueNonGCD],
-                                               _ composedCount: inout Int,
-                                               _ otherComposedCount: inout Int,
+func exploreAdditionOfFloorAsync(_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
                                                _ dispatcher: DispatcherAsync,
                                            kind: Op) async rethrows {
-    guard otherComposedCount == 0 else {
+    guard resolveCoreCtx.otherComposedCount == 0 else {
         return;
     }
     
-    otherComposedCount = composedCount;
-    composedCount = 0;
+    resolveCoreCtx.otherComposedCount = resolveCoreCtx.composedCount;
+    resolveCoreCtx.composedCount = 0;
     
-    try await exploreAdditionOfRightNodeAsync(&l,
-                                          &composedCount,
-                                          &otherComposedCount,
+    try await exploreAdditionOfRightNodeAsync(&resolveCoreCtx,
                                           dispatcher,
                                           kind: kind,
                                           skip: 0);
 
-    composedCount = otherComposedCount;
-    otherComposedCount = 0;
+    resolveCoreCtx.composedCount = resolveCoreCtx.otherComposedCount;
+    resolveCoreCtx.otherComposedCount = 0;
 }
 
-func exploreAdditionOfRightNodeAsync(_ l: inout [ReadyValueNonGCD],
-                                               _ composedCount: inout Int,
-                                               _ otherComposedCount: inout Int,
+func exploreAdditionOfRightNodeAsync(_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
                                                _ dispatcher: DispatcherAsync,
                                                kind: Op,
                                                skip: Int) async rethrows {
-    var referenceDecomposedValue = [false: kind.neutralValue(), true: kind.neutralValue()];
+    var reference = ExploreAdditionOfRightNodeCtxNonGCD(forwardValue: kind.neutralValue(),
+                                                        reverseValue: kind.neutralValue(),
+                                                        downstreamSiblingsRecruitmentSkip: Int(0)) // placeholder
     // Siblings are the other right nodes on the same floor
     // Downstream ones are those created while this one is active
-    var downstreamSiblingsRecruitmentSkip = Int(0); // placeholder
     
-    try await dispatcher(&l,
-                     &composedCount,
-                     &otherComposedCount,
-                     &referenceDecomposedValue,
-                     &downstreamSiblingsRecruitmentSkip,
+    try await dispatcher(&resolveCoreCtx,
+                     &reference,
                      kind,
                      /* startingFrom: */ skip,
                      { _ in return nil;});
@@ -97,11 +89,8 @@
 the same floor) recovering candidates in a way that results in a composition
 that we ourselves explored earlier, but that is the basic principle.
 */
-func iteratePossibleLeftNodesAsync(_ l: inout [ReadyValueNonGCD],
-                                             _ composedCount: inout Int,
-                                             _ otherComposedCount: inout Int,
-                                             _ decomposedValue: inout [Bool:UInt32],
-                                             _ downstreamSiblingsRecruitmentSkip: inout Int,
+func iteratePossibleLeftNodesAsync(_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
+                                             _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                                              _ resultReceiver: (_ result: String) -> Void,
                                              _ startGoal: UInt32,
                                              _ kind: Op,
@@ -117,35 +106,38 @@
     would never reach 0 no matter how many siblings we'd add, and so we'd never be
     able to close the floor anyway. So it'd be pointless to keep going. */
     for candidateReverseOffset in
-            startingFrom ..< ((emptyComposition && (otherComposedCount > 0)) ? startingFrom+1 : (l.count - composedCount) ) {
+            startingFrom ..< ((emptyComposition && (resolveCoreCtx.otherComposedCount > 0)) ? startingFrom+1 : (resolveCoreCtx.l.count - resolveCoreCtx.composedCount) ) {
         if emptyComposition {
-            downstreamSiblingsRecruitmentSkip = candidateReverseOffset;
+            exploreAdditionOfRightNodeCtx.downstreamSiblingsRecruitmentSkip = candidateReverseOffset;
         }
 
         /* Note: new compositions are added at the end of the array, i.e. at the high offsets,
          while primitives (if any left) will be at the low offsets. As a result, the candidates
          to be recruited as a first priority are found at the high offsets first, though not so
          high that we would end up recruiting from the same floor (forbidden). */
-        let rightChild = l.remove(at: (l.count - composedCount - 1) - candidateReverseOffset);
+        let rightChild = resolveCoreCtx.l.remove(at: (resolveCoreCtx.l.count - resolveCoreCtx.composedCount - 1) - candidateReverseOffset);
         if let _ = rightChild.expr {
-            otherComposedCount -= 1;
+            resolveCoreCtx.otherComposedCount -= 1;
         }
         
-        for phase in 0...1 {
-            let reverse = (phase == 1);
-            { (_ valueComponent: inout UInt32) in
-                valueComponent = kind.combine(valueComponent, rightChild.value);
-            }(&decomposedValue[reverse]!);
-            
+        func combine(_ total: inout UInt32) {
+            total = kind.combine(total, rightChild.value);
+        }
+
+        func uncombine(_ total: inout UInt32) {
+            total = kind.uncombine(total, rightChild.value);
+        }
+        
+        func commonAsync(_ innerResolveCoreCtx: inout ResolveCoreCtxNonGCD,
+                         _ innerExploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
+                         isReverse: Bool,
+                         _ innerDispatcher: DispatcherAsync) async rethrows {
             let selfNode = {@Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void? in
-                return action(rightChild, reverse) ?? walkOperands(action);
+                return action(rightChild, isReverse) ?? walkOperands(action);
             };
             
-            try await dispatcher(&l,
-                             &composedCount,
-                             &otherComposedCount,
-                             &decomposedValue,
-                             &downstreamSiblingsRecruitmentSkip,
+            try await innerDispatcher(&innerResolveCoreCtx,
+                             &innerExploreAdditionOfRightNodeCtx,
                              kind,
                              /* startingFrom: */ candidateReverseOffset,
                              selfNode);
@@ -153,49 +145,49 @@
             // close current composition
             var num = 0;
             if (selfNode({_,_ in guard num == 0 else {return ();}; num += 1; return nil;}) != nil)
-                    && ( (kind == .additive) ? decomposedValue[false]! > decomposedValue[true]! :
-                            ((decomposedValue[false]! % decomposedValue[true]!) == 0) ) {
+                && ( (kind == .additive) ? innerExploreAdditionOfRightNodeCtx.forwardValue > innerExploreAdditionOfRightNodeCtx.reverseValue :
+                            ((innerExploreAdditionOfRightNodeCtx.forwardValue % innerExploreAdditionOfRightNodeCtx.reverseValue) == 0) ) {
                 
-                let realizedValue = kind.uncombine(decomposedValue[false]!, decomposedValue[true]!);
+                let realizedValue = kind.uncombine(innerExploreAdditionOfRightNodeCtx.forwardValue, innerExploreAdditionOfRightNodeCtx.reverseValue);
                 
                 let readyValue = ReadyValueNonGCD(value: realizedValue,
                                                   expr: .init(op: kind,
                                                        operatorsBrowser: selfNode))
                 
-                if l.count == 0 {
+                if innerResolveCoreCtx.l.count == 0 {
                     if realizedValue == startGoal {
                         resultReceiver(readyValue.description());
                     }
                 } else {
-                    composedCount += 1;
-                    l.append(readyValue);
+                    innerResolveCoreCtx.composedCount += 1;
+                    innerResolveCoreCtx.l.append(readyValue);
                     
-                    try await exploreAdditionOfRightNodeAsync(&l,
-                                                          &composedCount,
-                                                          &otherComposedCount,
-                                                          dispatcher,
+                    try await exploreAdditionOfRightNodeAsync(&innerResolveCoreCtx,
+                                                              innerDispatcher,
                                                           kind: kind,
-                                                          skip: downstreamSiblingsRecruitmentSkip);
-                    try await exploreAdditionOfFloorAsync(&l,
-                                                      &composedCount,
-                                                      &otherComposedCount,
-                                                      dispatcher,
+                                                              skip: innerExploreAdditionOfRightNodeCtx.downstreamSiblingsRecruitmentSkip);
+                    try await exploreAdditionOfFloorAsync(&innerResolveCoreCtx,
+                                                          innerDispatcher,
                                                       kind: ((kind == .additive) ? .multiplicative : .additive));
                     
-                    l.remove(at: l.count - 1);
-                    composedCount -= 1;
+                    innerResolveCoreCtx.l.remove(at: innerResolveCoreCtx.l.count - 1);
+                    innerResolveCoreCtx.composedCount -= 1;
                 }
             } // end of operation validity test
+        } // end of commonAsync
+        
+        combine(&exploreAdditionOfRightNodeCtx.forwardValue);
+        try await commonAsync(&resolveCoreCtx, &exploreAdditionOfRightNodeCtx, isReverse: false, dispatcher);
+        uncombine(&exploreAdditionOfRightNodeCtx.forwardValue);
 
-            { (_ valueComponent: inout UInt32) in
-                valueComponent = kind.uncombine(valueComponent, rightChild.value);
-            }(&decomposedValue[reverse]!);
-        } // end of phase loop
-        
+        combine(&exploreAdditionOfRightNodeCtx.reverseValue);
+        try await commonAsync(&resolveCoreCtx, &exploreAdditionOfRightNodeCtx, isReverse: true, dispatcher);
+        uncombine(&exploreAdditionOfRightNodeCtx.reverseValue);
+
         if let _ = rightChild.expr {
-            otherComposedCount += 1;
+            resolveCoreCtx.otherComposedCount += 1;
         }
-        l.insert(rightChild, at: ((l.count+1) - composedCount - 1) - candidateReverseOffset);
+        resolveCoreCtx.l.insert(rightChild, at: ((resolveCoreCtx.l.count+1) - resolveCoreCtx.composedCount - 1) - candidateReverseOffset);
     } // end of candidate iteration loop
 }
 
diff -r d2ef6283be0b -r cdcaaddad141 AsyncCountdown/DispatchAPI.swift
--- a/AsyncCountdown/DispatchAPI.swift  Fri Jul 14 22:58:55 2023 +0200
+++ b/AsyncCountdown/DispatchAPI.swift  Sun Jul 02 19:01:50 2023 +0200
@@ -16,11 +16,8 @@
 @preconcurrency import Foundation
 import os;
 
-typealias Dispatcher = (_ l: inout [ReadyValue],
-                        _ composedCount: inout Int,
-                        _ otherComposedCount: inout Int,
-                        _ decomposedValue: inout [Bool:UInt32],
-                        _ downstreamSiblingsRecruitmentSkip: inout Int,
+typealias Dispatcher = (_ resolveCoreCtx: inout ResolveCoreCtx,
+                        _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtx,
                         _ kind: Op,
                         _ startingFrom: Int,
                         _ walkOperands: @escaping (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void?) -> Void?) throws -> Void;
@@ -39,19 +36,13 @@
          _ primitives: UInt32...) {
         let resultReceiver = { result in inqueue.async { incb(result); };};
         
-        func iteratePossibleLeftNodesFakeDispatch(_ l: inout [ReadyValue],
-                                                                _ composedCount: inout Int,
-                                                                _ otherComposedCount: inout Int,
-                                                                _ decomposedValue: inout [Bool:UInt32],
-                                                                _ downstreamSiblingsRecruitmentSkip: inout Int,
+        func iteratePossibleLeftNodesFakeDispatch(_ resolveCoreCtx: inout ResolveCoreCtx,
+                                                  _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtx,
                                                                 _ kind: Op,
                                                                 _ startingFrom: Int,
                                                                 _ walkOperands: @escaping  (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void?) -> Void?)  {
-            iteratePossibleLeftNodes(&l,
-                                  &composedCount,
-                                  &otherComposedCount,
-                                  &decomposedValue,
-                                  &downstreamSiblingsRecruitmentSkip,
+            iteratePossibleLeftNodes(&resolveCoreCtx,
+                                  &exploreAdditionOfRightNodeCtx,
                                   resultReceiver,
                                   startGoal,
                                   kind,
@@ -73,11 +64,8 @@
             completion.notify(queue: inqueue,
                               execute: completionCb);
 
-            func iteratePossibleLeftNodesDispatch(_ l: inout [ReadyValue],
-                                                  _ composedCount: inout Int,
-                                                  _ otherComposedCount: inout Int,
-                                                  _ decomposedValue: inout [Bool:UInt32],
-                                                  _ downstreamSiblingsRecruitmentSkip: inout Int,
+            func iteratePossibleLeftNodesDispatch(_ resolveCoreCtx: inout ResolveCoreCtx,
+                                                  _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtx,
                                                   _ kind: Op,
                                                   _ startingFrom: Int,
                                                   _ walkOperands: @escaping (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void?) -> Void?)  {
@@ -90,13 +78,10 @@
                  - 4 already means too few possibilities to explore
                  - 3 is right out
                  Note that among other properties, this estimator value is monotonic. */
-                guard startingFrom < (l.count - composedCount),
-                      l.count + (walkOperands({_,_ in return ();}) != nil ? 1 : 0) == 5 else {
-                    iteratePossibleLeftNodes(&l,
-                                             &composedCount,
-                                             &otherComposedCount,
-                                             &decomposedValue,
-                                             &downstreamSiblingsRecruitmentSkip,
+                guard startingFrom < (resolveCoreCtx.l.count - resolveCoreCtx.composedCount),
+                      resolveCoreCtx.l.count + (walkOperands({_,_ in return ();}) != nil ? 1 : 0) == 5 else {
+                    iteratePossibleLeftNodes(&resolveCoreCtx,
+                                             &exploreAdditionOfRightNodeCtx,
                                              resultReceiver,
                                              startGoal,
                                              kind,
@@ -107,26 +92,17 @@
                 }
                 
                 // reseat this divergence over a copy of the whole state
-                let childL = l;
-                let childComposedCount = composedCount;
-                let childOtherComposedCount = otherComposedCount;
-                let childDecomposedValue = decomposedValue;
-                let childDownstreamSiblingsRecruitmentSkip = downstreamSiblingsRecruitmentSkip;
-                
+                let paramResolveCoreCtx = resolveCoreCtx;
+                let paramExploreAdditionOfRightNodeCtx = exploreAdditionOfRightNodeCtx;
+
                 DispatchQueue.global(qos:.userInitiated).async(group: completion) {
                     os_signpost(.begin, log: LogWrapper.gObservationHandle, name: "micro-batch", signpostID: OSSignpostID(log: LogWrapper.gObservationHandle, object: Thread.current));
 
-                    var mutableL = childL;
-                    var mutableComposedCount = childComposedCount;
-                    var mutableOtherComposedCount = childOtherComposedCount;
-                    var mutableDecomposedValue = childDecomposedValue;
-                    var mutableDownstreamSiblingsRecruitmentSkip = childDownstreamSiblingsRecruitmentSkip;
+                    var childResolveCoreCtx = paramResolveCoreCtx;
+                    var childExploreAdditionOfRightNodeCtx = paramExploreAdditionOfRightNodeCtx;
 
-                    iteratePossibleLeftNodes(&mutableL,
-                                             &mutableComposedCount,
-                                             &mutableOtherComposedCount,
-                                             &mutableDecomposedValue,
-                                             &mutableDownstreamSiblingsRecruitmentSkip,
+                    iteratePossibleLeftNodes(&childResolveCoreCtx,
+                                             &childExploreAdditionOfRightNodeCtx,
                                              resultReceiver,
                                              startGoal,
                                              kind,

I have kept attacking this challenge from multiple angles, until I remembered and relocated a blog post about a scalability issue, whose message is: regardless of how tight its implementation is, an atomic counter semaphore can always become a scalability issue by mere virtue of requiring the cache line to bounce between the L1D attached to the respective cores if more than one of them manipulates it concurrently; which can easily and sometimes implicitly occur with reference counting.

So I tried to look for such a situation in my scenario, but even if I have some suspects, I first need data to know how much really of such potential savings can be had. And that is where I came up short on this Mac mini M2, as I failed to find a hardware counter that would be of help: none of them measure the case where a core requests a cache line but it can't be loaded from anywhere in its cache hierarchy, including the common RAM, at least not until another core has written it back. Instead, I had to get back to my meager Intel-based MacBook with only two physical cores, the bare minimum where this can happen, as this one does have counters to that effect. Do you hear me, Apple silicon guys? I had to get back to an Intel processor (which are now gone from the Mac hardware lineup) from a 7-year-old Mac in order to try and improve software scalability on your brand spanking new hardware since the latter is utterly unhelpful.

Using MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM in the CPU counters Instruments template in event mode (once every 1000 events), which allows samples to be taken at or close to where such events happen, I do notice swift_release and swift_retain at the top of the inverted call tree profile, in particular in comparison to the same profile for the sequential run, and unfolding their callers reveals that most of them are directly my code.

Unfortunately, Instruments was not able to point out which lines (even in the assembly) corresponded to such calls (it either complained of being unable to locate the file in question, or pointed me to the first line in the file, etc.), but there was only really one possibility for data that I share between concurrent tasks: the closures that correspond to the chain of operands for a given operation. Indeed, I had that data expressed as closures, being the most straightforward and economical way to make that data available to all the computation trees that may need access to it, as this data is immutable once captured. Everything else is mutable and copied (if not exactly at the point of the explicit copy, at worst immediately afterwards when it is first modified due to COW, which occurs at most once per micro-batch so not worth worrying about).

Unfortunately, Swift won't allow me to "fully copy" a closure, even in a shallow way: such an operation would probably be incompatible with its semantics anyway. So when spawning a new task, instead of such a copy I replace each reference to such a closure by a new closure whose only purpose is to store that reference and forward control to it when invoked:

// First, wrap the closures into a no-op closure that will serve as a cap
let childWalkOperands = { @Sendable action in
    anchor_func();
    return walkOperands(action);
};

Here this is the code for the closure for the operation that is currently being built, similar code is used to replace the similar reference found in the non-primitive ReadyValues in the available list (the array known as l which is copied when a new task is spawned). void anchor_func(void) is a no-op assembly function, which Swift imports as a foreign C declaration, whose only purpose is to ensure the optimizer won't deem that new closure to be equal to the one it wraps and merge the two as a result (which would defeat my work): its source code is

	.globl	_anchor_func
	.p2align	2
_anchor_func:
	ret

After a bit of additional code to support this switcheroo, I profile my code again.

Swift with closure cap

 N|  T  | factor
 1|100  | ref
 2| 69.3|x1.44
 3| 62.0|x1.61
 4| 59.3|x1.69
 5| 48.7|x2.05
 6| 42.1|x2.37
 7| 38.6|x2.59
 8| 36.1|x2.77
 9| 36.2|x2.76
10| 35.9|x2.78
11| 36.0|x2.77
12| 35.9|x2.79

(methodology is identical to Swift without dictionaries, plus an attached patch)

discussion

…well, the absolute performance did improve, as well as the consistency of the results (neither of which are shown here), which is why I will keep this change anyway; running that with the CPU counters instruments does show the event in question as being triggered by swift_release and swift_retain less often, and the remaining cases appear to be only indirectly related to my code.

But this did not move the needle on scalability one bit. Back to the drawing board…

diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/AsyncAPI.swift
--- a/AsyncCountdown/AsyncAPI.swift     Mon Jul 10 22:51:12 2023 +0200
+++ b/AsyncCountdown/AsyncAPI.swift     Sun Sep 17 12:12:20 2023 +0200
@@ -20,13 +20,13 @@
                         _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                         _ kind: Op,
                         _ startingFrom: Int,
-                        _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async throws -> Void;
+                        _ walkOperands: @escaping WalkerFuncNonGCD) async throws -> Void;
 
 typealias DispatcherPseudoReasync = (_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
                                      _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                         _ kind: Op,
                         _ startingFrom: Int,
-                        _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) throws -> Void;
+                        _ walkOperands: @escaping WalkerFuncNonGCD) throws -> Void;
 
 // Despite https://github.com/apple/swift-evolution/blob/main/proposals/0309-unlock-existential-types-for-all-protocols.md
 // I am unable to do the following, even with Xcode 13.2.1: I get the error that:
@@ -235,9 +235,29 @@
                                                               _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                                                               _ kind: Op,
                                                               _ startingFrom: Int,
-                                                              _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async throws {
+                                                              _ walkOperands: @escaping WalkerFuncNonGCD) async throws {
+                            // First, wrap the closures into a no-op closure that will serve as a cap
+                            let childWalkOperands = { @Sendable action in
+                                anchor_func();
+                                return walkOperands(action);
+                            };
+                            var interResolveCoreCtx = resolveCoreCtx;
+                            interResolveCoreCtx.l = resolveCoreCtx.l.map({e in
+                                return ReadyValueNonGCD(template: e,
+                                                        overridingExpr: e.expr.map({ innerE in
+                                        var innerR = innerE;
+                                        let innerOperandsBrowser = innerE.operatorsBrowser;
+                                        innerR.operatorsBrowser = { action in
+                                            anchor_func();
+                                            return innerOperandsBrowser(action);
+                                        };
+                                        return innerR;
+                                    })
+                                );
+                            });
+                            
                             // reseat this divergence over a copy of the whole state
-                            let paramResolveCoreCtx = resolveCoreCtx;
+                            let paramResolveCoreCtx = interResolveCoreCtx;
                             let paramExploreAdditionOfRightNodeCtx = exploreAdditionOfRightNodeCtx;
                             
                             let task = Task {
@@ -246,7 +266,7 @@
                                                                                            _ innerExploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                                                                                            _ innerKind: Op,
                                                                                            _ innerStartingFrom: Int,
-                                                                                           _ innerWalkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) {
+                                                                                           _ innerWalkOperands: @escaping WalkerFuncNonGCD) {
                                         iteratePossibleLeftNodesPseudoReasync(&innerResolveCoreCtx,
                                                                               &innerExploreAdditionOfRightNodeCtx,
                                                                               { result in innerResultsScopingGroup.addTask { await resultMediator.yield(result);}},
@@ -271,7 +291,7 @@
                                                                                       &childExploreAdditionOfRightNodeCtx,
                                                                                       kind,
                                                                                       startingFrom,
-                                                                                      walkOperands);
+                                                                                      childWalkOperands);
                                     
                                     os_signpost(.end, log: LogWrapper.gObservationHandle, name: "micro-batch", signpostID: OSSignpostID(log: LogWrapper.gObservationHandle, object: Thread.current));
                                 }
@@ -291,7 +311,7 @@
                                                                   _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                                                                   _ kind: Op,
                                                                   _ startingFrom: Int,
-                                                                  _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async throws {
+                                                                  _ walkOperands: @escaping WalkerFuncNonGCD) async throws {
                             /* 1: Peek a bit: if we find out from startingFrom that no further left node is
                              going to be able to be added anyway (and remember: closing the composition as
                              it stands is performed in our caller), don't bother spawning a micro-batch.
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/AsyncCountdown-Bridging-Header.h
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/AsyncCountdown/AsyncCountdown-Bridging-Header.h   Sun Sep 17 12:12:20 2023 +0200
@@ -0,0 +1,13 @@
+//
+//  AsyncCountdown-Bridging-Header.h
+//  AsyncCountdown
+//
+//  Created by Pierre Lebeaupin on 15/09/2023.
+//
+
+#ifndef AsyncCountdown_Bridging_Header_h
+#define AsyncCountdown_Bridging_Header_h
+
+#include "anchor_func.h"
+
+#endif /* AsyncCountdown_Bridging_Header_h */
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/AsyncCountdown.swift
--- a/AsyncCountdown/AsyncCountdown.swift       Mon Jul 10 22:51:12 2023 +0200
+++ b/AsyncCountdown/AsyncCountdown.swift       Sun Sep 17 12:12:20 2023 +0200
@@ -15,14 +15,15 @@
 
 import Foundation;
 
+typealias WalkerFuncNonGCD = @Sendable (_ action: (_ value: ReadyValueNonGCD,
+                                             _ reverse: Bool) -> Void?
+                                  ) -> Void?;
+   
 @available(macOS 12.0.0, iOS 15.0, *)
 struct ReadyValueNonGCD:Sendable {
     struct Expr {
         var op: Op;
-        var operatorsBrowser: @Sendable (_ action: (_ value: ReadyValueNonGCD,
-                                                    _ reverse: Bool)
-                                         -> Void?)
-        -> Void?;
+        var operatorsBrowser: WalkerFuncNonGCD;
     }
 
     let value: UInt32;
@@ -38,6 +39,11 @@
         expr = inExpr;
     }
     
+    init(template intemplate: Self, overridingExpr inexpr: Expr?) {
+        value = intemplate.value;
+        expr = inexpr;
+    }
+    
     func description() -> String {
          guard let localExpr = expr else {
              return value.description;
@@ -47,7 +53,7 @@
          var isFirst = true;
          var numRev = 0;
          
-         localExpr.operatorsBrowser({(_ subReadyValue: ReadyValueNonGCD,
+        localExpr.operatorsBrowser({(_ subReadyValue: ReadyValueNonGCD,
                     _ freverse: Bool) -> Void? in
              guard !freverse else {
                  numRev += 1;
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/AsyncExplorers.swift
--- a/AsyncCountdown/AsyncExplorers.swift       Mon Jul 10 22:51:12 2023 +0200
+++ b/AsyncCountdown/AsyncExplorers.swift       Sun Sep 17 12:12:20 2023 +0200
@@ -46,10 +46,10 @@
     // Downstream ones are those created while this one is active
     
     try await dispatcher(&resolveCoreCtx,
-                     &reference,
-                     kind,
-                     /* startingFrom: */ skip,
-                     { _ in return nil;});
+                         &reference,
+                         kind,
+                         /* startingFrom: */ skip,
+                         { _ in return nil;});
 }
 
 /*
@@ -95,7 +95,7 @@
                                              _ startGoal: UInt32,
                                              _ kind: Op,
                                              _ startingFrom: Int,
-                                             _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?,
+                                             _ walkOperands: @escaping WalkerFuncNonGCD,
                                              _ dispatcher: DispatcherAsync) async rethrows {
     let emptyComposition = (walkOperands({_,_ in return ();}) == nil)
 
@@ -137,10 +137,10 @@
             };
             
             try await innerDispatcher(&innerResolveCoreCtx,
-                             &innerExploreAdditionOfRightNodeCtx,
-                             kind,
-                             /* startingFrom: */ candidateReverseOffset,
-                             selfNode);
+                                      &innerExploreAdditionOfRightNodeCtx,
+                                      kind,
+                                      /* startingFrom: */ candidateReverseOffset,
+                                      selfNode);
             
             // close current composition
             var num = 0;
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/DispatchAPI.swift
--- a/AsyncCountdown/DispatchAPI.swift  Mon Jul 10 22:51:12 2023 +0200
+++ b/AsyncCountdown/DispatchAPI.swift  Sun Sep 17 12:12:20 2023 +0200
@@ -20,7 +20,7 @@
                         _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtx,
                         _ kind: Op,
                         _ startingFrom: Int,
-                        _ walkOperands: @escaping (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void?) -> Void?) throws -> Void;
+                        _ walkOperands: @escaping WalkerFunc) throws -> Void;
 
 
 @preconcurrency class ResolveObject {
@@ -40,7 +40,7 @@
                                                   _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtx,
                                                                 _ kind: Op,
                                                                 _ startingFrom: Int,
-                                                                _ walkOperands: @escaping  (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void?) -> Void?)  {
+                                                                _ walkOperands: @escaping WalkerFunc)  {
             iteratePossibleLeftNodes(&resolveCoreCtx,
                                   &exploreAdditionOfRightNodeCtx,
                                   resultReceiver,
@@ -68,7 +68,7 @@
                                                   _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtx,
                                                   _ kind: Op,
                                                   _ startingFrom: Int,
-                                                  _ walkOperands: @escaping (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void?) -> Void?)  {
+                                                  _ walkOperands: @escaping WalkerFunc)  {
                 /* 1: Peek a bit: if we find out from startingFrom that no further left node is
                  going to be able to be added anyway (and remember: closing the composition as
                  it stands is performed in our caller), don't bother spawning a micro-batch.
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/anchor_func.h
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/AsyncCountdown/anchor_func.h      Sun Sep 17 12:12:20 2023 +0200
@@ -0,0 +1,4 @@
+#ifndef ANCHOR_FUNC
+#define ANCHOR_FUNC
+void anchor_func(void);
+#endif // ANCHOR_FUNC
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/anchor_func.s
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/AsyncCountdown/anchor_func.s      Sun Sep 17 12:12:20 2023 +0200
@@ -0,0 +1,9 @@
+       .section        __TEXT,__text,regular,pure_instructions
+       .build_version macos, 13, 0     sdk_version 13, 3
+       .globl  _anchor_func
+       .p2align        2
+_anchor_func:
+       .cfi_startproc
+       ret
+       .cfi_endproc
+.subsections_via_symbols
1 Like

Small update: involving a (potentially target-dependent) assembly file and a C import because I don't know how else to make code off-limits to the Swift optimizer didn't sit well with me, so I made another effort at a swiftier solution. I have observed that the compiler, both in its internal operation and presumably in how it hands off its representation to LLVM (since the latter is going to also remove abstractions if they are presented to it the right way), is particularly respectful of class instances and does not appear to simplify code across class boundaries (e.g. propagate constraints, inline code, etc.). So I need to involve a class somehow, but still end up with a closure, because I need to transparently insert that precisely when needed. This is how I accomplished it:

final class ThreadLocalCapGutsNonGCD:Sendable {
    let closure: WalkerFuncNonGCD;
    init(closure inclosure: @escaping WalkerFuncNonGCD) {
        closure = inclosure;
    }
    
    class func cap(_ inclosure: @escaping WalkerFuncNonGCD) -> WalkerFuncNonGCD
    {
        // Put the closure in a wrapper…
        let wrapper = Self(closure: inclosure);
        // …so a new closure can capture the wrapper…
        return { @Sendable action in
            // …and recover the original closure from that wrapper
            return wrapper.closure(action);
            // This ought to ensure the wrapping isn't optimized away.
        };
    }
}

After profiling with Instruments and the MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM counter in event mode, I confirm that this is sufficient to obtain the same effect of reduction of cache line bouncing as I previously did with the anchor_func()-based solution, and that there is no performance regression, so now I can get rid of the bridging header and the assembly source file.

The patch:

diff -r 7d85ed77952d -r b0dd9436c367 AsyncCountdown/AsyncAPI.swift
--- a/AsyncCountdown/AsyncAPI.swift     Sun Sep 17 12:12:20 2023 +0200
+++ b/AsyncCountdown/AsyncAPI.swift     Tue Sep 19 22:41:45 2023 +0200
@@ -236,21 +236,13 @@
                                                               _ kind: Op,
                                                               _ startingFrom: Int,
                                                               _ walkOperands: @escaping WalkerFuncNonGCD) async throws {
-                            // First, wrap the closures into a no-op closure that will serve as a cap
-                            let childWalkOperands = { @Sendable action in
-                                anchor_func();
-                                return walkOperands(action);
-                            };
+                            let childWalkOperands = ThreadLocalCapGutsNonGCD.cap(walkOperands);
                             var interResolveCoreCtx = resolveCoreCtx;
                             interResolveCoreCtx.l = resolveCoreCtx.l.map({e in
                                 return ReadyValueNonGCD(template: e,
                                                         overridingExpr: e.expr.map({ innerE in
                                         var innerR = innerE;
-                                        let innerOperandsBrowser = innerE.operatorsBrowser;
-                                        innerR.operatorsBrowser = { action in
-                                            anchor_func();
-                                            return innerOperandsBrowser(action);
-                                        };
+                                        innerR.operatorsBrowser = ThreadLocalCapGutsNonGCD.cap(innerE.operatorsBrowser);
                                         return innerR;
                                     })
                                 );
diff -r 7d85ed77952d -r b0dd9436c367 AsyncCountdown/AsyncCountdown-Bridging-Header.h
--- a/AsyncCountdown/AsyncCountdown-Bridging-Header.h   Sun Sep 17 12:12:20 2023 +0200
+++ /dev/null   Thu Jan 01 00:00:00 1970 +0000
@@ -1,13 +0,0 @@
-//
-//  AsyncCountdown-Bridging-Header.h
-//  AsyncCountdown
-//
-//  Created by Pierre Lebeaupin on 15/09/2023.
-//
-
-#ifndef AsyncCountdown_Bridging_Header_h
-#define AsyncCountdown_Bridging_Header_h
-
-#include "anchor_func.h"
-
-#endif /* AsyncCountdown_Bridging_Header_h */
diff -r 7d85ed77952d -r b0dd9436c367 AsyncCountdown/AsyncCountdown.swift
--- a/AsyncCountdown/AsyncCountdown.swift       Sun Sep 17 12:12:20 2023 +0200
+++ b/AsyncCountdown/AsyncCountdown.swift       Tue Sep 19 22:41:45 2023 +0200
@@ -18,6 +18,33 @@
 typealias WalkerFuncNonGCD = @Sendable (_ action: (_ value: ReadyValueNonGCD,
                                              _ reverse: Bool) -> Void?
                                   ) -> Void?;
+
+/* The aim of this class is to provide each thread (or task, async block, etc.)
+ with its own reference that will get retain/release traffic, rather than all
+ threads sharing a single reference directly and causing the cache line
+ containing the reference count to bounce between cores.
+ 
+ In order to fit in structure fields and parameters that actually expect a
+ WalkerFuncNonGCD, this class needs to be wrapped by a closure of that type. */
+
+final class ThreadLocalCapGutsNonGCD:Sendable {
+    let closure: WalkerFuncNonGCD;
+    init(closure inclosure: @escaping WalkerFuncNonGCD) {
+        closure = inclosure;
+    }
+    
+    class func cap(_ inclosure: @escaping WalkerFuncNonGCD) -> WalkerFuncNonGCD
+    {
+        // Put the closure in a wrapper…
+        let wrapper = Self(closure: inclosure);
+        // …so a new closure can capture the wrapper…
+        return { @Sendable action in
+            // …and recover the original closure from that wrapper
+            return wrapper.closure(action);
+            // This ought to ensure the wrapping isn't optimized away.
+        };
+    }
+}
    
 @available(macOS 12.0.0, iOS 15.0, *)
 struct ReadyValueNonGCD:Sendable {
diff -r 7d85ed77952d -r b0dd9436c367 AsyncCountdown/anchor_func.h
--- a/AsyncCountdown/anchor_func.h      Sun Sep 17 12:12:20 2023 +0200
+++ /dev/null   Thu Jan 01 00:00:00 1970 +0000
@@ -1,4 +0,0 @@
-#ifndef ANCHOR_FUNC
-#define ANCHOR_FUNC
-void anchor_func(void);
-#endif // ANCHOR_FUNC
diff -r 7d85ed77952d -r b0dd9436c367 AsyncCountdown/anchor_func.s
--- a/AsyncCountdown/anchor_func.s      Sun Sep 17 12:12:20 2023 +0200
+++ /dev/null   Thu Jan 01 00:00:00 1970 +0000
@@ -1,9 +0,0 @@
-       .section        __TEXT,__text,regular,pure_instructions
-       .build_version macos, 13, 0     sdk_version 13, 3
-       .globl  _anchor_func
-       .p2align        2
-_anchor_func:
-       .cfi_startproc
-       ret
-       .cfi_endproc
-.subsections_via_symbols

OK, how much impact do you think this patch could possibly have on our matter?

diff -r 631593d9f4fa -r 07055be695f9 AsyncCountdown/AsyncExplorers.swift
--- a/AsyncCountdown/AsyncExplorers.swift       Mon Jul 10 22:51:12 2023 +0200
+++ b/AsyncCountdown/AsyncExplorers.swift       Sat Sep 23 14:19:11 2023 +0200
@@ -143,8 +143,7 @@
                              selfNode);
             
             // close current composition
-            var num = 0;
-            if (selfNode({_,_ in guard num == 0 else {return ();}; num += 1; return nil;}) != nil)
+            if /* parent is */ (!emptyComposition)
                 && ( (kind == .additive) ? innerExploreAdditionOfRightNodeCtx.forwardValue > innerExploreAdditionOfRightNodeCtx.reverseValue :
                             ((innerExploreAdditionOfRightNodeCtx.forwardValue % innerExploreAdditionOfRightNodeCtx.reverseValue) == 0) ) {
                 

I did not think it could have a lot, myself, as I performed this change while in the middle of more significant rearrangements: the latter involved some changes to the signature of the selfNode parameter, and I considered it would be just be as well to reuse a cached result here rather than have to keep tinkering with the signature of an anonymous closure that does not use its own parameters anyway just to keep it in sync (if you don't feel up to reading how it works, just know this code makes sure an operation has at least two operands before we try to obtain its result). For reasons that will eventually become clear, I tested this patch in isolation, as directly applied to the without dictionaries case. What's going to be the outcome?

Swift without dictionaries and with reduced walker calls

 N|  T  | factor
 1|100  | ref
 2| 62.8|x1.59
 3| 52.1|x1.92
 4| 48.1|x2.07
 5| 41.6|x2.40
 6| 37.2|x2.69
 7| 34.3|x2.91
 8| 32.3|x3.10
 9| 32.2|x3.11
10| 32.1|x3.12
11| 32.1|x3.12
12| 31.7|x3.15

(No change in methodology)

discussion

For the first time I break x2 with just the performance cores, and x3 when adding the efficiency cores: now that is a significant improvement (absolute performance of the reference did also improve, in case you're curious). Now I have a pretty good idea on why that is, but I would like to hear your theories before I expose it. Why did this small change have such an effect?

Bonus: in a further patch I also removed the call to walkOperands that sets emptyComposition (using instead specially added state to track the number of operands so as to obtain the same result), but that change did not have a significant impact on scalability. Why is that, in your opinion?

1 Like

So, about that performance and scalability change…

Remember earlier on that I made changes to avoid different CPU simultaneously manipulating the same reference counts, by having different threads manipulate different references. These changes were ultimately unsuccessful at improving scalability, but I kept them nevertheless, and a common theme of these changes were to make manipulation of the references in question more performant at the expense of actually visiting these data structures. So when I noticed the scalability improvement of the tiny patch above, I first wondered if that wasn't in relation to avoiding the increased expense of performing the extra calls. This is why I tested this patch when applied on the "without dictionaries" code: so as to test that hypothesis.

Given that the scalability improvements from the tiny patch were conserved even in that case, I deduced that these improvements had nothing to do with the cost of recovering the wrapped closure from it wrapper or whatever. So when I made the post above, I though that may have to do with accessing the data guarded by walkOperands: even if we discard the copy provided when we just use the fact we're called in order to increment num, that could still result in scalability troubles, such as when we generate accesses to a ReadyValue that (as rightChild) may be simultaneously accessed by other tasks in a similar manner.Âą

So I set out to determine what was going on here. But my hopes to find that out with a well-chosen performance monitoring counter were soon dashed: here, none of them (MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM, L2_RQSTS.RFO_MISS, nor a few others I tried on a lark) told me anything I did not already know from the time profiler, even with the benefit of being able to compare between with and without the improvement (as opposed to just being able to profile the version without improvement in order to look for such an improvement). In fact, in some cases I saw the profile of the unimproved version have lower prevalence of potential suspect functions (including in absolute terms), while it was a longer tail of smaller functions that were taking more CPU time – a telltale sign of victim code, without me being able to identify the perpetrating phenomenon.

Out of ideas, I tried to look for weaker signals, notably how the time in certain functions, possibly called in a certain way, would vary between the two scenarios, while ignoring or at least defocusing some other information.

To do that, seeing the limits of Instruments, I turned to flame graphs.

After some tinkering, I finally came up with these:

(before the improvement, then after. Fun fact: in order to come up with these I had to hack up support for generating flame graphs from inverted calls trees. Isn't Perl fun?)

While there are some shifts here and there, the major change is the amount of swift_release calls by common(PseudoReasync). After the improvement, they're gone! Now, I did notice on occasion the compiler shifting responsibilities between some of my own functions without necessarily inlining them, including sometimes assigning these responsibilities to thunks², so I looked if these calls had instead moved to other function of mine, but no (with the possible exception of destructors, but that cannot be the effect of shifting of responsibility coming from the compiler, can it?)³.

So, while I previously did not look too closely where in my code the calls to the Swift runtime occur, here I ended up looking at the disassembly of my code&sup4; and noticing that before the improvement, there is a call to swift_release in the main epilogue of common (along with an "outlined consume of ReadyValue.Expr?"), and there isn't any equivalent call in common after the improvement, including elsewhere in the same function; and if there was such an equivalent call elsewhere in my code (e.g. responsibility for calling it shifting to the called of common somehow), it would show up in the flame graph, but it doesn't.

At this point, I still don't know for sure what reference it is that gets an extra swift_release (and swift_retain, presumably, though in a way that is much less performance-impacting), but I know enough.


The whole point of this exploration was to see how well Swift can scale, by itself: even with assembly language and lock-free primitives, you're always bound to discover surprising scaling behaviour on your first try, so the important part is the process by which you discover and resolve those; here I wanted to see how far I could take Swift on the same road, by itself: when discovering issues, I wanted to see how far I could get resolving them by using more Swift, not less (and preferably, keeping the code as idiomatic as possible: it is said you can write FORTRAN in any language…)

Concurrent Swift has a lot going for it, notably a Task API that gets more right than any I've previously seen. For instance, when I wrote the concurrent C version of this code, I implemented recycling of allocations so a task could reuse allocations made by the one that just completed; but the libdispatch API does not support the notion of recovering a "result" from an async block, so I had to create an ad-hoc system with a pair of semaphores that surely introduces bottlenecks of its own.

The Swift Task API supports that out of the box, and this is the reason why I will keep tinkering with async Swift. But to get such variation in behaviour, variation comparable to that obtained by a non-trivial patch (the one which removed the dictionaries), as a side-effect (not even the main effect! Since the extra swift_release calls were not from the invocations removed proper) of a one-line change means I am always at the mercy of seemingly random performance variations that would drown out any intended improvement, without much help from the diagnostics or profiling tools available to me. In particular, the more elaborate the technique to help scalability (such as a system to recycle Swift allocations somehow), the more likely it would be for the compiler to just fail at doing the reference counting and other optimisations that impact requests to the runtime that it is succeeding at right now, such that there would be little way for me to reason about what I am doing. Right now, to me trying to make pure Swift code scale well up to 100% on all CPUs is a fool's errand.


1: Even if, as you've probably noticed, that enumeration supports early termination and so will not stray too far away in that particular invocation.

2: Ah, our friend partial apply for iteratePossibleLeftNodesFakeDispatchPseudoReasync #1 (_:_:_:_:_:) in closure #1 in closure #2 in iteratePossibleLeftNodesDispatch #1 (_:_:_:_:_:) in closure #1 in closure #1 in closure #2 in closure #1 in resolveAsync(_:_:)…

3: and I took into account the fact _swift_release_dealloc in in fact tailed-called by swift_release when looking for other culprits. Speaking of which, there is an attribution error in the flame graph: due to the use of "Flatten AsyncCountdown to boundary frames", those frames that include a destructor assign responsibility of the caller of _swift_release_dealloc directly to the bottom of the stack since the destructor is considered the boundary frame; this does not happen in frames when _swift_release_dealloc is not calling any of my destructors. I can't remove this data mining setting, or I end with illegible flame graphs that weigh multiple megabytes due to the use of indirect recursion in my code.

4: With the help of Instruments, which this time was somewhat cooperative for reasons I have not entirely explained: I have worked differently in some aspects, notably working off executables saved instead of them directly residing in the Xcode build directory, so as to be able to repeat experiments without recompiling back and forth the version without improvements and the one with)

3 Likes

I keep worrying that nobody replying will send the wrong signal on this. I for one have found this interesting and useful to follow along with, even though I haven't contributed. So you have at least one interested reader (and I suspect more)! :slightly_smiling_face:

One thought I've had a few times is that I might be inclined to take a look at your example myself, when I get some free time, but that's hindered by not having a simple way to obtain your full code. Perhaps you could put it on GitHub or similar?

10 Likes

Agreed. I have been following this thread with great interest. I find it worrisome that no Apple engineer replied.

2 Likes

That is a fair remark; I haven't had a public code portfolio since BitBucket sunsetted their Mercurial hostingÂą, so for this need I finally bit the bullet and evaluated my options. I ended up settling on SourceHut: ~zpedro/AsyncCountdown - Countdown numbers round solver written in async Swift - sourcehut hg

I particular, I added tags corresponding to the various profiled versions:

Don't forget to run RegenDerived.sh from the AsyncCountdown subfolder once downloaded/cloned (and redo so after each update, etc.): this is needed until reasync is available.


1: I prefer Mercurial to git so far in large part because of the Evolve extension, which is invaluable to refine a work in progress that needs to bounce between different machines

3 Likes

For various reasons I have resumed my exploration of this matter, and I have new results to show. However, for this evening I wanted to start by summing up the work so far as a basis for the forthcoming results, which will be visualized in the same way.

Indeed, how to properly represent these results in a visual way? You could show them as which performance was observed for which parallelism, but that is not necessarily the focus: that could be useful where additional parallelism would cause additional memory consumption, such that you could be interested in obtaining a tradeoff, for instance, but here the additional memory consumption is a rounding error.

Instead, I represent the scalability improvement as a vertical bar, a bit as if you'd see such a curve sideways. That also allows representing the different versions next to one another.

Tomorrow I will show these new results and graph them along with the previously obtained ones.

2 Likes

As some of you may know, I've been reading Doug Gregor's series on introducing Swift to people coming from C++; I don't know if I'm exactly the kind of C++ programmer targeted here: my professional practice is in the embedded space, where the standards are rather more conservative than the C++17+ features taken as reference there, but I learned a few nuggets nevertheless.

But what most interested me was this one: withoutActuallyEscaping(_:do:). For you see, one of the difficulties with my algorithm is that I need to keep a reference to variables in active frames so lower frames can use them in case they are ever needed (for printing a solution when it is found, mostly). Thankfully, these variables can be captured immutably, so the references in question are easy to handle (e.g. they are trivially sendable) nevertheless these captures need to happen often, need to be passed around even more often, and are invoked very rarely. But I also need to expose these references uniformly with other previously captured references, in practice here by stuffing the closure (which has a reference to the capture) in a data structure, and you know what that means: it becomes escaping, even though it won't actually escape; but I can't prove that to the compiler!

This is not just a problem in Swift, the Rust port of the algorithm has the exact same issue; the penalty might be lessened in a garbage-collected language that scans memory (I haven't tried), but it would still be there. I don't think I could expose the demonstration that the capture doesn't escape to anything short of a dedicated proof-writing language like Coq (or my fellow humans, of course), so I resigned myself to the penalty of escaping closures and their heap allocation as the price to pay for operating in a memory-safe language.

So let's try and see whether withoutActuallyEscaping improves the situation! Addition of this functionality was straightforward, except I needed to add it to a second place in order to avoid the dreaded Passing a non-escaping function parameter 'action' to a call to a non-escaping function parameter can allow re-entrant modification of a variable diagnostic… and, oh, yeah, I also needed to further distinguish the async parts from the non-async ones, as the captures made in the async part (the main wagon) may actually escape, since the trainees may cause them to be kept alive while the main wagon has already backtracked from that branch of the bifurcation (but remember CPU utilization of the main wagon is a rounding error).

Let's look at the results:

 N|  T  | factor
 1|100  | ref
 2| 75.0|x1.33
 3| 67.1|x1.49
 4| 64.5|x1.55
 5| 51.9|x1.93
 6| 45.1|x2.22
 7| 42.8|x2.33
 8| 39.3|x2.54
 9| 39.3|x2.54
10| 39.8|x2.52
11| 39.4|x2.54
12| 39.1|x2.56

…ouch. That hurt. It's fortunate I did not spend too much time on this, because we are back at the scalability of the naive Swift version. Looks like the runtime checking of withoutActuallyEscaping is having an impact, though I can't tell anything besides that: the changes in code have caused quite a shift in the captured stacks, resulting in the flame graphs (not shown here) being hard to compare. Suffice to say I did not see any obvious inefficiency to remove, or any way to salvage the effort.

In the end, I reverted the change.

To be continued…

3 Likes

Small intermezzo: additional false starts.

First, as a way to eliminate, or at least reduce, reference counting traffic on the memory bus, I have investigated the keywords presented in the "move-only types" proposal, borrowing and consuming, since they also apply to references and were stabilized as a result of this proposal.

The goal being, of course, to more tightly control when and whether "retain" and "release" are called on walkOperands instances.

But first, I needed to understand the existing situation and in particular to understand when those would be applied implicitly: mostly, consuming is implicit in parameters to constructors, and borrowing is implicit otherwise, unless of course the parameter is already marked mutating/inout. And when I looked at transmissions or uses of walkOperands instances, some I could eliminate, which left us with two structural transmissions:

after which no further use of that variable occurs in the function! Which means that, due to the order already chosen (for unrelated reasons), my code was already following a scenario best suited for ownership transfer, and sprinkling it here with additional keywords would amount, at best, to cargo cult programming.

But, wait, surely there are additional, more subtle sources of retain/release activity… and yes there is:

  • the ReadyValue thus created, by causing the reference to the closure to escape, indirectly causes the parent closure walkOperands to escape too, requiring iteratePossibleLeftNodes to hold on to the parameter it had just been borrowing so far
  • and the ReadyValue being removed from the ready list and assigned to rightChild is itself captured by the same closure, causing a copy of that structure as a result of the closure being escaping, copy which causes a retain (and an eventual release) of its own expr.operatorsBrowser.

This more implicit retain/release churn is entirely related to captures and cannot be solved by these keywords, which apply to parameters. And as seen yesterday, making that closure non-escaping in order to solve all those issues did not provide the expected results.


Second, the release of the Apple Silicon CPU Optimization Guide confirms what was discussed earlier: while that guide does discuss lock contention patterns (false sharing, in particular) and how to avoid them, these sections are conspicuously devoid of any reference to a relevant performance monitoring counter. These simply don't exist on Apple Silicon CPUs as of this writing.

As hinted before, after considering the problem from various angles I came to the conclusion the only way forward was to forego reference types entirely.

Indeed, this algorithm requires a lot of tracking of references. In fact, in the purest version of the algorithm, what is the "l" array in Swift is another linked list, except one where elements can be added or removed in the middle, using references that point within the list. But even with "l" being an array, that is still many references to track, here by counting them using atomic instructions (or atomic-like, such as load-acquire/store-release pairs).

In particular, I suspect I may not necessarily be penalized by contention over specific cache lines, but in fact by just the need to broadcast all these atomic memory operation over the core cluster hierarchy: even when these operations concern different cache lines, the announcement channel still has to deal with the announcements themselves anyway. Whether that channel is a common bus or a more structured setup.

If we can't use reference types, we'll have to standardize on everything being represented with value types. And I do mean everything, so including the solution tree being built so far (that with which we can print a solution when it is found). As a result, I considered actually representing it as a value-type tree, using enum and indirect, but I found the task daunting, not to mention a little bit risky: how do I know there are no intermediate reference counts in there? I addition, there was the risk of it not interacting well with the features being developed in parallel. Finally, my inability to prove to the compiler that my borrows can't escape is likely to come back to bite me. In the end, handling arbitrarily-large data structures as value types, without the benefit of COW (which is a library feature, not a language one), did not sit well with me.

So I turned to an old technique, where data structures never contain addresses/pointers/references/etc., and instead only contain offsets or indexes into a well-known array (technique most used on machines where the data bus was narrower than the address one, such that storing full addresses in memory was not necessarily practical or efficientÂą). This does result in a runtime penalty (besides the bounds checking of Swift, of course): even on architectures like ARM that support indexed addressing, some further purposes which could make use of various addressing modes in pointer-based code will have to find another way in index-based code. Additionally, on ARM the indexed addressing mode can only scale by the transfer size, while here the scale will need to be by a structure size (and the compiler won't be able to pre-scale the index in most cases, for reasons I will expose). And to top it off, the previously-linked Apple Silicon CPU Optimization Guide mentions having dedicated forwarding from one load operation to the base operand of the next, for pointer-chasing, which won't apply for the indexed operand. But remember I am most interested in scalability here, not absolute performance.

And here I encounter my first issue, as I do still need a value like nil, this time as an index rather than a reference, to mark the end of linked lists and whatnot. But I know from experience that UIntN? is very inefficient, whether as stored in a struct or as a local variable; they tend to result in dynamic allocations in particular, which we are trying to avoid as well EDIT: cf post 17. The solution came from Julia, where the pattern is to steal a valid value for use as a marker instead. I'm afraid I'm repeating the billion-dollar mistake again, since there is no way to treat variables that are meant to possibly turn out to have this marker value differently from variables that can't, but I'll take it.

Beyond that, nothing more than additional crimes against proper data representation (e.g. I add the operation being built to "l", now a linked list, while the operation is still in an incomplete state, as I need a placeholder so I can uniformly retrieve any other element, whether it is just below or deeper in the linked list). At least I was able to add type checking so I can't just index any array through any random index variable. Surprisingly few changes to the work dispatch scaffolding, which was mostly able to rely on existing abstractions.

So, were these transgressions worth it?

 N|  T  | factor
 1|100  | ref
 2| 57.4|x1.74
 3| 39.3|x2.55
 4| 34.8|x2.87
 5| 31.5|x3.18
 6| 28.7|x3.49
 7| 26.9|x3.73
 8| 26.0|x3.84
 9| 24.7|x4.06
10| 24.0|x4.17
11| 24.8|x4.04
12| 24.9|x4.02

Oh, yeah, now we're cooking with gas. In particular, we're now within spitting distance from C. And all it took was a return to the original structured programming practices!

Okay, I jest: Swift does provide me with more memory safety than any language of the time did. Nevertheless, since I need to recycle instances, I miss the benefits of immutability (recycling implies mutation) and definite initialization (what I recover is basically garbage). And while I like structs/aggregations/records/product types just as much as anyone, they prevent the compiler from performing a full lifetime analysis of the fields therein, so the optimizer will not be able to apply some optimizations such as pre-scaling of array indexes, while it could if these field had been local variables instead, even ones captured by closures.

Next step: merging that with the features developed in parallel.


1: though everything old is new again, as CHERI (I recommend Gankra's explainer) could again make pointers unwieldingly large at 16 bytes…

This won’t result in extra independent allocations in Swift; it’s closer to a C struct with a “valid” flag attached. It’ll waste a bunch of space in your array because of alignment though, so having a sentinel is probably still better.

(Conversely, indirect is always refcounted, so that wouldn’t have helped in the end.)

Tree traversal is definitely one of the weaker areas in Swift’s performance toolbox. There are ideas on how to improve it, but restructuring to indexed storage will probably still beat most of them if you’re willing to pay the ergonomics cost. And as you noted, at least you can use types to keep from making obvious mistakes there.

7 Likes

I may have misremembered; however, testing again that old project (an earlier experiments project, not published) and profiling it I end up with instances of outlined init with take of UInt?, themselves calling __swift_instantiateConcreteTypeFromMangledName. You're probably right: since those don't call anything, much less any allocation function, this is not actual memory allocation. Nevertheless, those show that the penalty is not just in the array itself, since that array was also preallocated at start, so those init were of something else than array elements.

This is a moot point anyway: there is no doubt that an, ahem, particularized sentinel value is preferable, as profiling the following revision of that unpublished project, where I switched to using them, confirmed that performance improved: by 40%…

This is a bit of an issue, as algorithms involving trees is where I'm most likely to want to extract performance out of Swift. Indeed, if I'm looking to optimize algorithms with more predictable execution paths, maybe I will write the first version in Swift, but them I am going to want to target a programmable GPU, since everyone has one of those and they do well on branchless number crunching; and Swift is not going to help me with that (even when counting the differentiable Swift experiments, since those were rather aimed at gradient descent, i.e. ML, than general purpose computations).

The merge? Oh, it’s long done, it was just not significant enough to be worth writing home about.

The only real work was translating the state dumping code, which was tedious, not hard. However, that work did convince me that I had taken on massive ergonomics debt by turning pointers into array indexes, and that servicing that debt would become crippling enough that I needed to reimburse it ASAP. But how? To be continued…

Restoration of of an ergonomic environment, possibly as ergonomic as “regular Swift”, will of course be accomplished by introducing abstractions. There are some already, but as of the merge IndexedReference is a very thin one over UInt32: it indicates to the future programmer that this particular variable/property is meant to index the well-known array, but she has to spell out the indexing, and everything else, herself.

But since I'm entering territory I'm not well-versed in, I have to check my assumptions, and in particular the possibility that some of the abstractions I'm about to add may not be zero-cost; and the best way to do that is to watch for regressions.

And that is where I have a first problem: while I have succeeded in getting rid of most retain/release traffic, on the other hand I still have a mysterious 32-byte allocation as part of the hot code. And while this does not hamper scalability (as far as I can tell), it does provide cover for future regressions to hide within: I most definitely don't want anything I add to be allocating memory, so I need to make sure any such allocation is clearly caught in the profile. So now I have a new goal:

Zero allocation in hot code

And since this is close enough to the embedded Swift goals, I am going to make sure my code conforms to embedded Swift as I'm at it.

Now I want you to picture my situation: while I know there are allocations, no tool could provide me more information about them, first of them the allocations instrument, for which these allocations are anonymous and not tied to any object. I definitely don't have any explicit allocation call, or class instantiation, and the patterns did not seem to correspond to instances of Node<ReadyValue.Contribution>.

Without any lead, I started dismantling the few abstractions I did have, and you can see these commits in the log. Without success.

Out of ideas, I ended up disassembling the offending code; that of course did not directly tell me what these allocations were for, but by following the instructions, I hoped to figure out what kind of pieces of information would be stored in these buffers, leading me to the reason.

And that worked: I figured out the allocations were meant for storing the closure context for common (the forward/reverse loop body inside iteratePossibleLeftNodes).

Wait, what? common does not escape, why would it need an allocation for its closure context? In fact, if this allocation was meant for an escaping closure, i.e. a reference type, this could result in at least release traffic, but there is none, I had made sure of that.

And in particular, I have since then watched John McCall's session, and at no point does the possibility of an allocation for storing the context of a non-escaping closure comes up.

But I still needed to confirm the theory, which I did in the most straightforward way: by turning common into a top-level function. I then ceased to see any allocation in the hot code. Victory!

I did not keep that as the definitive solution: it made the code unwieldy to work with. Instead, I eliminated common and folded the code back into iteratePossibleLeftNodes, returning to the loop system that I had prior to eliminating dictionaries; in order to avoid reintroducing those, I pick the bin (forward or reverse) using an if-branch system, hoping the optimizer will notice the commonalities between the two alternatives and fold them together, because I don't want one more unpredictable branch.

And now I'm ready to explore the actual ergonomics improvements…

Addendum: analysis of the assembly code which led me to deduce the allocation was a closure context; my annotations of the disassembly are in italics (this is arm64 assembly, where the target register goes first; most of the time, this is the destination register, but for movements to memory, that is stores, this first register is the source)

$ otool -tvV AsyncCountdown | xcrun swift-demangle | less
AsyncCountdown.iteratePossibleLeftNodesPseudoReasync(inout AsyncCountdown.ResolveCoreCtxNonGCD, inout AsyncCountdown.ExploreAdditionOfRightNodeCtxNonGCD, (Swift.String) -> (), Swift.UInt32, AsyncCountdown.Op, AsyncCountdown.IndexedReferenceNonGCD<AsyncCountdown.ReadyValueNonGCD.Contribution>, (inout AsyncCountdown.ResolveCoreCtxNonGCD, inout AsyncCountdown.ExploreAdditionOfRightNodeCtxNonGCD, AsyncCountdown.Op, AsyncCountdown.IndexedReferenceNonGCD<AsyncCountdown.ReadyValueNonGCD.Contribution>) throws -> ()) throws -> ():
0000000100003ef8 sub sp, sp, #0xe0
0000000100003efc str x28, [sp, #0x80]
0000000100003f00 stp x27, x26, [sp, #0x90]
0000000100003f04 stp x25, x24, [sp, #0xa0]
0000000100003f08 stp x23, x22, [sp, #0xb0]
0000000100003f0c stp x20, x19, [sp, #0xc0]
0000000100003f10 stp x29, x30, [sp, #0xd0]
0000000100003f14 add x29, sp, #0xd0
0000000100003f18 str x7, [sp, #0x28]
0000000100003f1c stur w4, [x29, #-0x48]
0000000100003f20 str x3, [sp, #0x20]
0000000100003f24 str x2, [sp, #0x40]
0000000100003f28 ldp x8, x9, [x1]
0000000100003f2c adds x8, x8, x9
0000000100003f30 str x8, [sp, #0x50]
0000000100003f34 b.hs 0x1000043bc
0000000100003f38 mov x23, x6 ; <- this parameter gets moved to x23 (unsure which at this point: I only know that parameters start at x0, and that some parameters may require two registers)
0000000100003f3c mov x24, x5
0000000100003f40 mov x26, x1
0000000100003f44 mov x27, x0
0000000100003f48 str x21, [sp, #0x48]
0000000100003f4c adrp x0, 25 ; 0x10001c000
0000000100003f50 add x0, x0, #0x900
0000000100003f54 mov w1, #0x14 ; 20, probably the allocation size (rounded up to 32 by the time it gets traced by Instruments)
0000000100003f58 mov w2, #0x7
0000000100003f5c bl 0x10001726c ; symbol stub for: _swift_allocObject
0000000100003f60 mov x19, x0 ; <- address of the allocation gets saved to x19
0000000100003f64 mov x25, x0 ; ...and x25
0000000100003f68 str w23, [x25, #0x10]! ; and this 32-bit value, which we know to be a parameter, gets stuffed inside at offset 16, i.e. the end (after some built-in overhead?)
0000000100003f6c add x20, x27, #0x8
0000000100003f70 mov x0, x27
0000000100003f74 mov x1, x23 ; and that same 32-bit parameter now gets used as a parameter to popFromWithin(): it must be currentLoc, initialized from startingFrom
0000000100003f78 str x20, [sp, #0x58]
0000000100003f7c bl generic specialization <AsyncCountdown.ReadyValueNonGCD.Contribution> of AsyncCountdown.SimplyLinkedListNonGCD.popFromWithin(_: inout [AsyncCountdown.NodeNonGCD<A>], after: AsyncCountdown.IndexedReferenceNonGCD<A>) -> AsyncCountdown.IndexedReferenceNonGCD<A>
0000000100003f80 cmn w0, #0x1
0000000100003f84 b.eq 0x100004324
0000000100003f88 mov x22, x0
0000000100003f8c ldr x8, [x29, #0x10]
0000000100003f90 str x8, [sp, #0x38]
0000000100003f94 and w8, w24, #0x1
0000000100003f98 stur w8, [x29, #-0x44]
0000000100003f9c add x1, sp, #0x68
0000000100003fa0 mov x0, x25
0000000100003fa4 mov w2, #0x1
0000000100003fa8 mov x3, #0x0
0000000100003fac bl 0x10001729c ; symbol stub for: _swift_beginAccess
0000000100003fb0 str x19, [sp, #0x30] ; the address of our allocation gets spilled to the stack
0000000100003fb4 str x25, [sp, #0x18]
0000000100003fb8 b 0x100003fe0
0000000100003fbc str w22, [x25]
0000000100003fc0 mov x0, x27
0000000100003fc4 mov x1, x22
0000000100003fc8 ldr x20, [sp, #0x58]
0000000100003fcc bl generic specialization <AsyncCountdown.ReadyValueNonGCD.Contribution> of AsyncCountdown.SimplyLinkedListNonGCD.popFromWithin(_: inout [AsyncCountdown.NodeNonGCD<A>], after: AsyncCountdown.IndexedReferenceNonGCD<A>) -> AsyncCountdown.IndexedReferenceNonGCD<A>
0000000100003fd0 mov x23, x22
0000000100003fd4 mov x22, x0
0000000100003fd8 cmn w0, #0x1
0000000100003fdc b.eq 0x100004324
0000000100003fe0 ldr x8, [sp, #0x50]
0000000100003fe4 cbnz x8, 0x100003fec
0000000100003fe8 str w23, [x26, #0x18]
0000000100003fec mov w28, w22
0000000100003ff0 ldr x24, [x27]
0000000100003ff4 ldr x9, [x24, #0x10]
0000000100003ff8 cmp x9, x28
0000000100003ffc b.ls 0x10000435c
0000000100004000 add x8, x24, #0x20
0000000100004004 lsl x10, x28, #4
0000000100004008 str x10, [sp, #0x60]
000000010000400c add x10, x8, x10
0000000100004010 ldrb w10, [x10, #0xc]
0000000100004014 cmp w10, #0x2
0000000100004018 b.eq 0x10000402c
000000010000401c ldr x10, [x27, #0x30]
0000000100004020 subs x10, x10, #0x1
0000000100004024 b.vs 0x100004394
0000000100004028 str x10, [x27, #0x30]
000000010000402c mov x20, x23
0000000100004030 ldr x10, [sp, #0x58]
0000000100004034 ldr w23, [x10]
0000000100004038 cmp x9, x23
000000010000403c b.ls 0x100004360
0000000100004040 lsl x25, x23, #4
0000000100004044 add x8, x8, x25
0000000100004048 ldr w19, [x8, #0x8]
000000010000404c mov x0, x24
0000000100004050 bl 0x10001741c ; symbol stub for: _swift_isUniquelyReferenced_nonNull_native
0000000100004054 tbz w0, #0x0, 0x100004304
0000000100004058 ldr x9, [x24, #0x10]
000000010000405c cmp x9, x28
0000000100004060 ldr x21, [sp, #0x48]
0000000100004064 b.ls 0x100004364
0000000100004068 add x10, x24, #0x20
000000010000406c ldr x8, [sp, #0x60]
0000000100004070 add x8, x10, x8
0000000100004074 str w19, [x8]
0000000100004078 cmp x9, x23
000000010000407c b.ls 0x100004368
0000000100004080 add x9, x10, x25
0000000100004084 str w22, [x9, #0x8]
0000000100004088 str x24, [x27]
000000010000408c ldr w9, [x26, #0x10]
0000000100004090 ldr w8, [x8, #0x4]
0000000100004094 ldur w10, [x29, #-0x44]
0000000100004098 cbnz w10, 0x1000040b4
000000010000409c adds w8, w9, w8
00000001000040a0 ldp x24, x19, [sp, #0x20]
00000001000040a4 ldur w23, [x29, #-0x48]
00000001000040a8 ldr x25, [sp, #0x30] ; the address of our allocation is recovered from the stack
00000001000040ac b.lo 0x1000040cc ; assume this branch is taken...
00000001000040b0 b 0x100004398 ; ...because this one skips the loop
00000001000040b4 umull x8, w9, w8
00000001000040b8 tst x8, #0xffffffff00000000
00000001000040bc ldp x24, x19, [sp, #0x20]
00000001000040c0 ldur w23, [x29, #-0x48]
00000001000040c4 ldr x25, [sp, #0x30]
00000001000040c8 b.ne 0x10000439c
00000001000040cc str w8, [x26, #0x10]
00000001000040d0 ldr x8, [x26]
00000001000040d4 adds x8, x8, #0x1
00000001000040d8 b.hs 0x10000436c
00000001000040dc ldr x9, [sp, #0x50]
00000001000040e0 cmp x9, #0x0
00000001000040e4 cset w9, eq
00000001000040e8 str x8, [x26]
00000001000040ec ldp x5, x8, [sp, #0x38]
00000001000040f0 stp x8, x24, [sp, #0x8]
00000001000040f4 str w23, [sp, #0x4]
00000001000040f8 strb w9, [sp]
00000001000040fc mov x0, x27
0000000100004100 mov x1, x26
0000000100004104 mov w2, #0x0
0000000100004108 mov x3, x22
000000010000410c mov x4, x19
0000000100004110 ldur w6, [x29, #-0x44]
0000000100004114 mov x7, x25 ; finally, our allocation address is passed as some sort of parameter to common, which along with its contents (initially startingFrom, but it in fact tracks the value of currentLoc in later loop iterations) makes it clear it is a dynamically allocated closure context
0000000100004118 bl commonPseudoReasync #1 (_: inout AsyncCountdown.ResolveCoreCtxNonGCD, _: inout AsyncCountdown.ExploreAdditionOfRightNodeCtxNonGCD, direction: AsyncCountdown.Direction) throws -> () in AsyncCountdown.iteratePossibleLeftNodesPseudoReasync(inout AsyncCountdown.ResolveCoreCtxNonGCD, inout AsyncCountdown.ExploreAdditionOfRightNodeCtxNonGCD, (Swift.String) -> (), Swift.UInt32, AsyncCountdown.Op, AsyncCountdown.IndexedReferenceNonGCD<AsyncCountdown.ReadyValueNonGCD.Contribution>, (inout AsyncCountdown.ResolveCoreCtxNonGCD, inout AsyncCountdown.ExploreAdditionOfRightNodeCtxNonGCD, AsyncCountdown.Op, AsyncCountdown.IndexedReferenceNonGCD<AsyncCountdown.ReadyValueNonGCD.Contribution>) throws -> ()) throws -> ()
000000010000411c cbnz x21, 0x10000432c
0000000100004120 ldr x8, [x26]
0000000100004124 subs x8, x8, #0x1
0000000100004128 b.lo 0x100004370
000000010000412c str x8, [x26]

1 Like

I will expose the techniques I used one by one, with examples, but first I need to expose an important principle:

Principle 1: Setting up ergonomic abstractions must be ergonomic itself

I consider the addition of access control to be the most significant contribution of C++ over “plain old C”: private/protected/public, the extended use of const; above even classes or templates. However, while its realization in C++ is undoubtedly useful for setting up interfaces that channel the user towards using the API in a maintainable way, performing that actual setup requires superhuman planning skills: for instance, which member functions ought to be marked virtual, and which could afford not to be? Will virtual inheritance turn out to be necessary, or overkill? And what’s the tradeoff between non-public inheritance and ordinary composition? Etc. While it’s not unsolvable, those always felt a bit “raw”.

Swift affords us to be ergonomic including when setting up abstractions, and I intend to make full use of it. In particular, these abstractions themselves may need to evolve, which in some cases mean their definitions will benefit from being “templated” themselves, one way or another.

This will unfold in the next installments (or if you’re impatient, you can already see the satisfyingly ergonomic state of the code: it’s already uploaded to SourceHut).