Improving Swift code for scalability

I recently acquired a Mac mini M2 (4 performance cores, 4 efficiency cores); even if I did not get the M2 Pro (much less the SMP monsters featured in the Mac Studio), I am very interested in seeing how well I can exploit these CPUs using thread-level parallelism. Particularly so with Swift, which was released at a time where all Macs and iOS devices were already multi-core and seemed meant to be tailored to such a multi-core reality, potential which was unveiled with the release of concurrency features starting with Swift 5.5.

So imagine my disappointment when I ran my experimentations project in it, and I obtained the following run times:

Naive Swift

 N|  T  | factor
 1|100  | ref
 2| 83.5|x1.20
 3| 69.5|x1.44
 4| 79.5|x1.26
 5| 60.8|x1.64
 6| 50.0|x2
 7| 47.2|x2.12
 8| 39.8|x2.51
 9| 45.2|x2.21
10| 45.1|x2.22
11| 45.3|x2.21
12| 39.2|x2.56

The first column is the number of tasks that my code spawns before requiring that a task completes as a condition for an additional task to new spawned; the code is run as micro-batches whose size is independent from the number of processors or from this parameter. I went up to 12 as a sanity check

The second column is the time taken in each scenario, normalized against a duration 100 for the sequential case (i.e. where the above parameter is 1; that code uses the async APIs too anyway so the comparison is fair): I am not interested in the absolute performance here, only in whether it scales as expected. Smaller amounts are better.

The third column is how better that scenario performs compared to the first, expressed as a performance multiplier. It is the inverse of the second column, presented for convenience. Higher numbers are better.

Methodology: I opened my project, which you can find as the attachment to the May 17th comment at It's time for `reasync` · Issue #65475 · apple/swift · GitHub with Xcode 15.0 beta 8 and profiled it with the time profiler template. For each run, I took the difference between the end time and the initialization time both reported by Instruments. After an initial run whose results were not kept so as to warm the caches, each scenario was run 3 times, and the median time was kept as the scenario outcome. Control of the parallelism was obtained through an environment variable, which requires the following patch to be applied over the project version from May 17th¹; obtention of the required parallelism was confirmed from inspection of the time profiler instrument (except for parallelism values above 8). The complexity of the computation was chosen so the scenario completes in a handful of seconds, even in the case of the reference (longest duration) scenario, so as to avoid thermal throttling effects.

discussion

Right off the bat, there are obviously some issues with these results: you can see performance regress when going from 3-way to 4-way parallelism, and again when going from 8-way to 9-way (before recovering at 12); there appears to be bimodal behaviors at work. But the general trend is clear: scalability is awful. I doubled-checked the time profiler instrument for obvious misbehavior, such as a single task completing much later than the others and making the processing complete much later than expected, but no such luck. For sanity, I also checked on the machine I was relying on before (2016 13' MacBook Pro, no touch bar), but I saw similar results when going from 1 to 2 parallel tasks.

When I tried the dedicated Swift concurrency template, it failed to provide much insight: to begin with, it appears to be overloaded by the amount of tasks that are spawned (~5000), even if there are never more than N at any given time, resulting in my processing slowing to a crawl.

Looking at the profiling results, I started to have ideas for improvements, but first I wanted to see what kind of scalability I could legitimately expect, to see what I could demand of my Swift code. So I reimplemented my processing in C, with concurrency obtained through libdispatch.

Fast-forward a few weeks…

C with libdispatch

 N|  T  | factor
 1|100  | ref
 2| 55.9|x1.79
 3| 36.5|x2.74
 4| 28.8|x3.47
 5| 28.3|x3.53
 6| 27.5|x3.64
 7| 26.4|x3.79
 8| 24.6|x4.07
 9| 24.0|x4.16
10| 23.7|x4.23
11| 22.8|x4.38
12| 22.8|x4.38

Methodology is similar; the only difference from the the Swift version is the computation load, which was adjusted so total duration in the reference scenario is in the same ballpark so as to obtain comparable thermal throttling effects, if any.

discussion

These results are much better behaved, for a start (this is confirmed by the individual runs, whose groupings are satisfying), and while scaling is not linear from 1 to 4, it is close enough to it that I don't feel computational resources are wasted. As expected, going from 4 to 8 improves the performance much less as we get into the efficiency cores, but still improves the situation. Performance does still improve beyond 8, possibly because we afford the system more opportunities for scheduling useful work, but we clearly hit diminishing returns by that point.

Coming back to the Swift version, one peculiarity I noticed in the profile is some symbols related to dictionaries (hashing-related infrastructure, in particular) jumping from the nowhere near the top in the case of the reference runs to the top of the symbols (sorted by descending weight in the inverted call tree) for all other runs, even when parallelism is barely 2. So I worked on eliminating dictionaries from my naive Swift code, and repeated the experiment.

Swift without dictionaries

 N|  T  | factor
 1|100  | ref
 2| 69.1|x1.45
 3| 62.0|x1.61
 4| 58.6|x1.71
 5| 48.3|x2.07
 6| 41.9|x2.39
 7| 37.9|x2.64
 8| 36.1|x2.77
 9| 36.1|x2.77
10| 36.1|x2.77
11| 35.9|x2.78
12| 36.0|x2.77

Methodology is the same as the naive Swift table; the only difference is the improvement patch, attached here² (I am not happy with some of the compromises that I had to make in order to remove usage of dictionaries; I welcome suggestions for improvements to better handle the situation).

discussion

This code is already much better behaved than the naive Swift one (this is all the more obvious in the individual runs). Scalability is much better at the earlier levels, though behavior is somewhat erratic (much better improvement going from 4 to 5 than going from 3 to 4 parallel tasks), and the factor caps at a level not much better than it did earlier.

At any rate, we are still far from scalability that could legitimately be expected of this algorithm.

Further improvements

The improvements I obtained by removing dictionaries suggest the issues I am having with scalability here may not be related solely to the way I use the Swift concurrency APIs, but also in how the Swift code proper behaves, so I think this should be a significant avenue for investigation.

Unfortunately, I have failed to notice any further lead from what Instruments or Xcode could tell me, so I am looking for means to investigate such remaining scalability impediments: tooling, experiments to perform, known issues, best practices, etc.

2 Likes

1:

diff -r 330582a717f6 -r d2ef6283be0b AsyncCountdown/AsyncAPI.swift
--- a/AsyncCountdown/AsyncAPI.swift     Wed May 17 01:41:53 2023 +0200
+++ b/AsyncCountdown/AsyncAPI.swift     Fri Jul 14 22:58:55 2023 +0200
@@ -223,7 +223,11 @@
                                  source: DispatchSource.makeSignalSource(signal: SIGINT,
                                              queue: DispatchQueue.global(qos:.userInitiated)),
                                  cancelHandler: { Task { await cancellator.raise();}});
-        let _ = Computer(exec: { () -> Channel in
+        var param: UInt?;
+        if let config = getenv("net_wanderingcoder_cpucount"), config[0] != 0 {
+            param = UInt(strlen(config));
+        }
+        let _ = Computer(cpuCount: param, exec: { () -> Channel in
             return await Channel(exec: { (sink) in
                 /* While the subtasks added to this group never throw themselves,
                  this is a throwing group so that it can be traversed by a thrown
diff -r 330582a717f6 -r d2ef6283be0b AsyncCountdown/PullAPI.swift
--- a/AsyncCountdown/PullAPI.swift      Wed May 17 01:41:53 2023 +0200
+++ b/AsyncCountdown/PullAPI.swift      Fri Jul 14 22:58:55 2023 +0200
@@ -21,14 +21,18 @@
 }
 
 @available(macOS 12.0, iOS 15.0, *) struct Computer<BatchIterator: BatchIteratorProtocol> {
-    init(exec: @Sendable @escaping () async -> BatchIterator) {
+    init(cpuCount: UInt? = nil, exec: @Sendable @escaping () async -> BatchIterator) {
+        guard cpuCount != 0 else
+        {
+            fatalError("Computer instantiated with 0 CPU!");
+        }
         Task {
             await withTaskGroup(of: Void.self) { group in
                 var outstandingTasks = UInt(0);
                 var iterator = await exec();
 
                 while (true) {
-                    guard outstandingTasks < ProcessInfo.processInfo.activeProcessorCount else {
+                    guard outstandingTasks < cpuCount ?? UInt(ProcessInfo.processInfo.activeProcessorCount) else {
                         _ = await group.next();
                         outstandingTasks -= 1;
                         continue;

2:

diff -r d2ef6283be0b -r cdcaaddad141 AsyncCountdown/AsyncAPI.swift
--- a/AsyncCountdown/AsyncAPI.swift     Fri Jul 14 22:58:55 2023 +0200
+++ b/AsyncCountdown/AsyncAPI.swift     Sun Jul 02 19:01:50 2023 +0200
@@ -16,20 +16,14 @@
 import Foundation;
 import os;
 
-typealias DispatcherAsync = (_ l: inout [ReadyValueNonGCD],
-                        _ composedCount: inout Int,
-                        _ otherComposedCount: inout Int,
-                        _ decomposedValue: inout [Bool:UInt32],
-                        _ downstreamSiblingsRecruitmentSkip: inout Int,
+typealias DispatcherAsync = (_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
+                        _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                         _ kind: Op,
                         _ startingFrom: Int,
                         _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async throws -> Void;
 
-typealias DispatcherPseudoReasync = (_ l: inout [ReadyValueNonGCD],
-                        _ composedCount: inout Int,
-                        _ otherComposedCount: inout Int,
-                        _ decomposedValue: inout [Bool:UInt32],
-                        _ downstreamSiblingsRecruitmentSkip: inout Int,
+typealias DispatcherPseudoReasync = (_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
+                                     _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                         _ kind: Op,
                         _ startingFrom: Int,
                         _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) throws -> Void;
@@ -237,62 +231,29 @@
                  */
                 do {
                     try await withThrowingTaskGroup(of: Void.self) { resultsScopingGroup in
-                        func iteratePossibleLeftNodesDispatch(_ l: inout [ReadyValueNonGCD],
-                                                              _ composedCount: inout Int,
-                                                              _ otherComposedCount: inout Int,
-                                                              _ decomposedValue: inout [Bool:UInt32],
-                                                              _ downstreamSiblingsRecruitmentSkip: inout Int,
+                        func iteratePossibleLeftNodesDispatch(_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
+                                                              _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                                                               _ kind: Op,
                                                               _ startingFrom: Int,
                                                               _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async throws {
                             // reseat this divergence over a copy of the whole state
-                            let paramL = l;
-                            let paramComposedCount = composedCount;
-                            let paramOtherComposedCount = otherComposedCount;
-                            let paramDecomposedValue = decomposedValue;
-                            let paramDownstreamSiblingsRecruitmentSkip = downstreamSiblingsRecruitmentSkip;
+                            let paramResolveCoreCtx = resolveCoreCtx;
+                            let paramExploreAdditionOfRightNodeCtx = exploreAdditionOfRightNodeCtx;
                             
                             let task = Task {
                                 await withTaskGroup(of: Void.self) { innerResultsScopingGroup in
-                                    /*func iteratePossibleLeftNodesFakeDispatchAsync(_ l: inout [ReadyValueNonGCD],
-                                     _ composedCount: inout Int,
-                                     _ otherComposedCount: inout Int,
-                                     _ decomposedValue: inout [Bool:UInt32],
-                                     _ downstreamSiblingsRecruitmentSkip: inout Int,
-                                     _ kind: Op,
-                                     _ startingFrom: Int,
-                                     _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async {
-                                     await iteratePossibleLeftNodesAsync(&l,
-                                     &composedCount,
-                                     &otherComposedCount,
-                                     &decomposedValue,
-                                     &downstreamSiblingsRecruitmentSkip,
-                                     resultReceiver,
-                                     startGoal,
-                                     kind,
-                                     startingFrom,
-                                     walkOperands,
-                                     iteratePossibleLeftNodesFakeDispatchAsync);
-                                     }*/
-                                    
-                                    func iteratePossibleLeftNodesFakeDispatchPseudoReasync(_ l: inout [ReadyValueNonGCD],
-                                                                                           _ composedCount: inout Int,
-                                                                                           _ otherComposedCount: inout Int,
-                                                                                           _ decomposedValue: inout [Bool:UInt32],
-                                                                                           _ downstreamSiblingsRecruitmentSkip: inout Int,
-                                                                                           _ kind: Op,
-                                                                                           _ startingFrom: Int,
-                                                                                           _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) {
-                                        iteratePossibleLeftNodesPseudoReasync(&l,
-                                                                              &composedCount,
-                                                                              &otherComposedCount,
-                                                                              &decomposedValue,
-                                                                              &downstreamSiblingsRecruitmentSkip,
+                                    func iteratePossibleLeftNodesFakeDispatchPseudoReasync(_ innerResolveCoreCtx: inout ResolveCoreCtxNonGCD,
+                                                                                           _ innerExploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
+                                                                                           _ innerKind: Op,
+                                                                                           _ innerStartingFrom: Int,
+                                                                                           _ innerWalkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) {
+                                        iteratePossibleLeftNodesPseudoReasync(&innerResolveCoreCtx,
+                                                                              &innerExploreAdditionOfRightNodeCtx,
                                                                               { result in innerResultsScopingGroup.addTask { await resultMediator.yield(result);}},
                                                                               startGoal,
-                                                                              kind,
-                                                                              startingFrom,
-                                                                              walkOperands,
+                                                                              innerKind,
+                                                                              innerStartingFrom,
+                                                                              innerWalkOperands,
                                                                               iteratePossibleLeftNodesFakeDispatchPseudoReasync);
                                     }
                                     
@@ -303,17 +264,11 @@
                                      partial functions on the same thread either. */
                                     os_signpost(.begin, log: LogWrapper.gObservationHandle, name: "micro-batch", signpostID: OSSignpostID(log: LogWrapper.gObservationHandle, object: Thread.current));
                                     
-                                    var childL = paramL;
-                                    var childComposedCount = paramComposedCount;
-                                    var childOtherComposedCount = paramOtherComposedCount;
-                                    var childDecomposedValue = paramDecomposedValue;
-                                    var childDownstreamSiblingsRecruitmentSkip = paramDownstreamSiblingsRecruitmentSkip;
+                                    var childResolveCoreCtx = paramResolveCoreCtx;
+                                    var childExploreAdditionOfRightNodeCtx = paramExploreAdditionOfRightNodeCtx;
                                     
-                                    iteratePossibleLeftNodesFakeDispatchPseudoReasync(&childL,
-                                                                                      &childComposedCount,
-                                                                                      &childOtherComposedCount,
-                                                                                      &childDecomposedValue,
-                                                                                      &childDownstreamSiblingsRecruitmentSkip,
+                                    iteratePossibleLeftNodesFakeDispatchPseudoReasync(&childResolveCoreCtx,
+                                                                                      &childExploreAdditionOfRightNodeCtx,
                                                                                       kind,
                                                                                       startingFrom,
                                                                                       walkOperands);
@@ -332,11 +287,8 @@
                             try await cancellator.checkRaised();
                         }
                         
-                        func iteratePossibleLeftNodesDispatchLoop(_ l: inout [ReadyValueNonGCD],
-                                                                  _ composedCount: inout Int,
-                                                                  _ otherComposedCount: inout Int,
-                                                                  _ decomposedValue: inout [Bool:UInt32],
-                                                                  _ downstreamSiblingsRecruitmentSkip: inout Int,
+                        func iteratePossibleLeftNodesDispatchLoop(_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
+                                                                  _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                                                                   _ kind: Op,
                                                                   _ startingFrom: Int,
                                                                   _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async throws {
@@ -349,13 +301,10 @@
                              - 4 already means too few possibilities to explore
                              - 3 is right out
                              Note that among other properties, this estimator value is monotonic. */
-                            guard startingFrom < (l.count - composedCount),
-                                  l.count + (walkOperands({_,_ in return ();}) != nil ? 1 : 0) == 5 else {
-                                try await iteratePossibleLeftNodesAsync(&l,
-                                                                        &composedCount,
-                                                                        &otherComposedCount,
-                                                                        &decomposedValue,
-                                                                        &downstreamSiblingsRecruitmentSkip,
+                            guard startingFrom < (resolveCoreCtx.l.count - resolveCoreCtx.composedCount),
+                                  resolveCoreCtx.l.count + (walkOperands({_,_ in return ();}) != nil ? 1 : 0) == 5 else {
+                                try await iteratePossibleLeftNodesAsync(&resolveCoreCtx,
+                                                                        &exploreAdditionOfRightNodeCtx,
                                                                         { result in resultsScopingGroup.addTask { await resultMediator.yield(result);}},
                                                                         startGoal,
                                                                         kind,
@@ -365,11 +314,8 @@
                                 return;
                             }
                             
-                            try await iteratePossibleLeftNodesDispatch(&l,
-                                                                       &composedCount,
-                                                                       &otherComposedCount,
-                                                                       &decomposedValue,
-                                                                       &downstreamSiblingsRecruitmentSkip,
+                            try await iteratePossibleLeftNodesDispatch(&resolveCoreCtx,
+                                                                       &exploreAdditionOfRightNodeCtx,
                                                                        kind,
                                                                        startingFrom,
                                                                        walkOperands);
diff -r d2ef6283be0b -r cdcaaddad141 AsyncCountdown/AsyncCountdown.swift
--- a/AsyncCountdown/AsyncCountdown.swift       Fri Jul 14 22:58:55 2023 +0200
+++ b/AsyncCountdown/AsyncCountdown.swift       Sun Jul 02 19:01:50 2023 +0200
@@ -101,24 +101,32 @@
 }
 
 @available(macOS 12.0.0, iOS 15.0, *)
+struct ResolveCoreCtxNonGCD:Sendable {
+    var l = [ReadyValueNonGCD]();
+    var composedCount = 0;
+    var otherComposedCount = 0;
+};
+
+@available(macOS 12.0.0, iOS 15.0, *)
+struct ExploreAdditionOfRightNodeCtxNonGCD:Sendable {
+    var forwardValue: UInt32;
+    var reverseValue: UInt32;
+    var downstreamSiblingsRecruitmentSkip: Int;
+}
+
+
+@available(macOS 12.0.0, iOS 15.0, *)
 func resolveCoreNonGCD(_ injectedDispatcher: DispatcherAsync, _ primitives: [UInt32]) async throws {
-    var referenceL = [ReadyValueNonGCD]();
-    var referenceComposedCount = 0;
-    var referenceOtherComposedCount = 0;
-            
+    var reference = ResolveCoreCtxNonGCD();
             
     for element in primitives {
-        referenceL.append(.init(element));
+        reference.l.append(.init(element));
     }
         
-    try await exploreAdditionOfFloorAsync(&referenceL,
-                                     &referenceComposedCount,
-                                     &referenceOtherComposedCount,
+    try await exploreAdditionOfFloorAsync(&reference,
                                      injectedDispatcher,
                                      kind: .additive);
-    try await exploreAdditionOfFloorAsync(&referenceL,
-                                     &referenceComposedCount,
-                                     &referenceOtherComposedCount,
+    try await exploreAdditionOfFloorAsync(&reference,
                                      injectedDispatcher,
                                      kind: .multiplicative);
 }
diff -r d2ef6283be0b -r cdcaaddad141 AsyncCountdown/AsyncExplorers.swift
--- a/AsyncCountdown/AsyncExplorers.swift       Fri Jul 14 22:58:55 2023 +0200
+++ b/AsyncCountdown/AsyncExplorers.swift       Sun Jul 02 19:01:50 2023 +0200
@@ -16,45 +16,37 @@
 import Foundation;
             
             
-func exploreAdditionOfFloorAsync(_ l: inout [ReadyValueNonGCD],
-                                               _ composedCount: inout Int,
-                                               _ otherComposedCount: inout Int,
+func exploreAdditionOfFloorAsync(_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
                                                _ dispatcher: DispatcherAsync,
                                            kind: Op) async rethrows {
-    guard otherComposedCount == 0 else {
+    guard resolveCoreCtx.otherComposedCount == 0 else {
         return;
     }
     
-    otherComposedCount = composedCount;
-    composedCount = 0;
+    resolveCoreCtx.otherComposedCount = resolveCoreCtx.composedCount;
+    resolveCoreCtx.composedCount = 0;
     
-    try await exploreAdditionOfRightNodeAsync(&l,
-                                          &composedCount,
-                                          &otherComposedCount,
+    try await exploreAdditionOfRightNodeAsync(&resolveCoreCtx,
                                           dispatcher,
                                           kind: kind,
                                           skip: 0);
 
-    composedCount = otherComposedCount;
-    otherComposedCount = 0;
+    resolveCoreCtx.composedCount = resolveCoreCtx.otherComposedCount;
+    resolveCoreCtx.otherComposedCount = 0;
 }
 
-func exploreAdditionOfRightNodeAsync(_ l: inout [ReadyValueNonGCD],
-                                               _ composedCount: inout Int,
-                                               _ otherComposedCount: inout Int,
+func exploreAdditionOfRightNodeAsync(_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
                                                _ dispatcher: DispatcherAsync,
                                                kind: Op,
                                                skip: Int) async rethrows {
-    var referenceDecomposedValue = [false: kind.neutralValue(), true: kind.neutralValue()];
+    var reference = ExploreAdditionOfRightNodeCtxNonGCD(forwardValue: kind.neutralValue(),
+                                                        reverseValue: kind.neutralValue(),
+                                                        downstreamSiblingsRecruitmentSkip: Int(0)) // placeholder
     // Siblings are the other right nodes on the same floor
     // Downstream ones are those created while this one is active
-    var downstreamSiblingsRecruitmentSkip = Int(0); // placeholder
     
-    try await dispatcher(&l,
-                     &composedCount,
-                     &otherComposedCount,
-                     &referenceDecomposedValue,
-                     &downstreamSiblingsRecruitmentSkip,
+    try await dispatcher(&resolveCoreCtx,
+                     &reference,
                      kind,
                      /* startingFrom: */ skip,
                      { _ in return nil;});
@@ -97,11 +89,8 @@
 the same floor) recovering candidates in a way that results in a composition
 that we ourselves explored earlier, but that is the basic principle.
 */
-func iteratePossibleLeftNodesAsync(_ l: inout [ReadyValueNonGCD],
-                                             _ composedCount: inout Int,
-                                             _ otherComposedCount: inout Int,
-                                             _ decomposedValue: inout [Bool:UInt32],
-                                             _ downstreamSiblingsRecruitmentSkip: inout Int,
+func iteratePossibleLeftNodesAsync(_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
+                                             _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                                              _ resultReceiver: (_ result: String) -> Void,
                                              _ startGoal: UInt32,
                                              _ kind: Op,
@@ -117,35 +106,38 @@
     would never reach 0 no matter how many siblings we'd add, and so we'd never be
     able to close the floor anyway. So it'd be pointless to keep going. */
     for candidateReverseOffset in
-            startingFrom ..< ((emptyComposition && (otherComposedCount > 0)) ? startingFrom+1 : (l.count - composedCount) ) {
+            startingFrom ..< ((emptyComposition && (resolveCoreCtx.otherComposedCount > 0)) ? startingFrom+1 : (resolveCoreCtx.l.count - resolveCoreCtx.composedCount) ) {
         if emptyComposition {
-            downstreamSiblingsRecruitmentSkip = candidateReverseOffset;
+            exploreAdditionOfRightNodeCtx.downstreamSiblingsRecruitmentSkip = candidateReverseOffset;
         }
 
         /* Note: new compositions are added at the end of the array, i.e. at the high offsets,
          while primitives (if any left) will be at the low offsets. As a result, the candidates
          to be recruited as a first priority are found at the high offsets first, though not so
          high that we would end up recruiting from the same floor (forbidden). */
-        let rightChild = l.remove(at: (l.count - composedCount - 1) - candidateReverseOffset);
+        let rightChild = resolveCoreCtx.l.remove(at: (resolveCoreCtx.l.count - resolveCoreCtx.composedCount - 1) - candidateReverseOffset);
         if let _ = rightChild.expr {
-            otherComposedCount -= 1;
+            resolveCoreCtx.otherComposedCount -= 1;
         }
         
-        for phase in 0...1 {
-            let reverse = (phase == 1);
-            { (_ valueComponent: inout UInt32) in
-                valueComponent = kind.combine(valueComponent, rightChild.value);
-            }(&decomposedValue[reverse]!);
-            
+        func combine(_ total: inout UInt32) {
+            total = kind.combine(total, rightChild.value);
+        }
+
+        func uncombine(_ total: inout UInt32) {
+            total = kind.uncombine(total, rightChild.value);
+        }
+        
+        func commonAsync(_ innerResolveCoreCtx: inout ResolveCoreCtxNonGCD,
+                         _ innerExploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
+                         isReverse: Bool,
+                         _ innerDispatcher: DispatcherAsync) async rethrows {
             let selfNode = {@Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void? in
-                return action(rightChild, reverse) ?? walkOperands(action);
+                return action(rightChild, isReverse) ?? walkOperands(action);
             };
             
-            try await dispatcher(&l,
-                             &composedCount,
-                             &otherComposedCount,
-                             &decomposedValue,
-                             &downstreamSiblingsRecruitmentSkip,
+            try await innerDispatcher(&innerResolveCoreCtx,
+                             &innerExploreAdditionOfRightNodeCtx,
                              kind,
                              /* startingFrom: */ candidateReverseOffset,
                              selfNode);
@@ -153,49 +145,49 @@
             // close current composition
             var num = 0;
             if (selfNode({_,_ in guard num == 0 else {return ();}; num += 1; return nil;}) != nil)
-                    && ( (kind == .additive) ? decomposedValue[false]! > decomposedValue[true]! :
-                            ((decomposedValue[false]! % decomposedValue[true]!) == 0) ) {
+                && ( (kind == .additive) ? innerExploreAdditionOfRightNodeCtx.forwardValue > innerExploreAdditionOfRightNodeCtx.reverseValue :
+                            ((innerExploreAdditionOfRightNodeCtx.forwardValue % innerExploreAdditionOfRightNodeCtx.reverseValue) == 0) ) {
                 
-                let realizedValue = kind.uncombine(decomposedValue[false]!, decomposedValue[true]!);
+                let realizedValue = kind.uncombine(innerExploreAdditionOfRightNodeCtx.forwardValue, innerExploreAdditionOfRightNodeCtx.reverseValue);
                 
                 let readyValue = ReadyValueNonGCD(value: realizedValue,
                                                   expr: .init(op: kind,
                                                        operatorsBrowser: selfNode))
                 
-                if l.count == 0 {
+                if innerResolveCoreCtx.l.count == 0 {
                     if realizedValue == startGoal {
                         resultReceiver(readyValue.description());
                     }
                 } else {
-                    composedCount += 1;
-                    l.append(readyValue);
+                    innerResolveCoreCtx.composedCount += 1;
+                    innerResolveCoreCtx.l.append(readyValue);
                     
-                    try await exploreAdditionOfRightNodeAsync(&l,
-                                                          &composedCount,
-                                                          &otherComposedCount,
-                                                          dispatcher,
+                    try await exploreAdditionOfRightNodeAsync(&innerResolveCoreCtx,
+                                                              innerDispatcher,
                                                           kind: kind,
-                                                          skip: downstreamSiblingsRecruitmentSkip);
-                    try await exploreAdditionOfFloorAsync(&l,
-                                                      &composedCount,
-                                                      &otherComposedCount,
-                                                      dispatcher,
+                                                              skip: innerExploreAdditionOfRightNodeCtx.downstreamSiblingsRecruitmentSkip);
+                    try await exploreAdditionOfFloorAsync(&innerResolveCoreCtx,
+                                                          innerDispatcher,
                                                       kind: ((kind == .additive) ? .multiplicative : .additive));
                     
-                    l.remove(at: l.count - 1);
-                    composedCount -= 1;
+                    innerResolveCoreCtx.l.remove(at: innerResolveCoreCtx.l.count - 1);
+                    innerResolveCoreCtx.composedCount -= 1;
                 }
             } // end of operation validity test
+        } // end of commonAsync
+        
+        combine(&exploreAdditionOfRightNodeCtx.forwardValue);
+        try await commonAsync(&resolveCoreCtx, &exploreAdditionOfRightNodeCtx, isReverse: false, dispatcher);
+        uncombine(&exploreAdditionOfRightNodeCtx.forwardValue);
 
-            { (_ valueComponent: inout UInt32) in
-                valueComponent = kind.uncombine(valueComponent, rightChild.value);
-            }(&decomposedValue[reverse]!);
-        } // end of phase loop
-        
+        combine(&exploreAdditionOfRightNodeCtx.reverseValue);
+        try await commonAsync(&resolveCoreCtx, &exploreAdditionOfRightNodeCtx, isReverse: true, dispatcher);
+        uncombine(&exploreAdditionOfRightNodeCtx.reverseValue);
+
         if let _ = rightChild.expr {
-            otherComposedCount += 1;
+            resolveCoreCtx.otherComposedCount += 1;
         }
-        l.insert(rightChild, at: ((l.count+1) - composedCount - 1) - candidateReverseOffset);
+        resolveCoreCtx.l.insert(rightChild, at: ((resolveCoreCtx.l.count+1) - resolveCoreCtx.composedCount - 1) - candidateReverseOffset);
     } // end of candidate iteration loop
 }
 
diff -r d2ef6283be0b -r cdcaaddad141 AsyncCountdown/DispatchAPI.swift
--- a/AsyncCountdown/DispatchAPI.swift  Fri Jul 14 22:58:55 2023 +0200
+++ b/AsyncCountdown/DispatchAPI.swift  Sun Jul 02 19:01:50 2023 +0200
@@ -16,11 +16,8 @@
 @preconcurrency import Foundation
 import os;
 
-typealias Dispatcher = (_ l: inout [ReadyValue],
-                        _ composedCount: inout Int,
-                        _ otherComposedCount: inout Int,
-                        _ decomposedValue: inout [Bool:UInt32],
-                        _ downstreamSiblingsRecruitmentSkip: inout Int,
+typealias Dispatcher = (_ resolveCoreCtx: inout ResolveCoreCtx,
+                        _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtx,
                         _ kind: Op,
                         _ startingFrom: Int,
                         _ walkOperands: @escaping (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void?) -> Void?) throws -> Void;
@@ -39,19 +36,13 @@
          _ primitives: UInt32...) {
         let resultReceiver = { result in inqueue.async { incb(result); };};
         
-        func iteratePossibleLeftNodesFakeDispatch(_ l: inout [ReadyValue],
-                                                                _ composedCount: inout Int,
-                                                                _ otherComposedCount: inout Int,
-                                                                _ decomposedValue: inout [Bool:UInt32],
-                                                                _ downstreamSiblingsRecruitmentSkip: inout Int,
+        func iteratePossibleLeftNodesFakeDispatch(_ resolveCoreCtx: inout ResolveCoreCtx,
+                                                  _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtx,
                                                                 _ kind: Op,
                                                                 _ startingFrom: Int,
                                                                 _ walkOperands: @escaping  (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void?) -> Void?)  {
-            iteratePossibleLeftNodes(&l,
-                                  &composedCount,
-                                  &otherComposedCount,
-                                  &decomposedValue,
-                                  &downstreamSiblingsRecruitmentSkip,
+            iteratePossibleLeftNodes(&resolveCoreCtx,
+                                  &exploreAdditionOfRightNodeCtx,
                                   resultReceiver,
                                   startGoal,
                                   kind,
@@ -73,11 +64,8 @@
             completion.notify(queue: inqueue,
                               execute: completionCb);
 
-            func iteratePossibleLeftNodesDispatch(_ l: inout [ReadyValue],
-                                                  _ composedCount: inout Int,
-                                                  _ otherComposedCount: inout Int,
-                                                  _ decomposedValue: inout [Bool:UInt32],
-                                                  _ downstreamSiblingsRecruitmentSkip: inout Int,
+            func iteratePossibleLeftNodesDispatch(_ resolveCoreCtx: inout ResolveCoreCtx,
+                                                  _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtx,
                                                   _ kind: Op,
                                                   _ startingFrom: Int,
                                                   _ walkOperands: @escaping (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void?) -> Void?)  {
@@ -90,13 +78,10 @@
                  - 4 already means too few possibilities to explore
                  - 3 is right out
                  Note that among other properties, this estimator value is monotonic. */
-                guard startingFrom < (l.count - composedCount),
-                      l.count + (walkOperands({_,_ in return ();}) != nil ? 1 : 0) == 5 else {
-                    iteratePossibleLeftNodes(&l,
-                                             &composedCount,
-                                             &otherComposedCount,
-                                             &decomposedValue,
-                                             &downstreamSiblingsRecruitmentSkip,
+                guard startingFrom < (resolveCoreCtx.l.count - resolveCoreCtx.composedCount),
+                      resolveCoreCtx.l.count + (walkOperands({_,_ in return ();}) != nil ? 1 : 0) == 5 else {
+                    iteratePossibleLeftNodes(&resolveCoreCtx,
+                                             &exploreAdditionOfRightNodeCtx,
                                              resultReceiver,
                                              startGoal,
                                              kind,
@@ -107,26 +92,17 @@
                 }
                 
                 // reseat this divergence over a copy of the whole state
-                let childL = l;
-                let childComposedCount = composedCount;
-                let childOtherComposedCount = otherComposedCount;
-                let childDecomposedValue = decomposedValue;
-                let childDownstreamSiblingsRecruitmentSkip = downstreamSiblingsRecruitmentSkip;
-                
+                let paramResolveCoreCtx = resolveCoreCtx;
+                let paramExploreAdditionOfRightNodeCtx = exploreAdditionOfRightNodeCtx;
+
                 DispatchQueue.global(qos:.userInitiated).async(group: completion) {
                     os_signpost(.begin, log: LogWrapper.gObservationHandle, name: "micro-batch", signpostID: OSSignpostID(log: LogWrapper.gObservationHandle, object: Thread.current));
 
-                    var mutableL = childL;
-                    var mutableComposedCount = childComposedCount;
-                    var mutableOtherComposedCount = childOtherComposedCount;
-                    var mutableDecomposedValue = childDecomposedValue;
-                    var mutableDownstreamSiblingsRecruitmentSkip = childDownstreamSiblingsRecruitmentSkip;
+                    var childResolveCoreCtx = paramResolveCoreCtx;
+                    var childExploreAdditionOfRightNodeCtx = paramExploreAdditionOfRightNodeCtx;
 
-                    iteratePossibleLeftNodes(&mutableL,
-                                             &mutableComposedCount,
-                                             &mutableOtherComposedCount,
-                                             &mutableDecomposedValue,
-                                             &mutableDownstreamSiblingsRecruitmentSkip,
+                    iteratePossibleLeftNodes(&childResolveCoreCtx,
+                                             &childExploreAdditionOfRightNodeCtx,
                                              resultReceiver,
                                              startGoal,
                                              kind,

I have kept attacking this challenge from multiple angles, until I remembered and relocated a blog post about a scalability issue, whose message is: regardless of how tight its implementation is, an atomic counter semaphore can always become a scalability issue by mere virtue of requiring the cache line to bounce between the L1D attached to the respective cores if more than one of them manipulates it concurrently; which can easily and sometimes implicitly occur with reference counting.

So I tried to look for such a situation in my scenario, but even if I have some suspects, I first need data to know how much really of such potential savings can be had. And that is where I came up short on this Mac mini M2, as I failed to find a hardware counter that would be of help: none of them measure the case where a core requests a cache line but it can't be loaded from anywhere in its cache hierarchy, including the common RAM, at least not until another core has written it back. Instead, I had to get back to my meager Intel-based MacBook with only two physical cores, the bare minimum where this can happen, as this one does have counters to that effect. Do you hear me, Apple silicon guys? I had to get back to an Intel processor (which are now gone from the Mac hardware lineup) from a 7-year-old Mac in order to try and improve software scalability on your brand spanking new hardware since the latter is utterly unhelpful.

Using MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM in the CPU counters Instruments template in event mode (once every 1000 events), which allows samples to be taken at or close to where such events happen, I do notice swift_release and swift_retain at the top of the inverted call tree profile, in particular in comparison to the same profile for the sequential run, and unfolding their callers reveals that most of them are directly my code.

Unfortunately, Instruments was not able to point out which lines (even in the assembly) corresponded to such calls (it either complained of being unable to locate the file in question, or pointed me to the first line in the file, etc.), but there was only really one possibility for data that I share between concurrent tasks: the closures that correspond to the chain of operands for a given operation. Indeed, I had that data expressed as closures, being the most straightforward and economical way to make that data available to all the computation trees that may need access to it, as this data is immutable once captured. Everything else is mutable and copied (if not exactly at the point of the explicit copy, at worst immediately afterwards when it is first modified due to COW, which occurs at most once per micro-batch so not worth worrying about).

Unfortunately, Swift won't allow me to "fully copy" a closure, even in a shallow way: such an operation would probably be incompatible with its semantics anyway. So when spawning a new task, instead of such a copy I replace each reference to such a closure by a new closure whose only purpose is to store that reference and forward control to it when invoked:

// First, wrap the closures into a no-op closure that will serve as a cap
let childWalkOperands = { @Sendable action in
    anchor_func();
    return walkOperands(action);
};

Here this is the code for the closure for the operation that is currently being built, similar code is used to replace the similar reference found in the non-primitive ReadyValues in the available list (the array known as l which is copied when a new task is spawned). void anchor_func(void) is a no-op assembly function, which Swift imports as a foreign C declaration, whose only purpose is to ensure the optimizer won't deem that new closure to be equal to the one it wraps and merge the two as a result (which would defeat my work): its source code is

	.globl	_anchor_func
	.p2align	2
_anchor_func:
	ret

After a bit of additional code to support this switcheroo, I profile my code again.

Swift with closure cap

 N|  T  | factor
 1|100  | ref
 2| 69.3|x1.44
 3| 62.0|x1.61
 4| 59.3|x1.69
 5| 48.7|x2.05
 6| 42.1|x2.37
 7| 38.6|x2.59
 8| 36.1|x2.77
 9| 36.2|x2.76
10| 35.9|x2.78
11| 36.0|x2.77
12| 35.9|x2.79

(methodology is identical to Swift without dictionaries, plus an attached patch)

discussion

…well, the absolute performance did improve, as well as the consistency of the results (neither of which are shown here), which is why I will keep this change anyway; running that with the CPU counters instruments does show the event in question as being triggered by swift_release and swift_retain less often, and the remaining cases appear to be only indirectly related to my code.

But this did not move the needle on scalability one bit. Back to the drawing board…

diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/AsyncAPI.swift
--- a/AsyncCountdown/AsyncAPI.swift     Mon Jul 10 22:51:12 2023 +0200
+++ b/AsyncCountdown/AsyncAPI.swift     Sun Sep 17 12:12:20 2023 +0200
@@ -20,13 +20,13 @@
                         _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                         _ kind: Op,
                         _ startingFrom: Int,
-                        _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async throws -> Void;
+                        _ walkOperands: @escaping WalkerFuncNonGCD) async throws -> Void;
 
 typealias DispatcherPseudoReasync = (_ resolveCoreCtx: inout ResolveCoreCtxNonGCD,
                                      _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                         _ kind: Op,
                         _ startingFrom: Int,
-                        _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) throws -> Void;
+                        _ walkOperands: @escaping WalkerFuncNonGCD) throws -> Void;
 
 // Despite https://github.com/apple/swift-evolution/blob/main/proposals/0309-unlock-existential-types-for-all-protocols.md
 // I am unable to do the following, even with Xcode 13.2.1: I get the error that:
@@ -235,9 +235,29 @@
                                                               _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                                                               _ kind: Op,
                                                               _ startingFrom: Int,
-                                                              _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async throws {
+                                                              _ walkOperands: @escaping WalkerFuncNonGCD) async throws {
+                            // First, wrap the closures into a no-op closure that will serve as a cap
+                            let childWalkOperands = { @Sendable action in
+                                anchor_func();
+                                return walkOperands(action);
+                            };
+                            var interResolveCoreCtx = resolveCoreCtx;
+                            interResolveCoreCtx.l = resolveCoreCtx.l.map({e in
+                                return ReadyValueNonGCD(template: e,
+                                                        overridingExpr: e.expr.map({ innerE in
+                                        var innerR = innerE;
+                                        let innerOperandsBrowser = innerE.operatorsBrowser;
+                                        innerR.operatorsBrowser = { action in
+                                            anchor_func();
+                                            return innerOperandsBrowser(action);
+                                        };
+                                        return innerR;
+                                    })
+                                );
+                            });
+                            
                             // reseat this divergence over a copy of the whole state
-                            let paramResolveCoreCtx = resolveCoreCtx;
+                            let paramResolveCoreCtx = interResolveCoreCtx;
                             let paramExploreAdditionOfRightNodeCtx = exploreAdditionOfRightNodeCtx;
                             
                             let task = Task {
@@ -246,7 +266,7 @@
                                                                                            _ innerExploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                                                                                            _ innerKind: Op,
                                                                                            _ innerStartingFrom: Int,
-                                                                                           _ innerWalkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) {
+                                                                                           _ innerWalkOperands: @escaping WalkerFuncNonGCD) {
                                         iteratePossibleLeftNodesPseudoReasync(&innerResolveCoreCtx,
                                                                               &innerExploreAdditionOfRightNodeCtx,
                                                                               { result in innerResultsScopingGroup.addTask { await resultMediator.yield(result);}},
@@ -271,7 +291,7 @@
                                                                                       &childExploreAdditionOfRightNodeCtx,
                                                                                       kind,
                                                                                       startingFrom,
-                                                                                      walkOperands);
+                                                                                      childWalkOperands);
                                     
                                     os_signpost(.end, log: LogWrapper.gObservationHandle, name: "micro-batch", signpostID: OSSignpostID(log: LogWrapper.gObservationHandle, object: Thread.current));
                                 }
@@ -291,7 +311,7 @@
                                                                   _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                                                                   _ kind: Op,
                                                                   _ startingFrom: Int,
-                                                                  _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?) async throws {
+                                                                  _ walkOperands: @escaping WalkerFuncNonGCD) async throws {
                             /* 1: Peek a bit: if we find out from startingFrom that no further left node is
                              going to be able to be added anyway (and remember: closing the composition as
                              it stands is performed in our caller), don't bother spawning a micro-batch.
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/AsyncCountdown-Bridging-Header.h
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/AsyncCountdown/AsyncCountdown-Bridging-Header.h   Sun Sep 17 12:12:20 2023 +0200
@@ -0,0 +1,13 @@
+//
+//  AsyncCountdown-Bridging-Header.h
+//  AsyncCountdown
+//
+//  Created by Pierre Lebeaupin on 15/09/2023.
+//
+
+#ifndef AsyncCountdown_Bridging_Header_h
+#define AsyncCountdown_Bridging_Header_h
+
+#include "anchor_func.h"
+
+#endif /* AsyncCountdown_Bridging_Header_h */
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/AsyncCountdown.swift
--- a/AsyncCountdown/AsyncCountdown.swift       Mon Jul 10 22:51:12 2023 +0200
+++ b/AsyncCountdown/AsyncCountdown.swift       Sun Sep 17 12:12:20 2023 +0200
@@ -15,14 +15,15 @@
 
 import Foundation;
 
+typealias WalkerFuncNonGCD = @Sendable (_ action: (_ value: ReadyValueNonGCD,
+                                             _ reverse: Bool) -> Void?
+                                  ) -> Void?;
+   
 @available(macOS 12.0.0, iOS 15.0, *)
 struct ReadyValueNonGCD:Sendable {
     struct Expr {
         var op: Op;
-        var operatorsBrowser: @Sendable (_ action: (_ value: ReadyValueNonGCD,
-                                                    _ reverse: Bool)
-                                         -> Void?)
-        -> Void?;
+        var operatorsBrowser: WalkerFuncNonGCD;
     }
 
     let value: UInt32;
@@ -38,6 +39,11 @@
         expr = inExpr;
     }
     
+    init(template intemplate: Self, overridingExpr inexpr: Expr?) {
+        value = intemplate.value;
+        expr = inexpr;
+    }
+    
     func description() -> String {
          guard let localExpr = expr else {
              return value.description;
@@ -47,7 +53,7 @@
          var isFirst = true;
          var numRev = 0;
          
-         localExpr.operatorsBrowser({(_ subReadyValue: ReadyValueNonGCD,
+        localExpr.operatorsBrowser({(_ subReadyValue: ReadyValueNonGCD,
                     _ freverse: Bool) -> Void? in
              guard !freverse else {
                  numRev += 1;
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/AsyncExplorers.swift
--- a/AsyncCountdown/AsyncExplorers.swift       Mon Jul 10 22:51:12 2023 +0200
+++ b/AsyncCountdown/AsyncExplorers.swift       Sun Sep 17 12:12:20 2023 +0200
@@ -46,10 +46,10 @@
     // Downstream ones are those created while this one is active
     
     try await dispatcher(&resolveCoreCtx,
-                     &reference,
-                     kind,
-                     /* startingFrom: */ skip,
-                     { _ in return nil;});
+                         &reference,
+                         kind,
+                         /* startingFrom: */ skip,
+                         { _ in return nil;});
 }
 
 /*
@@ -95,7 +95,7 @@
                                              _ startGoal: UInt32,
                                              _ kind: Op,
                                              _ startingFrom: Int,
-                                             _ walkOperands: @escaping @Sendable (_ action: (_ value: ReadyValueNonGCD, _ reverse: Bool) -> Void?) -> Void?,
+                                             _ walkOperands: @escaping WalkerFuncNonGCD,
                                              _ dispatcher: DispatcherAsync) async rethrows {
     let emptyComposition = (walkOperands({_,_ in return ();}) == nil)
 
@@ -137,10 +137,10 @@
             };
             
             try await innerDispatcher(&innerResolveCoreCtx,
-                             &innerExploreAdditionOfRightNodeCtx,
-                             kind,
-                             /* startingFrom: */ candidateReverseOffset,
-                             selfNode);
+                                      &innerExploreAdditionOfRightNodeCtx,
+                                      kind,
+                                      /* startingFrom: */ candidateReverseOffset,
+                                      selfNode);
             
             // close current composition
             var num = 0;
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/DispatchAPI.swift
--- a/AsyncCountdown/DispatchAPI.swift  Mon Jul 10 22:51:12 2023 +0200
+++ b/AsyncCountdown/DispatchAPI.swift  Sun Sep 17 12:12:20 2023 +0200
@@ -20,7 +20,7 @@
                         _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtx,
                         _ kind: Op,
                         _ startingFrom: Int,
-                        _ walkOperands: @escaping (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void?) -> Void?) throws -> Void;
+                        _ walkOperands: @escaping WalkerFunc) throws -> Void;
 
 
 @preconcurrency class ResolveObject {
@@ -40,7 +40,7 @@
                                                   _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtx,
                                                                 _ kind: Op,
                                                                 _ startingFrom: Int,
-                                                                _ walkOperands: @escaping  (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void?) -> Void?)  {
+                                                                _ walkOperands: @escaping WalkerFunc)  {
             iteratePossibleLeftNodes(&resolveCoreCtx,
                                   &exploreAdditionOfRightNodeCtx,
                                   resultReceiver,
@@ -68,7 +68,7 @@
                                                   _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtx,
                                                   _ kind: Op,
                                                   _ startingFrom: Int,
-                                                  _ walkOperands: @escaping (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void?) -> Void?)  {
+                                                  _ walkOperands: @escaping WalkerFunc)  {
                 /* 1: Peek a bit: if we find out from startingFrom that no further left node is
                  going to be able to be added anyway (and remember: closing the composition as
                  it stands is performed in our caller), don't bother spawning a micro-batch.
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/anchor_func.h
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/AsyncCountdown/anchor_func.h      Sun Sep 17 12:12:20 2023 +0200
@@ -0,0 +1,4 @@
+#ifndef ANCHOR_FUNC
+#define ANCHOR_FUNC
+void anchor_func(void);
+#endif // ANCHOR_FUNC
diff -r 631593d9f4fa -r 7d85ed77952d AsyncCountdown/anchor_func.s
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/AsyncCountdown/anchor_func.s      Sun Sep 17 12:12:20 2023 +0200
@@ -0,0 +1,9 @@
+       .section        __TEXT,__text,regular,pure_instructions
+       .build_version macos, 13, 0     sdk_version 13, 3
+       .globl  _anchor_func
+       .p2align        2
+_anchor_func:
+       .cfi_startproc
+       ret
+       .cfi_endproc
+.subsections_via_symbols
1 Like

Small update: involving a (potentially target-dependent) assembly file and a C import because I don't know how else to make code off-limits to the Swift optimizer didn't sit well with me, so I made another effort at a swiftier solution. I have observed that the compiler, both in its internal operation and presumably in how it hands off its representation to LLVM (since the latter is going to also remove abstractions if they are presented to it the right way), is particularly respectful of class instances and does not appear to simplify code across class boundaries (e.g. propagate constraints, inline code, etc.). So I need to involve a class somehow, but still end up with a closure, because I need to transparently insert that precisely when needed. This is how I accomplished it:

final class ThreadLocalCapGutsNonGCD:Sendable {
    let closure: WalkerFuncNonGCD;
    init(closure inclosure: @escaping WalkerFuncNonGCD) {
        closure = inclosure;
    }
    
    class func cap(_ inclosure: @escaping WalkerFuncNonGCD) -> WalkerFuncNonGCD
    {
        // Put the closure in a wrapper…
        let wrapper = Self(closure: inclosure);
        // …so a new closure can capture the wrapper…
        return { @Sendable action in
            // …and recover the original closure from that wrapper
            return wrapper.closure(action);
            // This ought to ensure the wrapping isn't optimized away.
        };
    }
}

After profiling with Instruments and the MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM counter in event mode, I confirm that this is sufficient to obtain the same effect of reduction of cache line bouncing as I previously did with the anchor_func()-based solution, and that there is no performance regression, so now I can get rid of the bridging header and the assembly source file.

The patch:

diff -r 7d85ed77952d -r b0dd9436c367 AsyncCountdown/AsyncAPI.swift
--- a/AsyncCountdown/AsyncAPI.swift     Sun Sep 17 12:12:20 2023 +0200
+++ b/AsyncCountdown/AsyncAPI.swift     Tue Sep 19 22:41:45 2023 +0200
@@ -236,21 +236,13 @@
                                                               _ kind: Op,
                                                               _ startingFrom: Int,
                                                               _ walkOperands: @escaping WalkerFuncNonGCD) async throws {
-                            // First, wrap the closures into a no-op closure that will serve as a cap
-                            let childWalkOperands = { @Sendable action in
-                                anchor_func();
-                                return walkOperands(action);
-                            };
+                            let childWalkOperands = ThreadLocalCapGutsNonGCD.cap(walkOperands);
                             var interResolveCoreCtx = resolveCoreCtx;
                             interResolveCoreCtx.l = resolveCoreCtx.l.map({e in
                                 return ReadyValueNonGCD(template: e,
                                                         overridingExpr: e.expr.map({ innerE in
                                         var innerR = innerE;
-                                        let innerOperandsBrowser = innerE.operatorsBrowser;
-                                        innerR.operatorsBrowser = { action in
-                                            anchor_func();
-                                            return innerOperandsBrowser(action);
-                                        };
+                                        innerR.operatorsBrowser = ThreadLocalCapGutsNonGCD.cap(innerE.operatorsBrowser);
                                         return innerR;
                                     })
                                 );
diff -r 7d85ed77952d -r b0dd9436c367 AsyncCountdown/AsyncCountdown-Bridging-Header.h
--- a/AsyncCountdown/AsyncCountdown-Bridging-Header.h   Sun Sep 17 12:12:20 2023 +0200
+++ /dev/null   Thu Jan 01 00:00:00 1970 +0000
@@ -1,13 +0,0 @@
-//
-//  AsyncCountdown-Bridging-Header.h
-//  AsyncCountdown
-//
-//  Created by Pierre Lebeaupin on 15/09/2023.
-//
-
-#ifndef AsyncCountdown_Bridging_Header_h
-#define AsyncCountdown_Bridging_Header_h
-
-#include "anchor_func.h"
-
-#endif /* AsyncCountdown_Bridging_Header_h */
diff -r 7d85ed77952d -r b0dd9436c367 AsyncCountdown/AsyncCountdown.swift
--- a/AsyncCountdown/AsyncCountdown.swift       Sun Sep 17 12:12:20 2023 +0200
+++ b/AsyncCountdown/AsyncCountdown.swift       Tue Sep 19 22:41:45 2023 +0200
@@ -18,6 +18,33 @@
 typealias WalkerFuncNonGCD = @Sendable (_ action: (_ value: ReadyValueNonGCD,
                                              _ reverse: Bool) -> Void?
                                   ) -> Void?;
+
+/* The aim of this class is to provide each thread (or task, async block, etc.)
+ with its own reference that will get retain/release traffic, rather than all
+ threads sharing a single reference directly and causing the cache line
+ containing the reference count to bounce between cores.
+ 
+ In order to fit in structure fields and parameters that actually expect a
+ WalkerFuncNonGCD, this class needs to be wrapped by a closure of that type. */
+
+final class ThreadLocalCapGutsNonGCD:Sendable {
+    let closure: WalkerFuncNonGCD;
+    init(closure inclosure: @escaping WalkerFuncNonGCD) {
+        closure = inclosure;
+    }
+    
+    class func cap(_ inclosure: @escaping WalkerFuncNonGCD) -> WalkerFuncNonGCD
+    {
+        // Put the closure in a wrapper…
+        let wrapper = Self(closure: inclosure);
+        // …so a new closure can capture the wrapper…
+        return { @Sendable action in
+            // …and recover the original closure from that wrapper
+            return wrapper.closure(action);
+            // This ought to ensure the wrapping isn't optimized away.
+        };
+    }
+}
    
 @available(macOS 12.0.0, iOS 15.0, *)
 struct ReadyValueNonGCD:Sendable {
diff -r 7d85ed77952d -r b0dd9436c367 AsyncCountdown/anchor_func.h
--- a/AsyncCountdown/anchor_func.h      Sun Sep 17 12:12:20 2023 +0200
+++ /dev/null   Thu Jan 01 00:00:00 1970 +0000
@@ -1,4 +0,0 @@
-#ifndef ANCHOR_FUNC
-#define ANCHOR_FUNC
-void anchor_func(void);
-#endif // ANCHOR_FUNC
diff -r 7d85ed77952d -r b0dd9436c367 AsyncCountdown/anchor_func.s
--- a/AsyncCountdown/anchor_func.s      Sun Sep 17 12:12:20 2023 +0200
+++ /dev/null   Thu Jan 01 00:00:00 1970 +0000
@@ -1,9 +0,0 @@
-       .section        __TEXT,__text,regular,pure_instructions
-       .build_version macos, 13, 0     sdk_version 13, 3
-       .globl  _anchor_func
-       .p2align        2
-_anchor_func:
-       .cfi_startproc
-       ret
-       .cfi_endproc
-.subsections_via_symbols

OK, how much impact do you think this patch could possibly have on our matter?

diff -r 631593d9f4fa -r 07055be695f9 AsyncCountdown/AsyncExplorers.swift
--- a/AsyncCountdown/AsyncExplorers.swift       Mon Jul 10 22:51:12 2023 +0200
+++ b/AsyncCountdown/AsyncExplorers.swift       Sat Sep 23 14:19:11 2023 +0200
@@ -143,8 +143,7 @@
                              selfNode);
             
             // close current composition
-            var num = 0;
-            if (selfNode({_,_ in guard num == 0 else {return ();}; num += 1; return nil;}) != nil)
+            if /* parent is */ (!emptyComposition)
                 && ( (kind == .additive) ? innerExploreAdditionOfRightNodeCtx.forwardValue > innerExploreAdditionOfRightNodeCtx.reverseValue :
                             ((innerExploreAdditionOfRightNodeCtx.forwardValue % innerExploreAdditionOfRightNodeCtx.reverseValue) == 0) ) {
                 

I did not think it could have a lot, myself, as I performed this change while in the middle of more significant rearrangements: the latter involved some changes to the signature of the selfNode parameter, and I considered it would be just be as well to reuse a cached result here rather than have to keep tinkering with the signature of an anonymous closure that does not use its own parameters anyway just to keep it in sync (if you don't feel up to reading how it works, just know this code makes sure an operation has at least two operands before we try to obtain its result). For reasons that will eventually become clear, I tested this patch in isolation, as directly applied to the without dictionaries case. What's going to be the outcome?

Swift without dictionaries and with reduced walker calls

 N|  T  | factor
 1|100  | ref
 2| 62.8|x1.59
 3| 52.1|x1.92
 4| 48.1|x2.07
 5| 41.6|x2.40
 6| 37.2|x2.69
 7| 34.3|x2.91
 8| 32.3|x3.10
 9| 32.2|x3.11
10| 32.1|x3.12
11| 32.1|x3.12
12| 31.7|x3.15

(No change in methodology)

discussion

For the first time I break x2 with just the performance cores, and x3 when adding the efficiency cores: now that is a significant improvement (absolute performance of the reference did also improve, in case you're curious). Now I have a pretty good idea on why that is, but I would like to hear your theories before I expose it. Why did this small change have such an effect?

Bonus: in a further patch I also removed the call to walkOperands that sets emptyComposition (using instead specially added state to track the number of operands so as to obtain the same result), but that change did not have a significant impact on scalability. Why is that, in your opinion?

1 Like

So, about that performance and scalability change…

Remember earlier on that I made changes to avoid different CPU simultaneously manipulating the same reference counts, by having different threads manipulate different references. These changes were ultimately unsuccessful at improving scalability, but I kept them nevertheless, and a common theme of these changes were to make manipulation of the references in question more performant at the expense of actually visiting these data structures. So when I noticed the scalability improvement of the tiny patch above, I first wondered if that wasn't in relation to avoiding the increased expense of performing the extra calls. This is why I tested this patch when applied on the "without dictionaries" code: so as to test that hypothesis.

Given that the scalability improvements from the tiny patch were conserved even in that case, I deduced that these improvements had nothing to do with the cost of recovering the wrapped closure from it wrapper or whatever. So when I made the post above, I though that may have to do with accessing the data guarded by walkOperands: even if we discard the copy provided when we just use the fact we're called in order to increment num, that could still result in scalability troubles, such as when we generate accesses to a ReadyValue that (as rightChild) may be simultaneously accessed by other tasks in a similar manner.Âą

So I set out to determine what was going on here. But my hopes to find that out with a well-chosen performance monitoring counter were soon dashed: here, none of them (MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM, L2_RQSTS.RFO_MISS, nor a few others I tried on a lark) told me anything I did not already know from the time profiler, even with the benefit of being able to compare between with and without the improvement (as opposed to just being able to profile the version without improvement in order to look for such an improvement). In fact, in some cases I saw the profile of the unimproved version have lower prevalence of potential suspect functions (including in absolute terms), while it was a longer tail of smaller functions that were taking more CPU time – a telltale sign of victim code, without me being able to identify the perpetrating phenomenon.

Out of ideas, I tried to look for weaker signals, notably how the time in certain functions, possibly called in a certain way, would vary between the two scenarios, while ignoring or at least defocusing some other information.

To do that, seeing the limits of Instruments, I turned to flame graphs.

After some tinkering, I finally came up with these:

(before the improvement, then after. Fun fact: in order to come up with these I had to hack up support for generating flame graphs from inverted calls trees. Isn't Perl fun?)

While there are some shifts here and there, the major change is the amount of swift_release calls by common(PseudoReasync). After the improvement, they're gone! Now, I did notice on occasion the compiler shifting responsibilities between some of my own functions without necessarily inlining them, including sometimes assigning these responsibilities to thunks², so I looked if these calls had instead moved to other function of mine, but no (with the possible exception of destructors, but that cannot be the effect of shifting of responsibility coming from the compiler, can it?)³.

So, while I previously did not look too closely where in my code the calls to the Swift runtime occur, here I ended up looking at the disassembly of my code&sup4; and noticing that before the improvement, there is a call to swift_release in the main epilogue of common (along with an "outlined consume of ReadyValue.Expr?"), and there isn't any equivalent call in common after the improvement, including elsewhere in the same function; and if there was such an equivalent call elsewhere in my code (e.g. responsibility for calling it shifting to the called of common somehow), it would show up in the flame graph, but it doesn't.

At this point, I still don't know for sure what reference it is that gets an extra swift_release (and swift_retain, presumably, though in a way that is much less performance-impacting), but I know enough.


The whole point of this exploration was to see how well Swift can scale, by itself: even with assembly language and lock-free primitives, you're always bound to discover surprising scaling behaviour on your first try, so the important part is the process by which you discover and resolve those; here I wanted to see how far I could take Swift on the same road, by itself: when discovering issues, I wanted to see how far I could get resolving them by using more Swift, not less (and preferably, keeping the code as idiomatic as possible: it is said you can write FORTRAN in any language…)

Concurrent Swift has a lot going for it, notably a Task API that gets more right than any I've previously seen. For instance, when I wrote the concurrent C version of this code, I implemented recycling of allocations so a task could reuse allocations made by the one that just completed; but the libdispatch API does not support the notion of recovering a "result" from an async block, so I had to create an ad-hoc system with a pair of semaphores that surely introduces bottlenecks of its own.

The Swift Task API supports that out of the box, and this is the reason why I will keep tinkering with async Swift. But to get such variation in behaviour, variation comparable to that obtained by a non-trivial patch (the one which removed the dictionaries), as a side-effect (not even the main effect! Since the extra swift_release calls were not from the invocations removed proper) of a one-line change means I am always at the mercy of seemingly random performance variations that would drown out any intended improvement, without much help from the diagnostics or profiling tools available to me. In particular, the more elaborate the technique to help scalability (such as a system to recycle Swift allocations somehow), the more likely it would be for the compiler to just fail at doing the reference counting and other optimisations that impact requests to the runtime that it is succeeding at right now, such that there would be little way for me to reason about what I am doing. Right now, to me trying to make pure Swift code scale well up to 100% on all CPUs is a fool's errand.


1: Even if, as you've probably noticed, that enumeration supports early termination and so will not stray too far away in that particular invocation.

2: Ah, our friend partial apply for iteratePossibleLeftNodesFakeDispatchPseudoReasync #1 (_:_:_:_:_:) in closure #1 in closure #2 in iteratePossibleLeftNodesDispatch #1 (_:_:_:_:_:) in closure #1 in closure #1 in closure #2 in closure #1 in resolveAsync(_:_:)…

3: and I took into account the fact _swift_release_dealloc in in fact tailed-called by swift_release when looking for other culprits. Speaking of which, there is an attribution error in the flame graph: due to the use of "Flatten AsyncCountdown to boundary frames", those frames that include a destructor assign responsibility of the caller of _swift_release_dealloc directly to the bottom of the stack since the destructor is considered the boundary frame; this does not happen in frames when _swift_release_dealloc is not calling any of my destructors. I can't remove this data mining setting, or I end with illegible flame graphs that weigh multiple megabytes due to the use of indirect recursion in my code.

4: With the help of Instruments, which this time was somewhat cooperative for reasons I have not entirely explained: I have worked differently in some aspects, notably working off executables saved instead of them directly residing in the Xcode build directory, so as to be able to repeat experiments without recompiling back and forth the version without improvements and the one with)

3 Likes

I keep worrying that nobody replying will send the wrong signal on this. I for one have found this interesting and useful to follow along with, even though I haven't contributed. So you have at least one interested reader (and I suspect more)! :slightly_smiling_face:

One thought I've had a few times is that I might be inclined to take a look at your example myself, when I get some free time, but that's hindered by not having a simple way to obtain your full code. Perhaps you could put it on GitHub or similar?

8 Likes

Agreed. I have been following this thread with great interest. I find it worrisome that no Apple engineer replied.

2 Likes

That is a fair remark; I haven't had a public code portfolio since BitBucket sunsetted their Mercurial hostingÂą, so for this need I finally bit the bullet and evaluated my options. I ended up settling on SourceHut: ~zpedro/AsyncCountdown - Countdown numbers round solver written in async Swift - sourcehut hg

I particular, I added tags corresponding to the various profiled versions:

Don't forget to run RegenDerived.sh from the AsyncCountdown subfolder once downloaded/cloned (and redo so after each update, etc.): this is needed until reasync is available.


1: I prefer Mercurial to git so far in large part because of the Evolve extension, which is invaluable to refine a work in progress that needs to bounce between different machines

3 Likes

For various reasons I have resumed my exploration of this matter, and I have new results to show. However, for this evening I wanted to start by summing up the work so far as a basis for the forthcoming results, which will be visualized in the same way.

Indeed, how to properly represent these results in a visual way? You could show them as which performance was observed for which parallelism, but that is not necessarily the focus: that could be useful where additional parallelism would cause additional memory consumption, such that you could be interested in obtaining a tradeoff, for instance, but here the additional memory consumption is a rounding error.

Instead, I represent the scalability improvement as a vertical bar, a bit as if you'd see such a curve sideways. That also allows representing the different versions next to one another.

Tomorrow I will show these new results and graph them along with the previously obtained ones.

2 Likes

As some of you may know, I've been reading Doug Gregor's series on introducing Swift to people coming from C++; I don't know if I'm exactly the kind of C++ programmer targeted here: my professional practice is in the embedded space, where the standards are rather more conservative than the C++17+ features taken as reference there, but I learned a few nuggets nevertheless.

But what most interested me was this one: withoutActuallyEscaping(_:do:). For you see, one of the difficulties with my algorithm is that I need to keep a reference to variables in active frames so lower frames can use them in case they are ever needed (for printing a solution when it is found, mostly). Thankfully, these variables can be captured immutably, so the references in question are easy to handle (e.g. they are trivially sendable) nevertheless these captures need to happen often, need to be passed around even more often, and are invoked very rarely. But I also need to expose these references uniformly with other previously captured references, in practice here by stuffing the closure (which has a reference to the capture) in a data structure, and you know what that means: it becomes escaping, even though it won't actually escape; but I can't prove that to the compiler!

This is not just a problem in Swift, the Rust port of the algorithm has the exact same issue; the penalty might be lessened in a garbage-collected language that scans memory (I haven't tried), but it would still be there. I don't think I could expose the demonstration that the capture doesn't escape to anything short of a dedicated proof-writing language like Coq (or my fellow humans, of course), so I resigned myself to the penalty of escaping closures and their heap allocation as the price to pay for operating in a memory-safe language.

So let's try and see whether withoutActuallyEscaping improves the situation! Addition of this functionality was straightforward, except I needed to add it to a second place in order to avoid the dreaded Passing a non-escaping function parameter 'action' to a call to a non-escaping function parameter can allow re-entrant modification of a variable diagnostic… and, oh, yeah, I also needed to further distinguish the async parts from the non-async ones, as the captures made in the async part (the main wagon) may actually escape, since the trainees may cause them to be kept alive while the main wagon has already backtracked from that branch of the bifurcation (but remember CPU utilization of the main wagon is a rounding error).

Let's look at the results:

 N|  T  | factor
 1|100  | ref
 2| 75.0|x1.33
 3| 67.1|x1.49
 4| 64.5|x1.55
 5| 51.9|x1.93
 6| 45.1|x2.22
 7| 42.8|x2.33
 8| 39.3|x2.54
 9| 39.3|x2.54
10| 39.8|x2.52
11| 39.4|x2.54
12| 39.1|x2.56

…ouch. That hurt. It's fortunate I did not spend too much time on this, because we are back at the scalability of the naive Swift version. Looks like the runtime checking of withoutActuallyEscaping is having an impact, though I can't tell anything besides that: the changes in code have caused quite a shift in the captured stacks, resulting in the flame graphs (not shown here) being hard to compare. Suffice to say I did not see any obvious inefficiency to remove, or any way to salvage the effort.

In the end, I reverted the change.

To be continued…

3 Likes

Small intermezzo: additional false starts.

First, as a way to eliminate, or at least reduce, reference counting traffic on the memory bus, I have investigated the keywords presented in the "move-only types" proposal, borrowing and consuming, since they also apply to references and were stabilized as a result of this proposal.

The goal being, of course, to more tightly control when and whether "retain" and "release" are called on walkOperands instances.

But first, I needed to understand the existing situation and in particular to understand when those would be applied implicitly: mostly, consuming is implicit in parameters to constructors, and borrowing is implicit otherwise, unless of course the parameter is already marked mutating/inout. And when I looked at transmissions or uses of walkOperands instances, some I could eliminate, which left us with two structural transmissions:

after which no further use of that variable occurs in the function! Which means that, due to the order already chosen (for unrelated reasons), my code was already following a scenario best suited for ownership transfer, and sprinkling it here with additional keywords would amount, at best, to cargo cult programming.

But, wait, surely there are additional, more subtle sources of retain/release activity… and yes there is:

  • the ReadyValue thus created, by causing the reference to the closure to escape, indirectly causes the parent closure walkOperands to escape too, requiring iteratePossibleLeftNodes to hold on to the parameter it had just been borrowing so far
  • and the ReadyValue being removed from the ready list and assigned to rightChild is itself captured by the same closure, causing a copy of that structure as a result of the closure being escaping, copy which causes a retain (and an eventual release) of its own expr.operatorsBrowser.

This more implicit retain/release churn is entirely related to captures and cannot be solved by these keywords, which apply to parameters. And as seen yesterday, making that closure non-escaping in order to solve all those issues did not provide the expected results.


Second, the release of the Apple Silicon CPU Optimization Guide confirms what was discussed earlier: while that guide does discuss lock contention patterns (false sharing, in particular) and how to avoid them, these sections are conspicuously devoid of any reference to a relevant performance monitoring counter. These simply don't exist on Apple Silicon CPUs as of this writing.

As hinted before, after considering the problem from various angles I came to the conclusion the only way forward was to forego reference types entirely.

Indeed, this algorithm requires a lot of tracking of references. In fact, in the purest version of the algorithm, what is the "l" array in Swift is another linked list, except one where elements can be added or removed in the middle, using references that point within the list. But even with "l" being an array, that is still many references to track, here by counting them using atomic instructions (or atomic-like, such as load-acquire/store-release pairs).

In particular, I suspect I may not necessarily be penalized by contention over specific cache lines, but in fact by just the need to broadcast all these atomic memory operation over the core cluster hierarchy: even when these operations concern different cache lines, the announcement channel still has to deal with the announcements themselves anyway. Whether that channel is a common bus or a more structured setup.

If we can't use reference types, we'll have to standardize on everything being represented with value types. And I do mean everything, so including the solution tree being built so far (that with which we can print a solution when it is found). As a result, I considered actually representing it as a value-type tree, using enum and indirect, but I found the task daunting, not to mention a little bit risky: how do I know there are no intermediate reference counts in there? I addition, there was the risk of it not interacting well with the features being developed in parallel. Finally, my inability to prove to the compiler that my borrows can't escape is likely to come back to bite me. In the end, handling arbitrarily-large data structures as value types, without the benefit of COW (which is a library feature, not a language one), did not sit well with me.

So I turned to an old technique, where data structures never contain addresses/pointers/references/etc., and instead only contain offsets or indexes into a well-known array (technique most used on machines where the data bus was narrower than the address one, such that storing full addresses in memory was not necessarily practical or efficientÂą). This does result in a runtime penalty (besides the bounds checking of Swift, of course): even on architectures like ARM that support indexed addressing, some further purposes which could make use of various addressing modes in pointer-based code will have to find another way in index-based code. Additionally, on ARM the indexed addressing mode can only scale by the transfer size, while here the scale will need to be by a structure size (and the compiler won't be able to pre-scale the index in most cases, for reasons I will expose). And to top it off, the previously-linked Apple Silicon CPU Optimization Guide mentions having dedicated forwarding from one load operation to the base operand of the next, for pointer-chasing, which won't apply for the indexed operand. But remember I am most interested in scalability here, not absolute performance.

And here I encounter my first issue, as I do still need a value like nil, this time as an index rather than a reference, to mark the end of linked lists and whatnot. But I know from experience that UIntN? is very inefficient, whether as stored in a struct or as a local variable; they tend to result in dynamic allocations in particular, which we are trying to avoid as well EDIT: cf post 17. The solution came from Julia, where the pattern is to steal a valid value for use as a marker instead. I'm afraid I'm repeating the billion-dollar mistake again, since there is no way to treat variables that are meant to possibly turn out to have this marker value differently from variables that can't, but I'll take it.

Beyond that, nothing more than additional crimes against proper data representation (e.g. I add the operation being built to "l", now a linked list, while the operation is still in an incomplete state, as I need a placeholder so I can uniformly retrieve any other element, whether it is just below or deeper in the linked list). At least I was able to add type checking so I can't just index any array through any random index variable. Surprisingly few changes to the work dispatch scaffolding, which was mostly able to rely on existing abstractions.

So, were these transgressions worth it?

 N|  T  | factor
 1|100  | ref
 2| 57.4|x1.74
 3| 39.3|x2.55
 4| 34.8|x2.87
 5| 31.5|x3.18
 6| 28.7|x3.49
 7| 26.9|x3.73
 8| 26.0|x3.84
 9| 24.7|x4.06
10| 24.0|x4.17
11| 24.8|x4.04
12| 24.9|x4.02

Oh, yeah, now we're cooking with gas. In particular, we're now within spitting distance from C. And all it took was a return to the original structured programming practices!

Okay, I jest: Swift does provide me with more memory safety than any language of the time did. Nevertheless, since I need to recycle instances, I miss the benefits of immutability (recycling implies mutation) and definite initialization (what I recover is basically garbage). And while I like structs/aggregations/records/product types just as much as anyone, they prevent the compiler from performing a full lifetime analysis of the fields therein, so the optimizer will not be able to apply some optimizations such as pre-scaling of array indexes, while it could if these field had been local variables instead, even ones captured by closures.

Next step: merging that with the features developed in parallel.


1: though everything old is new again, as CHERI (I recommend Gankra's explainer) could again make pointers unwieldingly large at 16 bytes…

This won’t result in extra independent allocations in Swift; it’s closer to a C struct with a “valid” flag attached. It’ll waste a bunch of space in your array because of alignment though, so having a sentinel is probably still better.

(Conversely, indirect is always refcounted, so that wouldn’t have helped in the end.)

Tree traversal is definitely one of the weaker areas in Swift’s performance toolbox. There are ideas on how to improve it, but restructuring to indexed storage will probably still beat most of them if you’re willing to pay the ergonomics cost. And as you noted, at least you can use types to keep from making obvious mistakes there.

6 Likes

I may have misremembered; however, testing again that old project (an earlier experiments project, not published) and profiling it I end up with instances of outlined init with take of UInt?, themselves calling __swift_instantiateConcreteTypeFromMangledName. You're probably right: since those don't call anything, much less any allocation function, this is not actual memory allocation. Nevertheless, those show that the penalty is not just in the array itself, since that array was also preallocated at start, so those init were of something else than array elements.

This is a moot point anyway: there is no doubt that an, ahem, particularized sentinel value is preferable, as profiling the following revision of that unpublished project, where I switched to using them, confirmed that performance improved: by 40%…

This is a bit of an issue, as algorithms involving trees is where I'm most likely to want to extract performance out of Swift. Indeed, if I'm looking to optimize algorithms with more predictable execution paths, maybe I will write the first version in Swift, but them I am going to want to target a programmable GPU, since everyone has one of those and they do well on branchless number crunching; and Swift is not going to help me with that (even when counting the differentiable Swift experiments, since those were rather aimed at gradient descent, i.e. ML, than general purpose computations).