Not-so-small intermezzo: to what exactly can we attribute the lack of scalability, back when we still used references?
Earlier in this exploration, most if not all impediments to scalability were removed when the algorithm was rewritten to eschew references, which is certainly a strong lead towards indicting them. However, I thought it would be useful to better characterize the effect of references; in particular, I was wondering whether frequent use of reference could cause bottlenecks not only within our process, but also across processes: what if my computation load could cause other processing on the system to slow down from memory bus activity, before we even saturate the cores?
When developing a mental model of scalability, I find it useful to think of the computing system as a number of diverse resources; any scalability issue can then be modeled as one resource or another being (at least at some point) the bottleneck. Sometimes the resources in question are obvious, e.g. a core to which our thread can be allocated, but sometimes less so, for instance write bandwidth to a specific filesystem device. Keep in mind that a cache line at a particular address can be one such resource, for instance when it is used as a semaphore in some manner (e.g. it contains a mutex, or atomic memory operations target directly an address cell in there, etc.), so it is best to think of these resources as being defined in software, even if their semantics end up being enforced in hardware.
evaluation 1: scalability when the code treats different "requests" in different processes
Let us get back to the revision immediately prior to the no-allocations rewrite (equivalent to swift-without-dictionaries-and-with-reduced-walker-calls), and run that code with a "cpu count" scenario of 1ā¦ but 100 instances of them, with parallelism controlled by xargs; first, the reference scenario (multiple runs were made, but they all had consistent results):
echo {0..99} | net_wanderingcoder_cpucount=C /usr/bin/time xargs -n 1 -P 1 /Users/pierretemp2/Library/Developer/Xcode/DerivedData/AsyncCountdown-bpbnpedmqnnszzeqogzfyrevtzpv/Build/Products/Release/AsyncCountdown
# ...
99,72 real 99,72 user 4,06 sys
Good, let us crank inter-process parallelism to 4 (again, multiple runs but so consistent I am only showing one):
echo {0..99} | net_wanderingcoder_cpucount=C /usr/bin/time xargs -n 1 -P 4 /Users/pierretemp2/Library/Developer/Xcode/DerivedData/AsyncCountdown-bpbnpedmqnnszzeqogzfyrevtzpv/Build/Products/Release/AsyncCountdown
# ...
27,42 real 109,82 user 7,79 sys
99.72/27.42 is x3.637; in other words, a better scalability than what we got from C; scalability overhead is less than 10%. The 8-way parallelism scenario confirms that (multiple runs but so consistent I am only showing one):
echo {0..99} | net_wanderingcoder_cpucount=C /usr/bin/time xargs -n 1 -P 8 /Users/pierretemp2/Library/Developer/Xcode/DerivedData/AsyncCountdown-bpbnpedmqnnszzeqogzfyrevtzpv/Build/Products/Release/AsyncCountdown
# ...
18,13 real 132,86 user 2,85 sys
99.72/18.13 = x5.500 is obviously less than x8 as we spill from the performance cores to the efficiency ones, but what matters is that it is infinitely better than the best intra-process scalability we ever obtained here. And that was with the C version: the code we're benchmarking here in fact managed x3.15 at best, when the parallelism was internal to the process.
With these figures, I assert there is no need to look at all intermediate parallelism values or further characterize how this scales: it does so as linearly as we could hope for.
So the first conclusion here is: whatever the scalability issues discussed here, they are unlikely to be the cause of two separate processes slowing down each other before core saturation comes into play.
evaluation 2: scalability when the code treats different "requests" in a single process
So, are we good here? Of course not: we have just shown that the same code that had poor scalability when we tried to run it on multiple tasks within the same process, has no trouble scaling with itself when run in separate processes, disproving the theory that the bottleneck was a global hardware resourceĀ¹. But we have also made sure that different refcounted objects were used in different threads, and that did not improve the situation any, so the bottleneck resource can't be any of these, so what's left?
But wait, did we make sure of that, really? How about we try this: have the same 100 "requests", but now run them within the same process. That will allow, for instance, to have the same number of instance of resolveCore()
running as in the previous evaluation, whether in total or at any given time, varying a single variable: how these instances share processes. Not at all, in evaluation 1; all in the same process, in evaluation 2.
But to do that, we can no longer rely on xargs; we will have to provide that code, which I eventually tagged as parallelism-experiments-evaluation-2. Let us run the reference scenario (multiple runs but so consistent I am only showing one):
net_wanderingcoder_parallelcount=C net_wanderingcoder_cpucount=C /usr/bin/time /Users/pierretemp2/Library/Developer/Xcode/DerivedData/AsyncCountdown-bpbnpedmqnnszzeqogzfyrevtzpv/Build/Products/Release/AsyncCountdown
# ...
83,92 real 112,71 user 3,56 sys
Timing is lower (better) than in evaluation 2, as I also ported a few performance improvements in between. Now with 4-way parallelism (multiple runs but so consistent I am only showing one):
net_wanderingcoder_parallelcount=CCCC net_wanderingcoder_cpucount=C /usr/bin/time /Users/pierretemp2/Library/Developer/Xcode/DerivedData/AsyncCountdown-bpbnpedmqnnszzeqogzfyrevtzpv/Build/Products/Release/AsyncCountdown
# ...
45,14 real 219,05 user 7,34 sys
83.92/45.14 is x1.859. This is completely comparable to the scalability observed when we adjusted net_wanderingcoder_cpucount rather than net_wanderingcoder_parallelcount. Let us look at 8-wayā¦ (multiple runs but so consistent I am only showing one)
net_wanderingcoder_parallelcount=CCCCcccc net_wanderingcoder_cpucount=C /usr/bin/time /Users/pierretemp2/Library/Developer/Xcode/DerivedData/AsyncCountdown-bpbnpedmqnnszzeqogzfyrevtzpv/Build/Products/Release/AsyncCountdown
# ...
33,16 real 254,78 user 0,57 sys
83.92/33,16 = x2.531. This is in fact bad enough that we need to have a closer lookĀ². To the flame graphs!
There is clearly something going on with (partial apply for) iteratePossibleLeftNodesFakeDispatchPseudoReasync()
, which (if it needs to be reminded) is a function that does not do much at all: it pretty much forwards control to the next recursion, because we're already in a subtask and not allowed further subdivide the batch we've been tasked with.
The flame graph for 8-way parallelism is harder to analyze, coming from 4-way parallelism: since we've spilled to encompass the efficiency cores as well, this is definitely going to affect the profile obtained from these CPU so there are additional phenomenons that are going on.
Unfortunately, this is where I am coming up short in my analysis: I have no idea where to go next, either from the CPU documentation or from anything in the Xcode toolchain. I can only conclude this:
While I can't attribute responsibility between the Apple Silicon hardware and the Swift runtime, know that manipulation of references cause scalability issues when attempting to scale within a single address space, even when the instances being referred to have been segregated between different threads to the seemingly largest extent.
1: worse, that discovery did not improve the initial goal, as that multi-process evaluation only allows throughput improvements (i.e. treating more requests per second), without improving the treatment latency of a single request, which is what we had been doing so far.
2: In fact, an earlier version of the code had a bug where the parallelism was off by one, i.e. the "reference" scenario already benefitted for 2-way parallelism which obviously skewed the whole thing; after fixing it, I double-checked all further builds with Instruments.