I recently acquired a Mac mini M2 (4 performance cores, 4 efficiency cores); even if I did not get the M2 Pro (much less the SMP monsters featured in the Mac Studio), I am very interested in seeing how well I can exploit these CPUs using thread-level parallelism. Particularly so with Swift, which was released at a time where all Macs and iOS devices were already multi-core and seemed meant to be tailored to such a multi-core reality, potential which was unveiled with the release of concurrency features starting with Swift 5.5.
So imagine my disappointment when I ran my experimentations project in it, and I obtained the following run times:
Naive Swift
N| T | factor
1|100 | ref
2| 83.5|x1.20
3| 69.5|x1.44
4| 79.5|x1.26
5| 60.8|x1.64
6| 50.0|x2
7| 47.2|x2.12
8| 39.8|x2.51
9| 45.2|x2.21
10| 45.1|x2.22
11| 45.3|x2.21
12| 39.2|x2.56
The first column is the number of tasks that my code spawns before requiring that a task completes as a condition for an additional task to new spawned; the code is run as micro-batches whose size is independent from the number of processors or from this parameter. I went up to 12 as a sanity check
The second column is the time taken in each scenario, normalized against a duration 100 for the sequential case (i.e. where the above parameter is 1; that code uses the async APIs too anyway so the comparison is fair): I am not interested in the absolute performance here, only in whether it scales as expected. Smaller amounts are better.
The third column is how better that scenario performs compared to the first, expressed as a performance multiplier. It is the inverse of the second column, presented for convenience. Higher numbers are better.
Methodology: I opened my project, which you can find as the attachment to the May 17th comment at It's time for `reasync` · Issue #65475 · apple/swift · GitHub with Xcode 15.0 beta 8 and profiled it with the time profiler template. For each run, I took the difference between the end time and the initialization time both reported by Instruments. After an initial run whose results were not kept so as to warm the caches, each scenario was run 3 times, and the median time was kept as the scenario outcome. Control of the parallelism was obtained through an environment variable, which requires the following patch to be applied over the project version from May 17th¹; obtention of the required parallelism was confirmed from inspection of the time profiler instrument (except for parallelism values above 8). The complexity of the computation was chosen so the scenario completes in a handful of seconds, even in the case of the reference (longest duration) scenario, so as to avoid thermal throttling effects.
discussion
Right off the bat, there are obviously some issues with these results: you can see performance regress when going from 3-way to 4-way parallelism, and again when going from 8-way to 9-way (before recovering at 12); there appears to be bimodal behaviors at work. But the general trend is clear: scalability is awful. I doubled-checked the time profiler instrument for obvious misbehavior, such as a single task completing much later than the others and making the processing complete much later than expected, but no such luck. For sanity, I also checked on the machine I was relying on before (2016 13' MacBook Pro, no touch bar), but I saw similar results when going from 1 to 2 parallel tasks.
When I tried the dedicated Swift concurrency template, it failed to provide much insight: to begin with, it appears to be overloaded by the amount of tasks that are spawned (~5000), even if there are never more than N at any given time, resulting in my processing slowing to a crawl.
Looking at the profiling results, I started to have ideas for improvements, but first I wanted to see what kind of scalability I could legitimately expect, to see what I could demand of my Swift code. So I reimplemented my processing in C, with concurrency obtained through libdispatch.
Fast-forward a few weeks…
C with libdispatch
N| T | factor
1|100 | ref
2| 55.9|x1.79
3| 36.5|x2.74
4| 28.8|x3.47
5| 28.3|x3.53
6| 27.5|x3.64
7| 26.4|x3.79
8| 24.6|x4.07
9| 24.0|x4.16
10| 23.7|x4.23
11| 22.8|x4.38
12| 22.8|x4.38
Methodology is similar; the only difference from the the Swift version is the computation load, which was adjusted so total duration in the reference scenario is in the same ballpark so as to obtain comparable thermal throttling effects, if any.
discussion
These results are much better behaved, for a start (this is confirmed by the individual runs, whose groupings are satisfying), and while scaling is not linear from 1 to 4, it is close enough to it that I don't feel computational resources are wasted. As expected, going from 4 to 8 improves the performance much less as we get into the efficiency cores, but still improves the situation. Performance does still improve beyond 8, possibly because we afford the system more opportunities for scheduling useful work, but we clearly hit diminishing returns by that point.
Coming back to the Swift version, one peculiarity I noticed in the profile is some symbols related to dictionaries (hashing-related infrastructure, in particular) jumping from the nowhere near the top in the case of the reference runs to the top of the symbols (sorted by descending weight in the inverted call tree) for all other runs, even when parallelism is barely 2. So I worked on eliminating dictionaries from my naive Swift code, and repeated the experiment.
Swift without dictionaries
N| T | factor
1|100 | ref
2| 69.1|x1.45
3| 62.0|x1.61
4| 58.6|x1.71
5| 48.3|x2.07
6| 41.9|x2.39
7| 37.9|x2.64
8| 36.1|x2.77
9| 36.1|x2.77
10| 36.1|x2.77
11| 35.9|x2.78
12| 36.0|x2.77
Methodology is the same as the naive Swift table; the only difference is the improvement patch, attached here² (I am not happy with some of the compromises that I had to make in order to remove usage of dictionaries; I welcome suggestions for improvements to better handle the situation).
discussion
This code is already much better behaved than the naive Swift one (this is all the more obvious in the individual runs). Scalability is much better at the earlier levels, though behavior is somewhat erratic (much better improvement going from 4 to 5 than going from 3 to 4 parallel tasks), and the factor caps at a level not much better than it did earlier.
At any rate, we are still far from scalability that could legitimately be expected of this algorithm.
Further improvements
The improvements I obtained by removing dictionaries suggest the issues I am having with scalability here may not be related solely to the way I use the Swift concurrency APIs, but also in how the Swift code proper behaves, so I think this should be a significant avenue for investigation.
Unfortunately, I have failed to notice any further lead from what Instruments or Xcode could tell me, so I am looking for means to investigate such remaining scalability impediments: tooling, experiments to perform, known issues, best practices, etc.