As hinted before, after considering the problem from various angles I came to the conclusion the only way forward was to forego reference types entirely.
Indeed, this algorithm requires a lot of tracking of references. In fact, in the purest version of the algorithm, what is the "l" array in Swift is another linked list, except one where elements can be added or removed in the middle, using references that point within the list. But even with "l" being an array, that is still many references to track, here by counting them using atomic instructions (or atomic-like, such as load-acquire/store-release pairs).
In particular, I suspect I may not necessarily be penalized by contention over specific cache lines, but in fact by just the need to broadcast all these atomic memory operation over the core cluster hierarchy: even when these operations concern different cache lines, the announcement channel still has to deal with the announcements themselves anyway. Whether that channel is a common bus or a more structured setup.
If we can't use reference types, we'll have to standardize on everything being represented with value types. And I do mean everything, so including the solution tree being built so far (that with which we can print a solution when it is found). As a result, I considered actually representing it as a value-type tree, using enum and indirect, but I found the task daunting, not to mention a little bit risky: how do I know there are no intermediate reference counts in there? I addition, there was the risk of it not interacting well with the features being developed in parallel. Finally, my inability to prove to the compiler that my borrows can't escape is likely to come back to bite me. In the end, handling arbitrarily-large data structures as value types, without the benefit of COW (which is a library feature, not a language one), did not sit well with me.
So I turned to an old technique, where data structures never contain addresses/pointers/references/etc., and instead only contain offsets or indexes into a well-known array (technique most used on machines where the data bus was narrower than the address one, such that storing full addresses in memory was not necessarily practical or efficient¹). This does result in a runtime penalty (besides the bounds checking of Swift, of course): even on architectures like ARM that support indexed addressing, some further purposes which could make use of various addressing modes in pointer-based code will have to find another way in index-based code. Additionally, on ARM the indexed addressing mode can only scale by the transfer size, while here the scale will need to be by a structure size (and the compiler won't be able to pre-scale the index in most cases, for reasons I will expose). And to top it off, the previously-linked Apple Silicon CPU Optimization Guide mentions having dedicated forwarding from one load operation to the base operand of the next, for pointer-chasing, which won't apply for the indexed operand. But remember I am most interested in scalability here, not absolute performance.
And here I encounter my first issue, as I do still need a value like nil, this time as an index rather than a reference, to mark the end of linked lists and whatnot. But I know from experience that UIntN? is very inefficient, whether as stored in a struct or as a local variable; they tend to result in dynamic allocations in particular, which we are trying to avoid as well EDIT: cf post 17. The solution came from Julia, where the pattern is to steal a valid value for use as a marker instead. I'm afraid I'm repeating the billion-dollar mistake again, since there is no way to treat variables that are meant to possibly turn out to have this marker value differently from variables that can't, but I'll take it.
Beyond that, nothing more than additional crimes against proper data representation (e.g. I add the operation being built to "l", now a linked list, while the operation is still in an incomplete state, as I need a placeholder so I can uniformly retrieve any other element, whether it is just below or deeper in the linked list). At least I was able to add type checking so I can't just index any array through any random index variable. Surprisingly few changes to the work dispatch scaffolding, which was mostly able to rely on existing abstractions.
So, were these transgressions worth it?
N| T | factor
1|100 | ref
2| 57.4|x1.74
3| 39.3|x2.55
4| 34.8|x2.87
5| 31.5|x3.18
6| 28.7|x3.49
7| 26.9|x3.73
8| 26.0|x3.84
9| 24.7|x4.06
10| 24.0|x4.17
11| 24.8|x4.04
12| 24.9|x4.02
Oh, yeah, now we're cooking with gas. In particular, we're now within spitting distance from C. And all it took was a return to the original structured programming practices!
Okay, I jest: Swift does provide me with more memory safety than any language of the time did. Nevertheless, since I need to recycle instances, I miss the benefits of immutability (recycling implies mutation) and definite initialization (what I recover is basically garbage). And while I like structs/aggregations/records/product types just as much as anyone, they prevent the compiler from performing a full lifetime analysis of the fields therein, so the optimizer will not be able to apply some optimizations such as pre-scaling of array indexes, while it could if these field had been local variables instead, even ones captured by closures.
Next step: merging that with the features developed in parallel.
1: though everything old is new again, as CHERI (I recommend Gankra's explainer) could again make pointers unwieldingly large at 16 bytes…
