I took my quick & dirty chess example and used it as a benchmark, experimenting with struct vs class types and different value storage methods.
// M1 Pro, release, maxDepth=6, duration of the first move with randomisation switched off.
type storage time ARC
struct unsafe pointer 1.5s no
class unsafe pointer 1.9s yes (via class)
struct 1D tuple 4.2s no
class 1D tuple 7.1s yes (via class)
struct 1D log tuple 4.7s no
class 1D log tuple 4.9 yes (via class)
struct 2D tuple 9.7s no
class 2D tuple 12.3s yes (via class)
struct 1D array 5.1s yes (via array)
class 1D array 2.4s yes (via class and array)
struct 2D array 17.2s yes (via array)
class 2D array 3.0s yes (via class and array)
In the table: "unsafe pointer" - a liner "malloced" area used as an 1D array, "1D tuple" - a tuple of 64 cells, "2D tuple" - a tuple of 8 tuples of 8 cells, "1D / 2D array" - similar setup. "1D log tuple" same as "1D tuple", just the modified subscript operation doing a "logarithmic lookup" to drill to the relevant cell rather than relying on a built-in switch operator behavior. 2D lookup is obviously "elements[y][x]" and 1D lookups is: "elements[y*8 + x]".
There is some notably anomaly at the end of the table where struct + array version working slower than the corresponding class + array version (TBD why), but in other cases struct version outperformed class version, quite possibly due to ARC overhead (but again there could be other differences in performance when you switch from struct to class). Also notably the tuple version is not that fast.
Not the most direct benchmark though, the best would be to run the ARC triggering versions of this app in a single threaded mode and toggle the above mentioned "atomic ARC" switch.
Is it possible to somehow override retain/release globally? For diagnostic purposes only, not going to ship this to a store, just to use it locally and measure the overhead of ARC operations in a single threaded app.
Oh and for us, the rest of the story is for classes where we do this, we have a resizable pool of instances that are allocated once and cleaned for reuse.
It makes mass allocations essentially free and ARC almost gone (still those high frequency calls to the retain/release hooks).
Going further into the future - as we move away from unified memory as our core counts grow - you could imagine single core allocators that use this same motif and are even cheaper . (Single core single thread allows don’t use locks)
This whole thing can be done inside the language ! So people don’t see all this complexity and get even more of a win !