[Pitch] Core team publishes results of performance study: Cooperative Scheduler just introduced, plus compile for non-atomic ARC

I took my quick & dirty chess example and used it as a benchmark, experimenting with struct vs class types and different value storage methods.

// M1 Pro, release, maxDepth=6, duration of the first move with randomisation switched off.

type    storage         time    ARC

struct  unsafe pointer  1.5s    no
class   unsafe pointer  1.9s    yes (via class)

struct  1D tuple        4.2s    no
class   1D tuple        7.1s    yes (via class)

struct  1D log tuple    4.7s    no
class   1D log tuple    4.9     yes (via class)

struct  2D tuple        9.7s    no
class   2D tuple        12.3s   yes (via class)

struct  1D array        5.1s    yes (via array)
class   1D array        2.4s    yes (via class and array)

struct  2D array        17.2s   yes (via array)
class   2D array        3.0s    yes (via class and array)

In the table: "unsafe pointer" - a liner "malloced" area used as an 1D array, "1D tuple" - a tuple of 64 cells, "2D tuple" - a tuple of 8 tuples of 8 cells, "1D / 2D array" - similar setup. "1D log tuple" same as "1D tuple", just the modified subscript operation doing a "logarithmic lookup" to drill to the relevant cell rather than relying on a built-in switch operator behavior. 2D lookup is obviously "elements[y][x]" and 1D lookups is: "elements[y*8 + x]".

There is some notably anomaly at the end of the table where struct + array version working slower than the corresponding class + array version (TBD why), but in other cases struct version outperformed class version, quite possibly due to ARC overhead (but again there could be other differences in performance when you switch from struct to class). Also notably the tuple version is not that fast.

Not the most direct benchmark though, the best would be to run the ARC triggering versions of this app in a single threaded mode and toggle the above mentioned "atomic ARC" switch.

Is it possible to somehow override retain/release globally? For diagnostic purposes only, not going to ship this to a store, just to use it locally and measure the overhead of ARC operations in a single threaded app.


Great ideas!

Please embed your code example in triple backticks ```
it would be formatted more sanely.

Like so
extern void (_swift_retain)(void *);
extern void (_swift_release)(void *);
extern void (_swift_retain_n)(void *, uint32_t);

struct HeapClass {
    HeapClass* isa;
    HeapClass* super;
    const char* name;
    long version;
    long info;
    long instancesize;
    objc_ivar_list* ivars;
    objc_methodlist* methods;
    objc_cache* cache;
    objc_cache* protoColL;
    uint32_t flags;
    uint32_t instanceAddressPoint;
    uint32_t instanceSize;
    uint16_t alignMaskAndBits;
    uint16_t reserved;
    uint32_t classSize;
    uint32_t classAddressPoint;
    void* description;

typedef struct {
    HeapClass* isa;
    uint32_t mightBeBeef;
    uint32_t storedPropsB;
    uint32_t storedPropsC;
    uint32_t e;
    // uint32_t f;
} HeapObjectX;

typedef uintptr_t __swift_uintptr_t;

typedef struct {
    __swift_uintptr_t refCounts;
} InlineRefCountsPlaceholder;

typedef InlineRefCountsPlaceholder InlineRefCounts;

    InlineRefCounts refCounts

#ifndef __ptrauth_objc_isa_pointer
#define __ptrauth_objc_isa_pointer

static bool isNoArcObject(void *object) {
    if (object) {
        HeapObjectX* objectX = (HeapObjectX*) object;
        if (objectX->mightBeBeef == 0xDEADBEEF) {
            return true;
        return false;
    } else {
        return true; // return true for NULL so we dont call built in
    return false;

Thanks for these:

extern void (_swift_retain)(void *);
extern void (_swift_release)(void *);
extern void (_swift_retain_n)(void *, uint32_t);

will play with those.

1 Like


I made a quick repo and put the full code here:
Its the actual code we use, and you’ll see theres some debug spew stuff for monitoring things too.


brightenai/swiftexamples: some stuff in the open around swift perf discussions

](GitHub - brightenai/swiftexamples: some stuff in the open around swift perf discussions)


Oh and for us, the rest of the story is for classes where we do this, we have a resizable pool of instances that are allocated once and cleaned for reuse.

It makes mass allocations essentially free and ARC almost gone (still those high frequency calls to the retain/release hooks).

Going further into the future - as we move away from unified memory as our core counts grow - you could imagine single core allocators that use this same motif and are even cheaper . (Single core single thread allows don’t use locks)

This whole thing can be done inside the language ! So people don’t see all this complexity and get even more of a win !

So hey, thanks Tera for leading the charge!

Love swift :ukraine:

We added that (with some help) to Benchmark to be able to capture ARC traffic when benchmarking.

You can have a look at:


for some inspiration if you want to play with it.