Improving Swift code for scalability

Technique 1: build your data structures without references, instead indexing well-known arrays that are grouped in structures, along with other context, that are passed as inout parameters

(This is a single technique as the whole is more than the sum of its parts)

Using indexes appears to me to be the only reliable way to eliminate references from Swift code, which is a goal as previous exploration has shown their manipulation to hamper scalability due to retain/release traffic. The effect on scalability persists even when contention over cache lines is eliminated: the effect also exists as soon as these atomic operations are executed, regardless of the addresses involved.

But everyone has to agree on the arrays and collaboratively maintain them. In Swift, the least painful way to do so is to have the arrays be passed as inout parameters in every function that needs access to them: that means functions receiving these parameters may (and likely will) need to in turn provide these as inout parameters to their own child functions, and so forth! There are two main reasons for requiring this scheme:

  • Other language features for accessing shared variables (such as globals or captures), assuming you could even render them thread-safe, require expensive beginAccess/endAccess calls to dynamically check access exclusivity. These are simply not a factor with inout parameters: with those, it is statically known that the function has exclusive access to the array, and lends that exclusive access to its own child functions.
  • Factoring the state this way makes it easy for the whole state to be duplicated, allowing a fork where one task follows one side of a bifurcation, and another the other side; but without actually relying on fork(2) (I don't even know whether Swift is fork(2)-safe, and this is a moot point anyway as some environments such as Windows or embedded don't provide it). How? By copying the arrays (trivial: Swift arrays are value types, no matter how large), then providing, as the inout parameter, the copy of the array to the function called inside the subtask.Ā¹

And finally, the arrays should be grouped together, along with other shared variables, as much as possible into a single structure: everything with the same lifetime should be grouped. This seems obvious, but it isn't, since I started by providing a bunch of unstructured types as inout parameters, which would often cause a lot of churn. There are no downsides to this grouping that I know of.

This technique is best seen in action, as of the current revision of the code as of this writing, in the context of this patch:

diff --git a/AsyncCountdown/AsyncAPI.swift b/AsyncCountdown/AsyncAPI.swift
--- a/AsyncCountdown/AsyncAPI.swift
+++ b/AsyncCountdown/AsyncAPI.swift
@@ -288,6 +288,7 @@
                                               _ exploreAdditionOfRightNodeCtx: inout ExploreAdditionOfRightNodeCtxNonGCD,
                                               _ kind: Op,
                                               _ startingFrom: IndexedReferenceNonGCD<CountdownArrayCellNonGCD, WritableNonGCD>) async throws {
+
             // reseat this divergence over a copy of the whole state
             let paramResolveCoreCtx = resolveCoreCtx;
             let paramExploreAdditionOfRightNodeCtx = exploreAdditionOfRightNodeCtx;

1: Addendum: this means indexes are meaningless to transmit across such divergences, whether through the outcome of a Task or through inter-task primitives, because the arrays may (and likely will) have diverged in the meantime; however, I am not aware of a way to prevent such transfers (short of making indexes non-sendable, but that would kill the whole point of copying state for a new Task if the indexes for that state couldn't follow). You wouldn't send pointers through IPC, now, would you? So the same way, don't send indexes between Tasks that have already diverged.

Technique 2: rather than being directly expressed as primitive types, these indexes must be expressed as structures that wrap those and instead provide accessors for ā€œdereferencingā€

Clearly you want these indexes, being a substitute for references, to be clearly distinguished from values that are best expressed as integers in structures and the like, so expressing indexes as single-property structure instances is a given.

But now that we have defined a new type, it's the perfect vehicle to avoid some infelicities, such as enqueue(table[ctx.top.g].contents), instead enabling enqueue(ctx.top[table].contents)Ā¹: it's not just that we avoid the .g (for ā€œgutsā€) in repeated usage (if that was the main issue, we'd use key paths forwarding for that), it's also that the expression can now be read fully left-to-right: ā€œfrom ctx we take top which we dereference through table and from that we access contentsā€

This accessor is easy enough to set up: we define a subscript that takes as parameter typeā€¦ that of a well-known array:

extension IndexedReferenceNonGCD
{
    subscript<Element> (_ ctx: TableNonGCD<Element, Target>) -> Element {
        get {
            return ctx.g[Int(self.g)];
        }
    }
}

Note that IndexedReference was previously defined as having a generic type parameter Target that it never manipulate instances of: its role is solely for checking that it correspond to ā€¦ something ā€¦ in the well-known array.

Unfortunately, that is only going to work as a get accessor: subscript parameters are immutable, so a setter along the same principles, which is going to need to mutate the well-known array, is not going to compile. We're going to have to be more subtle next timeā€¦

Meanwhile, this technique is best seen in action, as of the current version as of this writing, in the context of this patch:

diff --git a/AsyncCountdown/AsyncExplorers.swift b/AsyncCountdown/AsyncExplorers.swift
--- a/AsyncCountdown/AsyncExplorers.swift
+++ b/AsyncCountdown/AsyncExplorers.swift
@@ -116,6 +116,7 @@
             exploreAdditionOfRightNodeCtx.downstreamSiblingsRecruitmentLoc = currentLoc;
         }
 
+
         guard case let .value(op: candidateOp, value: candidateValue) = rightChild[resolveCoreCtx.mainTable].contents else {
             fatalError("Found element in l not in the value state");
         }

1: I do admit this syntax has got shades of 5["string"] (which is ostensibly valid C as it translates to *(5+"string") which is the same as "string"[5]), but it is readable in practice.

Technique 3: rather than a subscript, define a method on the index type that takes a closure to permit modification of the ā€œpointed toā€ data

This definition is going to look like this:

func open<Element, T>(_ context: inout TableNonGCD<Element, Target>, _ cb: (_ spot: inout Element) -> T) -> T {
    return cb(&context.g[Int(self.g)]);
}

But the important part is the expressions that are going to be possible with this method.

Bonus: you can define this method in a separate extension that is dependent on a second generic type parameter AccessRights being Writable, in case you want the equivalent of the distinction between const Element* and Element* in your code.

This technique is best seen in action, as of the current version as of this writing, in the context of this patch:

diff --git a/AsyncCountdown/AsyncExplorers.swift b/AsyncCountdown/AsyncExplorers.swift
--- a/AsyncCountdown/AsyncExplorers.swift
+++ b/AsyncCountdown/AsyncExplorers.swift
@@ -152,6 +152,7 @@
             /* Should normally be done before pushing it to the contribution list,
              but will be safe as there are only loop management calls in the
              interval. */
+
             rightChild.open(&resolveCoreCtx.mainTable, {
                 $0.contents.upgradeToContribution(direction: direction);
             });

Technique 4: when marker values are needed, such as a sentinel to mark the end of a list, don't just steal a value: define a type that is allowed to have this value

The obvious solution, an optional, will in this particular case cause various inefficiencies, the main one being doubled memory requirements in arrays. The same goes for any data-bearing enumeration type in the same situation. But that doesn't mean you can just use IndexedReference for everything: proper typing remains necessary, or you'd just be repeating the billion-dollar mistake.

So what's left is defining a whole new single-property structure that wraps the same primitive type, whose main initializer creates an instance with the stolen marker value:

extension PotentiallySentinelIndexedReferenceNonGCD
{
    init(sentinelFor type: Target.Type, rights: Rights.Type) {
        g = .max;
    }

While this type can also store valid index values, it will not have the ā€œaccessorsā€ defined earlier, which is a good thing: instead, this new type will have an initializer to convert from a ā€œguaranteed not sentinelā€ index, and a dedicated u() (for ā€œunwrapā€) that either returns such an index, or raises some failure if it turns out to have the stolen marker valueĀ¹: you must first map to an ordinary indexed reference type prior to any dereference but also prior to any parameter passing that expects such a type.

This technique is best seen in action, as of the current version as of this writing, in the context of this patch:

diff --git a/AsyncCountdown/AsyncExplorers.swift b/AsyncCountdown/AsyncExplorers.swift
--- a/AsyncCountdown/AsyncExplorers.swift
+++ b/AsyncCountdown/AsyncExplorers.swift
@@ -168,6 +168,7 @@
                 
                 let realizedValue = kind.uncombine(exploreAdditionOfRightNodeCtx.forwardValue, exploreAdditionOfRightNodeCtx.reverseValue);
                 
+
                 try! resolveCoreCtx.l.top.u().get().open(&resolveCoreCtx.mainTable, { $0.contents.upgradeToValue(value: realizedValue); });
                 if resolveCoreCtx.l.count == 1 {
                     if realizedValue == startGoal {

1: At the moment, ā€œraising a failureā€ consists of returning a Result type, as throwing any Error causes an allocation at runtime (I have not been able to make typed throws work at the time I came up with the solution; I will watch the next betas for improvements in this area); bailing with a fatalError is not an option, as I sometimes do want to treat a sentinel as an exceptional condition, such as when enumerating a linked list: I can then arrange for ā€œfailureā€ to result in the loop terminating. EDIT: typed throws (or this part of it) is in fact not slated for wide usage as part of this beta cycle, despite some enthusiasm in accompanying materials.

Technique 5: even if you have multiple tables that share indexes, put type checking in place so you won't mistakenly index any table with any random index

It's expected that there will be multiple well-known tables in the end, and that includes the possibility of more than one table sharing their indexes, much in the same way that two relational database tables can have the same primary keys. In the case of this project, this has happened for two reasons:

  • Linked list management routines, which are expressed as methods on the PotentiallySentinel index type, modify both their receiver and, as part of internal maintenance, the next property of the relevant slot of the storage array (the same would go for, say, a tree structure maintenance routine). If the two belong to the same array, the compiler will not allow that code to be written because of exclusivity access checks, which have array granularity at the point the call is made. Casting off the receiver to a secondary array is the solution I found to this problem.
  • In some cases, not only will your storage array contain data-bearing enums/variants, but it is in fact known that a contiguous chunk of the array will never have the most voluminous variant, leading to wasted space and higher cache occupancy. This is the case here with one half of the mainTable, which only contains primitive values: those have no need for a contributions linked list tip property. Even if I don't necessarily save space by casting those off to a secondary array (not until further developments), at least I no longer have holes of unused bytes in the middle of my working set.

But now that the same ā€œreferenceā€ is needed to index two arrays, you can no longer apply type checking between indexes and arrays, right? Of course you can! Just wrap your arrays in a ā€œtableā€ structure that takes a second generic type parameter (on top of the array element) that serves as a key: that type must match between the table and the ā€œtargetā€ type parameter of the index for dereferencing to be possible. By convention, for this pivot type I pick the type of the element for the ā€œmainā€ table: for that instance of the table structure, the two generic type parameters will be the same.

struct TableNonGCD<Element:Sendable, Target:Sendable>:Sendable {
    var g: [Element];
}

typealias MainTableNonGCD<Contents, Target> = TableNonGCD<NodeNonGCD<Contents, Target>, Target>;

typealias PoolNonGCD<A> = MainTableNonGCD<A, A>;

But, since these two tables need to accept the exact same index values, doesn't that mean their item counts need to be kept in sync? Yes, and that is what we're going to see in the next technique.

This technique is best seen in action, as of the current version as of this writing, in the context of this patch:

diff --git a/AsyncCountdown/AsyncExplorers.swift b/AsyncCountdown/AsyncExplorers.swift
--- a/AsyncCountdown/AsyncExplorers.swift
+++ b/AsyncCountdown/AsyncExplorers.swift
@@ -131,6 +131,7 @@
             total = kind.uncombine(total, candidateValue);
         }
 
+
         try! resolveCoreCtx.l.top.u().get().open(&resolveCoreCtx.contributionsTipTable, {
             try! $0.wu({
                 $0.pushWithoutAccounting(&resolveCoreCtx.mainTable, rightChild);

Technique 6: only create indexed references through an allocation routine, accessed by a method on a structure that groups all such synchronized tables; template that code through whatever means necessary

By centralizing the production of IndexedReference instances, you can be sure that the relevant arrays are as large as they need to be, and kept in sync. In the ideal case, the lifecycle of allocations is last-out-first-in, such that deallocations can just symmetrically shrink the arrays; in more general situations, more techniques (free list, invalidated element state, etc.) may be necessary.

extension PoolWith2TablesNonGCD
{
    mutating func place(_ elemA: ElementA, _ elemB: ElementB) -> IndexedReferenceNonGCD<Target, WritableNonGCD> {
        guard self[keyPath: Self.aTablePath].g.count == self[keyPath: Self.bTablePath].g.count else {
            fatalError("Mismatch in table lengths within a single pool");
        }
        
        self[keyPath: Self.aTablePath].g.append(elemA);
        self[keyPath: Self.bTablePath].g.append(elemB);
        
        return .init(g: UInt32(self[keyPath: Self.aTablePath].g.count) - 1);
    }
    
    mutating func remove<R: AccessRightsNonGCD>(recovering indexedRef: IndexedReferenceNonGCD<Target, R>) {
        guard self[keyPath: Self.aTablePath].g.count == self[keyPath: Self.bTablePath].g.count else {
            fatalError("Mismatch in table lengths within a single pool");
        }
        
        guard (UInt32(self[keyPath: Self.aTablePath].g.count) - 1) == indexedRef.g else {
            fatalError("Indexed reference deallocated out of order");
        }

        _ = self[keyPath: Self.aTablePath].g.popLast();
        _ = self[keyPath: Self.bTablePath].g.popLast();
    }

You really don't want to spell that out multiple times, so I eventually expect to use property packs (for now only a future direction on parameter packs); for now, a protocol extension will suffice.

But won't these array grow/shrink operations cause dynamic allocations? Well, in most cases you can predict the maximum size it can possibly have, and if you pass that to reserveCapacity(), no actual allocation will occur after that; meanwhile, by having the element not exist until you need to allocate it, you reap some benefits such as definite initialization.

Addendum: and in order to prevent indexes from being forged, you should make their sole property immutable (let) and restrict their general initializer to be fileprivate, so that only this allocation method can create them ex nihilo.

This technique is best seen in action, as of the current version as of this writing, in the context of this patch:

diff --git a/AsyncCountdown/AsyncExplorers.swift b/AsyncCountdown/AsyncExplorers.swift
--- a/AsyncCountdown/AsyncExplorers.swift
+++ b/AsyncCountdown/AsyncExplorers.swift
@@ -38,6 +38,7 @@
                                      _ dispatcher: DispatcherAsync,
                                      kind: Op,
                                      recruitmentLoc: PotentiallyEvenUndefinedIndexedReferenceNonGCD<CountdownArrayCellNonGCD, WritableNonGCD> = .init(CountdownArrayCellNonGCD.self, rights: WritableNonGCD.self)) async rethrows {
+
     resolveCoreCtx.l.push(&resolveCoreCtx.mainTable,
                           resolveCoreCtx.place(.init(.placeholder(op: kind)),
                                                .init(w: .init(sentinelFor: CountdownArrayCellNonGCD.self, rights: ReadOnlyNonGCD.self))

As you go through these it would be really interesting to see performance comparisons step by step.

Unfortunately, I did not come up with these in this exact order: I rearranged them for better pedagogical effect (you can double-check the version control log, I left that history as-is). I did, however, watch for performance regressions every step of the way, so you can assume scaling performance is the same as the last posted graph.

Technique 7: use defer to make manual read-modify-write sequences more readable, no matter how trivial, even in methods that do not throw

Sometimes, when the compiler diagnoses an instance of overlapping mutable accesses, you won't have any solution but to spell a manual read-modify-write pattern, as suggested by the compiler. But it makes for less straightforward code.

In this case, do not hesitate to use defer, no matter if it is library code, no matter how trivial the code, no matter the length of the containing block, no matter whether the latter even throws. Using defer is just more readable: note how in particular here how it allows the final line to just directly return the result of a called method (rather than having to define another temporary variable).

Addendum: but wait, did I not previously eliminate all uses of defer because I had just added them for convenience? The difference is, using defer here is not merely convenientā€¦ if you ask yourself the question "Even if I do not throw right now, what would I want to happen if I later allow the code that operates on or around the organ to throw?", the answer is clear: you most definitely want the current state of the organ to be put back in its place so that your object is in a consistent state before giving up control! So using defer here is convenient, but it is not merely convenient.

This technique is best seen in action, as of the current version as of this writing, in the context of this patch:

diff --git a/AsyncCountdown/AsyncCountdown.swift b/AsyncCountdown/AsyncCountdown.swift
--- a/AsyncCountdown/AsyncCountdown.swift
+++ b/AsyncCountdown/AsyncCountdown.swift
@@ -228,6 +228,7 @@
     mutating func popFromWithin(_ context: inout MainTableNonGCD<Contents, Target>, after locator: IndexedReferenceNonGCD<Target, WritableNonGCD>) -> Result<IndexedReferenceNonGCD<Target, WritableNonGCD>, PotentiallySentinelIndexedReferenceNonGCD<Target, WritableNonGCD>.SentinelBlown> {
         var organ = locator[context].next;
         defer { // as if we were pushing on the undo stack as part of the "do"
+
             locator.open(&context, { $0.next = organ; });
         }
         return organ.popWithoutAccounting(&context).map({

Technique 8: do not hesitate to use short closures, that capture few variables (three, maybe four at most), however replace non-trivial closures by any means necessary

Note that many constructs in Swift rely on closures (even defer we just saw earlier internally generates a closure) so you must be careful to keep those short.

This is a shame, as the capabilities and ease of working with captures of Swift is what, in my humble opinion, most distinguishes Swift from its peers. But the fact I had a moderately complex closure cause dynamic allocations of 32 bytes leads me to distrust them in general; this includes both anonymous closures and local functions.

If you can't use complex closures, what can you replace them with?

  • First attempt to turn the closure into a top-level function or a method of some object, passing what you would have captured as explicit parameters instead
  • If that is too cumbersome, then instead turn the closure body into a control structure body, even if you need to, say, set up a for loop for just two iterations (for code that needs to diverge between iterations, so far my best idea is to use an if and/or switch inside the body)
  • And in some cases, you might not be able to fully eliminate the closure, in which case partially applying the techniques above will allow you to reduce it enough

(Note that escaping closures, being reference types, must be eschewed at all costs, regardless of the size of their body)

This technique is best seen in action, as of the current version as of this writing, in the context of this patch:

diff --git a/AsyncCountdown/AsyncExplorers.swift b/AsyncCountdown/AsyncExplorers.swift
--- a/AsyncCountdown/AsyncExplorers.swift
+++ b/AsyncCountdown/AsyncExplorers.swift
@@ -138,6 +138,7 @@
         });
 
         for direction in Direction.allCases {
+
             switch direction {
             case .forward:
                 combine(&exploreAdditionOfRightNodeCtx.forwardValue);
1 Like

Hello, @ZPedro

Thank you for writing these posts; I enjoy reading them.

Can you elaborate on the following, with tiny examples if possible.

Thank you.

This principle is best illustrated here with technique 6.

Indeed, the hot code at:

benefits from an abstraction (the place method of resolveCoreCtx) that relieves it from having to keep the two well-known arrays lengths in sync. But while ResolveCoreCtx provides the abstraction, the place method is not defined there or even specifically for this type! Instead, this method is provided ā€œfor freeā€ by an extension to PoolWith2TablesNonGCD:

With Swift, anything written as protocol extensions is provided automatically to types that conform to the core part of a protocol, and the only thing that ResolveCoreCtx had to do to conform to the protocol, besides specifying PoolWith2TablesNonGCD after a colon, is define read-only static properties aTablePath and bTablePathā€¦

struct ResolveCoreCtxNonGCD: PoolWith2TablesNonGCD {
    typealias ElementA = NodeNonGCD<CountdownArrayCellNonGCD, CountdownArrayCellNonGCD>;
    typealias ElementB = PotentiallyEvenUndefinedIndexedReferenceNonGCD<CountdownArrayCellNonGCD, ReadOnlyNonGCD>;
    typealias Target = CountdownArrayCellNonGCD;
    
    var mainTable: PoolNonGCD<CountdownArrayCellNonGCD>;
    var contributionsTipTable: TableNonGCD<PotentiallyEvenUndefinedIndexedReferenceNonGCD<CountdownArrayCellNonGCD, ReadOnlyNonGCD>, CountdownArrayCellNonGCD>;
    
    static var aTablePath: WritableKeyPath<ResolveCoreCtxNonGCD, TableNonGCD<NodeNonGCD<CountdownArrayCellNonGCD, CountdownArrayCellNonGCD>, CountdownArrayCellNonGCD>> {
        return \Self.mainTable;
    }
    static var bTablePath: WritableKeyPath<ResolveCoreCtxNonGCD, TableNonGCD<PotentiallyEvenUndefinedIndexedReferenceNonGCD<CountdownArrayCellNonGCD, ReadOnlyNonGCD>, CountdownArrayCellNonGCD>> {
        return \Self.contributionsTipTable;
    }

(Admittedly, it is still a bit more characters that I would like, maybe some of that could be inferred; but the important part is that even if e.g. the complex types were copy-pasted, all of this is compiler-checked, so none of it is duplicate code that can go stale)

That way, we can have an object (ResolveCoreCtx) provide an ergonomic environment while being itself defined in a reasonably ergonomic way.

1 Like

Technique 9: liberally use inout parameters so as to secure guarantees that are hard to obtain otherwise; define ad-hoc functions or closures for that purpose if necessary

The short coverage of inout in the reference, mostly centered on it being an alternative to returning, belies its power: this is not your childhood's var keyword in Pascal, and it works better than your teenage years' by reference parameters of C++. In particular, it performs no conversions (limiting implicit behavior), but will assert exclusive access to the variable, stored property, or subscript, such that the callee can directly manipulate the caller's version, avoiding a read, copy and modify, write-back cycle (at least for standard library collections, in the case of subscripts).

Note that if the caller itself received the variable as an inout parameter, passing it in turn to a callee as an inout parameter will not trigger a further exclusivity check, since it was already done when it initially became inout, so stacking such calls is even more efficient. Especially if performing multiple such calls: ā€œreborrowingā€ is practically free.

In particular, while recent versions of Swift have gotten better at avoiding read-modify-write phenomenons when nested data structures are assigned to local variables, I find it better to keep unbroken the chain of inout custody: it is more reliable when interventions inevitably complexify the intermediate code (e.g. for debugging purposes); if necessary, I define a function for just this purposeĀ¹.

Note that this includes the case of functions that mutate a generic element: since that element might in turn be a data structure, arrange for that element to be passed as an inout parameter, to a callback if necessary.

This technique is best seen in action, well, about everywhere in the current version as of this writing: just look for inout or mutatingā€¦


1: Note that even with inout, it is possible to observe read-modify-write behavior: this is the case when you pass a computed property as an inout parameter for instance. But the issue is that you're accessing a computed property (or a property that is not, according to ABI stability, known to be stored) for which there is no alternative, not with inout. EDIT: my bad, properties "abstracted" by library evolution can in fact avoid performing the whole dance, as long as it turns out at runtime they actually are implemented as stored properties. I had that mixed up with the other phenomenon, which is that, for properties protected by library evolution, the compiler has to assume that exclusivity over the whole structure is required in order to have exclusivity over the property; even if it gets eventually implemented as a stored property.

3 Likes

Principle 2: for this kind of code, challenge any kind of unsafe building block

With the techniques above, we have defined an abstraction level we are mostly going to be working in. However, this abstraction level is simultaneous incomplete, leaky, intentionally unsound (two IndexedReferences can have Writable rights to the same location) and unintentionally unsound, which means that more than once you're going to have to rely on the basic guarantees of Swift as a safety net.

It's going to be bad enough whenever you have to work at two levels of abstractions simultaneously, for instance when a fault occurs at the basic Swift level (like, say, an array access from an out-of-bounds index) and you have to map it back to something you wrote at the higher abstraction level, possibly far removed from the point the fault was raised.

So how bad do you think it's going to get when adding a third level of abstraction to work simultaneously in, as you would do by adding unsafe code to the mix, which amounts to cutting a hole in the safety net of Swift?

This is particularly true when adding Swift concurrency to the mix (the point of this code is to scale, remember?), which is only going to widen the cognitive gap between abstraction levels, so in relation to that, make sure to adopt the Swift 6 language mode as soon as possible (the only part of my project that fails these compile-time checks is the alternative concurrency scaffolding that relies directly on the libdispatch APIs, which I maintain for comparison purposes with Swift-native concurrency, and is as such devoid of any concurrency safety annotations)

Technique 10: if you need to refer to a property by multiple names, then define it as a stored property bearing the most commonly used name, and use KeyPath-based indirection for the remaining accesses; do not use computed properties

This is really a consequence of the generalized use of inout: in the general case, passing a computed property as an inout parameter will cause a read-modify-write cycle; the fact this apparently also causes the whole structure to be reserved for exclusive access, preventing the simultaneous use of any other part of the structure as a parameter to the same call, is also a problem for general use.

On the other hand, the keyPath: subscript, just like the subscripts for the standard library collections, appears to be able to be used as inout without requiring a read-modify-write. It's more verbose, less readable, and that too forces the compiler to assume the whole structure is reserved for exclusive access since at the point of the analysis it does not have the concrete key path yet, obviously, so use key paths sparingly: for maintenance operations, typically.

This technique is best seen in action, as of the current version as of this writing, in the context of this patch:

diff --git a/AsyncCountdown/AsyncCountdown.swift b/AsyncCountdown/AsyncCountdown.swift
--- a/AsyncCountdown/AsyncCountdown.swift
+++ b/AsyncCountdown/AsyncCountdown.swift
@@ -262,6 +262,7 @@
         fatalError("Mismatch in table lengths within a single pool");
     }
     
+
     self[keyPath: Self.aTablePath].g.append(elemA);
     self[keyPath: Self.bTablePath].g.append(elemB);

Technique 11: use a "commando" scheme to be able to test and run in reasonable time

(This is mostly for Xcode users, as other IDEs have more straightforward ways of obtaining the same result)

The main technique here, using indexes into well-known arrays instead of references, helps with performance and scalabilityā€¦ but only in release mode: debug mode gets even slower, for various reasons (I did not thoroughly investigate them).

  • Other techniques here do help with performance in release mode, but don't help in debug mode, widening the gap. They don't necessarily make things worse in debug mode directly, but as a result of the improvements in release mode, you're going to want to increase the work load, if only so that the scenario takes an appreciable amount of time: indeed, I require that the "one task in parallel" reference scenario take at least one second, so that the fastest concurrency scenario still takes enough time to be able to be measured with reasonable accuracy.Ā¹
  • And some other techniques add abstractions that do not regress the performance in release mode, but very much do when in debug mode, leading to a yet slower experience in debug mode.

The cumulated effect is not subtle: with all these techniques applied, my project is by now about 100 times slower in debug mode than in release mode. Obviously, this is unacceptably slow to execute when the reference, in release, is ~1 second.

My solution, again in proximity with embedded practices, is to just run all the time in release mode, and I find that in Xcode this is best accomplished by:

  • duplicating the default scheme
  • renaming that copy to "project - Commando"
  • edit it so the build configuration is always "Release"

Yes, this does mean the occasional debugging in release mode; you do lose some source fidelity: the source line will occasionally jump backwards when stepping and many variables will turn out to be optimized outā€¦ but modern debuggers are up to the task.

In the occasional case where you do need to debug in the Debug build configuration, then switch back to the "regular" scheme.


1: It's a mistake to try and measure too short durations: here I wouldn't trust a fully parallel scenario that would last less than 100 ms, as then the measurements are going to be too noisy to be reliable. One solution could be to repeat them 10 or more times instead of just 3, but to keep things comparable in that case you're going to have to apply the same repetition to the longer-lasting reference scenario, leading to unacceptably long profiling runs. Better to keep the same methodology, and vary the workload instead.

Technique 12: only use local functions if you are actually going to need to refer to it within itself; other than that, use anonymous functions directly passed as a parameter or directly invoked; create a temporary function for that purpose if necessary

In other words, only top-level functions and self-referencing functions get to have a name, everything else is best left anonymous.

There are two main reasons for this requirement:

  1. by having the closure be directly provided as a parameter, the inference engine (and in turn the compiler in general) can better tell your intent with regard to the function and provide the appropriate diagnostics. By contrast, before I applied this technique it seems the compiler errs on the side of assuming the function to be escaping, leading it to provide diagnostics that an escaping function can't do such and such (e.g. capture an inout parameter). But that was my intent in the first place for this function to be non-escaping, and I wanted to know instead what required it to be escaping so that I could fix it! But the compiler wouldn't tell.

  2. It appears to improve code generation: remember when I had a mysterious 32-byte allocation? If at the time I had applied this technique, well, I don't know exactly how the compiler proceeds differently (and I don't want to go there), but I know that I no longer get this 32-byte allocation, even though the closure body is almost the sameĀ¹.

This technique is best seen in action, as of the current version as of this writing, in the context of this patch:

diff --git a/AsyncCountdown/AsyncAPI.swift b/AsyncCountdown/AsyncAPI.swift
--- a/AsyncCountdown/AsyncAPI.swift
+++ b/AsyncCountdown/AsyncAPI.swift
@@ -313,6 +313,7 @@
                                                                            _ innerStartingFrom: IndexedReferenceNonGCD<CountdownArrayCellNonGCD, WritableNonGCD>) {
                         iteratePossibleLeftNodesPseudoReasync(&innerResolveCoreCtx,
                                                               &innerExploreAdditionOfRightNodeCtx,
+
                                                               { result in innerResultsScopingGroup.addTask { try await innerMediator.yield(result)} },
                                                               startGoal,
                                                               innerKind,

1: and in case you're wondering whether that slight difference is responsible, no, it is not: in another branch I keep the function as a named local one and adjust it to have the exact same body as when I made it anonymous, just to make sure, and in that other branch I do observe the 32-byte allocation, contrary to the anonymous-but-otherwise-identical-body-case.

Not-so-small intermezzo: to what exactly can we attribute the lack of scalability, back when we still used references?

Earlier in this exploration, most if not all impediments to scalability were removed when the algorithm was rewritten to eschew references, which is certainly a strong lead towards indicting them. However, I thought it would be useful to better characterize the effect of references; in particular, I was wondering whether frequent use of reference could cause bottlenecks not only within our process, but also across processes: what if my computation load could cause other processing on the system to slow down from memory bus activity, before we even saturate the cores?

When developing a mental model of scalability, I find it useful to think of the computing system as a number of diverse resources; any scalability issue can then be modeled as one resource or another being (at least at some point) the bottleneck. Sometimes the resources in question are obvious, e.g. a core to which our thread can be allocated, but sometimes less so, for instance write bandwidth to a specific filesystem device. Keep in mind that a cache line at a particular address can be one such resource, for instance when it is used as a semaphore in some manner (e.g. it contains a mutex, or atomic memory operations target directly an address cell in there, etc.), so it is best to think of these resources as being defined in software, even if their semantics end up being enforced in hardware.

evaluation 1: scalability when the code treats different "requests" in different processes

Let us get back to the revision immediately prior to the no-allocations rewrite (equivalent to swift-without-dictionaries-and-with-reduced-walker-calls), and run that code with a "cpu count" scenario of 1ā€¦ but 100 instances of them, with parallelism controlled by xargs; first, the reference scenario (multiple runs were made, but they all had consistent results):

echo {0..99} | net_wanderingcoder_cpucount=C /usr/bin/time xargs -n 1 -P 1 /Users/pierretemp2/Library/Developer/Xcode/DerivedData/AsyncCountdown-bpbnpedmqnnszzeqogzfyrevtzpv/Build/Products/Release/AsyncCountdown
# ...
       99,72 real        99,72 user         4,06 sys

Good, let us crank inter-process parallelism to 4 (again, multiple runs but so consistent I am only showing one):

echo {0..99} | net_wanderingcoder_cpucount=C /usr/bin/time xargs -n 1 -P 4 /Users/pierretemp2/Library/Developer/Xcode/DerivedData/AsyncCountdown-bpbnpedmqnnszzeqogzfyrevtzpv/Build/Products/Release/AsyncCountdown
# ...
       27,42 real       109,82 user         7,79 sys

99.72/27.42 is x3.637; in other words, a better scalability than what we got from C; scalability overhead is less than 10%. The 8-way parallelism scenario confirms that (multiple runs but so consistent I am only showing one):

echo {0..99} | net_wanderingcoder_cpucount=C /usr/bin/time xargs -n 1 -P 8 /Users/pierretemp2/Library/Developer/Xcode/DerivedData/AsyncCountdown-bpbnpedmqnnszzeqogzfyrevtzpv/Build/Products/Release/AsyncCountdown
# ...
       18,13 real       132,86 user         2,85 sys

99.72/18.13 = x5.500 is obviously less than x8 as we spill from the performance cores to the efficiency ones, but what matters is that it is infinitely better than the best intra-process scalability we ever obtained here. And that was with the C version: the code we're benchmarking here in fact managed x3.15 at best, when the parallelism was internal to the process.

With these figures, I assert there is no need to look at all intermediate parallelism values or further characterize how this scales: it does so as linearly as we could hope for.

So the first conclusion here is: whatever the scalability issues discussed here, they are unlikely to be the cause of two separate processes slowing down each other before core saturation comes into play.

evaluation 2: scalability when the code treats different "requests" in a single process

So, are we good here? Of course not: we have just shown that the same code that had poor scalability when we tried to run it on multiple tasks within the same process, has no trouble scaling with itself when run in separate processes, disproving the theory that the bottleneck was a global hardware resourceĀ¹. But we have also made sure that different refcounted objects were used in different threads, and that did not improve the situation any, so the bottleneck resource can't be any of these, so what's left?

But wait, did we make sure of that, really? How about we try this: have the same 100 "requests", but now run them within the same process. That will allow, for instance, to have the same number of instance of resolveCore() running as in the previous evaluation, whether in total or at any given time, varying a single variable: how these instances share processes. Not at all, in evaluation 1; all in the same process, in evaluation 2.

But to do that, we can no longer rely on xargs; we will have to provide that code, which I eventually tagged as parallelism-experiments-evaluation-2. Let us run the reference scenario (multiple runs but so consistent I am only showing one):

net_wanderingcoder_parallelcount=C net_wanderingcoder_cpucount=C /usr/bin/time /Users/pierretemp2/Library/Developer/Xcode/DerivedData/AsyncCountdown-bpbnpedmqnnszzeqogzfyrevtzpv/Build/Products/Release/AsyncCountdown 
# ...
       83,92 real       112,71 user         3,56 sys

Timing is lower (better) than in evaluation 2, as I also ported a few performance improvements in between. Now with 4-way parallelism (multiple runs but so consistent I am only showing one):

net_wanderingcoder_parallelcount=CCCC net_wanderingcoder_cpucount=C /usr/bin/time /Users/pierretemp2/Library/Developer/Xcode/DerivedData/AsyncCountdown-bpbnpedmqnnszzeqogzfyrevtzpv/Build/Products/Release/AsyncCountdown
# ...
       45,14 real       219,05 user         7,34 sys

83.92/45.14 is x1.859. This is completely comparable to the scalability observed when we adjusted net_wanderingcoder_cpucount rather than net_wanderingcoder_parallelcount. Let us look at 8-wayā€¦ (multiple runs but so consistent I am only showing one)

net_wanderingcoder_parallelcount=CCCCcccc net_wanderingcoder_cpucount=C /usr/bin/time /Users/pierretemp2/Library/Developer/Xcode/DerivedData/AsyncCountdown-bpbnpedmqnnszzeqogzfyrevtzpv/Build/Products/Release/AsyncCountdown
# ...
       33,16 real       254,78 user         0,57 sys

83.92/33,16 = x2.531. This is in fact bad enough that we need to have a closer lookĀ². To the flame graphs!



There is clearly something going on with (partial apply for) iteratePossibleLeftNodesFakeDispatchPseudoReasync(), which (if it needs to be reminded) is a function that does not do much at all: it pretty much forwards control to the next recursion, because we're already in a subtask and not allowed further subdivide the batch we've been tasked with.

The flame graph for 8-way parallelism is harder to analyze, coming from 4-way parallelism: since we've spilled to encompass the efficiency cores as well, this is definitely going to affect the profile obtained from these CPU so there are additional phenomenons that are going on.

Unfortunately, this is where I am coming up short in my analysis: I have no idea where to go next, either from the CPU documentation or from anything in the Xcode toolchain. I can only conclude this:

While I can't attribute responsibility between the Apple Silicon hardware and the Swift runtime, know that manipulation of references cause scalability issues when attempting to scale within a single address space, even when the instances being referred to have been segregated between different threads to the seemingly largest extent.


1: worse, that discovery did not improve the initial goal, as that multi-process evaluation only allows throughput improvements (i.e. treating more requests per second), without improving the treatment latency of a single request, which is what we had been doing so far.

2: In fact, an earlier version of the code had a bug where the parallelism was off by one, i.e. the "reference" scenario already benefitted for 2-way parallelism which obviously skewed the whole thing; after fixing it, I double-checked all further builds with Instruments.

I find that if a local function only refers to parameters that have been passed into it, and no other external state, then it generates identical code as if it were a fileprivate function. It would definitely be nice to have an annotation that caused a diagnostic if such a function accidentally closed over a variable from it's parent scope.

Technique 13: do not hesitate to use sum types for state, for instance for the cell of the well-known array(s), even when you "know" what the state is supposed to be at that point

At first, doing so might seem redundant with what you know already about the data stored in there: for instance, if I follow the chain of nexts starting from a "contribution tip", I know the structure in there will have a defined direction, befitting a contribution. So it might seem strange to define a data-bearing enum that I will never query for what to do next: the only thing I can do with it is assert it has the state I expect at that point:

        let outcome = try! self[contributionsTipTable].u().get().reduce(context, OngoingForward.init(current: "(", isFirst: true, numRev: 0), {(innerContext, storySoFar, contributionLoc) in
            guard case let .contribution(op: _, direction: direction, value: _) = contributionLoc[innerContext].contents else {
                fatalError("Found an element of a contributions list not in the contributing state");
            }

However, this is useful nevertheless as I would have a hard time exposing that knowledge any other way to the compiler: the well-known array must have a given element type that we can use regardless of the indexed reference used to dereference it; and in order to ensure safety, that type cannot be a union, so the next best thing is a typed union, better known to Swift as a data-bearing enum.

Meanwhile, this provides a number of benefits, chief among them being definite initialization: none of the fields of the borne data need to be an optional, for instance (except for Op, to support primitives), so you avoid a number of "non-nil" assertions, substituting them with the higher-level one: that the state is the one you expect, which doubles as documentation of your invariants.

        guard case let .value(op: candidateOp, value: candidateValue) = rightChild[resolveCoreCtx.mainTable].contents else {
            fatalError("Found element in l not in the value state");
        }

There are subtler benefits: on occasion, you might want to transitorily weaken these invariants to avoid pointless churn. For instance, immediately before I further the recursion, I store the direction that the value-turned-contribution I manipulate is to have for that branch of the bifurcation: either .forward or .reverse. If I were to strictly maintain my invariants, I would do so together with pushing the contribution.

            rightChild.open(&resolveCoreCtx.mainTable, {
                $0.contents.upgradeToContribution(direction: direction);
            });

But that push is not done here, or immediately before or after either. Why? Because after exploring the .forward branch of the bifurcation, it would be pretty pointless for me to pop to contribution from the contributions list (of the placeholder currently being built), only for me to pretty quickly add it back for the purposes of the .reverse branch. Instead, this push is done once, ahead of time (and undone once too), which means the invariant (that a member of a contribution list is, you know, in the contribution state) is temporarily violated.

        try! resolveCoreCtx.l.top.u().get().open(&resolveCoreCtx.contributionsTipTable, {
            try! $0.wu({
                $0.pushWithoutAccounting(&resolveCoreCtx.mainTable, rightChild);
            });
        });

But the data-bearing enum I recover from the well-known array provides me with a safety net: if for some reason control escapes while in that window of invalidity, the assertion will play its role when we later attempt to use this "contribution" as such, so the reason for the fault will be obvious (more so than in case of a force-unwrapped nil), and with that knowledge it should be pretty easy to track down where we allowed control to escape while our predicates were dangling in the wind, since that ought to be in the call stack.

As for the cost, I have not found it to be a problem. If you have a need to recover info that you know is borne regardless of the state, just define an accessor that dispatches accord to the state: with any luck, the requested field will be at the same location regardless of the state, so optimization will be able to remove any branch and just translate that into a memory access.

var op: Op? {
    let extractedOp: Op?;
    switch self {
    case let .placeholder(op: foundOp):
        extractedOp = foundOp;
        break;
    case let .value(op: foundOp, value: _):
        extractedOp = foundOp;
        break;
    case let .contribution(op: foundOp, direction: _, value: _):
        extractedOp = foundOp;
        break;
    }
    
    return extractedOp;
}

As for the enum state bits themselves, since the associated data is an aggregation of fields, the compiler will be able to fit them into unused bits that would have been lost to alignment padding anyway, so I expect the memory cost to be 0 (it might help to list the fields in the order of increasing size).

1 Like