Memory pools (re. binary tree performance)

wadetregaskis · August 3, 2023, 9:20pm

Basically: how can memory pools be utilised in Swift? ("allocation zones" if you're an Objective-C old-timer)

Just for fun I took a stab at optimising the Swift implementation of "binary-trees" in the Computer Language Benchmarks Game. This benchmark basically just builds & traverses binary trees over and over again.

Why this benchmark?

I have actually hit performance problems with trees in Swift in real-world use, and I think this particular benchmark actually captures the essence of the relevant performance problem.

Why is it slow?

In the Swift implementation, pretty much all its time is spent in refcounting and alloc/free. That implementation uses a pretty typical approach for a tree structure: an indirect enum (alternatively one can use a class, but it's essentially equivalent at runtime).

What's faster?

If you look at the faster implementations, e.g. in C++, you'll see that they get around the memory management burden by using a memory allocation pool, so that they can trivially allocate nodes (stack style) and then dump the entire tree of them in a single free.

Can Swift do that?

Swift can be similarly efficient iff you can stack-allocate, but you cannot stack-allocate indirect enums or classes, and as far as I can tell there's no way to get Swift to let me use a recursive struct or a direct enum (even though I do actually know the depth of the tree in advance). And even if it would, I assume that wouldn't work for larger trees (tens of gigabytes) because thread stacks presumably have size limits?

It is possible to mimic a memory allocation pool in Swift by replacing references with manual indexing into an array (the pool). Example below. This speeds up this benchmark by about twelve-fold on my M2 MacBook Air, which might make it the fastest implementation in the CLBG across all languages (unfortunately the leading implementation, in C++, doesn't compile with clang* so I can't compare directly, and CLBG isn't accepting contributions).

It's also much more memory efficient - no headers for the tree nodes (for refcounts etc) and the references are half the size because only 32 bits are needed for indices, rather than the unavoidable 64 bits for pointers.

* = It appears clang doesn't support C++17 fully; it's missing std::execution and other such things.

Can it be simpler?

I don't know if all those memory savings could realistically be achieved in a more abstract way, but it'd be nice if at least that manual memory management could be abstracted away, so that the execution time gains could be realised with simpler code. As far as I'm aware there's no way to do so in Swift today, but am I missing something? Are memory allocation pools something that's being considered for future versions of Swift?

Optimised imlpementation

// The Computer Language Benchmarks Game
// http://benchmarksgame.alioth.debian.org/
//
// Optimisation of binary-trees Swift #2,
// an adaptation of binary-trees Go #8,
// that in turn is based on Rust #4
//
// contributed by Wade Tregaskis

import Dispatch
import Foundation

fileprivate struct Tree {
    let left: Int32
    let right: Int32

    func check(_ pool: [Tree]) -> UInt32 {
        (1
         + (0 <= left ? pool[Int(left)].check(pool) : 1)
         + (0 <= right ? pool[Int(right)].check(pool) : 1))
    }

    static func bottomUp(_ depth: UInt32, _ pool: inout [Tree]) -> Int32 {
        guard 0 < depth else {
            return -1
        }

        pool.append(Tree(left: bottomUp(depth - 1, &pool),
                         right: bottomUp(depth - 1, &pool)))

        return Int32(pool.count - 1)
    }
}

@inline(__always)
fileprivate func makePool(_ depth: UInt32) -> [Tree] {
    var pool = [Tree]()
    pool.reserveCapacity((1 << depth) - 1)
    return pool
}

let n = UInt32(CommandLine.arguments.dropFirst().first ?? "") ?? 10

let minDepth: UInt32 = 4
let maxDepth = (n > minDepth + 2) ? n : minDepth + 2

var messages: [UInt32: String] = [:]
let messageQueue = DispatchQueue(label: "messageQueue", qos: .background)

let group = DispatchGroup()
let workerQueue = DispatchQueue.global(qos: .userInteractive)

group.enter()
workerQueue.async {
    let depth = maxDepth + 1

    var pool = makePool(depth)
    let tree = Tree.bottomUp(depth, &pool)

    messageQueue.async {
        messages[0] = "stretch tree of depth \(depth)\t check: \(pool[Int(tree)].check(pool))"
        group.leave()
    }
}

group.enter()
workerQueue.async {
    var pool = makePool(maxDepth)
    let longLivedTree = Tree.bottomUp(maxDepth, &pool)

    messageQueue.async {
        messages[UINT32_MAX] = "long lived tree of depth \(maxDepth)\t check: \(pool[Int(longLivedTree)].check(pool))"
        group.leave()
    }
}

let iterations = Int((maxDepth - minDepth) / 2) + 1

DispatchQueue.concurrentPerform(iterations: iterations) {
    let depth = minDepth + UInt32((iterations - $0 - 1) * 2)
    let iterations = UInt32(1 << (maxDepth + minDepth - depth))

    var chk: UInt32 = 0
    var pool = makePool(depth)

    for _ in 0..<iterations {
        pool.removeAll(keepingCapacity: true)
        chk += pool[Int(Tree.bottomUp(depth, &pool))].check(pool)
    }

    let msg = "\(iterations)\t trees of depth \(depth)\t check: \(chk)"

    group.enter()
    messageQueue.async {
        messages[depth] = msg
        group.leave()
    }
}

group.wait()

for msg in messages.sorted(by: { $0.0 < $1.0 }) {
    print(msg.value)
}

tera · August 3, 2023, 11:32pm

Maybe this?

struct Tree1 {}

struct Tree2 {
    var left: Tree1
    var right: Tree1
}

struct Tree3 {
    var left: Tree2
    var right: Tree2
}

Not convenient obviously.

I wonder about this as well: what is the actual stack limit? In theory on 64 bit computer virtual stack size for all stacks in total could be exabytes (right?), and real stack size would probably be limited by available RAM (minus whatever is needed for OS / heap / etc, say, half of RAM).

Is that 8 cores, or 4? They use quad core CPU for the benchmarks.

I ran a naïve single threaded version of that benchmark (with nodes done via classes and no optimization tricks other than "release") on M1 Pro and got these results:

with allocations / deallocations: 29 sec
with no allocations / deallocations: 17 sec
(preallocated nodes that are never disposed)

Also tried to check the version with no ARC overhead by making nodes immortal but that didn't affect the timing:

class Node {  ... }
...
        let node = Node(left: nil, right: nil)
        let p = Unmanaged.passUnretained(node).toOpaque()
        _swift_stdlib_immortalize(p)
        newNodes.append(node)

bridging header:

void _swift_stdlib_immortalize(void * __nonnull obj);

Edit:

Be careful here, the number of preallocated nodes (if to not recycle the nodes) for this benchmark at depth 21 is about 14% of UInt32.max... with a slightly deeper tree (one or two levels down) it would easily overflow UInt32.

wadetregaskis · August 4, 2023, 12:43am

Thread.current.stackSize says 524,288 (bytes) in a playground on my iMac Pro, which seems plausible. ulimit -Hs and sysctl kern.stack_size say 64kiB and 16kiB respectively, which I know aren't correct (for a Swift app, at least), so I don't know what they think they're reporting.

I've wondered how Swift thinks about this. In Objective-C there weren't really any significant stack-allocated types, it was largely just frame data and local scalar storage (and the occasional alloca use, perhaps). So usually your stack only needed to support function recursion, to whatever degree you actually did that in your program - and you wanted your stack to be fairly small so that you'd catch infinite recursion bugs quicker.

In Swift a lot more is stack-allocated by design, so stack sizing presumably needs to consider not just recursive functions' frames but also a lot more actual data storage.

Then again, most collections in Swift aren't true value types - they heap-allocate their actual storage, for copy-on-write reasons if nothing else. So I don't really have a good instinct for what typical stack sizes are in Swift applications.

And x86-64, and Linux, and Swift 5.8 (instead of 5.9 as I'm using). So yeah, the numbers aren't comparable directly. But I observed that my "8-core" (4P+4E) M2 is about twice as fast as CLBG's published numbers.

Those seem oddly large durations… is this with the "21" CLI argument as used for the official benchmark? On an -O or -Ounchecked build?

I get about 430ms for the optimised version I posted (~650ms on first run, because dyld does a bunch of one-time stuff).

That could make sense - I had already wondered if it's the actual malloc activity that's the time sink, not merely some integer arithmetic [for the refcounts].

Yeah, I know, but I tested that it will crash on overflow so it's at least noticeable. And Int32 is still large enough for program arguments up to "29" at least (which takes a couple of minutes to complete even in my optimised version, so I didn't bother testing any larger values).

wadetregaskis · August 4, 2023, 1:02am

And interestingly (but quite tangentially), my iMac Pro (10-core) is only slightly faster, at about 350ms (and very similar for "29"). Which matches my general observation that the current MacBook Air can sometimes hold its own against the much older iMac, but there's still no substitute for cores (when you consider that the M2 has only four "real" cores, versus the iMac's ten, then it's pretty impressive that it's comparable at all).

Though in this case it's basically memory-latency-bound, I assume, at least for trees of non-trivial size. I did not yet put in any optimisations for the memory layout or "random" memory walk aspects. If I did, those might improve the performance an order of magnitude further. e.g. similar to this.

tera · August 4, 2023, 1:20am

I meant making a new thread and setting the stack size explicitly, what's the limit in this case.

Yes, 21 and it is -O. Too long? 17 sec / 4 (in multithreaded case) would give ~4sec which would be on par with C# in their table, not too bad.

I just rewrote it to use malloc to get rid of ARC overhead, the result is much better (single threaded case again):

normally allocated / deallocated class instances: 29 sec
preallocated class instances (still ARC overhead): 17 sec
malloced allocated / deallocated nodes (no ARC overhead): 15 sec
malloced preallocated nodes (no ARC overhead): 4.6 sec

That's true. And because there's nothing else in the nodes – just left/right references – by switching to 32-bit references you are effectively reducing memory I/O by half.

If it's easy for you to switch from 32 to 64 bit for references and limit the number of threads to 1, then I wonder what result would you get, will that be 4 - 5 sec or better.

wadetregaskis · August 4, 2023, 4:34am

You can't compare directly though. How does the original Swift version perform on your machine - that can give you a reference point since that specific implementation has published numbers in the rankings.

Oh, sorry, I overlooked that you were talking about a single-threaded version.

I replaced all the asyncs with syncs and the concurrentPerform with a simple for-loop, and that slows it down to 2.0 seconds on my M2 MacBook Air. Code:

// The Computer Language Benchmarks Game
// http://benchmarksgame.alioth.debian.org/
//
// Optimisation of binary-trees Swift #2,
// an adaptation of binary-trees Go #8,
// that in turn is based on Rust #4
//
// contributed by Wade Tregaskis

import Dispatch
import Foundation

fileprivate struct Tree {
    let left: Int32
    let right: Int32

    func check(_ pool: [Tree]) -> UInt32 {
        (1
         + (0 <= left ? pool[Int(left)].check(pool) : 1)
         + (0 <= right ? pool[Int(right)].check(pool) : 1))
    }

    static func bottomUp(_ depth: UInt32, _ pool: inout [Tree]) -> Int32 {
        guard 0 < depth else {
            return -1
        }

        pool.append(Tree(left: bottomUp(depth - 1, &pool),
                         right: bottomUp(depth - 1, &pool)))

        return Int32(pool.count - 1)
    }
}

@inline(__always)
fileprivate func makePool(_ depth: UInt32) -> [Tree] {
    var pool = [Tree]()
    pool.reserveCapacity((1 << depth) - 1)
    return pool
}

let n = UInt32(CommandLine.arguments.dropFirst().first ?? "") ?? 10

let minDepth: UInt32 = 4
let maxDepth = (n > minDepth + 2) ? n : minDepth + 2

var messages: [UInt32: String] = [:]
let messageQueue = DispatchQueue(label: "messageQueue", qos: .background)

let group = DispatchGroup()
let workerQueue = DispatchQueue.global(qos: .userInteractive)

group.enter()
workerQueue.sync {
    let depth = maxDepth + 1

    var pool = makePool(depth)
    let tree = Tree.bottomUp(depth, &pool)

    messageQueue.sync {
        messages[0] = "stretch tree of depth \(depth)\t check: \(pool[Int(tree)].check(pool))"
        group.leave()
    }
}

group.enter()
workerQueue.sync {
    var pool = makePool(maxDepth)
    let longLivedTree = Tree.bottomUp(maxDepth, &pool)

    messageQueue.sync {
        messages[UINT32_MAX] = "long lived tree of depth \(maxDepth)\t check: \(pool[Int(longLivedTree)].check(pool))"
        group.leave()
    }
}

let iterations = Int((maxDepth - minDepth) / 2) + 1

for i in 0..<iterations {
    let depth = minDepth + UInt32((iterations - i - 1) * 2)
    let iterations = UInt32(1 << (maxDepth + minDepth - depth))

    var chk: UInt32 = 0
    var pool = makePool(depth)

    for _ in 0..<iterations {
        pool.removeAll(keepingCapacity: true)
        chk += pool[Int(Tree.bottomUp(depth, &pool))].check(pool)
    }

    let msg = "\(iterations)\t trees of depth \(depth)\t check: \(chk)"

    group.enter()
    messageQueue.sync {
        messages[depth] = msg
        group.leave()
    }
}

group.wait()

for msg in messages.sorted(by: { $0.0 < $1.0 }) {
    print(msg.value)
}

I didn't specifically confirm it doesn't spawn any extra threads, but time says user time matches wallclock time, so it looks fully serialised.

Enlarging the indices to Int64 slows it down only very slightly, to 2.1s. That's with sufficient RAM available - it would of course matter more under memory pressure. I don't have a reproducible environment for testing that.

There must still be some disconnect between what we're each testing, because I can't see any reason why your M1 Pro would be so much slower than an M2 or an old iMac Pro.

I'm using swiftc binarytrees.swift-2d.swift -Ounchecked -g, with Apple Swift version 5.9 (swiftlang-5.9.0.124.4 clang-1500.0.38.1) and Target: arm64-apple-macosx13.0.

wadetregaskis · August 4, 2023, 4:39am

Also, tangentially: to Swift's credit, there's basically no performance differences between -O and -Ounchecked in all the versions I've tested. And the value of -O is demonstrated with the reference version since it contains an unsigned integer underflow bug (though it's not functionally a bug because the code adds another value immediately, which overflows the unsigned integer back to good ).

tera · August 4, 2023, 9:20am

Good call. Swift submissions run in about 3.7 sec on my M1 Pro, making my hardware about 3.5 times faster than what they have.

I confirm that - 1.8 sec on my hardware.

Ditto here.

I like your version, it's really quick.
I bet if you do the same in C++.... that would be still quicker

wadetregaskis · August 4, 2023, 3:40pm

If I change the allocation order of the nodes to put them in perfect order for the subsequent traversal, e.g.:

        let index = pool.count

        pool.append(Tree(left: -2, right: -2))
        pool[index].left = bottomUp(depth - 1, &pool)
        pool[index].right = bottomUp(depth - 1, &pool)

        return Int32(index)

…it actually worsens performance significantly (15% - 30% depending on exactly how I do that insert then modify). It appears the extra cost is entirely in array indexing; it looks like indexing into an array is surprisingly expensive. Perhaps because that makes the memory write pattern during tree initialisation non-sequential.

Correcting it so that it's completely sequential in both building and traversing the tree does finally improve performance further, now down to ~330ms (for "21") on my M2 MacBook Air:

// The Computer Language Benchmarks Game
// http://benchmarksgame.alioth.debian.org/
//
// Optimisation of binary-trees Swift #2,
// an adaptation of binary-trees Go #8,
// that in turn is based on Rust #4
//
// contributed by Wade Tregaskis

import Dispatch
import Foundation

fileprivate struct Tree {
    var left: Int32
    var right: Int32

    func check(_ pool: [Tree]) -> UInt32 {
        (1
         + (0 <= left ? pool[Int(left)].check(pool) : 1)
         + (0 <= right ? pool[Int(right)].check(pool) : 1))
    }

    static func bottomUp(_ depth: UInt32, _ pool: inout [Tree]) -> Int32 {
        guard 0 < depth else {
            return -1
        }

        guard 1 < depth else {
            pool.append(Tree(left: -1, right: -1))
            return Int32(pool.count - 1)
        }

        let index = Int32(pool.count)
        let leftIndex = index + 1
        let rightIndex = index + (1 << (depth - 1))

        pool.append(Tree(left: leftIndex, right: rightIndex))
        let left = bottomUp(depth - 1, &pool)
        assert(leftIndex == left, "Left: \(leftIndex) != \(left)")
        let right = bottomUp(depth - 1, &pool)
        assert(rightIndex == right, "Right: \(rightIndex) != \(right)")

        return index
    }
}

@inline(__always)
fileprivate func makePool(_ depth: UInt32) -> [Tree] {
    var pool = [Tree]()
    pool.reserveCapacity((1 << depth) - 1)
    return pool
}

let n = UInt32(CommandLine.arguments.dropFirst().first ?? "") ?? 10

let minDepth: UInt32 = 4
let maxDepth = (n > minDepth + 2) ? n : minDepth + 2

var messages: [UInt32: String] = [:]
let messageQueue = DispatchQueue(label: "messageQueue", qos: .background)

let group = DispatchGroup()
let workerQueue = DispatchQueue.global(qos: .userInteractive)

group.enter()
workerQueue.async {
    let depth = maxDepth + 1

    var pool = makePool(depth)
    let tree = Tree.bottomUp(depth, &pool)

    messageQueue.async {
        messages[0] = "stretch tree of depth \(depth)\t check: \(pool[Int(tree)].check(pool))"
        group.leave()
    }
}

group.enter()
workerQueue.async {
    var pool = makePool(maxDepth)
    let longLivedTree = Tree.bottomUp(maxDepth, &pool)

    messageQueue.async {
        messages[UINT32_MAX] = "long lived tree of depth \(maxDepth)\t check: \(pool[Int(longLivedTree)].check(pool))"
        group.leave()
    }
}

let iterations = Int((maxDepth - minDepth) / 2) + 1

DispatchQueue.concurrentPerform(iterations: iterations) {
    let depth = minDepth + UInt32((iterations - $0 - 1) * 2)
    let iterations = UInt32(1 << (maxDepth + minDepth - depth))

    var chk: UInt32 = 0
    var pool = makePool(depth)

    for _ in 0..<iterations {
        pool.removeAll(keepingCapacity: true)
        chk += pool[Int(Tree.bottomUp(depth, &pool))].check(pool)
    }

    let msg = "\(iterations)\t trees of depth \(depth)\t check: \(chk)"

    group.enter()
    messageQueue.async {
        messages[depth] = msg
        group.leave()
    }
}

group.wait()

for msg in messages.sorted(by: { $0.0 < $1.0 }) {
    print(msg.value)
}

Note that returning the index from bottomUp is now technically redundant, but interestingly removing that (hard-coding the assumption that the root node is always index zero, etc) along with the asserts etc has absolutely no impact on the final performance (in optimised builds). So safer to keep them in.

Of course, this optimisation isn't viable in non-trivial trees where you can't easily anticipate how many nodes a subtree will have (whether because it's not perfectly balanced, or has an indeterminate depth, etc).

But going back to the original question of this thread, I'm still no closer to finding a way to achieve good performance using more regular, idiomatic Swift. This "manual memory management" approach isn't terrible - and still safer than the equivalent in C/C++ - but it's not how most people would ever think to write this kind of structure, I think.

scanon · August 4, 2023, 3:44pm

Entirely system-dependent. Values between 16k and "a couple megs" are pretty common for "normal" computer systems.

Worth noting that for the purposes of an arena allocator / memory pool, there is effectively no performance difference between stack and heap memory; allocation is a one-time up-front cost that you will completely amortize either way, so you might as well use heap and avoid the hassles of dealing with stack size and scoping.

tera · August 4, 2023, 4:40pm

Meanwhile I optimised the simplified single threaded version of your app to use malloc instead of arrays:

Single threaded unsafe version based on malloc

import Foundation

typealias IndexType = Int
let maxIndex = IndexType.max

class Pool {
    
    init(capacity: Int) {
        self.capacity = capacity
        trees = malloc(MemoryLayout<Tree>.stride * capacity)!.assumingMemoryBound(to: Tree.self)
    }
    
    deinit {
        free(trees)
    }
    
    var trees: UnsafeMutablePointer<Tree>
    let capacity: Int
    private (set) var count: Int = 0
    
    subscript(_ index: Int) -> Tree {
        get {
            precondition(index >= 0 && index < count)
            return trees[index]
        }
        set {
            precondition(index >= 0 && index < count)
            trees[index] = newValue
        }
    }
    
    func append(_ tree: Tree) {
        precondition(count < capacity)
        trees[count] = tree
        count += 1
    }
    
    func removeAll() {
        count = 0
    }
}

struct Tree {
    let left: IndexType
    let right: IndexType

    func check(_ pool: Pool) -> IndexType {
        (1
         + (0 <= left ? pool[Int(left)].check(pool) : 1)
         + (0 <= right ? pool[Int(right)].check(pool) : 1))
    }

    static func bottomUp(_ depth: IndexType, _ pool: Pool) -> IndexType {
        guard 0 < depth else {
            return -1
        }
        pool.append(Tree(left: bottomUp(depth - 1, pool), right: bottomUp(depth - 1, pool)))
        return IndexType(pool.count) - 1
    }
}

func makePool(_ depth: IndexType) -> Pool {
    Pool(capacity: (1 << depth) - 1)
}

let n: IndexType = 21
let minDepth: IndexType = 4
let maxDepth = (n > minDepth + 2) ? n : minDepth + 2
var messages: [IndexType: String] = [:]
let depth = maxDepth + 1
var pool1 = makePool(depth)
let tree1 = Tree.bottomUp(depth, pool1)
messages[0] = "stretch tree of depth \(depth)\t check: \(pool1[Int(tree1)].check(pool1))"

let pool2 = makePool(maxDepth)
let longLivedTree2 = Tree.bottomUp(maxDepth, pool2)
messages[maxIndex] = "long lived tree of depth \(maxDepth)\t check: \(pool2[Int(longLivedTree2)].check(pool2))"
let iterations = Int((maxDepth - minDepth) / 2) + 1

for i in 0 ..< iterations {
    let depth = minDepth + IndexType((iterations - i - 1) * 2)
    let iterations = IndexType(1 << (maxDepth + minDepth - depth))
    var chk: IndexType = 0
    let pool = makePool(depth)

    for _ in 0 ..< iterations {
        pool.removeAll()
        chk += pool[Int(Tree.bottomUp(depth, pool))].check(pool)
    }
    let msg = "\(iterations)\t trees of depth \(depth)\t check: \(chk)"
    messages[depth] = msg
}

for msg in messages.sorted(by: { $0.0 < $1.0 }) {
    print(msg.value)
}

which now takes 1.4 sec. Interestingly switching the index type from Int to Int32 makes it a bit slower, which is somewhat counter intuitive.

I estimate it to take around 350ms when back converting to multithreaded version running on 4 cores, and around 1sec when running on their slower hardware.

Very true.

scanon · August 4, 2023, 5:24pm

FWIW, this is exactly the pattern I want someone to reach for in Swift, and we can build some affordances to make it easier and safer.

tera · August 4, 2023, 5:32pm

Yeah, I was mostly interested in the maximum value, not the default value.

stack size test

import Foundation

let wantedStackSize = 123*1024*1024*1024 // 123 GB

for _ in 0 ..< 1000 {
    let t = Thread {
        print("NSThread stack size is \(Thread.current.stackSize/1024/1024/1024) GB")
        Thread.sleep(forTimeInterval: 1)
    }
    t.stackSize = wantedStackSize
    t.start()
}

for _ in 0 ..< 1000 {
    var attr = pthread_attr_t()
    var err = pthread_attr_init(&attr)
    precondition(err != -1)
    err = pthread_attr_setstacksize(&attr, wantedStackSize)
    precondition(err != -1)
    var t: pthread_t? = .init(bitPattern: 0)!
    err = pthread_create(&t, &attr, { _ in
        let t = pthread_self()
        let size = pthread_get_stacksize_np(t)
        print("pthread size is \(size/1024/1024/1024) GB")
        sleep(1)
        return nil
    }, nil)
    precondition(err != -1)
}
RunLoop.main.run(until: .distantFuture)

Looks like pthreads have no problem with large stacks (got 1000 threads with 123GB stack size each), whilst NSThreads when asked for 123GB limited stack size to 1GB for some reason.

How do we promote it?

adamkuipers · August 5, 2023, 12:25am

Assuming you're trying to compete with the other languages in the benchmark, this manual memory management is explicitly against the rules, "Please don't implement your own custom "arena" or "memory pool" or "free list" - they will not be accepted."^

wadetregaskis · August 5, 2023, 1:15am

I'd argue that my implementation is still within the bounds of that rule. Using an array and indices is usefully distinct from actual manual memory management.

Also, all the leading implementations I looked at do exactly this kind of thing, in C, C++, Rust, etc. They clearly don't enforce those rules, or have a somewhat contrived definition of what "your own" means.

They're not accepting new submissions anyway. I came across the benchmarks randomly while reading some CS papers, and I became interested in this particular benchmark because it represents some real-world performance challenges I've run into. It was useful having numerous implementations in other languages as guidance for how fast I could conceivably make it. I was just interested in seeing how far I can take it in Swift without resorting to what I personally consider unreasonable for real-world Swift code (e.g. literal manual memory & pointer management).

Which is not to disparage approaches like @tera's which go that far - and from the sounds of it get even better performance as a result - but I personally just prefer to drop to actual C or C++ if I'm going to do significant manual memory management. I find it much easier to do that sort of thing in those languages than in Swift (the myriad of UnsafeFoo pointer types creates very verbose & confusing code compared to the C/C++ equivalents, I find).

tera · August 5, 2023, 1:45am

I believe the way to fix that is to systematically remove "Unsafe" prefixes, make unsafe functions callable only from functions marked "unsafe" (as we discussed), and possibly simplify the API in favour of, say, "data.bytes()" instead of "data.withBytes { ... }" – unsafe code won't be much less safer as a result but would look simpler and even similar to what you could do in C/C++.

I'd also like to see some built-in sanctioned way to have non atomic retain count operations – that will only work for objects accessible from a single thread (along with a precondition check that will fire if I am using those pinned to thread objects from different threads. And a placement new analogue from C++.

Nobel · November 6, 2023, 1:30am

Swift's manual memory management is done using unsafe pointers and unsafe buffer pointers. Which, as you pointed out, are quite inconvenient in use.
So a solution could be to write a wrapper around this, to at least abstract away the manual allocation and deallocation of the buffer and the objects inside it. This would still leave one with having to use UnsafePointers or UnsafeMutablePointers to interact with these objects in the buffer.
To circumvent that, one could implement a 'reference wrapper', that would allow for a more convenient access to the underlying object. But since we cannot write perfect wrapper types yet (AFAIK) due to key paths not being able to refer to methods, we do need some additional way to access these methods.
But I believe that these 2 strategies combined could be a viable option for a more abstracted way for manual memory management in Swift.

Here is a small example of that binary tree benchmark program, that uses this form of memory management. Here, we assume that these are part of a hypothetical module called 'SwiftMemoryResource'.

import SwiftMemoryResource

struct Node {
    var left, right: UnsafeRef<Node>
    
    func check() -> Int {
        // End of tree is signalized by left and right node being
        // indentical.
        guard left != right else {
            return 1
        }
        return 1 + left.ref.check() + right.ref.check()
    }
}

func makeTree(depth: Int, pool: inout MonotonicBuffer) -> UnsafeRef<Node> {
    guard let node = pool.allocate(Node.self) else {
        fatalError("Not enough memory available!")
    }
    
    if depth > 0 {
        node.initialize(to: Node(left: makeTree(depth: depth-1, pool: &pool),
                                 right: makeTree(depth: depth-1, pool: &pool)))
    } else {
        node.initialize(to: Node(left: node, right: node))
    }
    
    return node
}

struct BinaryTree {
    init(depth: Int, initialize: Bool = true) {
        pool = MonotonicBuffer.create(
            size: MemoryLayout<Node>.size * (1 << (depth + 1)))
        self.depth = depth
        root = makeTree(depth: initialize ? depth : 0, pool: &pool)
    }
    
    init(from oldTree: BinaryTree, depth: Int) {
        pool = oldTree.pool
        pool.reset()
        self.depth = oldTree.depth
        root = makeTree(depth: depth, pool: &pool)
    }
    
    var root: UnsafeRef<Node>
    var pool: MonotonicBuffer
    let depth: Int
}

let cmdLineDepth = Int(CommandLine.arguments.count > 1 ? CommandLine.arguments[1] : "10")

let minDepth = 4
let maxDepth = max(cmdLineDepth!, minDepth + 1)

let stretchDepth = maxDepth + 1

// Allocate stretch tree
do {
    let tree = BinaryTree(depth: stretchDepth)
    print("Stretch tree of depth \(stretchDepth)\t check: \(tree.root.ref.check())")
}

let longLivingTree = BinaryTree(depth: maxDepth)
var bufferTree = BinaryTree(depth: maxDepth, initialize: false)

for currentDepth in stride(from: minDepth, through: maxDepth, by: 2) {
    let iterations = 1 << (maxDepth - currentDepth + minDepth)
    var checkSum = 0
    for _ in 1...iterations {
        let tree = BinaryTree(from: bufferTree, depth: currentDepth)
        checkSum += tree.root.ref.check()
    }
    print("\(iterations)\t trees of depth \(currentDepth)\t check: \(checkSum)")
}

// Check long living tree and print out check info
print("long lived tree of depth \(maxDepth)\t check: \(longLivingTree.root.ref.check())")

Here, we use a memory resource called 'MonotonicBuffer', which allocates one pool of the requested size. It is only deallocated when the MonotonicBuffer instance goes out of scope. New objects can be allocated inside the pool using the methods 'allocate' or 'construct'. 'allocate' will simply allocate memory for the new object in the pool, however, it will not initialize the memory. This can be done in an second step by calling 'initialize' on the returned reference. 'construct' allows to allocate and initialize an object in the memory pool directly ('construct' is not used in the example as the order of nodes would be different, resulting in worse layout for this traversal). The buffer is not limited to a single type of elements that it can hold. The functions 'allocate' and 'construct' determine which type of object is created.
Also, as one can see, accessing methods through the reference wrapper requires using the computed property 'ref', which returns a reference to the actual memory. If one was only interested in properties (both stored and computed), one can directly access or modify them through the reference wrapper.

As for the performance, here are the measurements for the different binary tree benchmarks on my machine (MacBook Pro M2 Max).
Swift #1 is the above shown program.
Swift #2 is your initial version (both multithreaded and single threaded).
Swift #3 is your final version (both multithreaded and single threaded without workers).
C++ #1 is the fastest C++ program from CLBG, compiled with gcc 13.2.

Program	Threads	Time	Time Total
Swift #1	single	1.80 s	-
Swift #1 no module	single	1.67 s	-
Swift #2	single	1.81 s	-
Swift #2	multi	260 ms	2.14 s
Swift #3	single	1.01 s	-
Swift #3	multi	160 ms	1.32 s
C++ #1	single	1.68 s	-
C++ #1	multi	321 ms	2.26 s

This shows the performance is comparable to your initial array-based implementation, but allows working on a higher abstraction layer and is suitable for different tasks. If the buffers are not loaded from a module but implemented directly inside the project, the performance reached is quite similar to the C++ version using pmr's monotonic buffer resources.

The MonotonicBuffer does not allocate new memory when it runs out, instead, the methods 'allocate' and 'construct' will return nil. So one must know which amount of memory will be used beforehand.
If it is not possible to know the required capacity before, this module offers a growable buffer, called 'GrowableMonotonicBuffer'. When this buffer runs out of memory, it will allocate a new block of memory, so one can keep creating new objects.

Memory safety is important in Swift. Allocations in a memory pool and using pointers or references to these objects is not really safe. The memory pool could be deallocated while there are still live pointers or references to the objects inside the buffer. Or a buffer reset would allow overwriting the data still in use.
In order to prevent this, I have also implemented safe versions of the buffers. When allocating or constructing objects, the functions return safe references of type 'Ref<T>'. Safe references will keep the underlying memory pool alive even if the buffer itself goes out of scope. Also, they lock the buffer, which prevents it from being reset.
Here is how an implementation of the binary tree program could look like using a safe buffer. Only the tree root will be a safe reference, otherwise, we end up with a cycle condition. If safe references are stored in the buffer (in this example as part of the Node type), they will not be destroyed automatically once the safe reference to them goes out of scope. One would have to manually overwrite the data in the buffer. Therefore, having only the root node be a safe node offers enough safety for the entire tree, while the child nodes can be regarded as weak references.

Binary trees using a safe buffer

import SwiftMemoryResource

struct Node {
    var left, right: UnsafeRef<Node>?
    
    func check() -> Int {
        // End of tree is signalized by left and right node being
        // nil.
        guard let leftNode = left else {
            return 1
        }
        return 1 + leftNode.ref.check() + right!.ref.check()
    }
}

func makeTree(depth: Int, pool: inout SafeMonotonicBuffer) -> UnsafeRef<Node>? {
    guard let node = pool.allocateUnsafe(Node.self) else {
        return nil
    }
    
    if depth > 0 {
        node.initialize(to: Node(left: makeTree(depth: depth-1, pool: &pool),
                                 right: makeTree(depth: depth-1, pool: &pool)))
    } else {
        node.initialize(to: Node(left: nil, right: nil))
    }
    
    return node
}

struct BinaryTree {
    init(depth: Int, initialize: Bool = true) {
        pool = SafeMonotonicBuffer.create(
            size: MemoryLayout<Node>.size * (1 << (depth + 1)))
        self.depth = depth
        if initialize {
            root = pool.construct(Node())
            root!.left = makeTree(depth: depth-1, pool: &pool)
            root!.right = makeTree(depth: depth-1, pool: &pool)
        }
    }
    
    init(from oldTree: inout BinaryTree, depth: Int) {
        pool = oldTree.pool
        if !pool.reset() {
            fatalError("Pool is locked!")
        }
        self.depth = oldTree.depth
        root = pool.construct(Node())
        root!.left = makeTree(depth: depth-1, pool: &pool)
        root!.right = makeTree(depth: depth-1, pool: &pool)
    }
    
    var root: Ref<Node>?
    var pool: SafeMonotonicBuffer
    let depth: Int
}

let cmdLineDepth = Int(CommandLine.arguments.count > 1 ? CommandLine.arguments[1] : "10")

let minDepth = 4
let maxDepth = max(cmdLineDepth!, minDepth + 1)

let stretchDepth = maxDepth + 1

// Allocate stretch tree
do {
    let tree = BinaryTree(depth: stretchDepth)
    print("Stretch tree of depth \(stretchDepth)\t check: \(tree.root!.ref.check())")
}

let longLivingTree = BinaryTree(depth: maxDepth)
var bufferTree = BinaryTree(depth: maxDepth, initialize: false)

for currentDepth in stride(from: minDepth, through: maxDepth, by: 2) {
    let iterations = 1 << (maxDepth - currentDepth + minDepth)
    var checkSum = 0
    for _ in 1...iterations {
        let tree = BinaryTree(from: &bufferTree, depth: currentDepth)
        checkSum += tree.root!.ref.check()
    }
    print("\(iterations)\t trees of depth \(currentDepth)\t check: \(checkSum)")
}

// Check long living tree and print out check info
print("long lived tree of depth \(maxDepth)\t check: \(longLivingTree.root!.ref.check())")

Growable and safe buffers require additional logic or safety checks, leading to some performance penalties. But the tradeoffs are manageable.
These are the benchmarks for the different buffer types, all in single threaded tests.

Memory Resource Type	Time
Monotonic Buffer	1.80 s
Growable Monotonic Buffer	1.84 s
Safe Monotonic Buffer	2.12 s
Safe Growable Monotonic Buffer	2.18 s

Here are the implementations for the buffers:

Unsafe Buffer Implementations

@dynamicMemberLookup
public struct UnsafeRef<T> {
    public var ref: T {
        @_transparent unsafeAddress {
            return UnsafePointer(self.pointer)
        }
        @_transparent nonmutating unsafeMutableAddress {
            return self.pointer
        }
    }
    
    public func initialize(to obj: T) {
        self.pointer.initialize(to: obj)
    }
    
    public subscript<U>(dynamicMember keyPath: WritableKeyPath<T, U>) -> U {
        get {
            return ref[keyPath: keyPath]
        }
        set {
            return ref[keyPath: keyPath] = newValue
        }
    }
    
    public static func == (lhs: UnsafeRef<T>, rhs: UnsafeRef<T>) -> Bool {
            return lhs.pointer == rhs.pointer
        }

    public static func != (lhs: UnsafeRef<T>, rhs: UnsafeRef<T>) -> Bool {
        return lhs.pointer != rhs.pointer
    }
    
    init(pointer: UnsafeMutablePointer<T>) {
        self.pointer = pointer
    }
    
    @usableFromInline
    var pointer: UnsafeMutablePointer<T>
}

public class Pool {
    init(size: Int) {
        buffer = UnsafeMutableRawBufferPointer.allocate(
            byteCount: size, alignment: MemoryLayout<UInt8>.alignment)
    }
    
    deinit {
        buffer.deallocate()
    }
    
    var buffer: UnsafeMutableRawBufferPointer
    var offset: Int = 0
}

public struct MonotonicBuffer {
    public mutating func allocate<T>(_ type: T.Type) -> UnsafeRef<T>? {
        let requiredSize = MemoryLayout<T>.size
        guard pool.offset + requiredSize <= pool.buffer.count else {
            return nil
        }
        
        let pointer = (pool.buffer.baseAddress! + pool.offset)
            .bindMemory(to: T.self, capacity: 1)
        pool.offset += requiredSize
        return UnsafeRef(pointer: pointer)
    }
    
    public mutating func construct<T>(_ object: T) -> UnsafeRef<T>? {
        let requiredSize = MemoryLayout<T>.size
        guard pool.offset + requiredSize <= pool.buffer.count else {
            return nil
        }
        
        let pointer = (pool.buffer.baseAddress! + pool.offset)
            .bindMemory(to: T.self, capacity: 1)
        pointer.initialize(to: object)
        pool.offset += requiredSize
        return UnsafeRef(pointer: pointer)
    }
    
    public mutating func reset() {
        pool.offset = 0
    }
    
    public static func create(size: Int) -> MonotonicBuffer {
        return MonotonicBuffer(size: size)
    }
    
    init(size: Int) {
        pool = Pool(size: size)
    }
    
    var pool: Pool
}

public struct GrowableMonotonicBuffer {
    public mutating func allocate<T>(_ type: T.Type) -> UnsafeRef<T> {
        let requiredSize = MemoryLayout<T>.size
        if currentPool.offset + requiredSize > currentPool.buffer.count {
            getNextPool(requiredSize: requiredSize)
        }
        
        let pointer = (currentPool.buffer.baseAddress! + currentPool.offset)
            .bindMemory(to: T.self, capacity: 1)
        currentPool.offset += requiredSize
        
        return UnsafeRef(pointer: pointer)
    }
    
    public mutating func construct<T>(_ object: T) -> UnsafeRef<T> {
        let requiredSize = MemoryLayout<T>.size
        if currentPool.offset + requiredSize > currentPool.buffer.count {
            getNextPool(requiredSize: requiredSize)
        }
        
        let pointer = (currentPool.buffer.baseAddress! + currentPool.offset)
            .bindMemory(to: T.self, capacity: 1)
        pointer.initialize(to: object)
        currentPool.offset += requiredSize

        return UnsafeRef(pointer: pointer)
    }
    
    mutating func getNextPool(requiredSize: Int) {
        for pool in pools {
            if pool.offset + requiredSize <= pool.buffer.count {
                currentPool = pool
                return
            }
        }
        pools.append(Pool(size: max(newPoolSize, requiredSize)))
        currentPool = pools.last!
    }
    
    public mutating func reset() {
        for pool in pools {
            pool.offset = 0
        }
        currentPool = pools.first!
    }
    
    public mutating func reset(poolId: Int) {
        if pools.indices.contains(poolId) {
            pools[poolId].offset = 0
        }
    }
    
    public static func create(size: Int) -> GrowableMonotonicBuffer {
        return GrowableMonotonicBuffer(size: size, newPoolSize: size)
    }
    
    public static func create(size: Int, newPoolSize: Int)
    -> GrowableMonotonicBuffer {
        return GrowableMonotonicBuffer(size: size, newPoolSize: newPoolSize)
    }
    
    init(size: Int, newPoolSize: Int) {
        pools = [Pool(size: size)]
        currentPool = pools.first!
        self.newPoolSize = newPoolSize
    }
    
    var pools: [Pool]
    var currentPool: Pool
    var newPoolSize: Int
}

Safe Buffer Implementations

@dynamicMemberLookup
public struct Ref<T> {
    public var ref: T {
        @_transparent unsafeAddress {
            return UnsafePointer(self.pointer)
        }
        @_transparent nonmutating unsafeMutableAddress {
            return self.pointer
        }
    }
    
    public func initialize(to obj: T) {
        self.pointer.initialize(to: obj)
    }
    
    public subscript<U>(dynamicMember keyPath: WritableKeyPath<T, U>) -> U {
        get {
            return ref[keyPath: keyPath]
        }
        set {
            return ref[keyPath: keyPath] = newValue
        }
    }
    
    public static func == (lhs: Ref<T>, rhs: Ref<T>) -> Bool {
            return lhs.pointer == rhs.pointer
        }

    public static func != (lhs: Ref<T>, rhs: Ref<T>) -> Bool {
        return lhs.pointer != rhs.pointer
    }
    
    init(pointer: UnsafeMutablePointer<T>, lock: SafePool.Lock) {
        self.pointer = pointer
        self.lock = lock
    }
    
    @usableFromInline
    var pointer: UnsafeMutablePointer<T>
    private let lock: SafePool.Lock
}

public class SafePool {
    class Lock {
        init(_ pool: SafePool) {
            self.pool = pool
            self.pool.locked = true
        }
        
        deinit {
            pool.locked = false
        }
        
        var pool: SafePool
    }
    
    init(size: Int) {
        buffer = UnsafeMutableRawBufferPointer.allocate(
            byteCount: size,alignment: MemoryLayout<UInt8>.alignment)
    }
    
    deinit {
        buffer.deallocate()
    }

    var buffer: UnsafeMutableRawBufferPointer
    var offset: Int = 0
    var locked: Bool = false
    unowned var lock: Lock = unsafeBitCast(0, to: Lock.self)
}

public struct SafeMonotonicBuffer {
    public mutating func allocate<T>(_ type: T.Type) -> Ref<T>? {
        let requiredSize = MemoryLayout<T>.size
        guard pool.offset + requiredSize <= pool.buffer.count else {
            return nil
        }
        
        let pointer = (pool.buffer.baseAddress! + pool.offset)
            .bindMemory(to: T.self, capacity: 1)
        pool.offset += requiredSize
        
        if !pool.locked {
            let lock = SafePool.Lock(pool)
            pool.lock = lock
            return Ref(pointer: pointer, lock: lock)
        }
        
        return Ref(pointer: pointer, lock: pool.lock)
    }
    
    public mutating func allocateUnsafe<T>(_ type: T.Type) -> UnsafeRef<T>? {
        let requiredSize = MemoryLayout<T>.size
        guard pool.offset + requiredSize <= pool.buffer.count else {
            return nil
        }
        
        let pointer = (pool.buffer.baseAddress! + pool.offset)
            .bindMemory(to: T.self, capacity: 1)
        pool.offset += requiredSize
        
        return UnsafeRef(pointer: pointer)
    }
    
    public mutating func construct<T>(_ object: T) -> Ref<T>? {
        let requiredSize = MemoryLayout<T>.size
        guard pool.offset + requiredSize <= pool.buffer.count else {
            return nil
        }
        
        let pointer = (pool.buffer.baseAddress! + pool.offset)
            .bindMemory(to: T.self, capacity: 1)
        pointer.initialize(to: object)
        pool.offset += requiredSize
        
        if !pool.locked {
            let lock = SafePool.Lock(pool)
            pool.lock = lock
            return Ref(pointer: pointer, lock: lock)
        }
        
        return Ref(pointer: pointer, lock: pool.lock)
    }
    
    public mutating func constructUnsafe<T>(_ object: T) -> UnsafeRef<T>? {
        let requiredSize = MemoryLayout<T>.size
        
        guard pool.offset + requiredSize <= pool.buffer.count else {
            return nil
        }
        
        let pointer = (pool.buffer.baseAddress! + pool.offset)
            .bindMemory(to: T.self, capacity: 1)
        pointer.initialize(to: object)
        pool.offset += requiredSize
        
        return UnsafeRef(pointer: pointer)
    }
    
    @discardableResult
    public func reset() -> Bool {
        if !pool.locked {
            pool.offset = 0
            return true
        }
        return false
    }
    
    public static func create(size: Int) -> SafeMonotonicBuffer {
        return SafeMonotonicBuffer(size: size)
    }
    
    init(size: Int) {
        pool = SafePool(size: size)
    }
    
    var pool: SafePool
}

public struct SafeGrowableMonotonicBuffer {
    public mutating func allocate<T>(_ type: T.Type) -> Ref<T> {
        let requiredSize = MemoryLayout<T>.size
        if currentPool.offset + requiredSize > currentPool.buffer.count {
            getNextPool(requiredSize: requiredSize)
        }
        
        let pointer = (currentPool.buffer.baseAddress! + currentPool.offset)
            .bindMemory(to: T.self, capacity: 1)
        currentPool.offset += requiredSize
        
        if !currentPool.locked {
            let lock = SafePool.Lock(currentPool)
            currentPool.lock = lock
            return Ref(pointer: pointer, lock: lock)
        }
        
        return Ref(pointer: pointer, lock: currentPool.lock)
    }
    
    public mutating func allocateUnsafe<T>(_ type: T.Type) -> UnsafeRef<T> {
        let requiredSize = MemoryLayout<T>.size
        if currentPool.offset + requiredSize > currentPool.buffer.count {
            getNextPool(requiredSize: requiredSize)
        }
        
        let pointer = (currentPool.buffer.baseAddress! + currentPool.offset)
            .bindMemory(to: T.self, capacity: 1)
        currentPool.offset += requiredSize
        
        return UnsafeRef(pointer: pointer)
    }
    
    public mutating func construct<T>(_ object: T) -> Ref<T> {
        let requiredSize = MemoryLayout<T>.size
        if currentPool.offset + requiredSize > currentPool.buffer.count {
            getNextPool(requiredSize: requiredSize)
        }
        
        let pointer = (currentPool.buffer.baseAddress! + currentPool.offset)
            .bindMemory(to: T.self, capacity: 1)
        pointer.initialize(to: object)
        currentPool.offset += requiredSize
        
        if !currentPool.locked {
            let lock = SafePool.Lock(currentPool)
            currentPool.lock = lock
            return Ref(pointer: pointer, lock: lock)
        }
        
        return Ref(pointer: pointer, lock: currentPool.lock)
    }
    
    public mutating func constructUnsafe<T>(_ object: T) -> UnsafeRef<T> {
        let requiredSize = MemoryLayout<T>.size
        if currentPool.offset + requiredSize > currentPool.buffer.count {
            getNextPool(requiredSize: requiredSize)
        }
        
        let pointer = (currentPool.buffer.baseAddress! + currentPool.offset)
            .bindMemory(to: T.self, capacity: 1)
        pointer.initialize(to: object)
        currentPool.offset += requiredSize
        
        return UnsafeRef(pointer: pointer)
    }
    
    @discardableResult
    public mutating func reset() -> Bool {
        var didReset = false
        for pool in pools {
            if !pool.locked {
                pool.offset = 0
                didReset = true
            }
        }
        return didReset
    }
    
    @discardableResult
    public mutating func reset(poolId: Int) -> Bool {
        if pools.indices.contains(poolId) && !pools[poolId].locked {
            pools[poolId].offset = 0
            return true
        }
        return false
    }
    
    public static func create(size: Int) -> SafeGrowableMonotonicBuffer {
        return SafeGrowableMonotonicBuffer(size: size, newPoolSize: size)
    }
    
    public static func create(size: Int, newPoolSize: Int)
    -> SafeGrowableMonotonicBuffer {
        return SafeGrowableMonotonicBuffer(size: size, newPoolSize: newPoolSize)
    }
    
    mutating func getNextPool(requiredSize: Int) {
        for pool in pools {
            if pool.offset + requiredSize <= pool.buffer.count {
                currentPool = pool
                return
            }
        }
        pools.append(SafePool(size: max(newPoolSize, requiredSize)))
        currentPool = pools.last!
    }
    
    init(size: Int, newPoolSize: Int) {
        pools = [SafePool(size: size)]
        currentPool = pools.first!
        self.newPoolSize = newPoolSize
    }
    
    var pools: [SafePool]
    var currentPool: SafePool
    var newPoolSize: Int
}

Does this approach meet what you were looking for? I welcome any thoughts or feedback you might have.