Conditional copy/clone on write

carsonfarmer · May 28, 2024, 5:10pm

I am currently (re)-implementing a persistent binary tree data structure in Swift, and in the simplest case, I'm already seeing pretty good performance with entirely safe code. This is awesome (my naive implementation is only about 1.5x slower than a pretty optimized Rust version).

The insert function is a recursive function, which takes a Node currently defined as something like (simplified for demonstration purposes):

final class Node<Key: Hashable & Comparable, Value> {
  var entry: (key: Key, value: Value)
  var left: Node<Key, Value>?
  var right: Node<Key, Value>?

  init(
    entry: (key: Key, value: Value),
    left: Node<Key, Value>? = nil,
    right: Node<Key, Value>? = nil
  ) {
    self.entry = entry
    self.left = left
    self.right = right
  }
}

and if possible, should mutate it in place, but if there are multiple reference to the Node itself, then it should copy itself, and mutate itself in place. Here is the definition for insert that I have now, the code itself isn't all that relevant...

static func insert(node: inout Node<Key, Value>?, entry: Entry<Key, Value>) -> (Node<Key, Value>, Bool)

...except in a few places. First, I have a guard at the top of the function. Since this appears to always create a reference, any time I do something like isKnownUniquelyReferenced(&node) it always returns false (obviously), so I can't do a conditional copy if there is only one reference.

guard let node else {
      return ... // Returns a tree with entry as the single item
    }

So my first question is, how to handle this particular case, where I have an optional where I need to conditionally copy the inner value (if not nil) before a mutation, but where I know there is going to be more than once reference to it (even if I own both of them).

After the guard, any mutations can be implemented using something like:

...
let (inserted, isNewKey) = insert(node: &node.left, entry: entry)
node.left = inserted
...

So my first thought was to leverage something like GitHub - Swift-CowBox/Swift-CowBox: Easy Copy-on-Write Semantics for Swift Structs., but that appears to kill my performance, possibly due to my use of guards and causing too many copies? I am still learning about __shared and __owned, so I am unsure if those will provide the right solutions here. Lastly, it looks like some clever use of modify accessors might do the trick here? But again, I'm still new to Swift so not entirely sure if this is the right path? Lastly, perhaps not using optional, and instead using a recursive enum to represent a Node or Empty? I was a bit worried about performance in this case. It also looks like this pitch would be very close to what I want/need?

So my question is essentially this: What is the current best practice for safe persistent trees in Swift? Or more generally, how to match something like Rust's RC::make_mut semantics in Swift? Here are the relevant docs for make_mut:

Makes a mutable reference into the given Rc.
If there are other Rc pointers to the same allocation, then make_mut will clone the inner value to a new allocation to ensure unique ownership. This is also referred to as clone-on-write.

vns · May 28, 2024, 7:20pm

Representing nodes as value types has its downsides in Swift, and right now it would be probably much less efficient for general cases (with non-copyable generics it might get another chance - example of linked list)

I’m not sure what was your intention with inout here, but classes (reference types) in Swift can be mutated regardless of being declared as var or passed as inout. There is no equivalent to Rust’s let & let mut that would apply to all the types in the language. In Swift var has mutation consequences for value types (structs, enums and non-class protocols) only. So your Node type is currently freely modifiable by everyone as it is (declaring as let node = Node(...) still will allow modifications). You’d probably want to hide write access to its properties, if you want to implement Copy-on-Write behavior.

The closest thing that comes to mind is CFGetRetainCount which is available with Foundation framework on Apple platforms, and (at least according to code) is present in GitHub - apple/swift-corelibs-foundation: The Foundation Project, providing core utilities, internationalization, and OS independence to have it on Linux (Windows?) -- yet there are some confusion about foundation open source versions I'm not familiar close enough. I suppose Rc in Rust works roughly the same way, maybe with more native to language mechanisms, but do not know equivalent in Swift to CFGetRetainCount. So you can use this to replicate behaviour of conditional copy when count > 1.

vanvoorden · May 28, 2024, 7:37pm

Ahh… interesting! You measured benchmarks? Do you have an example sample project I could reference? I would like to see if there was something easy to optimize in the CowBox implementation for this. Was it memory performance or CPU performance (or both) that was worse with CowBox?

Hmm… maybe take a look at the Swift Collections TreeDictionary CHAMP tree for some ideas.

tera · May 28, 2024, 7:59pm

As @vns mentioned you don't need inout with classes:

static func insert(node: Node<Key, Value>?, entry: Entry<Key, Value>) -> (Node<Key, Value>, Bool)

and this could be further split into the two:

static func insert(node: Node<Key, Value>, entry: Entry<Key, Value>)
static func insert(entry: Entry<Key, Value>) -> (Node<Key, Value>, Bool)

Furthermore I don't see how "guard let" was used in the original code... You'd need at least "guard var" to be able to pass &node to "isKnownUniquelyReferenced". If that creates an extra unwanted reference – just don't do that, you may simply check if node != nil without using extra let/var binding.

vns · May 28, 2024, 8:03pm

I've actually missed that one detail:

Maybe the intention was to modify initial variable passed from the caller side (not inside this boolean method, which I suppose don't need to accept inout after all, but somewhere inside insert itself), which then will make sense to have node as inout parameter.

wadetregaskis · May 28, 2024, 9:26pm

Tangentially [perhaps], many tree-based algorithms benefit a lot from manual memory management, in Swift. Irrespective of whether you use structs or classes or other such architectural choices, letting Swift do the memory management is expensive.

See for example this thread.

A facet of this that might apply more directly is to use Unmanaged to manually eliminate some ARC traffic, which might help with erroneous "is not uniquely referenced" results.

carsonfarmer · May 28, 2024, 9:36pm

Ah sorry, I should have kept a reproducible example before posting something like that! I don't have one, but IIRC it was mostly CPU performance that I was testing.

Thanks for the CHAMP pointer, I did take a look at that one. There are certainly some useful bits in there (which I did in fact borrow)... but many of the manual memory management bits in there were a little advanced for me at this early stage of my Swift exploration... but obviously worth investing some more time in!

carsonfarmer · May 28, 2024, 9:38pm

So in truth, it is actually split into two, so glad to hear that was an ok idea.

I'm a little bit embarrassed that I didn't even try just checking with if node != nil... so thanks for that!

carsonfarmer · May 28, 2024, 9:39pm

Yeh, I kinda figured this might be the case, thanks for the links!

carsonfarmer · May 28, 2024, 9:40pm

Wow, thanks all. I'm not used to this level of helpful and insightful activity on forum posts! I might lean harder into Swift just for the community?!

vanvoorden · May 30, 2024, 9:10pm

Hmm… my expectation here with the copy-on-write value type (whether it is the CowBox repo, another repo, or writing your own from scratch) is that engineers considering this will:

Start with a data model built on simple classic Swift struct value types.
Want (or need) to preserve value (immutable) semantics.
Want to reduce memory usage with many copies by leveraging object references under-the-surface.

If you start with a solution built on Swift classes and are comfortable with (or have defensively coded around) reference semantics… there probably isn't much that CowBox can do to improve your perf (and I could understand that this migration actually harms perf).

If you have a solution built on Swift classes and are considering moving to a copy-on-write value type… my expectation would be that the benefits from migrating to value semantics (the ability to "reason locally" about code) would be important enough from a DevX POV that the perf regressions were an acceptable tradeoff.

That ability to reason locally leads to another benefit… which is that if you are currently blocked on migrating to parallel (concurrent) computing because your reference semantics make writing thread-safe code impractical… migrating those reference types to (immutable) value types might lead to a perf regression in a single-threaded algorithm (where the unsafe behavior of a mutable object reference is not as much of an issue to defend against), but this migration might still be a "net positive" if you can now migrate this algorithm to parallel concurrency (now that you have immutable value types that are safe to pass across threads).

If you do have a reproducible example project I would still like to take a look if you were interested. But there are going to be situations (this might be one) where continuing with reference semantics (and living with the risks and pain-points of mutable data) might just be the right tool for the job.

carsonfarmer · May 31, 2024, 5:47am

Here's the (arguably slow) implementation for reference, I probably should have just shared this earlier (I've stripped it down to a minimal set). This is based on (a variant of) Zip Trees, which is a great paper by the way!

import Foundation

final class Entry<Key: Hashable, Value> {
  let key: Key
  var value: Value

  init(key: Key, value: Value) {
    self.key = key
    self.value = value
  }
}

final class Node<Key: Hashable & Comparable, Value> {
  var entry: Entry<Key, Value>
  let rank: Int
  var left: Node<Key, Value>?
  var right: Node<Key, Value>?

  init(
    entry: Entry<Key, Value>,
    rank: Int,
    left: Node<Key, Value>? = nil,
    right: Node<Key, Value>? = nil
  ) {
    self.entry = entry
    self.rank = rank
    self.left = left
    self.right = right
  }
}

extension Node {
  convenience init(singleton entry: Entry<Key, Value>) {
    self.init(entry: entry, rank: entry.key.hashValue.trailingZeroBitCount + 1)
  }

  convenience init(other: Node) {
    self.init(entry: other.entry, rank: other.rank, left: other.left, right: other.right)
  }

  func get(_ key: Key) -> Entry<Key, Value>? {
    switch key {
    case _ where key < entry.key:
      return left?.get(key)
    case _ where key > entry.key:
      return right?.get(key)
    default:
      return entry
    }
  }

  static func insert(node: inout Node<Key, Value>?, entry: Entry<Key, Value>) -> Node<Key, Value> {
    let isUnique = Swift.isKnownUniquelyReferenced(&node)
    guard let node = node else {
      return Node(singleton: entry)
    }
    let root = isUnique ? node : Node(other: node)
    if entry.key == root.entry.key {
      root.entry.value = entry.value
      return root
    } else if entry.key < root.entry.key {
      let inserted = insert(node: &root.left, entry: entry)
      if inserted.rank < root.rank {
        root.left = inserted
        return root
      } else {
        inserted.right = root
        return inserted
      }
    } else {
      let inserted = insert(node: &root.right, entry: entry)
      if inserted.rank <= root.rank {
        root.right = inserted
        return root
      } else {
        inserted.left = root
        return inserted
      }
    }
  }
}

public struct Tree<Key, Value> where Key: Hashable & Comparable {
  var root: Node<Key, Value>? = nil

  public init() {}

  public func get(key: Key) -> Value? {
    return root?.get(key)?.value
  }

  public mutating func insert(key: Key, value: Value) {
    let entry = Entry(key: key, value: value)
    root = Node.insert(node: &root, entry: entry)
  }
}

A quick test might look something like this, which runs in 15s on my system, but ideally it should be less than half this if it is to be competitive with my Rust version. The 'get' calls appear to be almost negligible in the scheme of things.

import Tree

func insertTest(values: [UInt32]) {
  var tree = Tree<UInt32, UInt32>()
  for (index, value) in values.enumerated() {
    tree.insert(key: value, value: 2 * value)

    let other = values[index / 2]

    assert(tree.get(key: value) == 2 * value)
    assert(tree.get(key: other) == 2 * other)
  }
}

let limit = 25_000
var permutation: [UInt32] = Array(0..<256)
for _ in 0..<limit {
  permutation.shuffle()
  insertTest(values: permutation)
}

I'm still new to Swift and the dev tooling, so I'm having to go through some blogs and docs to learn more about profiling Swift code.

vanvoorden · May 31, 2024, 6:40am

My advice would be to try the Benchmarks repo from Ordo One for measuring memory together with CPU.

The swift-collections-benchmark repo is great for measuring CPU as your collection scales in size.

vns · May 31, 2024, 7:27am

Both Rust and Swift versions you check on debug or release builds? I assume at least for Swift this tests are running in debug, otherwise asserts would be removed, so get wouldn’t called at all.

vns · May 31, 2024, 7:34am

Also, I’d try to make Entry to be a struct. It should improve time since its instantiation is probably one of the significant time consumers here (you create it on each insert, so in the test it is 25000 * 256 times you create instance of a class, which is not cheap in Swift, while structs are cheap.

UPD: turned out that was invalid assumption, it only made it slower.

carsonfarmer · May 31, 2024, 4:27pm

Both running in release mode. In the actual test I was running, those bits were commented out. But, good callout.

Yes exactly, I did test that (since it was so easy to switch it up), but showed a marked degradation in performance.

vanvoorden · May 31, 2024, 9:17pm

I see similar results. Here is an example of what an Ordo One Benchmark looks like:

func insertTest<T: Hashable & Comparable & BinaryInteger>(values: [T]) {
  var tree = Tree<T, T>()
  for (index, value) in values.enumerated() {
    tree.insert(key: value, value: 2 * value)
    
    let other = values[index / 2]
    
    precondition(tree.get(key: value) == 2 * value)
    precondition(tree.get(key: other) == 2 * other)
  }
}

import Benchmark

let benchmarks = {
  
  Benchmark.defaultConfiguration.metrics = .default
  Benchmark.defaultConfiguration.timeUnits = .microseconds
  Benchmark.defaultConfiguration.maxDuration = .seconds(86400)
  Benchmark.defaultConfiguration.maxIterations = .count(250_000)
  
  Benchmark("Final Class Entry UInt32") { benchmark in
    var permutation = Array(UInt32(0)..<UInt32(256))
    permutation.shuffle()
    benchmark.startMeasurement()
    insertTest(values: permutation)
    benchmark.stopMeasurement()
  }
  
  Benchmark("Final Class Entry UInt64") { benchmark in
    var permutation = Array(UInt64(0)..<UInt64(256))
    permutation.shuffle()
    benchmark.startMeasurement()
    insertTest(values: permutation)
    benchmark.stopMeasurement()
  }

}

Here is what that looks like running:

Final Class Entry UInt32

Metric	p0	p25	p50	p75	p90	p99	p100	Samples
Instructions (K) *	2158	3166	3482	3871	4293	5226	8461	250000
Malloc (total) *	767	1246	1386	1551	1726	2103	3298	250000
Memory (resident peak) (K)	8749	9126	9126	9126	9126	9126	9126	250000
Throughput (# / s) (#)	6351	4151	3749	3347	2991	2397	47	250000
Time (total CPU) (μs) *	159	242	268	299	335	417	1097	250000
Time (wall clock) (μs) *	157	241	267	299	335	418	21309	250000

Final Class Entry UInt64

Metric	p0	p25	p50	p75	p90	p99	p100	Samples
Instructions (K) *	2272	3389	3727	4141	4596	5599	8560	250000
Malloc (total) *	752	1251	1396	1570	1756	2161	3165	250000
Memory (resident peak) (K)	8716	9093	9093	9093	9093	9093	9093	250000
Throughput (# / s) (#)	6242	3939	3547	3151	2797	2207	706	250000
Time (total CPU) (μs) *	162	254	282	317	356	451	751	250000
Time (wall clock) (μs) *	160	254	282	318	358	453	1416	250000

I can also attempt a struct entry:

-final class Entry<Key: Hashable, Value> {
+struct Entry<Key: Hashable, Value> {
   let key: Key
   var value: Value

   Benchmark.defaultConfiguration.maxDuration = .seconds(86400)
   Benchmark.defaultConfiguration.maxIterations = .count(250_000)
   
-  Benchmark("Final Class Entry UInt32") { benchmark in
+  Benchmark("Struct Entry UInt32") { benchmark in
     var permutation = Array(UInt32(0)..<UInt32(256))
     permutation.shuffle()
     benchmark.startMeasurement()

     benchmark.stopMeasurement()
   }
   
-  Benchmark("Final Class Entry UInt64") { benchmark in
+  Benchmark("Struct Entry UInt64") { benchmark in
     var permutation = Array(UInt64(0)..<UInt64(256))
     permutation.shuffle()
     benchmark.startMeasurement()

And here is what that looks like running:

Struct Entry UInt32

Metric	p0	p25	p50	p75	p90	p99	p100	Samples
Instructions (K) *	2448	4147	4616	5181	5784	7082	10756	250000
Malloc (total) *	479	1014	1157	1325	1502	1870	2892	250000
Memory (resident peak) (K)	8716	9110	9110	9110	9110	9110	9110	250000
Throughput (# / s) (#)	5635	3349	2989	2641	2335	1839	650	250000
Time (total CPU) (μs) *	179	300	335	379	428	543	1041	250000
Time (wall clock) (μs) *	177	299	335	379	429	544	1539	250000

Struct Entry UInt64

Metric	p0	p25	p50	p75	p90	p99	p100	Samples
Instructions (K) *	2651	4284	4755	5325	5939	7279	12927	250000
Malloc (total) *	508	989	1131	1296	1474	1854	3311	250000
Memory (resident peak) (K)	8749	9167	9167	9167	9183	9191	9191	250000
Throughput (# / s) (#)	5667	3261	2915	2573	2275	1784	135	250000
Time (total CPU) (μs) *	178	307	343	388	439	559	1084	250000
Time (wall clock) (μs) *	176	307	343	389	440	561	7392	250000

vns · June 1, 2024, 12:28pm

Another idea to try (if you haven't) is to try implementing non-recursive algorithm for insertion?