Conditional copy/clone on write

I am currently (re)-implementing a persistent binary tree data structure in Swift, and in the simplest case, I'm already seeing pretty good performance with entirely safe code. This is awesome (my naive implementation is only about 1.5x slower than a pretty optimized Rust version).

The insert function is a recursive function, which takes a Node currently defined as something like (simplified for demonstration purposes):

final class Node<Key: Hashable & Comparable, Value> {
  var entry: (key: Key, value: Value)
  var left: Node<Key, Value>?
  var right: Node<Key, Value>?

  init(
    entry: (key: Key, value: Value),
    left: Node<Key, Value>? = nil,
    right: Node<Key, Value>? = nil
  ) {
    self.entry = entry
    self.left = left
    self.right = right
  }
}

and if possible, should mutate it in place, but if there are multiple reference to the Node itself, then it should copy itself, and mutate itself in place. Here is the definition for insert that I have now, the code itself isn't all that relevant...

static func insert(node: inout Node<Key, Value>?, entry: Entry<Key, Value>) -> (Node<Key, Value>, Bool)

...except in a few places. First, I have a guard at the top of the function. Since this appears to always create a reference, any time I do something like isKnownUniquelyReferenced(&node) it always returns false (obviously), so I can't do a conditional copy if there is only one reference.

guard let node else {
      return ... // Returns a tree with entry as the single item
    }

So my first question is, how to handle this particular case, where I have an optional where I need to conditionally copy the inner value (if not nil) before a mutation, but where I know there is going to be more than once reference to it (even if I own both of them).

After the guard, any mutations can be implemented using something like:

...
let (inserted, isNewKey) = insert(node: &node.left, entry: entry)
node.left = inserted
...

So my first thought was to leverage something like GitHub - Swift-CowBox/Swift-CowBox: Easy Copy-on-Write Semantics for Swift Structs., but that appears to kill my performance, possibly due to my use of guards and causing too many copies? I am still learning about __shared and __owned, so I am unsure if those will provide the right solutions here. Lastly, it looks like some clever use of modify accessors might do the trick here? But again, I'm still new to Swift so not entirely sure if this is the right path? Lastly, perhaps not using optional, and instead using a recursive enum to represent a Node or Empty? I was a bit worried about performance in this case. It also looks like this pitch would be very close to what I want/need?

So my question is essentially this: What is the current best practice for safe persistent trees in Swift? Or more generally, how to match something like Rust's RC::make_mut semantics in Swift? Here are the relevant docs for make_mut:

Makes a mutable reference into the given Rc.
If there are other Rc pointers to the same allocation, then make_mut will clone the inner value to a new allocation to ensure unique ownership. This is also referred to as clone-on-write.

3 Likes

Representing nodes as value types has its downsides in Swift, and right now it would be probably much less efficient for general cases (with non-copyable generics it might get another chance - example of linked list)

I’m not sure what was your intention with inout here, but classes (reference types) in Swift can be mutated regardless of being declared as var or passed as inout. There is no equivalent to Rust’s let & let mut that would apply to all the types in the language. In Swift var has mutation consequences for value types (structs, enums and non-class protocols) only. So your Node type is currently freely modifiable by everyone as it is (declaring as let node = Node(...) still will allow modifications). You’d probably want to hide write access to its properties, if you want to implement Copy-on-Write behavior.

The closest thing that comes to mind is CFGetRetainCount which is available with Foundation framework on Apple platforms, and (at least according to code) is present in GitHub - apple/swift-corelibs-foundation: The Foundation Project, providing core utilities, internationalization, and OS independence to have it on Linux (Windows?) -- yet there are some confusion about foundation open source versions I'm not familiar close enough. I suppose Rc in Rust works roughly the same way, maybe with more native to language mechanisms, but do not know equivalent in Swift to CFGetRetainCount. So you can use this to replicate behaviour of conditional copy when count > 1.

4 Likes

Ahh… interesting! You measured benchmarks? Do you have an example sample project I could reference? I would like to see if there was something easy to optimize in the CowBox implementation for this. Was it memory performance or CPU performance (or both) that was worse with CowBox?

Hmm… maybe take a look at the Swift Collections TreeDictionary CHAMP tree for some ideas.

1 Like

As @vns mentioned you don't need inout with classes:

static func insert(node: Node<Key, Value>?, entry: Entry<Key, Value>) -> (Node<Key, Value>, Bool)

and this could be further split into the two:

static func insert(node: Node<Key, Value>, entry: Entry<Key, Value>)
static func insert(entry: Entry<Key, Value>) -> (Node<Key, Value>, Bool)

Furthermore I don't see how "guard let" was used in the original code... You'd need at least "guard var" to be able to pass &node to "isKnownUniquelyReferenced". If that creates an extra unwanted reference – just don't do that, you may simply check if node != nil without using extra let/var binding.

1 Like

I've actually missed that one detail:

Maybe the intention was to modify initial variable passed from the caller side (not inside this boolean method, which I suppose don't need to accept inout after all, but somewhere inside insert itself), which then will make sense to have node as inout parameter.

1 Like

Tangentially [perhaps], many tree-based algorithms benefit a lot from manual memory management, in Swift. Irrespective of whether you use structs or classes or other such architectural choices, letting Swift do the memory management is expensive.

See for example this thread.

A facet of this that might apply more directly is to use Unmanaged to manually eliminate some ARC traffic, which might help with erroneous "is not uniquely referenced" results.

5 Likes

Ah sorry, I should have kept a reproducible example before posting something like that! I don't have one, but IIRC it was mostly CPU performance that I was testing.

Thanks for the CHAMP pointer, I did take a look at that one. There are certainly some useful bits in there (which I did in fact borrow)... but many of the manual memory management bits in there were a little advanced for me at this early stage of my Swift exploration... but obviously worth investing some more time in!

1 Like

So in truth, it is actually split into two, so glad to hear that was an ok idea.

I'm a little bit embarrassed that I didn't even try just checking with if node != nil... so thanks for that!

2 Likes

Yeh, I kinda figured this might be the case, thanks for the links!

Wow, thanks all. I'm not used to this level of helpful and insightful activity on forum posts! I might lean harder into Swift just for the community?!

8 Likes

Hmm… my expectation here with the copy-on-write value type (whether it is the CowBox repo, another repo, or writing your own from scratch) is that engineers considering this will:

  • Start with a data model built on simple classic Swift struct value types.
  • Want (or need) to preserve value (immutable) semantics.
  • Want to reduce memory usage with many copies by leveraging object references under-the-surface.

If you start with a solution built on Swift classes and are comfortable with (or have defensively coded around) reference semantics… there probably isn't much that CowBox can do to improve your perf (and I could understand that this migration actually harms perf).

If you have a solution built on Swift classes and are considering moving to a copy-on-write value type… my expectation would be that the benefits from migrating to value semantics (the ability to "reason locally" about code) would be important enough from a DevX POV that the perf regressions were an acceptable tradeoff.

That ability to reason locally leads to another benefit… which is that if you are currently blocked on migrating to parallel (concurrent) computing because your reference semantics make writing thread-safe code impractical… migrating those reference types to (immutable) value types might lead to a perf regression in a single-threaded algorithm (where the unsafe behavior of a mutable object reference is not as much of an issue to defend against), but this migration might still be a "net positive" if you can now migrate this algorithm to parallel concurrency (now that you have immutable value types that are safe to pass across threads).

If you do have a reproducible example project I would still like to take a look if you were interested. But there are going to be situations (this might be one) where continuing with reference semantics (and living with the risks and pain-points of mutable data) might just be the right tool for the job.

2 Likes

Here's the (arguably slow) implementation for reference, I probably should have just shared this earlier (I've stripped it down to a minimal set). This is based on (a variant of) Zip Trees, which is a great paper by the way!

import Foundation

final class Entry<Key: Hashable, Value> {
  let key: Key
  var value: Value

  init(key: Key, value: Value) {
    self.key = key
    self.value = value
  }
}

final class Node<Key: Hashable & Comparable, Value> {
  var entry: Entry<Key, Value>
  let rank: Int
  var left: Node<Key, Value>?
  var right: Node<Key, Value>?

  init(
    entry: Entry<Key, Value>,
    rank: Int,
    left: Node<Key, Value>? = nil,
    right: Node<Key, Value>? = nil
  ) {
    self.entry = entry
    self.rank = rank
    self.left = left
    self.right = right
  }
}

extension Node {
  convenience init(singleton entry: Entry<Key, Value>) {
    self.init(entry: entry, rank: entry.key.hashValue.trailingZeroBitCount + 1)
  }

  convenience init(other: Node) {
    self.init(entry: other.entry, rank: other.rank, left: other.left, right: other.right)
  }

  func get(_ key: Key) -> Entry<Key, Value>? {
    switch key {
    case _ where key < entry.key:
      return left?.get(key)
    case _ where key > entry.key:
      return right?.get(key)
    default:
      return entry
    }
  }

  static func insert(node: inout Node<Key, Value>?, entry: Entry<Key, Value>) -> Node<Key, Value> {
    let isUnique = Swift.isKnownUniquelyReferenced(&node)
    guard let node = node else {
      return Node(singleton: entry)
    }
    let root = isUnique ? node : Node(other: node)
    if entry.key == root.entry.key {
      root.entry.value = entry.value
      return root
    } else if entry.key < root.entry.key {
      let inserted = insert(node: &root.left, entry: entry)
      if inserted.rank < root.rank {
        root.left = inserted
        return root
      } else {
        inserted.right = root
        return inserted
      }
    } else {
      let inserted = insert(node: &root.right, entry: entry)
      if inserted.rank <= root.rank {
        root.right = inserted
        return root
      } else {
        inserted.left = root
        return inserted
      }
    }
  }
}

public struct Tree<Key, Value> where Key: Hashable & Comparable {
  var root: Node<Key, Value>? = nil

  public init() {}

  public func get(key: Key) -> Value? {
    return root?.get(key)?.value
  }

  public mutating func insert(key: Key, value: Value) {
    let entry = Entry(key: key, value: value)
    root = Node.insert(node: &root, entry: entry)
  }
}

A quick test might look something like this, which runs in 15s on my system, but ideally it should be less than half this if it is to be competitive with my Rust version. The 'get' calls appear to be almost negligible in the scheme of things.

import Tree

func insertTest(values: [UInt32]) {
  var tree = Tree<UInt32, UInt32>()
  for (index, value) in values.enumerated() {
    tree.insert(key: value, value: 2 * value)

    let other = values[index / 2]

    assert(tree.get(key: value) == 2 * value)
    assert(tree.get(key: other) == 2 * other)
  }
}

let limit = 25_000
var permutation: [UInt32] = Array(0..<256)
for _ in 0..<limit {
  permutation.shuffle()
  insertTest(values: permutation)
}

I'm still new to Swift and the dev tooling, so I'm having to go through some blogs and docs to learn more about profiling Swift code.

1 Like

My advice would be to try the Benchmarks repo from Ordo One for measuring memory together with CPU.

The swift-collections-benchmark repo is great for measuring CPU as your collection scales in size.

1 Like

Both Rust and Swift versions you check on debug or release builds? I assume at least for Swift this tests are running in debug, otherwise asserts would be removed, so get wouldn’t called at all.

1 Like

Also, I’d try to make Entry to be a struct. It should improve time since its instantiation is probably one of the significant time consumers here (you create it on each insert, so in the test it is 25000 * 256 times you create instance of a class, which is not cheap in Swift, while structs are cheap.

UPD: turned out that was invalid assumption, it only made it slower.

1 Like

Both running in release mode. In the actual test I was running, those bits were commented out. But, good callout.

Yes exactly, I did test that (since it was so easy to switch it up), but showed a marked degradation in performance.

1 Like

I see similar results. Here is an example of what an Ordo One Benchmark looks like:

func insertTest<T: Hashable & Comparable & BinaryInteger>(values: [T]) {
  var tree = Tree<T, T>()
  for (index, value) in values.enumerated() {
    tree.insert(key: value, value: 2 * value)
    
    let other = values[index / 2]
    
    precondition(tree.get(key: value) == 2 * value)
    precondition(tree.get(key: other) == 2 * other)
  }
}

import Benchmark

let benchmarks = {
  
  Benchmark.defaultConfiguration.metrics = .default
  Benchmark.defaultConfiguration.timeUnits = .microseconds
  Benchmark.defaultConfiguration.maxDuration = .seconds(86400)
  Benchmark.defaultConfiguration.maxIterations = .count(250_000)
  
  Benchmark("Final Class Entry UInt32") { benchmark in
    var permutation = Array(UInt32(0)..<UInt32(256))
    permutation.shuffle()
    benchmark.startMeasurement()
    insertTest(values: permutation)
    benchmark.stopMeasurement()
  }
  
  Benchmark("Final Class Entry UInt64") { benchmark in
    var permutation = Array(UInt64(0)..<UInt64(256))
    permutation.shuffle()
    benchmark.startMeasurement()
    insertTest(values: permutation)
    benchmark.stopMeasurement()
  }

}

Here is what that looks like running:

Final Class Entry UInt32

Metric p0 p25 p50 p75 p90 p99 p100 Samples
Instructions (K) * 2158 3166 3482 3871 4293 5226 8461 250000
Malloc (total) * 767 1246 1386 1551 1726 2103 3298 250000
Memory (resident peak) (K) 8749 9126 9126 9126 9126 9126 9126 250000
Throughput (# / s) (#) 6351 4151 3749 3347 2991 2397 47 250000
Time (total CPU) (μs) * 159 242 268 299 335 417 1097 250000
Time (wall clock) (μs) * 157 241 267 299 335 418 21309 250000

Final Class Entry UInt64

Metric p0 p25 p50 p75 p90 p99 p100 Samples
Instructions (K) * 2272 3389 3727 4141 4596 5599 8560 250000
Malloc (total) * 752 1251 1396 1570 1756 2161 3165 250000
Memory (resident peak) (K) 8716 9093 9093 9093 9093 9093 9093 250000
Throughput (# / s) (#) 6242 3939 3547 3151 2797 2207 706 250000
Time (total CPU) (μs) * 162 254 282 317 356 451 751 250000
Time (wall clock) (μs) * 160 254 282 318 358 453 1416 250000

I can also attempt a struct entry:

-final class Entry<Key: Hashable, Value> {
+struct Entry<Key: Hashable, Value> {
   let key: Key
   var value: Value
   Benchmark.defaultConfiguration.maxDuration = .seconds(86400)
   Benchmark.defaultConfiguration.maxIterations = .count(250_000)
   
-  Benchmark("Final Class Entry UInt32") { benchmark in
+  Benchmark("Struct Entry UInt32") { benchmark in
     var permutation = Array(UInt32(0)..<UInt32(256))
     permutation.shuffle()
     benchmark.startMeasurement()
     benchmark.stopMeasurement()
   }
   
-  Benchmark("Final Class Entry UInt64") { benchmark in
+  Benchmark("Struct Entry UInt64") { benchmark in
     var permutation = Array(UInt64(0)..<UInt64(256))
     permutation.shuffle()
     benchmark.startMeasurement()

And here is what that looks like running:

Struct Entry UInt32

Metric p0 p25 p50 p75 p90 p99 p100 Samples
Instructions (K) * 2448 4147 4616 5181 5784 7082 10756 250000
Malloc (total) * 479 1014 1157 1325 1502 1870 2892 250000
Memory (resident peak) (K) 8716 9110 9110 9110 9110 9110 9110 250000
Throughput (# / s) (#) 5635 3349 2989 2641 2335 1839 650 250000
Time (total CPU) (μs) * 179 300 335 379 428 543 1041 250000
Time (wall clock) (μs) * 177 299 335 379 429 544 1539 250000

Struct Entry UInt64

Metric p0 p25 p50 p75 p90 p99 p100 Samples
Instructions (K) * 2651 4284 4755 5325 5939 7279 12927 250000
Malloc (total) * 508 989 1131 1296 1474 1854 3311 250000
Memory (resident peak) (K) 8749 9167 9167 9167 9183 9191 9191 250000
Throughput (# / s) (#) 5667 3261 2915 2573 2275 1784 135 250000
Time (total CPU) (μs) * 178 307 343 388 439 559 1084 250000
Time (wall clock) (μs) * 176 307 343 389 440 561 7392 250000
3 Likes

Another idea to try (if you haven't) is to try implementing non-recursive algorithm for insertion?