Copy-on-write implementation choice

Hi everyone,

I’m currently trying to understand Copy-on-Write (COW) behavior, and I’ve come across two different implementations. I was wondering why some people might prefer the first approach over the second.

Here’s the first implementation:

struct MyStruct_1 {
    private class _Storage {
        var array: [Int64] = []

        func copy() -> Self {
            let copy = Self()
            copy.array = self.array
            return copy
        }

        func append(_ element: Int64) {
            array.append(element)
        }
    }

    private var _storage = _Storage()

    mutating func append(_ element: Int64) {
        if !isKnownUniquelyReferenced(&self._storage) {
            self._storage = self._storage.copy()
        }
        self._storage.append(element)
    }
}

And here’s the second one:

struct MyStruct_2 {
    var array: [Int64]
}

Since Swift’s Array already implements Copy-on-Write semantics, the array property in MyStruct_2 should avoid duplicating its content when MyStruct_2 is passed around—until it’s modified.

So, my question is:

Is there a performance or other practical reason to store the array inside a reference type (_Storage) in the first implementation?

Thanks in advance for your insights!

If it's just an array, then no, but if MyStruct_1 contains a large number of stored properties, then the implemented approach might have some performance benefit.

I believe you can also fake this without resorting to isKnownUniquelyReferenced(), by using a one-element array instead:

struct CowPoint {
  private struct Storage {
    var x: Int = 0
    var y: Int = 0
    var z: Int = 0
  }

  private var _storage = [Storage()]

  var x: Int { get { return _storage[0].x } set { _storage[0].x = newValue } }
  var y: Int { get { return _storage[0].y } set { _storage[0].y = newValue } }
  var z: Int { get { return _storage[0].z } set { _storage[0].z = newValue } }
}
10 Likes

Thank you for taking the time to respond :slightly_smiling_face:

This is an interesting idea… and it got me thinking: to what extent – if any – could a "regular old" Swift.Array serve as a legit "cow storage" box? Would that be more performant than a solution built from a "roll your own" storage box?

The Swift-CowBox repo (full disclosure: self-promotion) is a set of macros for adding copy-on-write semantics to Swift Structs: it writes the boilerplate for you at compile-time. For the most part, CowBox follows the patterns from Swift NIO[1][2][3]: a custom Storage class is the "Box" that is copied by-reference when the struct container is copied.

The CowBox repo comes with Performance Benchmarks built from the Ordo One Benchmarks package. We focus on three operations for measuring performance:

  • We create a Swift.Array of N elements.
  • We create a Swift.Array of N elements, copy the array, and mutate the copy with one new element.
  • We create a Swift.Array of N elements, copy the array, mutate the copy with one new element, and then test both arrays for equality.

For those three operations, we test against an Element built from a "classic" swift struct:

struct StructElement: Hashable {
  let a: Int64
  let b: Int64
  let c: Int64
  let d: Int64
  let e: Int64
  let f: Int64
  let g: Int64
  let h: Int64
  let i: Int64
  let j: Int64
}

And a "CowBox" struct:

@CowBox struct CowBoxElement: Hashable {
  @CowBoxNonMutating var a: Int64
  @CowBoxNonMutating var b: Int64
  @CowBoxNonMutating var c: Int64
  @CowBoxNonMutating var d: Int64
  @CowBoxNonMutating var e: Int64
  @CowBoxNonMutating var f: Int64
  @CowBoxNonMutating var g: Int64
  @CowBoxNonMutating var h: Int64
  @CowBoxNonMutating var i: Int64
  @CowBoxNonMutating var j: Int64
}

Here is a new element for testing the performance of an element built from a Swift.Array cow storage:

struct ArrayElement: Hashable {
  private struct _Storage: Hashable {
    let a: Int64
    let b: Int64
    let c: Int64
    let d: Int64
    let e: Int64
    let f: Int64
    let g: Int64
    let h: Int64
    let i: Int64
    let j: Int64
  }
  
  private var _storage: [_Storage]
}

Once we have these three element types defined, we can run our three operations on each element (for nine benchmarks). Let's focus on three metrics: Instructions, Memory, and CPU.

For our experiment, we create arrays of 10 million elements and run each benchmark 100 times. These results are from running on a MBP M2 Pro Max with 96GB memory.

Let's start with measuring instructions:

Instructions

Test p0 p25 p50 p75 p90 p99 p100 Samples
ArrayElement Create (M) * 7219 7227 7227 7231 7231 7234 7234 100
CowBoxElement Create (M) * 4763 4769 4773 4777 4777 4777 4779 100
StructElement Create (M) * 440 441 441 441 441 441 441 100

Instructions

Test p0 p25 p50 p75 p90 p99 p100 Samples
ArrayElement Copy (M) * 432 438 440 443 445 447 462 100
CowBoxElement Copy (M) * 378 381 381 381 381 382 383 100
StructElement Copy (M) * 437 438 438 438 439 439 439 100

Instructions

Test p0 p25 p50 p75 p90 p99 p100 Samples
ArrayElement Equal (M) * 152 152 152 152 152 153 153 100
CowBoxElement Equal (M) * 80 82 82 82 82 83 83 100
StructElement Equal (M) * 316 319 320 320 320 321 322 100

Across the board, our CowBoxElement uses fewer instructions than ArrayElement.

Here are the results for memory:

Memory (resident peak)

Test p0 p25 p50 p75 p90 p99 p100 Samples
ArrayElement Create (M) 1184 1214 1228 1228 1228 1228 1228 100
CowBoxElement Create (M) 1057 1057 1057 1057 1057 1057 1057 100
StructElement Create (M) 763 807 809 809 809 809 809 100

Memory (resident peak)

Test p0 p25 p50 p75 p90 p99 p100 Samples
ArrayElement Copy (M) 1278 1294 1302 1302 1302 1302 1302 100
CowBoxElement Copy (M) 1114 1137 1137 1137 1137 1137 1137 100
StructElement Copy (M) 1561 1609 1609 1609 1609 1609 1609 100

Memory (resident peak)

Test p0 p25 p50 p75 p90 p99 p100 Samples
ArrayElement Equal (M) 1293 1294 1310 1310 1310 1310 1310 100
CowBoxElement Equal (M) 1127 1137 1137 1137 1137 1137 1137 100
StructElement Equal (M) 1609 1609 1609 1609 1609 1609 1609 100

Across the board, our CowBoxElement uses less memory than ArrayElement.

Here are the results for CPU time:

Time (total CPU)

Test p0 p25 p50 p75 p90 p99 p100 Samples
ArrayElement Create (μs) * 400034 408420 412353 416547 418382 422838 423065 100
CowBoxElement Create (μs) * 240640 250348 250872 251265 251527 253493 253817 100
StructElement Create (μs) * 64299 65765 65962 66519 69140 71369 71391 100

Time (total CPU)

Test p0 p25 p50 p75 p90 p99 p100 Samples
ArrayElement Copy (μs) * 42760 44499 46006 48071 49775 52527 74258 100
CowBoxElement Copy (μs) * 31069 33505 34144 35062 40075 45941 46178 100
StructElement Copy (μs) * 55292 56459 56885 57344 59113 61374 61536 100

Time (total CPU)

Test p0 p25 p50 p75 p90 p99 p100 Samples
ArrayElement Equal (μs) * 22184 22839 22938 23069 23167 23806 24130 100
CowBoxElement Equal (μs) * 3396 4354 4395 4436 4477 4813 5473 100
StructElement Equal (μs) * 34959 36733 37356 37650 37814 38928 39466 100

Across the board, our CowBoxElement uses less CPU than ArrayElement. We also see what looks like a big win when comparing two non-identical array instances for equality.

Based on these experiments, a custom-built copy-on-write data storage class seems to perform better than leveraging a Swift.Array for data storage. The Swift.Array approach does mean less code than building copy-on-write semantics by-hand, but the CowBox macro can also build that boilerplate for you.

You can experiment with this for yourself. Here is a patch you can apply on the 1.0.0 commit from Swift-CowBox to set up the new benchmarks:


  1. https://www.youtube.com/watch?v=iLDldae64xE ↩︎

  2. https://www.youtube.com/watch?v=WCUj581Dpec ↩︎

  3. https://www.youtube.com/watch?v=m9JZmP9E12M ↩︎

7 Likes

Swift.Array somewhat unsurprisingly has more overhead because there's additional indirection inside the type implementation to support multiple storage representations, like bridged NSArrays from Objective-C. It's not just a simple wrapper around a heap allocation like a hand-written CoW type would tend to be.

4 Likes

I would be extremely surprised if that was the source of the overhead here. Single branches are very cheap, and the compiler has a lot of knowledge of Array internals, to the point where ContiguousArray sometimes ends up slower than Array.

Step one if someone wants to dig into this would be to get time profiles of the different cases so we can avoid guessing.

3 Likes

There's also the fact that the array version has to perform a bounds check on each access.

1 Like