I’m currently trying to understand Copy-on-Write (COW) behavior, and I’ve come across two different implementations. I was wondering why some people might prefer the first approach over the second.
Since Swift’s Array already implements Copy-on-Write semantics, the array property in MyStruct_2 should avoid duplicating its content when MyStruct_2 is passed around—until it’s modified.
So, my question is:
Is there a performance or other practical reason to store the array inside a reference type (_Storage) in the first implementation?
If it's just an array, then no, but if MyStruct_1 contains a large number of stored properties, then the implemented approach might have some performance benefit.
I believe you can also fake this without resorting to isKnownUniquelyReferenced(), by using a one-element array instead:
struct CowPoint {
private struct Storage {
var x: Int = 0
var y: Int = 0
var z: Int = 0
}
private var _storage = [Storage()]
var x: Int { get { return _storage[0].x } set { _storage[0].x = newValue } }
var y: Int { get { return _storage[0].y } set { _storage[0].y = newValue } }
var z: Int { get { return _storage[0].z } set { _storage[0].z = newValue } }
}
This is an interesting idea… and it got me thinking: to what extent – if any – could a "regular old" Swift.Array serve as a legit "cow storage" box? Would that be more performant than a solution built from a "roll your own" storage box?
The Swift-CowBox repo (full disclosure: self-promotion) is a set of macros for adding copy-on-write semantics to Swift Structs: it writes the boilerplate for you at compile-time. For the most part, CowBox follows the patterns from Swift NIO[1][2][3]: a custom Storage class is the "Box" that is copied by-reference when the struct container is copied.
The CowBox repo comes with Performance Benchmarks built from the Ordo One Benchmarks package. We focus on three operations for measuring performance:
We create a Swift.Array of N elements.
We create a Swift.Array of N elements, copy the array, and mutate the copy with one new element.
We create a Swift.Array of N elements, copy the array, mutate the copy with one new element, and then test both arrays for equality.
For those three operations, we test against an Element built from a "classic" swift struct:
struct StructElement: Hashable {
let a: Int64
let b: Int64
let c: Int64
let d: Int64
let e: Int64
let f: Int64
let g: Int64
let h: Int64
let i: Int64
let j: Int64
}
And a "CowBox" struct:
@CowBox struct CowBoxElement: Hashable {
@CowBoxNonMutating var a: Int64
@CowBoxNonMutating var b: Int64
@CowBoxNonMutating var c: Int64
@CowBoxNonMutating var d: Int64
@CowBoxNonMutating var e: Int64
@CowBoxNonMutating var f: Int64
@CowBoxNonMutating var g: Int64
@CowBoxNonMutating var h: Int64
@CowBoxNonMutating var i: Int64
@CowBoxNonMutating var j: Int64
}
Here is a new element for testing the performance of an element built from a Swift.Array cow storage:
struct ArrayElement: Hashable {
private struct _Storage: Hashable {
let a: Int64
let b: Int64
let c: Int64
let d: Int64
let e: Int64
let f: Int64
let g: Int64
let h: Int64
let i: Int64
let j: Int64
}
private var _storage: [_Storage]
}
Once we have these three element types defined, we can run our three operations on each element (for nine benchmarks). Let's focus on three metrics: Instructions, Memory, and CPU.
For our experiment, we create arrays of 10 million elements and run each benchmark 100 times. These results are from running on a MBP M2 Pro Max with 96GB memory.
Let's start with measuring instructions:
Instructions
Test
p0
p25
p50
p75
p90
p99
p100
Samples
ArrayElement Create (M) *
7219
7227
7227
7231
7231
7234
7234
100
CowBoxElement Create (M) *
4763
4769
4773
4777
4777
4777
4779
100
StructElement Create (M) *
440
441
441
441
441
441
441
100
Instructions
Test
p0
p25
p50
p75
p90
p99
p100
Samples
ArrayElement Copy (M) *
432
438
440
443
445
447
462
100
CowBoxElement Copy (M) *
378
381
381
381
381
382
383
100
StructElement Copy (M) *
437
438
438
438
439
439
439
100
Instructions
Test
p0
p25
p50
p75
p90
p99
p100
Samples
ArrayElement Equal (M) *
152
152
152
152
152
153
153
100
CowBoxElement Equal (M) *
80
82
82
82
82
83
83
100
StructElement Equal (M) *
316
319
320
320
320
321
322
100
Across the board, our CowBoxElement uses fewer instructions than ArrayElement.
Here are the results for memory:
Memory (resident peak)
Test
p0
p25
p50
p75
p90
p99
p100
Samples
ArrayElement Create (M)
1184
1214
1228
1228
1228
1228
1228
100
CowBoxElement Create (M)
1057
1057
1057
1057
1057
1057
1057
100
StructElement Create (M)
763
807
809
809
809
809
809
100
Memory (resident peak)
Test
p0
p25
p50
p75
p90
p99
p100
Samples
ArrayElement Copy (M)
1278
1294
1302
1302
1302
1302
1302
100
CowBoxElement Copy (M)
1114
1137
1137
1137
1137
1137
1137
100
StructElement Copy (M)
1561
1609
1609
1609
1609
1609
1609
100
Memory (resident peak)
Test
p0
p25
p50
p75
p90
p99
p100
Samples
ArrayElement Equal (M)
1293
1294
1310
1310
1310
1310
1310
100
CowBoxElement Equal (M)
1127
1137
1137
1137
1137
1137
1137
100
StructElement Equal (M)
1609
1609
1609
1609
1609
1609
1609
100
Across the board, our CowBoxElement uses less memory than ArrayElement.
Here are the results for CPU time:
Time (total CPU)
Test
p0
p25
p50
p75
p90
p99
p100
Samples
ArrayElement Create (μs) *
400034
408420
412353
416547
418382
422838
423065
100
CowBoxElement Create (μs) *
240640
250348
250872
251265
251527
253493
253817
100
StructElement Create (μs) *
64299
65765
65962
66519
69140
71369
71391
100
Time (total CPU)
Test
p0
p25
p50
p75
p90
p99
p100
Samples
ArrayElement Copy (μs) *
42760
44499
46006
48071
49775
52527
74258
100
CowBoxElement Copy (μs) *
31069
33505
34144
35062
40075
45941
46178
100
StructElement Copy (μs) *
55292
56459
56885
57344
59113
61374
61536
100
Time (total CPU)
Test
p0
p25
p50
p75
p90
p99
p100
Samples
ArrayElement Equal (μs) *
22184
22839
22938
23069
23167
23806
24130
100
CowBoxElement Equal (μs) *
3396
4354
4395
4436
4477
4813
5473
100
StructElement Equal (μs) *
34959
36733
37356
37650
37814
38928
39466
100
Across the board, our CowBoxElement uses less CPU than ArrayElement. We also see what looks like a big win when comparing two non-identical array instances for equality.
Based on these experiments, a custom-built copy-on-write data storage class seems to perform better than leveraging a Swift.Array for data storage. The Swift.Array approach does mean less code than building copy-on-write semantics by-hand, but the CowBox macro can also build that boilerplate for you.
You can experiment with this for yourself. Here is a patch you can apply on the 1.0.0 commit from Swift-CowBox to set up the new benchmarks:
Swift.Array somewhat unsurprisingly has more overhead because there's additional indirection inside the type implementation to support multiple storage representations, like bridged NSArrays from Objective-C. It's not just a simple wrapper around a heap allocation like a hand-written CoW type would tend to be.
I would be extremely surprised if that was the source of the overhead here. Single branches are very cheap, and the compiler has a lot of knowledge of Array internals, to the point where ContiguousArray sometimes ends up slower than Array.
Step one if someone wants to dig into this would be to get time profiles of the different cases so we can avoid guessing.