How to implement copy-on-write behavior for my own collections?

Swift Arrays, Sets, and Dictionaries are copy-on-write structs. The storage is shared between copies of the collection until one copy modifies the data, then it will receive a private copy that can be modified. I want to do something like that for a collection I’m implementing.

I looked at the Swift stdlib sources and this appears to be implemented using a function called isUniquelyReferenced(). This function appears to do lots of black-magic mojo.

Is there a ‘legit’ way to implement this? If it requires internals, is there a plan to make this an officially supported capability that can be safely used?

isKnownUniquelyReferenced(_:) does the trick and is as “legit” as anything. The documentation discusses how it can be used to implement copy-on-write:

https://developer.apple.com/documentation/swift/2429905-isknownuniquelyreferenced

1 Like

Excellent, thanks!

The docs for isKnownUniquelyReferenced give an example that is not thread safe:

mutating func update(withValue value: T) {
    if !isKnownUniquelyReferenced(&myStorage) {
        myStorage = self.copiedStorage()
    }
    myStorage.update(withValue: value)
}

I thought the standard Swift collections like Array were thread safe and would maintain the value type illusion, even with multiple threads. If so, what do they use for a locking mechanism? I can probably do something with a pthread mutex but I'd like to use whatever the standard library collections are using.

It is thread safe for multiple arrays instances to share the storage. You have either:

  • isKnownUniquelyReferenced returns true: there is only one pointer to the storage and thus it belongs to the current thread, so no need to lock for mutations.
  • isKnownUniquelyReferenced returns false: you can only read the storage because other threads might have access to it (also, other arrays want to preserve their own value); so you get a new storage when mutating and only the current array is affected.

What is not thread safe is to share the same var with multiple threads (unless you synchronize access to that var). That is true for most types, including Array.

3 Likes

The following case is confusing me. It seems like you could have two separate vars but still have a problem. For example, imagine the update above is the array subscript setter.

  • We start with just one var. Thread #1 enters the setter, checks isKnownUniquelyReferenced and sees that it is true, so it proceeds without copy. It gets interrupted, before it actually updates the storage.
  • Thread #2 makes a copy of the value to a second var. The two vars point to the same storage so isKnownUniquelyReferenced is false. Thread #2 reads a value from array[0] and decides to set array[1] to match that value. It calls the setter. It gets interrupted.
  • Thread #1 finishes its update, which sets array[0] in the shared storage.
  • Thread #2 resumes, copies the storage, and updates array[1]. It thinks its first and second elements are equal, but they're not, because it has a "value" that changed out from under it.

Isn't that a problem? Does it break the illusion that an Array is a value type?

This implies that the first var is accessible concurrently by two threads, otherwise how does it make a copy? That's not thread safe.

Instead, Thread #1 thread should make a copy and pass it to Thread #2.

3 Likes

You have two threads accessing the same var here. Thread #2 cannot make a copy of the value without accessing the original var. Swift is not going to protect you against race conditions like this.

3 Likes

Ah, I understand now. Thanks!

To note, from the isKnownUniquelyReferenced(_:) reference:

If the instance passed as object is being accessed by multiple threads simultaneously, this function may still return true . Therefore, you must only call this function from mutating methods with appropriate thread synchronization. That will ensure that isKnownUniquelyReferenced(_:) only returns true when there is really one accessor, or when there is a race condition, which is already undefined behavior.

A lot of stdlib types already assume single-thread access, and leave the handling about race to things like Dispatch. I suppose you could do the same.