A way to check for unique storage in standard library collections

dabrahams · June 15, 2020, 3:36pm

Consider:

struct Vector<T> {
  var scalars: [T]
  static func +=(lhs: inout Vector, rhs: Vector) {
    for i in lhs.scalars.indices { lhs.scalars[i] += rhs.scalars[i] }
  }
}

This implementation is suboptimal in the case where lhs.scalars shares its storage with another array, because it first copies the scalars in lhs into new storage and then writes over them, accumulating elements from rhs into it. What we really want to be able to do is:

struct Vector<T> {
  var scalars: [T]
  static func +=(lhs: inout Vector, rhs: Vector) {
    if hasUniquelyReferencedStorage(&lhs.scalars) {
      for i in lhs.scalars.indices { lhs.scalars[i] += rhs.scalars[i] }
    }
    else { 
      lhs = lhs + rhs // build new storage once and replace old storage with it
    }
  }
}

Currently the only known way to do the check for unique storage is a horrible hack that probably violates the memory model:

extension ObjectIdentifier {
  /// Returns true iff the object identified by `self` is uniquely referenced.
  ///
  /// - Requires: the object identified by `self` exists.
  /// - Note: will only work when called from a mutating method
  internal func _liveObjectIsUniquelyReferenced() -> Bool {
    var me = self
    return withUnsafeMutablePointer(to: &me) {
      $0.withMemoryRebound(to: AnyObject.self, capacity: 1) {
        isKnownUniquelyReferenced(&$0.pointee)
      }
    }    
  }
}

/// Returns `true` iff the reference contained in `x` is uniquely referenced.
///
/// - Requires: `T` contains exactly one reference or optional reference
///   and no other stored properties or is itself a reference.
func unsafeHasUniquelyReferencedStorage<T>(_ x: inout T) -> Bool {
  unsafeBitCast(x, to: ObjectIdentifier.self)._liveObjectIsUniquelyReferenced()
}

Ideally, we would have a way to ask generically whether all the references in any given type were uniquely-referenced, but as a start, it would be extremely useful just to be able to ask that question about arrays, sets, and dictionaries. My pitch to you is that we expose unsafeHasUniquelyReferencedStorage in the standard library.

The alternative, for those unwilling to use horrible hacks, is to reconstruct all the functionality of the standard library collections on top of ManagedBuffer et. al just so that one is able to access the underlying reference and ask the uniqueness question, and even then, you have a type that doesn't interoperate with Array et. al.

Q: Why not a first-class safe property on Array, Dictionary, Set, and String? A: we could do that, but there are two reasons I pitched the unsafe API. First, this extremely low-level property would clutter the API of these centrally-important types. Second, the unsafe API is more broadly useful. That said, I'm not attached to the form of the answer and would be happy to propose a different variant if consensus dictates that.

Karl · June 15, 2020, 6:13pm

I feel that this is something the compiler ought to be able to do. Maybe it will soon; Array recently moved to use brand-new COW storage built-ins, so hopefully the optimiser will get smarter about how it initialises the new storage, notice that the previous stores are dead, and will directly initialise the storage with the final value.

Maybe @Erik_Eckstein could say whether that might be possible.

dabrahams · June 15, 2020, 6:26pm

This is not something the compiler can possibly do in the general case of arbitrary in-place mutations. There are quite a few mutating algorithms that are too complicated for the compiler to analyze, and even if they weren't, whose best out-of-place implementations are so different from the in-place ones that they can't be derived from one another.

Karl · June 15, 2020, 6:30pm

Okay, that's a more compelling example. I agree it seems useful to know if in-place mutations would require a copy for those algorithms with very divergent paths.

jrose · June 15, 2020, 7:34pm

I'm not even sure your implementation is correct; recall that the underlying storage of Array is a Builtin.BridgeObject (object reference with flags and possibly no retain), not an AnyObject. To that end, I think a standard-named property* on Array et al would be a better direction; anyone implementing their own CoW type should be able to add a similar property without too much difficulty:

struct MyCoWValue {
  private var backingStorage: ManagedBuffer<MyCoWHeader, MyCoWElement>

  var hasUniquelyOwnedStorage: Bool {
    mutating get {
      return isKnownUniquelyReferenced(&self.backingStorage)
    }
  }
}

EDIT: This also handles the case where the struct's unique storage is not the first member.

* or conditional callback, like withContiguousStorageIfAvailable

dabrahams · June 15, 2020, 9:01pm

That doesn't matter for the purposes of this pitch because a correct implementation is easy to build using primitives available to the standard library.

jrose · June 15, 2020, 9:35pm

I suppose it is, since I think any BridgeObject is a valid AnyObject today, but I think the point stands that unsafeHasUniquelyReferencedStorage is "unsafe" in a way that makes the preconditions very hard to satisfy: it is only safe to use on structures whose first field as stored is a valid object reference (possibly optional). That means we'd still be explicitly documenting which types are safe to use with it anyway, which to me means we might as well just have separate properties. It's true that people would be able to use it for their own types that have that layout as well, but I don't see the benefit in providing the unsafe operation when the safe one isn't too hard to implement for each type that cares.

EDIT: This response doesn't address your point about cluttering the API of the types in question. It'd be a small bit of code size overhead, but how about a marker protocol and a top-level generic function? (Or extension on MemoryLayout.) The protocol could even provide a key path to the storage member to be checked.

dabrahams · June 15, 2020, 9:39pm

I wasn't going to mention that, but yes.

Whose only field is a valid object reference or valid optional object reference, with possible spare bits set.

That's a fair argument.

Andrew_Trick · June 16, 2020, 5:14am

The "standard-named property" approach sounds great to me.

lukasa · June 16, 2020, 6:02am

As an extra data point, NIO had a need for this as well on our ByteBuffer type. We ended up implementing this:

    @inlinable
    public mutating func modifyIfUniquelyOwned<T>(_ body: (inout ByteBuffer) throws -> T) rethrows -> T? {
        if isKnownUniquelyReferenced(&self._storage) {
            return try body(&self)
        } else {
            return nil
        }
    }

We came to the conclusion that this would work better than having a property. This was in part because in order to call this method you must be holding the buffer mutably, and so we can guarantee that we are reporting the answer in the context of the operation you’re trying to perform (that is, this approach accounts for varying different accessors).