How to check two `Array` instances for identity equality in constant time?

tera · April 4, 2025, 8:27pm

That's a very good point. Then instead of an API that exposes "identity" we could have an API that compares it: a.sameIdentity(as: b)

protocol IdentityComparable {
    func isSameIdentity(as other: Self) -> Bool
}

extension Set: IdentityComparable {
    private var identity: ObjectIdentifier {
        unsafeBitCast(self, to: ObjectIdentifier.self)
    }
    func isSameIdentity(as other: Self) -> Bool {
        identity == other.identity
    }
}

extension String: IdentityComparable {
    private struct Identity: Hashable {
        var a, b: UInt64
    }
    private var identity: Identity {
        unsafeBitCast(self, to: Identity.self)
    }
    func isSameIdentity(as other: Self) -> Bool {
        identity == other.identity
    }
}

let a: Set = [1,2,3]
let b = a
let c: Set = [1,2,3]
precondition(a.isSameIdentity(as: b))
precondition(!a.isSameIdentity(as: c))

let d = String(cString: "short str")
let e = String(cString: "short str")
let f = e
let g = String(cString: "long long string string")
let h = String(cString: "long long string string")
let i = h

precondition(d.isSameIdentity(as: e)) // same short strings are identical
precondition(f.isSameIdentity(as: e))
precondition(!g.isSameIdentity(as: h)) // same long strings are not identical
precondition(i.isSameIdentity(as: h))

vanvoorden · April 4, 2025, 11:46pm

Could you help me understand more the rules about shipping a protocol in standard library that would then be adopted on a type from Foundation?

I can understand that shipping a protocol in standard library in 6.3 and then adopting that protocol on a type from standard library in 6.4 is breaking. I checked the documentation and I'm not completely clear what the rules would then be for adopting that protocol on a type from Foundation like Data. I'm guessing we would want to define the protocol and choose the types from standard library and Foundation to adopt the type all at the same time. Does that sound correct?

vanvoorden · April 7, 2025, 7:53pm

@lorentey Could you think of any gotchas or blockers if this protocol ships in the standard library and we then want to adopt that protocol on types from swift-collections? Like TreeDictionary?

Slava_Pestov · April 7, 2025, 9:33pm

You can add new protocols, and new conformances of concrete types to both existing and new protocols, without restriction.

A narrower set of rules govern what can be added to an existing protocol. You can add new requirements with default implementations, and new associated types with defaults, but you can’t change an existing protocol’s inheritance clause after the fact, even to add a new protocol.

vanvoorden · April 7, 2025, 9:39pm

Here is a pitch. Thanks!

Loooop · April 10, 2025, 3:18pm

Sorry for the late reply. Just use a uuid. Example:

import Foundation

public protocol StorageIdentifiable<StorageID> {
	associatedtype StorageID : Hashable
	var storageID: Self.StorageID { get }
}

// The private reference type that holds the container buffer
private class ContainerStorage : StorageIdentifiable {
	private var uuid : UUID?
	
	var storageID: UUID {
		// creates and stores a unique UUID only if required
		if uuid == nil {
			uuid	= UUID()
		}
		return uuid!
	}
	
	/* ... etc ... */
}

// The public value type that implements a COW machinery
public struct Container {
	private var storage: ContainerStorage
	/* ... etc ... */
}

extension Container: StorageIdentifiable {
	public var storageID: UUID {
		storage.storageID
	}
}

lorentey · April 11, 2025, 10:02pm

Please let's not expose identities as escapable objects; I fully concur with @Slava_Pestov's note.

I agree there is a legitimate need for quick isIdentical(to:) checks, and I think it would be a good idea to add such methods as public members on any concrete type where it makes sense: types that implement the copy on write optimization, on referential types like UnsafeBufferPointer, etc.

SE-0447 has already established the name isIdentical(to:). We added these to support simple tests for referential equality without having to conform Span/RawSpan to Equatable. (It would not have been possible to do that, as Equatable currently requires escapability; but even if we fixed that (as we're planning to do), the conformance would be highly ambiguous: some people would read span1 == span2 to compare identities, others would expect it to compare contents. We do intend Span to conform to a collection-like protocol, and Equatable currently carries an expectation for elementwise equality for those. It seemed better to avoid this ambiguity, and instead provide clearly named methods for the two different interpretations.)

I don't really see why we would need to define a standard protocol for this, though. (I'd particularly not like to have to think about types like Bool getting forced to conform to it.) If the protocol has widespread appeal, then it ought to be possible to demonstrate this by shipping it (and building on it) in a package first -- we don't need to immediately jump to defining it in the Standard Library. The package can make use of concrete isIdentical(to:) implementations in the stdlib to define its own conformances, as it sees fit.

The one mainstream benefit of a separate standard protocol I do see would be that it would allow us to use the === operator as an alias for (or in place of) isIdentical(to:), including over wrappers like Optional. Span/RawSpan avoided adding any new overloads of === to prevent triggering a tsunami of similar overloads for other types, which would be unlikely to scale well.

An important alternative to a new protocol is to add a isKnownIdentical(to:) requirement to Equatable, with a default implementation that returns false.

protocol Equatable {
  static func == (lhs: Self, rhs: Self) -> Bool

  @available(SwiftStdlib 6.x, *)
  func isKnownIdentical(to other: Self) -> Bool
}
extension Equatable {
  @inlinable /* @backDeployed etc */
  public func isKnownIdentical(to other: Self) -> Bool { false }
}

This avoids the (mostly mental) overhead of a whole new protocol, but it does put such an operation into the member namespace of every equatable type, which is quite noisy -- so this too would be best proposed once there is widespread agreement that we need to routinely invoke such operations from generic contexts. (Note that default implementations for protocol requirements tend to get compiled into Swift binaries as part of specialization/inlining, and so it will lead to components of an application disagreeing on the return value of isKnownIdentical. This is not necessarily a problem, but it is a complication.)

grynspan · April 12, 2025, 11:15pm

If we are willing to accept the possibility of padding bytes giving us a false negative, it is trivially easy to write a function that checks two values of the same type for identicality—identicalness?

extension Equatable {
  @inlinable /* @backDeployed etc */
  public func isKnownIdentical(to other: Self) -> Bool {
    withUnsafeBytes(of: self) { lhs in
      withUnsafeBytes(of: other) { rhs in
        lhs.elementsEqual(rhs) // or just memcmp
      }
    }
  }
}

Have we definitely ruled out the utility of that option in this thread? It would be at least as correct as a default implementation that always returned false.

MPLewis · April 13, 2025, 6:50pm

I think the point is that this pitch is trying to have the new method (whatever it’s called) return its result in constant time instead of linear. memcmp (or an equivalent) may be heavily optimized but at the end of the day is still something above O(1) and that could rear its head for large types.

grynspan · April 13, 2025, 6:56pm

Not quite! The complexity of the code I posted is effectively optimal, and is only O(n) over the size of the immediate value, not the size of any backing memory or referenced objects. The code compares the bits/bytes of the two immediate values. Going back to the type used in @vanvoorden's original question… any implementation you could write that correctly detected two instances of Array were identical would have the same algorithmic complexity.

MPLewis · April 13, 2025, 7:13pm

That approach really relies on there being a reference to other memory in the type you’re trying to compare (where, yes, it’s pretty optimal) - what happens if the type itself is massive, like some huge InlineArray?

grynspan · April 13, 2025, 7:33pm

How would you propose comparing two such values for identicality?

MPLewis · April 13, 2025, 7:55pm

I’m not proposing anything, and think this notion of being identical needs to be handled on a type-by-type basis due to different notions of equality and different implementation details - just saying that your extension on every Equatable type above will fall flat on something like InlineArray (if/when it eventually gains Equatable conformance), or even just particularly large structs.

grynspan · April 13, 2025, 8:06pm

We could set an upper limit on the number of bytes the function will compare, e.g. "the size of a typical CPU cache line" or some other relevant heuristic. Having a reasonable default implementation (whatever that might be) doesn't preclude specializations with different implementations, of course.

tera · April 13, 2025, 8:20pm

Mike, the "isKnownIdentical" implementation above doesn't compare the bytes of the actual value (†)... only the bytes of a "pointer" (like 8 bytes for Array/Dictionary or etc and 16 bytes for Strings). Let memcmp not confuse you, it compares a fixed number of bytes, so very O(1).

(†) unless they are stored inline, like very short strings.

As for large structs / InlineArrays, etc – here you are correct, that would compare the whole thing. InlineArrays won't be huge.

And yes, it would be good for the type participating in this "isIdentical" / "definitelyDifferent" / "can't tell" machinery. For example "definitelyDifferent" would be very type specific.

lorentey · April 14, 2025, 6:00pm

I have two small notes:

First, the bitwise equality of two objects does not necessarily imply that they are identical. An object's identity sometimes includes the location of its bits in addition to their value. For example, this is the case for noncopyable synchronization constructs like struct Atomic and struct Mutex -- two distinct atomic integers are never identical, even if at some point they both happen to have the same integer value, i.e., their storage consists of the same bits. Some Swift types are not bitwise comparable.

(This can also go in the other way. Two semantically identical Swift values can sometimes differ in their bit patterns due to padding bytes whose values are unspecified and can vary. In general, raw bitwise equality can be used neither to deny nor confirm that two Swift objects are identical.)

Second, regarding asymptotic complexity of comparing instances using memcmp: each concrete type in Swift has a constant memory layout, known at runtime. To decide whether the bits of two instances of the same type are the same, in the worst case (bitwise equality) we need to access each bit in both objects. This requires accessing/comparing ~2 * MemoryLayout.size bytes, which gives us a constant upper bound. The time complexity of bitwise instance comparisons in Swift is O(1). This is true for every concrete Swift type, from Bool to InlineArray<1_000_000_000, Int>.

grynspan · April 14, 2025, 6:39pm

I'm well aware, but such types would either want their own specialization (where Self: ~Copyable vs where Self: Copyable, or however @_rawLayout might shake out) or would not be able to meaningfully conform to this sort of protocol anyway.

I did mention padding bytes, yes. Barring compiler support for a synthesized implementation, I don't think there's much you can realistically do about that in a default/generic implementation, but also the promise here would be "definitely identical" vs. "unknown", so

tera · April 14, 2025, 8:08pm

That highlights that the type should be able saying the first word if it's identical or not.

BTW, what's your view on Jordans comment? I believe it's equally important to support that ("definitely not equal") in addition to the discussed "definitely equal".

I think we could only compare the first bytes till there's a difference, so for some InlineArray<1_000_000_000, Int> we will know the answer after comparing the first element and for some only when comparing all elements.

BTW the way you put is like saying that "for every concrete N, comparing two strings of length N is O(1)"

Sure, O(1_000_000_000) == 1_000_000_000 * O(1) == O(1) ....

But that "1_000_000_000" is "N"... no? Looks like O(N) operation to me.

Slava_Pestov · April 14, 2025, 8:48pm

The time complexity of memcmp() is linear, not constant. It doesn’t matter if the size is encoded as part of the type, part of the value, or supplied out of line.

lorentey · April 15, 2025, 3:16am

That's great, but it does not in fact matter. Given a Swift type, the asymptotic time complexity of bitwise comparing two of its instances is O(1) whether or not we stop comparing bytes at the first mismatch.

Asymptotic complexity analysis is concerned about the growth rate of functions. The big omicron notation O(...) gives us an upper bound for how fast a function can possibly grow. An O(1) result is the best possible outcome -- it means that the function does not at all grow "with its input": its output stays within some constant range, no matter what it input is. The function in this case is the time complexity of an algorithm presented as a piece of Swift code. (Time complexity is measured by -- say -- counting the number of CPU operations executed while running it.)

For any given concrete Swift type, Jonathan's code performs at most MemoryLayout<Self>.size units of work, each of a constant time. MemoryLayout<Self>.size is itself a constant value. This makes the asymptotic time complexity of the overall algorithm O(1) -- even in the worst case, the code's time complexity remains completely independent of what values we pass into it.

O(1) is already the best possible outcome of asymptotic complexity analysis; as far as the model is concerned, it cannot be improved upon. Stopping the comparison at the first mismatching byte does not change this result -- it's still O(1).

We can reasonably argue about whether asymptotic analysis is an impractical oversimplification, and whether it is useful at all, given how often it is misunderstood. But O(1) and O(n) have specific meanings; it is highly counterproductive to mix up one with the other.

That's great, but we are not talking about memcmp in general. I'm talking about the asymptotic time complexity of bitwise comparing the representations of two instances of a given Swift type. That is a constant time operation, whether or not we stop at the first mismatching byte, and no matter if choose to do the comparison using memcmp, URBP.elementsEqual or if we roll our own loop.

The time complexity of memcmp is O(n) memory accesses, where n is the size_t argument. When memcmp is used to compare the bitwise representations of two instances of a Swift type T, n is MemoryLayout<T>.size. That is a constant value.

It does not matter if the constant is 1 or 8_000_000_000; it is still a constant.

The size passed to memcmp is not variable. It does not change depending on what two instances we choose to compare. It is always the same. It is unvarying, fixed, immutable. It gives us a constant upper bound.

It does not matter that Jonathan's code defines a generic function -- complexity analysis is concerned about the behavior of a single generic instantiation at a time.

(If y'all think comparing the byte representations of two instances of a Swift type is a "linear-time" operation, then so is passing a value to any Swift function -- because it can involve loading its full representation into specific registers or copying it into a temporary stack location. That and the withUnsafeBytes(of:) invocations in Jonathan's code often result in binaries that make several full copies of the input arguments. A clever enough optimizer might get rid of those, but I do not think ours is clever enough for that -- at least not in general.)