How to check two `Array` instances for identity equality in constant time?

Slava_Pestov · April 2, 2025, 8:59pm

It’s only a source break with the legacy “pre-any” P existential type syntax. Once any P was introduced you could form an existential type for any protocol (get it?) regardless of whether it has associated types or Self requirements, so it’s not source breaking to change a protocol in this way as long as any usage is enforced.

vanvoorden · April 2, 2025, 9:14pm

I can see how that makes sense for the case where a new requirement would be the first use of Self. I think my question was then more specifically if:

a new requirement uses Self
an existing requirement already uses Self

That's what wasn't clear for me reading the evolution doc. If those two conditions hold would that then not be a source breaking change… including for product engineers building on the legacy existential syntax:

struct S: Equatable { }
let eq: Equatable = S()

Slava_Pestov · April 2, 2025, 9:17pm

We never accepted bare Equatable as a type so that’s fine too.

vanvoorden · April 2, 2025, 9:24pm

Hmm… this code builds for me with warnings instead of errors from 6.1 and Swift 6 language mode:

func f(_: Equatable) { }
          `- warning: use of protocol 'Equatable' as a type must be written 'any Equatable'; this will be an error in a future Swift language mode

func main() {
  struct S: Equatable { }
  let eq: Equatable = S()
          `- warning: use of protocol 'Equatable' as a type must be written 'any Equatable'; this will be an error in a future Swift language mode
  f(eq)
}

Is the implication that this code should not break if the Equatable protocol was changed with that new requirement?

Slava_Pestov · April 2, 2025, 9:33pm

I see, it was downgraded to a warning. Yes, adding new requirements to Equatable will not change existing code. However, I don’t think a “probably equal” requirement belongs on Equatable. There are many possible ways to define “probable equality” and picking one “official” choice of axioms is difficult.

tera · April 3, 2025, 1:07am

Maybe worth making it a base of Equatable? Being a separate protocol it could then be used without (and unrelated to) Equatable.

// bike shed name

protocol ProbablyEqual {
    func probablyEqual(_ other: Self) -> Bool
}

protocol Equatable: ProbablyEqual { ... }

Slava_Pestov · April 3, 2025, 1:16am

Unfortunately inserting a new base protocol is an ABI break.

tera · April 3, 2025, 1:23am

BTW, "probably equal" name doesn't convey the right message. What is the "not" of it? "definitely equal"? "definitely different?". With "definitely equal" there's no confusion.

Example of using DefinitelyEqual to speed Set.contains operation up(pseudo code):

func contains(_ element: Element) -> Bool {
    if count < 100 { // threshold
        for v in all set elements {
            if element.definitelyEqual(v) { return true }
        }
    }
    // traditional path, calculate hash, etc
    ....
}

vanvoorden · April 3, 2025, 1:34am

Hmm… I was kind of leaning in the direction of maybeDifferent or mightHaveChanged:

false implies these two instances must be "equal" (in quotes)
true does not imply anything about "equality"

And then the "equality" concept means something more general than value-equality if this lives outside of Equatable. My use-cases so far are all for putting this on types that already adopt Equatable but also have some kind of implementation detail (like CoW storage) that make this kind of function a O(1) operation.

tera · April 3, 2025, 1:54am

Similarly for Array's contains (pseudocode):

func contains(_ element: Element) -> Bool {
    for v in all array elements {
        if element.definitelyEqual(v) { return true }
    }
    // traditional path:
    for v in all array elements {
        if element == v { return true }
    }
}

The initial quick path that does a shallow check won't bring element contents into cache, could be a win in some situations.

watch the hands:
- "maybe different" ~> "maybe not equal" (obviously)
- "maybe not equal" ~> "maybe equal" (right?)
- "maybe equal" ~> "probably equal" (semantics)

and then we are back to what would it mean when something is "not ("probably equal")"

Slava_Pestov · April 3, 2025, 2:10am

tera:

Similarly for Array's contains (pseudocode):

func contains(_ element: Element) -> Bool {
    for v in all array elements {
        if element.definitelyEqual(v) { return true }
    }
    // traditional path:
    for v in all array elements {
        if element == v { return true }
    }
}

I think if you’re optimizing a contains() check at this level you really want to build a specialized data structure for your application instead of relying on magical behaviors of Array.

Always performing this definitelyEqual check for an Element type of Int for example is just pure overhead.

tera · April 3, 2025, 2:20am

unless there's some (very quick) check along the lines of:

// the above shallow loop

// then:
if element type is purely bitwise comparable {
    return false // we are done, nothing else left to check
}

// traditional path:
...

OTOH the "not" of "definitely equal" could be also confusing...

What is the opposite of "we know it's true"...
formal logic suggests "we don't know whether it's true or lie" but it could (and would) be confused with "we know it's a lie".

Slava_Pestov · April 3, 2025, 2:31am

tera:

unless there's some (very quick) check along the lines of:

// the above shallow loop

// then:
if element type is purely bitwise comparable {
    return false // we are done, nothing else left to check
}

// traditional path:
...

A hypothetical fast path for bitwise comparable types could avoid protocol witness calls inside the loop altogether, and would directly compare bytes in a vectorized loop. So it’s certainly plausible to have an implementation of contains() or similar that would check the condition first, and then execute one of two loops depending on the outcome.

tera · April 3, 2025, 2:49am

Quick and dirty test suggests huge wins are possible in certain scenarios.

struct S: Equatable {
    var x: Int
    var y: Int
}

extension Array where Element == String {
    func contains2(_ element: Element) -> Bool {
        let a = unsafeBitCast(element, to: S.self)
        for v in self {
            let b = unsafeBitCast(v, to: S.self)
            if a == b {
                return true
            }
        }
        for v in self {
            if element == v { return true }
        }
        return false
    }
}

let strings = (0 ..< 10_000).map { i in
    var a = String(repeating: "a", count: 10_000)
    a.append("\(i)")
    return a
}
var a = strings[9000]

var start = Date()
let x = strings.contains(a)
let elapsed1 = Date().timeIntervalSince(start)
start = Date()
let y = strings.contains2(a)
let elapsed2 = Date().timeIntervalSince(start)
start = Date()
print(elapsed1 / elapsed2) // 1623

vanvoorden · April 3, 2025, 6:13pm

The other direction I was thinking was instead of figuring out how to express some concept of "might be changed" or "most certainly did not change" we could express "is identical".

For a copy-on-write data structure like Array this naturally can map to both values wrapping "the same" storage reference. At this point… we are kind of "leaking" an implementation detail. Is it right for a product engineer to know this "value" type is actually kinda sorta also a reference behind the scenes? There's pros and cons… but it does help clean up the ambiguity about "might be changed".

If the API exposed isIdentical it should hopefully be clear what true and false are trying to express. The product engineer could then use that true or false to express their "might be changed" or "definitely equal" semantics in their most preferred style.

There is also some public-but-underscored precedent in standard library. This function on String seems to work similar to what we are talking about here:

github.com/swiftlang/swift

stdlib/public/core/String.swift

swift-6.1-RELEASE


      
          extension String {
            /// Returns a boolean value indicating whether this string is identical to
            /// `other`.
            ///
            /// Two string values are identical if there is no way to distinguish between
            /// them.
            ///
            /// Comparing strings this way includes comparing (normally) hidden
            /// implementation details such as the memory location of any underlying
            /// string storage object. Therefore, identical strings are guaranteed to
            /// compare equal with `==`, but not all equal strings are considered
            /// identical.
            ///
            /// - Performance: O(1)
            @_alwaysEmitIntoClient
            public func _isIdentical(to other: Self) -> Bool {
              self._guts.rawBits == other._guts.rawBits
            }
          }

AFAIK we are also planning to ship a isIdentical function on Span and RawSpan from SE-0447.^[1]

swift-evolution/proposals/0447-span-access-shared-contiguous-storage.md at main · swiftlang/swift-evolution · GitHub ↩︎

Loooop · April 4, 2025, 9:11am

I think all structs that use the Copy on Write mechanism should conform to a protocol that provides the identity of the internal memory in which the data is stored.
Something like this:

protocol StorageIdentifiable<StorageID> {
	associatedtype StorageID : Hashable
	var storageID: Self.StorageID { get }
}

extension Array:StorageIdentifiable {
	var storageID: ObjectIdentifier {
		withUnsafeBytes { unsafeBitCast( $0.baseAddress, to: ObjectIdentifier.self) }
	}
}

/*
extension Set:StorageIdentifiable {
	var storageID: /* not possible today */
}

extension Dictionary:StorageIdentifiable {
	var storageID: /* not possible today */
}
*/

let a = [1,2,3]
let b = a

print("Same storage = \( a.storageID == b.storageID )")
// print "Same storage = true"

In this way, not only does it become easy to perform comparisons such as the one required not only for arrays, but also for sets, dictionaries, etc., but, above all, to implement file persistence without saving the contents of container structures that refer to the same data multiple times.

vanvoorden · April 4, 2025, 5:41pm

It's an interesting idea… but also keep in mind that CoW is not exactly a contract that needs to be visible externally… it's an implementation choice on the part of the infra engineer. And some types that do CoW on "large" values might choose to save "small" values on the stack as a performance optimization. So a type like Data might not always have the ability to return a storageID.

vanvoorden · April 4, 2025, 5:43pm

This is great! My immediate and practical use-case example for wanting this API were for product engineers to use this check as an optimization going into SwiftUI… but if there are side-quests that can refactor internal standard library code that SGTM.

tera · April 4, 2025, 5:57pm

Why is it not possible today for sets / dictionaries?

Looks doable. Example with Set and String, Dictionary would be similar to Set:

protocol Identifiable<Identity> {
    associatedtype Identity : Hashable
    var identity: Self.Identity { get }
}

extension Set: Identifiable {
    var identity: ObjectIdentifier {
        unsafeBitCast(self, to: ObjectIdentifier.self)
    }
}

extension String: Identifiable {
    struct Identity: Hashable {
        var a, b: UInt64
    }
    var identity: Identity {
        unsafeBitCast(self, to: Identity.self)
    }
}

let a: Set = [1,2,3]
let b = a
let c: Set = [1,2,3]
precondition(a.identity == b.identity)
precondition(a.identity != c.identity)

let d = String(cString: "short str")
let e = String(cString: "short str")
let f = e
let g = String(cString: "long long string string")
let h = String(cString: "long long string string")
let i = h

precondition(d.identity == e.identity) // same short strings are identical
precondition(f.identity == e.identity)
precondition(g.identity != h.identity) // same long strings are not identical
precondition(i.identity == h.identity)

Slava_Pestov · April 4, 2025, 7:40pm

The problem is that once you form the Identifier there is nothing keeping the original buffer alive, so it might be deallocated underneath you, in which case a subsequent allocation might reuse the same address and result in a false positive.