Could `Data` ship a "fast path" check for identity equality?

Hi! I've been looking around through StdLib and Foundation for types that might expose some kind of "identity equality". This would return true in constant time if we were sure that two values were equal. This would be related to this pitch:

I thought Foundation.Data would be a good candidate… but it looks like the Equatable operator currently checks value equality without an early exit if two Data values are equal by identity:

Which is kind of surprising… do I understand this code correctly? There is no constant-time early return if two Data values are identical?

I checked the blame but it doesn't tell me much:

So there's no super helpful context in there. I also searched for GitHub issues and PR diffs related to Data and Equatable and I couldn't find any more information in there.

I can try a simple benchmark to see the growth:

import CollectionsBenchmark
import Foundation

var benchmark = Benchmark(title: "Benchmark")

benchmark.add(
  title: "Array Equal",
  input: [Int].self
) { input in
  return { timer in
    precondition(input == input)
  }
}

benchmark.add(
  title: "Data Equal",
  input: [Int].self
) { input in
  let d1 = input.withUnsafeBufferPointer {
    Data(buffer: $0)
  }
  let d2 = d1
  return { timer in
    precondition(d1 == d2)
  }
}

benchmark.main()

And here is what that looks like:

The Array equality looks like constant time because the value equality operator has an early return if two Array values are equal by identity. We lack that for Data. Hence the linear growth.

Does anyone know any more history there about that? I would think that Data would be a good candidate for a constant-time check for identity… but maybe there are some legit reasons that didn't ship as an optimization in the value equality operator? Some kind of weird footguns or underwater stones?

5 Likes

This is sad and strange indeed as it is so low hanging fruit and an easy win...


As a side note there was (and still is on Apple platforms!) another gotcha that Data only includes the first 80 bytes (and its size) in the hash value.

3 Likes

Isn’t that fine though? A regular hash (not for cryptographic use cases) isn’t inherently unique, instead it aims to resolve most conflicts in a lightweight manner.

3 Likes

Modern implementations of memcmp will exit immediately (i.e., return zero) if the pointers passed in are identical. So why the linear growth?

I don't think the benchmark is measuring quite what you are hoping for it to measure. The Data(buffer:) initialiser is copying the contents of the passed UnsafeBufferPointer<Int>, which is the reason for the linear growth in the execution time. No such copy is present in the array equality check.


Also, are you sure that the precondition in the array equality check doesn't get optimised out? The less than 1ns equality check seems too fast for an identity check, or maybe I'm underestimating the power of todays computers. For example, this function

func testArrayEquality(_ a: [Int]) {
    precondition(a == a)
}

generates

output.testArrayEquality([Swift.Int]) -> ():
        ret

which is just an empty function call and no equality check is actually performed.

2 Likes

I believe OP was using the benchmark library correctly, the code uses Benchmark.add and it returns a closure, only the execution of this closure should be measured.

1 Like

Ah I totally missed that a closure was returned. Indeed then it is very curious why the linear growth since memcmp should first compare the equality of the passed pointers and in case of equality return early.

That actually came up in the pitch thread:

3 Likes

Oh that is news to me. Then I agree that kind of check could indeed be added.

We can test to confirm the performance of copying these values with and without a mutation applied:

benchmark.add(
  title: "Array Copy",
  input: [Int].self
) { input in
  let lhs = input
  return { timer in
    let rhs = lhs
    blackHole(rhs)
  }
}

benchmark.add(
  title: "Array Copy on Write",
  input: [Int].self
) { input in
  let lhs = input
  return { timer in
    var rhs = lhs
    rhs.append(input.count)
    blackHole(rhs)
  }
}

benchmark.add(
  title: "Array Equal",
  input: [Int].self
) { input in
  let lhs = input
  let rhs = lhs
  return { timer in
    let isEqual = lhs == rhs
    precondition(isEqual)
    blackHole(isEqual)
  }
}

benchmark.add(
  title: "Data Copy",
  input: [Int].self
) { input in
  let lhs = input.withUnsafeBufferPointer {
    Data(buffer: $0)
  }
  return { timer in
    let rhs = lhs
    blackHole(rhs)
  }
}

benchmark.add(
  title: "Data Copy on Write",
  input: [Int].self
) { input in
  let lhs = input.withUnsafeBufferPointer {
    Data(buffer: $0)
  }
  return { timer in
    var rhs = lhs
    rhs.append(contentsOf: [0])
    blackHole(rhs)
  }
}

benchmark.add(
  title: "Data Equal",
  input: [Int].self
) { input in
  let lhs = input.withUnsafeBufferPointer {
    Data(buffer: $0)
  }
  let rhs = lhs
  return { timer in
    let isEqual = lhs == rhs
    precondition(isEqual)
    blackHole(isEqual)
  }
}

Copying Array without a mutation is constant time. The linear time copy takes place when a mutation is applied. And value equality is constant time before a mutation is applied.

Copying Data without a mutation is constant time. The linear time copy takes place when a mutation is applied. And value equality is linear time before a mutation is applied.

I confirm OP results with a totally different test

Simple test
var eqCount = 0
let k = 10000
for i in 0 ..< k {
    let n = 100_000 * I
    let data = Data(repeating: 0, count: n)
    let data2 = data
    let start = Date()
    eqCount += data == data2 ? 1 : 0
    let elapsed = Date().timeIntervalSince(start)
    print(n, elapsed * 1000, "ms")
}
precondition(eqCount == k)

Did it on macOS to see if it's Foundation implementation behaves differently in this case - no, it's still O(n) there.


You could see more details in another thread, but in a nutshell – no, it shouldn't be like this. Besides other things it makes a vector of DoS attack possible, a similar one to what hash randomisation is solving.

var data = Data(repeating: 0, count: 81)
data[data.count - 1] = 1
let a = data.hashValue
data[data.count - 1] = 2
let b = data.hashValue
print(a, b, a == b) // 7025483855299775759 7025483855299775759 true

You could see it here and it is the same behaviour on Apple's version of Foundation used on macOS/iOS.

2 Likes

Thanks for elaborating @tera !

1 Like

Let's fix it!

20 Likes