Does anyone know what is the most efficient way to check if an array contains duplicate elements?
Something better than eg Set(arr).count < arr.count
?
Does anyone know what is the most efficient way to check if an array contains duplicate elements?
Something better than eg Set(arr).count < arr.count
?
Well, any such algorithm must be at least O(n), since if the elements are unique, you still have to check them all to be sure.
I have it on good authority that heapify
can be performed in O(n), so perhaps build a heap and check each element as it is inserted to see if it is a duplicate, and exit early if so.
If the array is sorted, then simply iterate over the elements, checking to see if the next element is the same. This has a time complexity of O(n).
if the array is not sorted, then iterate over the elements, checking a dictionary to determine whether the dictionary contains the element. If not, add the element. This has a time complexity fo O(n).
A set will do, too
Earlier replies have suggested using Set
, Dictionary
, or heaps, all of which can get you to O(n) performance. The former two require conformance to Hashable
whereas the latter requires Comparable
.
With Dictionary
, you can do one trick that'll improve over your Set(arr).count < arr.count
, return early in the case of duplicate elements:
enum DuplicateError: Error { case dupe }
extension Array where Element: Hashable {
var isUnique: Bool {
do {
_ = try Dictionary(self.map { ($0, ()) }, uniquingKeysWith: { _, _ in
throw DuplicateError.dupe
})
return true
} catch {
return false
}
}
}
(Hope I got it right, typing on my phone.)
Edit: With Set
, there's this neat trick:
extension Array where Element: Hashable {
var isUnique: Bool {
var seen = Set<Element>()
return allSatisfy { seen.insert($0).inserted }
}
}
Thanks, here's what I ended up using:
extension Sequence where Element: Hashable {
/// Returns true if no element is equal to any other element.
func isDistinct() -> Bool {
var set = Set<Element>()
for e in self {
if set.insert(e).inserted == false { return false }
}
return true
}
}
very similar to yours although more verbose, I wonder if allSatisfy
and its closure has any cost.
Compare in godbolt, but I strongly doubt there will be any difference. The closure is non-escaping, so it can be inlined.
A possible improvement would be to only see if a hash value has been seen before, instead of having to insert whole objects into a Set (bonus: better space complexity):
extension Array where Element: Hashable {
var isUnique: Bool {
var seen = Set<Int>()
return allSatisfy { seen.insert($0.hashValue).inserted }
}
}
Here is a benchmark.
Hash values are not guaranteed unique.
Downside: It isn't actually correct, because unlike with cryptographic hashes, Swift's hashValue
may collide in practice.
It's not mentioned in the documentation. Do you have some official docs on that?
It's the nature of hashes. Uniqueness is never guaranteed. In the case of hashValue
, you want to look here.
The short of it is, two objects which compare equal must have the same hashValue
, but two objects which hash to the same value may be unequal.
Neither Set
nor Dictionary
assume hashes are unique. In such data structures, the hash is a starting point. Items with matching hashes are always compared for equality. Due to the requirement (in Foundation) that equal items hash the same, the implementations can assume that unequal hashes mean unequal items, but the reverse can't be assumed. Unequal items may very well hash the same.
This is getting to be rather off-topic for this thread. If you are still confused about the relationship between Hashable
and Equatable
, please start a new thread.
No.
There was some discussion of this in an older thread: Remove duplicate elements from a collection - #23 by Torust.
I'm going to restate what I said there: if you have a small array of simple types, the trivial O(n^2) operation might just end up being the fastest. Otherwise, I'd expect sorting or using a Set
to be your best options, as has already been discussed.
Cryptographic hashes can also collide. There's no way to represent kn
bits of data uniquely using only an n
bit hash. For every possible n
bit hash value, k
different values will collide (on average, for a mythical perfectly uniform hash function).
Think of hashes like police descriptions of suspect vehicles. They're a heuristic for equality. If a car thief is reported at large, driving in a grey Toyota Corolla, and I own a Toyota Corolla, does that necessitate that I'm the thief? Of course not. Police offers will pull me over, and verify using more detailed attributes (my facial appearance, name, social insurance number, date of birth, VIN, etc.). That's a test for equality.
All a hash can do is exclude certainly non matching options, so you don't needlessly "go into the details" of obvious non-matches. A cop, knowing he's looking for a grey Toyata Corolla, doesn't have to pull over and interrogate the driver of a pink Lambourghuini.
At best, a hash can point you in the right direction, but you still need to do the fine level verification of the identity of a desired item, to be sure it's what you're looking for, lest you accidentally pick something that's unequal but with a colliding hash value.
Not because of the current case, anyway.
You're quite right, but in practice they hardly do. I was simply trying to say that at 64 bits, Swift's hashValue
is somewhat prone to birthday collisions. And more so because it isn't particularly documented to be cryptographic in particular (but rather it's designed with hashed collections in mind).