Most efficient way to check if the set of a sequence == another set?

Let's say we have some set:

let set: Set = [1, 2, 3]

And some array (these are three examples of that array):

let arr1: Array = [1, 2, 3, 3, 2]
let arr2: Array = [1, 2, 3, 3, 8]
let arr3: Array = [1, 2, 1, 2, 2]

Now, we want to check if the set of the array is equal to the set.

Here's one way of doing that:

print(set == Set(arr1)) // true
print(set == Set(arr2)) // false
print(set == Set(arr3)) // false

But let's say this check is done many times per second and set and arr are large.

Question: What's the most efficient way to do this check?

It should work for any Sequence (not just an Array).

The only way to know for sure is to measure various options. But I'd probably start with

array.allSatisfy(set.contains)

That alone won't be enough:

let array = [1, 1, 1]
let set: Set = [1, 2, 3]

I guess there are at least two most efficient ways - depending on wether you focus on speed or memory consumption.
For the latter, iterating through the set and checking if each element is also in the array should be optimal - but of course terrible slow ;-)

2 Likes

Derp. Ignore me.

Speed is more important than memory consumption in my use case.

I guess I’d try

func areEqual<E>(_ set: Set<E>, _ array: [E]) -> Bool {
    var remainder = set
    for e in array {
        guard set.contains(e) else { return false }
        remainder.remove(e)
    }
    return remainder.isEmpty
}
1 Like

Don't mind - it's early in the morning ;-)

I thought so ;-)
May expectation is that it is faster to iterate though… what @mayoff said ;-)

Taking a copy from a set should be faster than creating a completely new set.
If you can restrict requirements, you could also start with a check for equal size: When the set is bigger, the other collection can't be equal… oh, wait: Creating a Set from another collection will eliminate duplicates, right?

That will also create a new set and looks like it might be doing more work than the original naive solution:

let areEqualSets = (set == Set(sequence))

I guess it cannot be done without some extra storage (since it's not enough to just iterate through the sequence and do set.contains on each element, we have to keep track of the "opposite" too, ie that every element of the set is also in the sequence).

So the naive approach might not be that bad after all, but maybe keeping and reusing a mutable set setOfSequence would be faster (than the proposed areEqual method above):

setOfSequence.removeAll(keepingCapacity: true)
setOfSequence.formUnion(sequence)
let areEqualSets = (set == setOfSequence)

Oh, yes, didn't think about that.


We're probably at a point where profiling > speculation (will not investigate this further myself right now though).

let set: Set = [1]
let sequence = [1, 1]

Should be false, shouldn't it?

No, the set of your sequence is equal to your set there (see the title of this thread).

The sequence can contain duplicate elements or not, it doesn't matter for the check we're talking about.

I see… that make's things tougher ;-)
You could transform the reference set into a Dictionary and use this to identify duplicates - but I guess by avoiding the creation of new Sets is the best optimization. Especially when the Set changes often, that Dictionary would be quite a lot overhead.

func checkSequence<S: Sequence, I: Equatable>(_ sequence: S, cache: inout Dictionary<S.Element, I>, identifier: I) -> Bool where S.Element: Hashable {
    var count = 0
    var dups = 0
    for element in sequence {
        count += 1
        if let old = cache[element] {
            if old == identifier {
                dups += 1
            } else {
                cache[element] = identifier
            }
        } else {
            return false
        }
    }
    return count - dups == cache.count
}

let set: Set = [1, 2, 3, 4]
var cache = Dictionary<Int, Int>()
for element in set {
    cache[element] = 0
}

checkSequence([1, 1, 4, 3], cache: &cache, identifier: 1)
// It would be less brittle to use something like the hash of the sequence as identifier - but that needs to be calculated... 
checkSequence([1, 1, 4, 3, 2], cache: &cache, identifier: 2)
checkSequence([1, 4, 3, 1], cache: &cache, identifier: 3)
checkSequence([1, 1, 4, 3, 2, 4], cache: &cache, identifier: 4)
checkSequence([99, 1, 4, 2, 3], cache: &cache, identifier: 5)
1 Like

I might be missing something but I can't see why the dictionary and identification of duplicates would be needed (Sets already takes care of duplicates :slight_smile: ), or be more efficient than simply:

Which works, and can be put in a method like this:

extension Set {
    func isEqualTo<S>(setOf values: S, reusing set: inout Set<Element>) -> Bool
        where S: Sequence, S.Element == Element
    {
        set.removeAll(keepingCapacity: true)
        set.formUnion(values)
        return self == set
    }
}

Used like so:

func test() {
    let set: Set = [1, 2, 3]

    let arr1: Array = [1, 2, 3, 3, 2]
    let arr2: Array = [1, 2, 3, 3, 8]
    let arr3: Array = [1, 2, 1, 2, 2]

    // Original naive version of check (correct result):
    print(set == Set(arr1)) // true
    print(set == Set(arr2)) // false
    print(set == Set(arr3)) // false

    // The above (faster?) version of the check:
    var reusedSet = Set<Int>()
    print(set.isEqualTo(setOf: arr1, reusing: &reusedSet)) // true
    print(set.isEqualTo(setOf: arr2, reusing: &reusedSet)) // false
    print(set.isEqualTo(setOf: arr3, reusing: &reusedSet)) // false
}
test()

No objection:

But I think you know better than most of us that it is often very hard to judge performance of a piece of Swift ;-)

1 Like

Do you have any control over the sequences?

Can you guarantee the elements are Hashable?
What about Comparable?
Equatable?

Can you ensure the sequences conform to RandomAccessCollection?
Can you control the order of the elements (eg. always sorted, or a tree, etc.)?
Can you just use a Set (or a counted set) instead of a generic sequence?

I believe the Set == Set comparison is fast since it should be able to simply compare the whole block of memory. It also looks like that init basically just inserts every item of the array, which should also be pretty efficient.

So really you're only iterating over each array once with an insert, and then making a full comparison. I can't imagine a manual-iteration that would be much-if-any faster than that, even in ideal circumstances.

If speed is an absolute priority, making it highly generic (i.e. compare any Sequence) is probably a mistake. You should instead write concrete implementations tailored to specific types. Not just what collection, but what they contain. Trivial elements will behave differently to ref-counted ones, in some cases the comparison may even vectorize. Shortcuts for sorted types like Range will be available. Do you also have to use a Set? If the data is relatively static, a sorted Array to test against may be faster.

2 Likes

Not sure exactly what you mean, but for most interpretations of this sentence, this isn’t true.

Really? How is it done?

To clarify what I mean, my understanding is it doesn't need to iterate through each item in the Sets and make sure they match. They should be able to just compare the entire thing. Is the internal make-up of a Set not consistent at runtime? Maybe that's where my understanding's incorrect.

You can't just compare memory like that. Some elements might be equal but not the same in memory (i.e. two strings can be equal while being backed by different buffers). And even in terms of elements with guaranteed bitwise equality, the insertion order when building the set can result in different internal layouts.

There are fast paths: if the two sets are backed by the same buffer, they are guaranteed equal. The implementation also checks the counts of the two sets. But then it iterates. There is a potential (unimplemented, probably unwise) fast path that two buffers that are bitwise equal are guaranteed equal, but that is a much more expensive check and it can generate false negatives so you'd always have to iterate.

2 Likes

Ah ok, I guess I assumed with simple data like Integers there were faster paths available. Thanks for the clarification!

In that case this problem seems almost identical to what Set.== has to do.