Optimising Set<String> comparisons


(Nial Giacomelli) #1

Many thanks to Tim for suggesting isSubset, which I had somehow missed
while browsing the documentation.

To answer Dave's questions: unfortunately the Set comparisons aren't an
ideal candidate for asynchronous work. The comparisons take place as part
of a CSS-like styling phase, whereby a number of Sets (the stylesheet
definitions) are compared against many thousands of other Sets (the class
lists for each styleable object). While I believe the Int comparisons would
undoubtedly be faster, I'm likely to lose even more time calculating the
hashValues for each Set.

I believe the use of isSubset as well as some result caching (storing
matches for subsequent queries including identical class lists) is likely
to result in a decent speed improvement. Thank you both for taking the time
to respond.

And if anyone has any suggestions for alternative approaches I'm all ears
:slight_smile:

···

On Sun, Nov 13, 2016 at 11:49 PM, David Sweeris <davesweeris@mac.com> wrote:

On Nov 13, 2016, at 1:58 PM, Nial Giacomelli via swift-users < > swift-users@swift.org> wrote:

Using Swift 3 I have a function that's called extremely frequently and is
appearing regularly in Profiler runs. The function takes two Set<String>
instances and simply attempts to determine whether all items from Set A are
present in Set B (it's important to note that Set B may well contain
additional items).

I've attempted to approach the problem in two ways:

let diff = a.subtracting(b)
guard diff.count == 0 else { return false }

And also by simply iterating over the contents of Set A, like so:

for item in a {
if !b.contains(item) {
return false
}
}

Both ultimately end up spending the majority of their time in String._
compareDeterministicUnicodeCollaton(String) -> Int. Which makes sense,
given what I'm doing - but ideally I'd like to come up with a more
efficient way of performing the above check. Swift's String representation
is incredibly robust, but for my needs the strings could be adequately
represented in ASCII encoding. I've also considered storing and comparing
the hashValue of the strings, to speed up comparisons...

Hopefully this is an acceptable question for this mailing list. I'm aware
that this may not be a Swift-specific question and could well be solved
with a more efficient data structure or approach, but I'd really appreciate
some input from more experienced Swift developers :slight_smile:

How often do the sets change? And where do you get them from? I’m
wondering, if there’s enough time between when you get the strings and when
you need to know if setA is a subset of setB, could you somehow compute the
answer asynchronously in a background thread or something? Strictly
speaking it wouldn't be any less work, but if you can move it to where
you’re waiting for user input or something you’ll probably get your answer
sooner.

Regarding how to get the answer using less work in the first place, I
think you’re doing a tiny bit of extra work — just some allocation
overhead, really — by calculating the set difference and checking that its
count == 0, rather than just checking if a.isSubset(of: b). Beyond that,
I’m not really sure… You might be able to do something with the strings’
hash values. I mean, I don’t know if they really “work like that” (I’m
quite foggy on how hash values are calculated and exactly what they’re
suitable for… this might be a horrible idea), but if they do, comparing
hash values (Ints) will definitely be faster than comparing Strings. I did
some quick’n’dirty testing on my machine with some code very much like:

let setA: Set<String> = ["some", "set", "of", "random", "strings"]
let setB: Set<String> = ["some", "other", "set", "of", "strings"]
let hashedA = Set(setA.map{$0.hashValue})
let hashedB = Set(setB.map{$0.hashValue})
setA.isSubset(of: setB)
hashedA.isSubset(of: hashedB)

The actual execution of hashedA.isSubset(of: hashedB) seems to be very
roughly 2x faster than setA.isSubset(of: setB). However, if you include
the overhead of calculating them, using hash values drops to about 5x
*slower*. So unless you’d be reusing the hashed values a lot or
something, I think you’re already doing about the least amount of work I
can think of.

- Dave Sweeris