Dictionary Collision Resolution Guarantees

I’m currently cleaning up the docs on Dictionary to reflect the new indexing model, and I stumbled across the note that the following code should work assuming no reallocations occur:

//
// var (i, found) = d.find(k) // i is associated with d's buffer
// if found {
// var e = d // now d is sharing its data with e
// e[newKey] = newValue // e now has a unique copy of the data
// return e[i] // use i to access e
// }
//

This is effectively assuming that the open-addressing scheme being used is first-come-first-serve (FCFS). That is, any element being inserted can *only* be inserted into vacant buckets, rather than displacing existing elements. This is currently only internal docs, but do we actually want to guarantee this? The popular alternative of robin hood (RH) doesn’t follow this.

Some background: One interesting optimization avenue worth exploring is to tweak Dictionary to store hashes, rather than bits, to identify occupied slots (with some careful masking so that 0 still means “unoccupied”). Then searching for elements can be implemented as follows:

let hash = hashFromKey(key)
var i = indexFromHash(hash)
while true {
  if hashes[i] == 0 {
    // vacant, not contained
  } else if hashes[i] == hash {
    // maybe already exists?
    if keys[i] == key {
      // actually exists
    }
  }
  i = nextIndex(i)
}

The interesting property of this is that it almost all of the search time is spent linearly walking through an array and doing simple comparisons on integers, which is of course really easy to optimize (potentially even SIMD). 99.9999% of the time, if the element isn’t stored in the Dictionary, then you’ll just hit a 0 hash, and never look at the keys. Similarly, 99.9999% of the time, if the element *is* stored in the Dictionary, you’ll only do a single key comparison (on the actually equal element). So this is also really great for cache — pretty much all of your access are just linear scans of the hashes.

The main downside is you’re now “wasting” more memory to store hashes, but you can potentially compensate for this by truncating the hashes to 8/16 bits for small Dictionaries (which will be better for SIMD anyway).

So what does this have to do with the RH scheme? Compared to FCFS, RH generally leads to lower variance on search distances, and provides a mechanism to short-circuit long runs if you have the hashes handy. If you find hashes which want to live later in the array than where you started, then the current element definitely isn’t contained. This means you can increase the load factor on your dictionary and further improve cache/memory usage (compensating for the memory-usage loss from storing hashes). However it will be prohibitively expensive if you require hashes to be recomputed on-the-fly.

I’m currently cleaning up the docs on Dictionary to reflect the new indexing model, and I stumbled
across the note that the following code should work assuming no reallocations occur:

//
// var (i, found) = d.find(k) // i is associated with d's buffer
// if found {
// var e = d // now d is sharing its data with e
// e[newKey] = newValue // e now has a unique copy of the data
// return e[i] // use i to access e
// }
//

This is effectively assuming that the open-addressing scheme being
used is first-come-first-serve (FCFS). That is, any element being
inserted can *only* be inserted into vacant buckets, rather than
displacing existing elements. This is currently only internal docs,
but do we actually want to guarantee this?

No, not to users. But "//" comments are not user-level comments and
don't imply any guarantees.

The popular alternative of robin hood (RH) doesn’t follow this.

Some background: One interesting optimization avenue worth exploring
is to tweak Dictionary to store hashes, rather than bits, to identify
occupied slots (with some careful masking so that 0 still means
“unoccupied”). Then searching for elements can be implemented as
follows:

let hash = hashFromKey(key)
var i = indexFromHash(hash)
while true {
  if hashes[i] == 0 {
    // vacant, not contained
  } else if hashes[i] == hash {
    // maybe already exists?
    if keys[i] == key {
      // actually exists
    }
  }
  i = nextIndex(i)
}

The interesting property of this is that it almost all of the search
time is spent linearly walking through an array and doing simple
comparisons on integers, which is of course really easy to optimize
(potentially even SIMD).

Yep, that's cool. It does cost a lot more storage, though. Tradeoffs.

99.9999% of the time, if the element isn’t stored in the Dictionary,
then you’ll just hit a 0 hash, and never look at the keys. Similarly,
99.9999% of the time, if the element *is* stored in the Dictionary,
you’ll only do a single key comparison (on the actually equal
element). So this is also really great for cache — pretty much all of
your access are just linear scans of the hashes.

The main downside is you’re now “wasting” more memory to store hashes,
but you can potentially compensate for this by truncating the hashes
to 8/16 bits for small Dictionaries (which will be better for SIMD
anyway).

Hey, cool.

So what does this have to do with the RH scheme? Compared to FCFS, RH
generally leads to lower variance on search distances, and provides a
mechanism to short-circuit long runs if you have the hashes handy. If
you find hashes which want to live later in the array than where you
started, then the current element definitely isn’t contained. This
means you can increase the load factor on your dictionary and further
improve cache/memory usage (compensating for the memory-usage loss
from storing hashes). However it will be prohibitively expensive if
you require hashes to be recomputed on-the-fly.

Sounds like a worthy optimization!

···

on Thu Oct 13 2016, Alexis <swift-dev-AT-swift.org> wrote:

--
-Dave

I’m currently cleaning up the docs on Dictionary to reflect the new indexing model, and I stumbled
across the note that the following code should work assuming no reallocations occur:

//
// var (i, found) = d.find(k) // i is associated with d's buffer
// if found {
// var e = d // now d is sharing its data with e
// e[newKey] = newValue // e now has a unique copy of the data
// return e[i] // use i to access e
// }
//

This is effectively assuming that the open-addressing scheme being
used is first-come-first-serve (FCFS). That is, any element being
inserted can *only* be inserted into vacant buckets, rather than
displacing existing elements. This is currently only internal docs,
but do we actually want to guarantee this?

No, not to users. But "//" comments are not user-level comments and
don't imply any guarantees.

OK cool, is there any reason it’s even written down? I don’t see any code
that’s obviously relying on it. (seems fine to delete it?)

···

On Oct 13, 2016, at 6:09 PM, Dave Abrahams via swift-dev <swift-dev@swift.org> wrote:
on Thu Oct 13 2016, Alexis <swift-dev-AT-swift.org <http://swift-dev-at-swift.org/&gt;&gt; wrote:

The popular alternative of robin hood (RH) doesn’t follow this.

Some background: One interesting optimization avenue worth exploring
is to tweak Dictionary to store hashes, rather than bits, to identify
occupied slots (with some careful masking so that 0 still means
“unoccupied”). Then searching for elements can be implemented as
follows:

let hash = hashFromKey(key)
var i = indexFromHash(hash)
while true {
if hashes[i] == 0 {
   // vacant, not contained
} else if hashes[i] == hash {
   // maybe already exists?
   if keys[i] == key {
     // actually exists
   }
}
i = nextIndex(i)
}

The interesting property of this is that it almost all of the search
time is spent linearly walking through an array and doing simple
comparisons on integers, which is of course really easy to optimize
(potentially even SIMD).

Yep, that's cool. It does cost a lot more storage, though. Tradeoffs.

99.9999% of the time, if the element isn’t stored in the Dictionary,
then you’ll just hit a 0 hash, and never look at the keys. Similarly,
99.9999% of the time, if the element *is* stored in the Dictionary,
you’ll only do a single key comparison (on the actually equal
element). So this is also really great for cache — pretty much all of
your access are just linear scans of the hashes.

The main downside is you’re now “wasting” more memory to store hashes,
but you can potentially compensate for this by truncating the hashes
to 8/16 bits for small Dictionaries (which will be better for SIMD
anyway).

Hey, cool.

So what does this have to do with the RH scheme? Compared to FCFS, RH
generally leads to lower variance on search distances, and provides a
mechanism to short-circuit long runs if you have the hashes handy. If
you find hashes which want to live later in the array than where you
started, then the current element definitely isn’t contained. This
means you can increase the load factor on your dictionary and further
improve cache/memory usage (compensating for the memory-usage loss
from storing hashes). However it will be prohibitively expensive if
you require hashes to be recomputed on-the-fly.

Sounds like a worthy optimization!

--
-Dave

_______________________________________________
swift-dev mailing list
swift-dev@swift.org <mailto:swift-dev@swift.org>
https://lists.swift.org/mailman/listinfo/swift-dev

It's written down because we've never formalized our index invalidation
rules. I didn't realize that this comment was related to index
invalidation. The guarantee mentioned is one we might want to give, at
least under some circumstances. We'll have to think about that more
carefully. In any case, it's not time to delete it yet.

···

on Thu Oct 13 2016, Alexis <abeingessner-AT-apple.com> wrote:

This is effectively assuming that the open-addressing scheme being
used is first-come-first-serve (FCFS). That is, any element being
inserted can *only* be inserted into vacant buckets, rather than
displacing existing elements. This is currently only internal docs,
but do we actually want to guarantee this?

No, not to users. But "//" comments are not user-level comments and
don't imply any guarantees.

OK cool, is there any reason it’s even written down? I don’t see any code
that’s obviously relying on it. (seems fine to delete it?)

--
-Dave

For what it's worth, this rule is also explicitly mentioned in docs/IndexInvalidation.rst (https://github.com/apple/swift/blob/master/docs/IndexInvalidation.rst\):

"Insertion into a Dictionary invalidates indexes only on a rehash. If a Dictionary has enough free buckets (guaranteed by calling an initializer or reserving space), then inserting elements does not invalidate indexes.

Note: unlike C++'s std::unordered_map, removing elements from a Dictionary invalidates indexes."

(I realize the stuff in /docs is not necessarily official documentation (right?), I just wanted to mention it.)

···

On 14/10/2016 02:46, Dave Abrahams wrote:

OK cool, is there any reason it’s even written down? I don’t see any code
that’s obviously relying on it. (seems fine to delete it?)

It's written down because we've never formalized our index invalidation
rules. I didn't realize that this comment was related to index
invalidation. The guarantee mentioned is one we might want to give, at
least under some circumstances. We'll have to think about that more
carefully. In any case, it's not time to delete it yet.

Thanks for pointing that out; that's important.

···

on Fri Oct 14 2016, Ole Begemann <ole-AT-oleb.net> wrote:

On 14/10/2016 02:46, Dave Abrahams wrote:

OK cool, is there any reason it’s even written down? I don’t see any code
that’s obviously relying on it. (seems fine to delete it?)

It's written down because we've never formalized our index invalidation
rules. I didn't realize that this comment was related to index
invalidation. The guarantee mentioned is one we might want to give, at
least under some circumstances. We'll have to think about that more
carefully. In any case, it's not time to delete it yet.

For what it's worth, this rule is also explicitly mentioned in
docs/IndexInvalidation.rst
(https://github.com/apple/swift/blob/master/docs/IndexInvalidation.rst\):

"Insertion into a Dictionary invalidates indexes only on a rehash. If
a Dictionary has enough free buckets (guaranteed by calling an
initializer or reserving space), then inserting elements does not
invalidate indexes.

Note: unlike C++'s std::unordered_map, removing elements from a
Dictionary invalidates indexes."

(I realize the stuff in /docs is not necessarily official
documentation (right?), I just wanted to mention it.)

--
-Dave