PSA: Per-instance seeding for Set and Dictionary


(Karoy Lorentey) #1

Quick heads up -- with PR #19533, the standard library will start to use a different hash seed for every Set and Dictionary value. I'm doing this to radically improve performance for union-like operations on large sets and dictionaries; see the PR for details.

The practical upshot is that sets and dictionaries containing the same elements will more frequently contain them in a different order:

// Swift 4.2:
▶︎ Set(0..<10)
[7, 3, 4, 6, 9, 8, 2, 5, 1, 0]
▶︎ Set(0..<10)
[7, 3, 4, 6, 9, 8, 2, 5, 1, 0]

// with PR #19533:
▶︎ Set(0..<10)
[3, 6, 4, 7, 5, 2, 9, 1, 8, 0]
▶︎ Set(0..<10)
[1, 2, 5, 7, 4, 8, 3, 0, 9, 6]

Note that Set/Dictionary never guaranteed that equal instances contained elements in the same order. The order of elements currently depends on the execution hash seed, the table size, as well as the insertion order:

// Swift 1.0 -- 4.2:
▶︎ Set(0..<10)
[7, 3, 4, 6, 9, 8, 2, 5, 1, 0]
▶︎ Set((0..<10).reversed())
[7, 6, 4, 9, 3, 8, 2, 5, 1, 0]     // Spot the differences!

However, in Swift 4.2 and below, performing the same mutations in the same order on two different empty sets/dictionaries produce identical results -- and PR #19533 changes this.

There is a small chance that there is some code out there that relies on Set and Dictionary behaving in a deterministic way. Please let me know if your code might be affected! (It is quite possible we will end up not landing this change, or we may need to revert it later.)

Overview of recent developments in this area:

  • Swift 4.1 and below used the same hash seed for all tables, and exhibited quadratic performance for many operations that filter/merge elements from multiple sets or dictionaries.
  • In Swift 4.2, the hash seed depends on the capacity of the table, as well as the process-wide random 128-bit execution seed. This eliminated the issue for some, but not all, operations.
  • With this change, we're updating the implementation to mix in the storage address instead of the capacity. This eliminates the remaining issues with Set.union, Set.symmetricDifference, Dictionary.merge, and similar operations.

Because we need to guarantee value semantics, we can't use a different hash seed for mutated copy-on-write copies of the same set/dictionary. So even with per-instance seeding, you can still get quadratic behavior, although you'll need to work for it. (We need to guarantee indices remain valid on CoW copies even across insertions, as long as the table doesn't get resized.)


(Jonathan Hise Kaldma) #2

This sounds great. I recently got bitten by the current behavior.

I'd written something like this to compare two dictionaries:

for (a, b) in zip(dictionaryA, dictionaryB) {
    if a == b {
        ...
    }
}

That worked for simple test cases, but of course not for real-life values.

If the order had always been different, I would have found the bug much earlier.