PSA: The stdlib now uses randomly seeded hash values

(Karoy Lorentey) #1

With PR #14913, the stdlib has switched to a randomly seeded, high-quality hash function for all hashing. (The function is currently SipHash-1-3, but this is an implementation detail that is subject to change.)

“Random seeding” means that hashValue properties will return different values on each execution of a Swift program. This is an important tool for improving the reliability of the Standard Library’s hashing collections, Set and Dictionary. In particular, random seeding enables better protection against (accidental or deliberate) hash-flooding attacks.

This change fulfills a long-standing prophecy in Hashable's documentation:

Hash values are not guaranteed to be equal across different executions of your program. Do not save hash values to use during a future execution.

All Hashable types in the standard library (including primitive types like Bool and Int) now generate randomized hash values:

$ swift
 1> 10.hashValue
$R0: Int = 3189672894122490707
 2> 20.hashValue
$R1: Int = -731278079191967151
 3> true.hashValue
$R2: Int = 1698870856037189238

$ swift
 1> 10.hashValue
$R0: Int = -7576852868862274754
 2> 20.hashValue
$R1: Int = 6522600449632548270
 3> true.hashValue
$R2: Int = 2902202285030183828

I expect hash randomization will have minimal/no impact on the vast majority of existing code. Please let us know by posting here if you have questions or concerns.

Synthesized Hashable implementations are currently still using a lesser form of hashing; I’m preparing an update to change this in PR #15122. Once this gets done, the next step is to consider making the new hashing interface public – we can discuss this in Combining hashes.

Combining hashes
What's the intended behavior of this program that uses a Set as the Sequence it currently is?
(Karoy Lorentey) #2

Update: See below on how to disable hash randomization in more recent builds.

If you need deterministic hash values (for example, to make your test cases repeatable), you can currently override the random seed by setting the _Hasher._secretKey property to a pair of UInt64 values.

_Hasher._secretKey = (0, 0)

Because updating the seed affects all hash values in the current process, you’ll have to do this before the first Set or Dictionary is created. The StdlibUnitTest package has an example of how this may be done – it ensures that the seed is overridden when the first TestSuite is initialized.

Note that _Hasher._secretKey is not intended for use in regular Swift programs, and will probably disappear in a future update. (In all likelihood, it will be replaced by a better interface for overriding the seed.)

(Karoy Lorentey) #3

In certain controlled environments, such as while running certain CI tests, it may be helpful to selectively disable hash seed randomization, so that hash values and Set/Dictionary orderings remain consistent across executions.

In recent trunk development builds, you can disable hash seed randomization by defining the environment variable SWIFT_DETERMINISTIC_HASHING with the value of 1. The Swift runtime looks at this variable during process startup and, if it is defined, replaces the random seed with a constant value.

If you need more control over the seed, please let us know by describing your use case below.

Note that hash randomization has no effect on the performance of Swift programs; however, it is an important factor in the reliability of Set and Dictionary, so generally it should not be disabled.

Additionally, setting SWIFT_DETERMINISTIC_HASHING does not guarantee that hash values remain stable across different versions of the compiler/stdlib – please do not rely on specific hash values and do not save them across executions.

Deterministic "randomness" in Swift
(Pierpaolo Frasa) #4

Apologies for reviving this thread.

We're currently investigating an issue in our application related to our hashing algorithm for certain core data structures. This issue appears randomly, causing random test failures. We've been able to narrow down a test case that fails in maybe every 20th or 30th test run (and otherwise succeeds) and some debugging shows that the issue lies within our hash implementation, which in turn uses the stdlib hash values.

For further analysis, it would be immensely useful to be able to print the hash seed when the failure occurs and to then be able to start a test run with that particular seed, so the failure can be easily reproduced.

Is this something that could be added?

(Karoy Lorentey) #5

Hash replayability did come up a couple of times over the last year. Unfortunately it's generally unfeasible to reproduce the exact Set/Dictionary orderings exhibited by a Swift process, unless SWIFT_DETERMINISTIC_HASHING is turned on.

Details: In Swift 5.0, `Set` and `Dictionary` use per-instance hash seeds, based on the memory address of their storage buffer as well as the global hash seed. Capturing and overriding the global seed in a new process isn't enough: you also need to ensure hashing collections get allocated at the exact same addresses -- which isn't normally practical. So hash tables are unlikely to exhibit repeatable behavior in regular runs, even if we added a mechanism to override the global seed.

Setting SWIFT_DETERMINISTIC_HASHING disables per-instance seeding. So it would make sense to allow the seed to be overridden in this mode (and only this mode) -- so that you could try a series of different seeds until you find one that reliably exhibits the issue in your test suite. We don't have this yet, but it wouldn't be difficult to implement it, and I can see other usecases for such an override, especially in automated testing. I'm very much open to adding it!

However, it would be laborious to debug a crasher through that -- this would really only be useful as a last resort. It sounds like you already have a reproducible (if nondeterministic) crash; getting it to reproduce 100% wouldn't necessarily get you much closer to fixing the issue.

The vast majority of nondeterministic crashes in Set/Dictionary operations that I've seen traced back to one of three underlying problems:

  1. A Hashable implementation that violates hashing requirements. This is typically caused by a hash(into:) (or hashValue) implementation that processes a component that isn't compared in ==. This can cause all sorts of bizarre behavior, including lost/duplicate keys and crashes in various stdlib preconditions. (If you know the types involved, it is usually straightforward to audit the Hashable implementations -- any hashing code that doesn't follow the exact same structure as the corresponding == deserves a closer look.)

  2. A race condition where a set/dictionary variable is shared between multiple threads, and at least one of them mutates it without adequate synchronization. The crash typically occurs when one thread accesses a hash table while another thread is busy resizing the same; this is relatively easy to recognize in crash reports.

  3. (Rarely) a key gets mutated after it was inserted into a hash table. This cannot happen with Swift's value types, but it is possible to accidentally do this when a Dictionary key is a reference type (or happens to contains one).

Supposing that you can rule out the race condition case, the most probable cause is an invalid Hashable implementation. It should be possible to catch it by auditing the Key type (and its components). (Does the crash go away if you replace all hash(into:) implementations with an empty function? If so, you can try isolating the culprit by restoring the original implementations one by one, starting from the top.)

Note: It may be helpful to try running your tests on the latest Xcode beta or a toolchain snapshot. Swift 5's stdlib is a lot better at identifying crashes caused by inconsistent Hashable implementations, especially in debug builds -- if you ever see a crash in a function called KEY_TYPE_OF_DICTIONARY_VIOLATES_HASHABLE_REQUIREMENTS, then it's most likely caused by issue (1) or (3) above. The key type is included in the trap message to aid debugging. (Sadly, Set/Dictionary cannot detect such issues with 100% reliability, so they remain nondeterministic -- but at least you have a better chance of identifying the problem when it does occur.)

(Pierpaolo Frasa) #6

First of all, thank you for the extensive reply.

Our situation is a little bit different than what you are describing in that our app never crashed, it just displayed wrong behaviour (that we could observe in tests). We currently know exactly where the problem occurs and can try to fix it, it's just that it took us a very long time to be able to definitely determine that yes, the underlying cause was indeed related to hashing (as opposed to any other component of the system). If there had been a way to a) print the random seed in case of a test failure and b) set that seed when running the tests, we would have been able to verify this assumption much more quickly. Therefore, for any future situations, this would be helpful.

I understand the point about non-deterministic set/dictionary orderings. In our case, however, it was never about sets or dictionaries. What we did is that we compared the hash values of two different objects in order to quickly determine whether they are equal. Of course, such a comparison will always have false positives, since hash functions generally have collisions. What we were seeing however were collisions on several orders of magnitude more frequently that you would assume given a truly uniform hash function (which, given that it's range contains 2^64 elements on a 64-bit machine, would only occur after seeing about 2^32 elements, according to the Birthday problem). Therefore, simply specifying the seed would have helped determine the root cause much more quickly.

Of course, the problem has nothing to do with Swift's underlying hash implementations, and everything with faults in our own implementation (for example, we wrote it before Swift's Hasher became available in, I think, Swift 4?), it just would have been easier for us to debug with more control over the seed.

(Karoy Lorentey) #7

Interesting! It sounds like SWIFT_DETERMINISTIC_HASHING with a customizable seed override would indeed help a lot with debugging in your case.