PSA: The stdlib now uses randomly seeded hash values

lorentey · March 10, 2018, 12:05am

With PR #14913, the stdlib has switched to a randomly seeded, high-quality hash function for all hashing. (The function is currently SipHash-1-3, but this is an implementation detail that is subject to change.)

"Random seeding" means that hashValue properties will return different values on each execution of a Swift program. This is an important tool for improving the reliability of the Standard Library's hashing collections, Set and Dictionary. In particular, random seeding enables better protection against (accidental or deliberate) hash-flooding attacks.

This change fulfills a long-standing prophecy in Hashable's documentation:

Hash values are not guaranteed to be equal across different executions of your program. Do not save hash values to use during a future execution.

All Hashable types in the standard library (including primitive types like Bool and Int) now generate randomized hash values:

$ swift
 1> 10.hashValue
$R0: Int = 3189672894122490707
 2> 20.hashValue
$R1: Int = -731278079191967151
 3> true.hashValue
$R2: Int = 1698870856037189238
 4>

$ swift
 1> 10.hashValue
$R0: Int = -7576852868862274754
 2> 20.hashValue
$R1: Int = 6522600449632548270
 3> true.hashValue
$R2: Int = 2902202285030183828
 4>

I expect hash randomization will have minimal/no impact on the vast majority of existing code. Please let us know by posting here if you have questions or concerns.

Synthesized Hashable implementations are currently still using a lesser form of hashing; I'm preparing an update to change this in PR #15122. Once this gets done, the next step is to consider making the new hashing interface public -- we can discuss this in Combining hashes.

lorentey · March 10, 2018, 2:00am

Update: See below on how to disable hash randomization in more recent builds.

If you need deterministic hash values (for example, to make your test cases repeatable), you can currently override the random seed by setting the _Hasher._secretKey property to a pair of UInt64 values.

~~_Hasher._secretKey = (0, 0)~~

Because updating the seed affects all hash values in the current process, you'll have to do this before the first Set or Dictionary is created. The StdlibUnitTest package has an example of how this may be done -- it ensures that the seed is overridden when the first TestSuite is initialized.

Note that _Hasher._secretKey is not intended for use in regular Swift programs, and will probably disappear in a future update. (In all likelihood, it will be replaced by a better interface for overriding the seed.)

lorentey · March 16, 2018, 4:24pm

In certain controlled environments, such as while running certain CI tests, it may be helpful to selectively disable hash seed randomization, so that hash values and Set/Dictionary orderings remain consistent across executions.

In recent trunk development builds, you can disable hash seed randomization by defining the environment variable SWIFT_DETERMINISTIC_HASHING with the value of 1. The Swift runtime looks at this variable during process startup and, if it is defined, replaces the random seed with a constant value.

If you need more control over the seed, please let us know by describing your use case below.

Note that hash randomization has no effect on the performance of Swift programs; however, it is an important factor in the reliability of Set and Dictionary, so generally it should not be disabled.

Additionally, setting SWIFT_DETERMINISTIC_HASHING does not guarantee that hash values remain stable across different versions of the compiler/stdlib -- please do not rely on specific hash values and do not save them across executions.

yxckjhasdkjh · February 18, 2019, 3:42pm

Apologies for reviving this thread.

We're currently investigating an issue in our application related to our hashing algorithm for certain core data structures. This issue appears randomly, causing random test failures. We've been able to narrow down a test case that fails in maybe every 20th or 30th test run (and otherwise succeeds) and some debugging shows that the issue lies within our hash implementation, which in turn uses the stdlib hash values.

For further analysis, it would be immensely useful to be able to print the hash seed when the failure occurs and to then be able to start a test run with that particular seed, so the failure can be easily reproduced.

Is this something that could be added?

lorentey · February 20, 2019, 5:28am

Hash replayability did come up a couple of times over the last year. Unfortunately it's generally unfeasible to reproduce the exact Set/Dictionary orderings exhibited by a Swift process, unless SWIFT_DETERMINISTIC_HASHING is turned on.

Details: In Swift 5.0, `Set` and `Dictionary` use per-instance hash seeds, based on the memory address of their storage buffer as well as the global hash seed. Capturing and overriding the global seed in a new process isn't enough: you also need to ensure hashing collections get allocated at the exact same addresses -- which isn't normally practical. So hash tables are unlikely to exhibit repeatable behavior in regular runs, even if we added a mechanism to override the global seed.

Setting SWIFT_DETERMINISTIC_HASHING disables per-instance seeding. So it would make sense to allow the seed to be overridden in this mode (and only this mode) -- so that you could try a series of different seeds until you find one that reliably exhibits the issue in your test suite. We don't have this yet, but it wouldn't be difficult to implement it, and I can see other usecases for such an override, especially in automated testing. I'm very much open to adding it!

However, it would be laborious to debug a crasher through that -- this would really only be useful as a last resort. It sounds like you already have a reproducible (if nondeterministic) crash; getting it to reproduce 100% wouldn't necessarily get you much closer to fixing the issue.

The vast majority of nondeterministic crashes in Set/Dictionary operations that I've seen traced back to one of three underlying problems:

A Hashable implementation that violates hashing requirements. This is typically caused by a hash(into:) (or hashValue) implementation that processes a component that isn't compared in ==. This can cause all sorts of bizarre behavior, including lost/duplicate keys and crashes in various stdlib preconditions. (If you know the types involved, it is usually straightforward to audit the Hashable implementations -- any hashing code that doesn't follow the exact same structure as the corresponding == deserves a closer look.)
A race condition where a set/dictionary variable is shared between multiple threads, and at least one of them mutates it without adequate synchronization. The crash typically occurs when one thread accesses a hash table while another thread is busy resizing the same; this is relatively easy to recognize in crash reports.
(Rarely) a key gets mutated after it was inserted into a hash table. This cannot happen with Swift's value types, but it is possible to accidentally do this when a Dictionary key is a reference type (or happens to contains one).

Supposing that you can rule out the race condition case, the most probable cause is an invalid Hashable implementation. It should be possible to catch it by auditing the Key type (and its components). (Does the crash go away if you replace all hash(into:) implementations with an empty function? If so, you can try isolating the culprit by restoring the original implementations one by one, starting from the top.)

Note: It may be helpful to try running your tests on the latest Xcode beta or a toolchain snapshot. Swift 5's stdlib is a lot better at identifying crashes caused by inconsistent Hashable implementations, especially in debug builds -- if you ever see a crash in a function called KEY_TYPE_OF_DICTIONARY_VIOLATES_HASHABLE_REQUIREMENTS, then it's most likely caused by issue (1) or (3) above. The key type is included in the trap message to aid debugging. (Sadly, Set/Dictionary cannot detect such issues with 100% reliability, so they remain nondeterministic -- but at least you have a better chance of identifying the problem when it does occur.)

yxckjhasdkjh · February 20, 2019, 4:00pm

First of all, thank you for the extensive reply.

Our situation is a little bit different than what you are describing in that our app never crashed, it just displayed wrong behaviour (that we could observe in tests). We currently know exactly where the problem occurs and can try to fix it, it's just that it took us a very long time to be able to definitely determine that yes, the underlying cause was indeed related to hashing (as opposed to any other component of the system). If there had been a way to a) print the random seed in case of a test failure and b) set that seed when running the tests, we would have been able to verify this assumption much more quickly. Therefore, for any future situations, this would be helpful.

I understand the point about non-deterministic set/dictionary orderings. In our case, however, it was never about sets or dictionaries. What we did is that we compared the hash values of two different objects in order to quickly determine whether they are equal. Of course, such a comparison will always have false positives, since hash functions generally have collisions. What we were seeing however were collisions on several orders of magnitude more frequently that you would assume given a truly uniform hash function (which, given that it's range contains 2^64 elements on a 64-bit machine, would only occur after seeing about 2^32 elements, according to the Birthday problem). Therefore, simply specifying the seed would have helped determine the root cause much more quickly.

Of course, the problem has nothing to do with Swift's underlying hash implementations, and everything with faults in our own implementation (for example, we wrote it before Swift's Hasher became available in, I think, Swift 4?), it just would have been easier for us to debug with more control over the seed.

lorentey · February 20, 2019, 11:02pm

Interesting! It sounds like SWIFT_DETERMINISTIC_HASHING with a customizable seed override would indeed help a lot with debugging in your case.

AnuzaMaxima · February 25, 2020, 4:36am

Hello, Lorentey. You asked if hashing randomization matters to me, and unfortunately, I have to answer yes . In my case, I use hashing to determine the identity of a type for a function that relies on that stable hash value between program runs to perform memoization. But with the recent deprecation of hashValue property in Hashable protocol, it is no longer possible to provide the necessary functionality. Is there a way to implement hashing for types that is compatible with Hashable and that can be provided for user to opt?

ps There is nothing like HasherProtocol to make custom hashers.

TheCodingart · January 8, 2021, 9:02pm

To reiterate off of this, we were using hash values to generate unique identifiers for database storage. It seems quite silly that there's no opt out for having randomly generated seeded values per launch...

jrose · January 8, 2021, 10:25pm

It still wouldn't help, since Hasher doesn't promise to use the same hash algorithm going forward. I think there's still a missing API here, but deterministic seeding alone wouldn't solve the problem.

Joe_Groff · January 8, 2021, 10:39pm

And even if the seed were deterministic, and Hasher promised to use the same algorithm forever, types conforming to Hashable also don't promise they'll hash their members in the same way.

The closest alternative for producing a stable persistent identifier that I can think of would be to write a custom Encoder implementation which uses your own hash algorithm; that would allow you to hash any Codable type and produce a stable identifier using an algorithm you control.

Jumhyn · January 8, 2021, 10:50pm

Doesn't that have the same issue as Hashable in that types will not necessarily always encode themselves in the same way? It would be perfectly valid to update my Encodable implementation to use a completely different format AFAICT (and the Decodable implementation could be updated correspondingly to allow migration, even). I don't think any custom Encoder could reasonably protect against that. Or am I missing something?

Joe_Groff · January 8, 2021, 10:52pm

Perhaps, though you'd generally need to be more careful with breaking Codable changes, since its primary purpose is already to allow serializing and deserializing values across processes.

Jumhyn · January 8, 2021, 10:55pm

Right, it makes sense that Codable conformances are (on the whole) likely to be more stable than Hashable conformances, but neither protocol provides the guarantees that I would think are necessary to rely on for something like a stable database identifier.

John_McCall · January 8, 2021, 11:08pm

A pointer-width integer is not a good basis for a presumed-unique database identifier anyway. Among other things, you cannot possibly get the same identifiers on 32-bit and 64-bit platforms.

Joe_Groff · January 8, 2021, 11:16pm

Standard library types like String and Int do at least guarantee their coding behavior, unlike their hashing behavior, because it's effectively ABI for deserializing existing coded archives. So user-defined types have the ability to provide stronger guarantees about their own composed coding behavior than they can for hashing.

lorentey · March 12, 2021, 2:12am

Belated reply: the purpose of Hashable is to enable using Swift types as keys in in-memory hash tables -- such as (but not limited to) the hash tables in general-purpose data structures such as Set and Dictionary.

Seed randomization is unfortunately unavoidable these days -- it provides crucial protection against (accidental or deliberate, but sadly very real) denial of service attacks targeting the hash algorithm. Set and Dictionary must work reliably no matter what data is put in them.

As nicely pointed out by people above, Hashable.hashValue never was, is not, and cannot ever be an appropriate choice for a stable identifier. The documentation has always been quite explicit about this fact. (Unconditionally enabling random seeding in fact had the nice benefit that it made this deadly problem immediately apparent, rather than making it a trap that only eventually triggers (say, after days/weeks/months/years of potential production use, when you or your users install an OS update, upgrade to a minor bugfix release of some dependency, or when you recompile for a different architecture).)

If you want to use a content-based hash as a stable, unique identifier, one good way to do that would be to (1) use a carefully selected cryptographic digest (with up front work to allow migrating to a better one when (not if) it gets broken) and (2) to make absolutely sure you are in full control over exactly what gets hashed and in what order.

By "full control" I mean you yourself probably need to write and carefully audit the code that feeds every byte into the hash function, and that code must never ever change. Existing protocols such as Codable or Hashable are designed to allow conforming types to vary the encoding from one release to another, but even the most trivial change could break the stable identifier use case.

CryptoKit already provides the HashFunction protocol generalizing Hasher for cryptography applications. I believe it would be technically possible to add a CryptographicHashable protocol that used it -- however, it might not be a good idea to do so: it would need to come with semantic requirements about stable encodings that would be very difficult to verify/enforce. If my business was relying on stable digests, I'd be unlikely to trust code I don't directly control to get this right.

BasemEmara · May 19, 2021, 11:19am

I also need a persistent identifier for a struct, which represents a user's request that I store for later so I don't have to run an expensive logic in the future again if the user's "request" hasn't changed (which implies the computed response that's also stored will be the same if run with the same request).

I thought Hashable would be perfect for this, but the random seed was a show stopper. Then also realized Codable does not guarantee the order of properties in the generated output, just that if can be decoded / encoded.

I like @lorentey idea about converting the request into a cryptographic value, but my struct has many nested structs and feels like doing everything by hand for every nested type and property might as well just create a simple identifier manually.

I wondered if Identifiable could help here. If a struct's properties and nested types all conform to Identifiable, wouldn't it make sense for Swift to synthesis Identifiable for the top-level struct if added? As it stands, adding Identifiable to the top-level struct forces me to manually implement the identifier by concatenating all the sub types and property identifiers together which is painful.

It feels like something's missing for generating a persistent identifier for a type even though I have all the pieces in place.

BasemEmara · May 19, 2021, 1:27pm

Although a synthesized Identifiable method would be really cool, I was able to solve my case with Codable. I made my type conform to Codable and Equatable then stored the encoded data. On next run, I'm decoding the stored data then checking if it is equal to the current request.

EricRabil · November 14, 2021, 6:23pm

This solution is very fragile, and will break the moment any changes are made to Hasher's memory layout. I'm not recommending it unless you're in a situation where you truly need reproducible hashes across executions and don't want an external library. Additionally, for the reasons mentioned above, you probably truly do not need this. The hashing algorithm makes no promises to be stable across Swift versions.

This is 50 shades of wrong but I needed runtime seeding of Hashable and there was no public API, so here's my solution:

public extension Hasher {
    init(seed: [UInt8]) {
        guard seed.count == 32 else {
            preconditionFailure("Seed must be exactly 32 bytes")
        }
        
        let buffer = UnsafeMutableRawBufferPointer.allocate(byteCount: MemoryLayout<Hasher>.size, alignment: MemoryLayout<Hasher>.alignment)
        
        for index in 0..<8 {
            buffer[index] = 0
        }
        
        for index in 8..<40 {
            buffer[index] = seed[index - 8]
        }
        
        for index in 40..<72 {
            buffer[index] = 0
        }

        defer {
            buffer.deallocate()
        }
        
        self = buffer.bindMemory(to: Hasher.self).baseAddress!.move()
    }
}

and an example which will yield the same finalized hash across executions:

// Returns an array of random UInt8 that you can pass to Hasher(seed:), if you need to bootstrap a seed for your project
func randomSeed() -> [UInt8] {
    Array(unsafeUninitializedCapacity: 32) { pointer, initializedCount in
        for i in 0..<32 {
            pointer[i] = UInt8(arc4random())
        }
        
        initializedCount = 32
    }
}

let PROJECT_SEED: [UInt8] = [235, 45, 216, 214, 16, 112, 184, 247, 156, 181, 159, 38, 245, 165, 35, 224, 255, 58, 206, 200, 16, 122, 174, 232, 130, 191, 143, 49, 246, 179, 41, 240]

var hasher = Hasher(seed: PROJECT_SEED)

"asdf".hash(into: &hasher)

print(hasher.finalize())