PSA: The stdlib now uses randomly seeded hash values

Please do not do this; it will eventually break. To get a Hasher with your own custom seed, all you need to do is to combine some data onto a stock Hasher instance.

public extension Hasher {
  init(seed: Int) {
    self.init()
    self.combine(seed)
  }
}

var hasher = Hasher(seed: 42)
"asdf".hash(into: &hasher)
print(hasher.finalize())

When combined with SWIFT_DETERMINISTIC_HASHING=1, this will produce repeatable hash values.

Can you tell us a bit more about why you needed to do this?

2 Likes

Try this: GitHub - Saklad5/Identifier: A Swift package that provides a type-safe identifier for use with Identifiable.

It’s a simple wrapper generic over a type of your choice (Self, most likely). As for the underlying ID, I recommend a UUID.

One option, depending on your use case, might be to switch to using OrderedDictionary and OrderedSet, found in Swift Collections. My problem was that sets and dictionaries do not provide deterministic output when you serialize them, something I discovered when taking cryptographic hashes. While I understand why the stdlib uses randomly seeded hash values, etc., having non-deterministic serialization seems less than ideal. That is, it is not possible to compare two datasets using a cryptographic hash to see if they are the same, assuming that they contain standard data structures like dictionaries and sets.

3 Likes

OrderedDictionary and OrderedSet are excellent recommendations, if you are able to guarantee a stable insertion order.

To generate repeatable output, it’s usually best to explicitly copy the items into an Array and manually sort them before serialization.

Be careful around keyed encoders/decoders in Codable — in many serialization formats, the corresponding representations (such as JSON objects) are explicitly documented as unordered, and the ordering of items may not be preserved during serialization/deserialization. In some cases, the order of keyed items may even change during transport. This makes keyed encoders tricky to work with if you need to produce reproducible output.

Note: SortedDictionary and SortedSet are planned additions to the Swift Collections package that will provide deterministic ordering even if items get inserted in non-deterministic order.

3 Likes

While technically correct, this concern and the attendant implementation of Codable for the Ordered* collections rather misses the point. The vast majority of JSON implementations provide de facto order stability even though it's not guaranteed by the standard. Some backend implementations even go so far as requiring consistent ordering. Some even give significance to relative ordering. Ordered*'s insistence on technical correctness in that regard makes it far less useful than it should be and means most users are left without the solution Ordered* was intended to provide entirely.

1 Like

There's JSONEncoder's sortedKeys outputFormatting which should give sorted results:

["a": 1, "b": 1]

although if there're floating point numbers the two results can still be different (from strcmp perspective):

["a": 1.00000001, "b": 1]

(and strictly speaking JSON format doesn't differentiate ints and floats, so beware).

OrderedSet/OrderedDictionary are general-purpose implementations that need to avoid making assumptions that aren’t guaranteed; however, if you need to customize the Codable implementation of these (or any other) types, I highly recommend wrapping them in a custom struct:

struct CustomOrderedSet<Element: Hashable>: Codable {
  var value: OrderedSet<Element>

  <custom encoding/decoding members>
}
2 Likes

While we are at this, is there a setting similar to SWIFT_DETERMINISTIC_HASHING that seeds Swift's random to a given seed value?

You’ve simply traded one set of assumptions (ordering) for another (representation). If the default implementation isn’t useful in most cases, something’s wrong. Of course, this may be more of a Codable issue than any particular implementation.

The wrapper suggestion is workable, as long as we only use it at the coding boundaries, as otherwise it’s rather hard to use since it can’t do everything the underlying collection can.

There is not

The primary goal of a Codable conformance for any type is to ensure safe & reliable round tripping through serialization.

OrderedDictionary is built around the notion that the ordering of items is significant -- risking ordering to be lost during a serialization round trip seems like an absurd ask to me.

Indeed. The Codable family of protocols do not guarantee that a keyed container must preserve ordering during an encode/transmit/decode round trip. (And indeed they can't, since e.g. technically neither does JSON.)

Therefore, OrderedDictionary cannot, by default, encode itself into a keyed container. Its current implementation is the correct way to guarantee that ordering is preserved after an encode/decode round trip. This implementation is obviously useful for serialization purposes.

The wrapper suggestion is workable, as long as we only use it at the coding boundaries, as otherwise it’s rather hard to use since it can’t do everything the underlying collection can.

Yep.

If for some reason you want/need to serialize or deserialize an ordered dictionary to/from a keyed container (for example, because you know you're always going to be using an order-preserving serializer/deserializer, and you have external constraints on the precise way your data gets serialized -- which, I have to note, isn't a use case Codable was specifically designed to cover*), then one simple way to do that is to wrap the collection into an adapter type on the fly, within the encode/decode members:

func encode(to encoder: Encoder) throws {
  ...
  try keyedContainer.encode(KeyedEncodingAdapter(orderedDict), forKey: .dict)
  ...
}

init(from decoder: Decoder) throws {
  ...
  orderedDict = try keyedContainer.decode(
    KeyedEncodingAdapter<Key, Value>.self, 
    forKey: .dict
  ).value
  ...
}

You can also define your own extension methods on Keyed[Encoding,Decoding]Container to achieve this less verbosely:

  try keyedContainer.encodeIntoKeyedContainer(orderedDict, forKey: .dict)
  ...
  orderedDict = try keyedContainer.decodeFromKeyedContainer(orderedDict, forKey: .dict)

Of course, this means you cannot rely on compiler synthesis, which, given the risks associated with using a keyed container in this context, seems like a good thing! Assumptions are best when they're made explicitly.

* (A serialization facility that was designed to provide direct control over the serialization output would most likely be (1) explicitly designed around that specific serialization format, and (2) would have extensive hooks for convenient remapping of names and customizing all other details of the serialization.)

1 Like