Assigning a dictionary without re-hashing

Alvae · May 30, 2022, 3:52pm

Looking at the API of Dictionary, I am under the impression that it misses one operation that would set a key/value pair if and only if one doesn't already exist. The issue is that I suspect the following code will hash my key twice:

if d[key] == nil { d[key] = value }

There's updateValue(_:forKey:), but I suspect that method will unnecessarily compute a new value if the key/value pair is already in the dictionary. I'm looking for a method along these lines:

extension Dictionary {
  mutating func setIfNotAlready(_ value: @autoclosure () -> Value, forKey: Key) { ... }
}

For some context, I'm trying to use a dictionary as a memoization cache and would like to avoid hashing my keys twice to set new key/value pair.

tera · May 30, 2022, 6:48pm

updateValue is the same as using subscript's setter.

Do you need this to work with your specific keys? If that's the case and the individual fields of your key are immutable you may recalculate the hash value in the initialiser once and your "hash(into:)" could be very fast to be of a concern.

Alternatively, if the hash calculation is very slow, you may speed it up by caching the hash values themselves in an external side table.

I wonder how much CPU is taken by the hash calculation in your case (if that's the concern).

taylorswift · May 30, 2022, 6:53pm

i am with you completely. i feel like i am reaching for this operation half the times i am working with Dictionary.

Alvae · May 30, 2022, 6:57pm

Thanks for the reply.

I'm building a generic library around a data structure that uses hashing extensively. I can't predict hash performance because the library gets instantiated with user-provided data types. It also seems like it would be difficult to store pre-computed hashes for the same reason.

AFAICT, most people using that library do so with fairly easy-to-hash data types. But still, I think there's an opportunity for some speedup.

Alejandro · May 30, 2022, 6:58pm

If you only want a single hash:

let x = "hello"
let y = ["hello": 0]

if let idx = y.index(forKey: x) {
  print(y[idx]) // (key: "hello", value: 0)
}

Alvae · May 30, 2022, 7:00pm

I think that won't work. The interesting case is when there's a cache miss; i.e., when the key isn't in the dictionary. Ideally, you would:

compute the hash
search for the key
if the key is missing, compute the value
reuse the hash from step 1 to insert a new key/value pair

tera · May 30, 2022, 8:06pm

The very same could be said about the Dictionary itself!... yet it's quite alright as it is. Having said that, the proposed func setIfNotAlready could be useful.

It doesn't feel right to calculate the hash again and again if the value hasn't been changed after the first hash calculation. If hash calculation is really slow (and only if it is really slow) I'd do something like this:

class CachedHashable<T: Hashable>: Hashable {
    
    static func == (a: CachedHashable<T>, b: CachedHashable<T>) -> Bool {
        a.value == b.value
    }
    
    func hash(into hasher: inout Hasher) {
        if cachedHash == nil {
            recalculateHash()
        }
        hasher.combine(cachedHash!)
    }
    
    var value: T {
        didSet {
            invalidateHash()
        }
    }
    
    private var cachedHash: Int?

    init(_ value: T) {
        self.value = value
    }
    
    private func invalidateHash() {
        cachedHash = nil
    }
    
    private func recalculateHash() {
        var hasher = Hasher()
        hasher.combine(value)
        cachedHash = hasher.finalize()
    }
}

Alvae · May 30, 2022, 8:09pm

Fair point ;)

lukasa · May 31, 2022, 7:02am

I think the way you get this to work is to force the Dictionary into the _modify accessor:

extension Dictionary {
    mutating func setIfNotAlready(_ value: @autoclosure () -> Value, forKey key: Key) {
        self[key].setIfNil(value)
    }
}

extension Optional {
    fileprivate mutating func setIfNil(_ newValue: () -> Wrapped) {
        if self == nil {
            self = .some(newValue())
        }
    }
}

You can see the code for the _modify accessor here, which includes only one call to mutatingFind which does the actual hashing. All other calls have already found a bucket, so no further hashing is required to perform the insertion.

gwendal.roue · May 31, 2022, 7:40am

I tried to apply your idea with the following code:

do {
    print("---")
    var d: [Key: String] = [:]
    let key = Key(value: 1)
    if d[key] == nil { d[key] = "foo" }
    if d[key] == nil { d[key] = "bar" }
    print(d)
}

do {
    print("---")
    var d: [Key: String] = [:]
    let key = Key(value: 1)
    d.setIfNotAlready("foo", forKey: key)
    d.setIfNotAlready("bar", forKey: key)
    print(d)
}

Full code

extension Dictionary {
    mutating func setIfNotAlready(_ value: @autoclosure () -> Value, forKey key: Key) {
        self[key].setIfNil(value)
    }
}

extension Optional {
    fileprivate mutating func setIfNil(_ newValue: () -> Wrapped) {
        if self == nil {
            self = .some(newValue())
        }
    }
}

struct Key: Hashable {
    var value: Int
    
    static func == (lhs: Key, rhs: Key) -> Bool {
        return lhs.value == rhs.value
    }
    
    func hash(into hasher: inout Hasher) {
        print("compute hash")
        hasher.combine(value)
    }
}

do {
    print("---")
    var d: [Key: String] = [:]
    let key = Key(value: 1)
    if d[key] == nil { d[key] = "foo" }
    if d[key] == nil { d[key] = "bar" }
    print(d)
}

do {
    print("---")
    var d: [Key: String] = [:]
    let key = Key(value: 1)
    d.setIfNotAlready("foo", forKey: key)
    d.setIfNotAlready("bar", forKey: key)
    print(d)
}

Both variants produce the same output (in both debug and release configurations):

---
compute hash
compute hash
compute hash
[Key(value: 1): "foo"]
---
compute hash
compute hash
compute hash
[Key(value: 1): "foo"]

lukasa · May 31, 2022, 9:34am

Try inserting something into the Dictionary first:

var d: [Key: String] = [Key(value: 2): "baz"]

The get accessor on Dictionary has a fast-path for when the dictionary is empty that avoids hashing the key. You'll then see a difference in the number of hashing operations. I'll note that the number does not appear to be as low as I'd expect it to be, but the version using _modify certainly hashes less often.

gwendal.roue · May 31, 2022, 10:03am

Thanks for fixing my sample code, and raising the awareness about _modify, @lukasa :-)

Here we do not use _modify in our code, so the discussion is not polluted by the fact that it is still not public. Instead, we're using our knowledge of its existence in order to write more efficient code – that's quite precious

This thread should be able to improve a few micro-benchmarks. @Alvae, please tell us if this makes a difference in your app!

lorentey · June 1, 2022, 9:59am

Another way of doing this is to use the Dictionary subscript operation that allows specifying a default value. Unfortunately this still requires a helper function to force things through the _modify accessor, but it's slightly simpler:

func modify<T>(_ value: inout T) { }

extension Dictionary {
  mutating func setIfNotAlready(_ value: @autoclosure () -> Value, forKey key: Key) {
    modify(&self[key, default: value()])
  }
}

Ideally the standard library would supply a version of this out of the box. I have been toying with something like this variant:

extension Dictionary {
  mutating func memoizedValue(
    forKey key: Key, 
    generator: (Key) throws -> Value
  ) rethrows -> Value {
    ... 
  }
}

(The name would be subject to heavy bikeshedding, I'm sure. I find it more useful to have the value generator supplied by a regular function argument rather than an @autoclosure. This variant also returns the actual cached value, as that tends to prevent yet another lookup immediately following this operation. In this version I also have the key as an explicit argument -- this lets callers avoid capturing the key in the supplied closure, but perhaps that's too subtle a point to insist on.)

Sample usage in a silly dynamic programming example:

var fibTable: [Int: Int] = [:]

func fib(_ n: Int) -> Int {
  fibTable.memoizedValue(forKey: n) { n in 
    guard n > 1 else { return n }
    return fib(n - 1) + fib(n - 2)
  }
}

Unfortunately, this does not actually work, and this demonstrates an inherent limitation of this general approach: attempts at recursion tend to violate the Law of Exclusivity. Sadly fibTable is being exclusively accessed by memoizedValue while the closure is executing, so the recursive calls lead to concurrent modify accesses, which are illegal.

print(fib(10)) // Simultaneous accesses to 0x1079d0250, but modification requires exclusive access.

However, as long as we're okay with this limitation, I think such a Dictionary method would be a worthwhile addition!

Note: The original two-phase solution doesn't have the same problem, as the recursion happens outside of accesses to the table:

var fibTable: [Int: Int] = [:]

func _fib(_ n: Int) -> Int {
  guard n > 1 else { return n }
  return fib(n - 1) + fib(n - 2)
}

func fib(_ n: Int) -> Int {
  if let cached = fibTable[n] { return cached }
  let result = _fib(n)
  fibTable[n] = result
  return result
}

print(fib(10)) // ⟹ 55

(Note how the table may receive an arbitrary number of insertions between the initial lookup of cached and the final setting of result -- this is another reason why we cannot generally do this with a single hash table lookup: the table may well be resized in between the two stages.)

(There are clever ways to try automating this memoization of such recursive functions via a construct not unlike a property wrapper; however I suspect that we can't really express that yet in Swift. Or can we? )

tera · June 1, 2022, 11:35am

lorentey:

func modify<T>(_ value: inout T) { }

extension Dictionary {
  mutating func setIfNotAlready(_ value: @autoclosure () -> Value, forKey key: Key) {
    modify(&self[key, default: value()])
  }
}

Huh, calling a function that visually does nothing... this is so magical!

BTW, why there's no setter for dictionary index subscript?

let index = dict.index(forKey: key)!
_ = dict[index] // ok
dict[index] = 1 // not ok

(it won't help in this case though.)

jrose · June 1, 2022, 4:07pm

The dictionary index subscript returns a key-value pair as a tuple, and there’s no way to enforce that you only change the value and not the key. However, you can mutate dict.values[index], and maybe we should add an unavailable subscript setter to Dictionary to direct people there.

tera · June 1, 2022, 5:11pm

Runtime's precondition can do, similar to array[index] or dictionary[index] with an invalid index.

Good to know.

Alvae · June 2, 2022, 5:02pm

I played with the suggestions from @lukasa and @lorentey.

I wasn't able to observe any significant speedup in my app before the time to hash a single element becomes prohibitively expensive anyway. It also appears the cases where I can use setIfNotAlready are not as frequent as I thought, because of the LoE.

My attempts to observe speedup in micro benchmarks have failed so far. It seems the optimizer is too clever to miss the opportunity in bare-bones examples.

lorentey · June 3, 2022, 1:15am

Out of curiosity, I did some microbenchmarking of my own.

Click to see benchmark sources

extension Optional {
  mutating func setIfNil(_ newValue: () -> Wrapped) -> Wrapped {
    if let v = self { return v }
    let v = newValue()
    self = v
    return v
  }
}

inline(__always)
func modify<T>(_ value: inout T) -> T { value }

extension Dictionary {
  mutating func memoizedValue0(
    forKey key: Key,
    generator: (Key) -> Value
  ) -> Value {
    if let value = self[key] { return value }
    let value = generator(key)
    self[key] = value
    return value
  }

  mutating func memoizedValue1(
    forKey key: Key,
    generator: (Key) -> Value
  ) -> Value {
    self[key].setIfNil { generator(key) }
  }

  mutating func memoizedValue2(
    forKey key: Key,
    generator: (Key) -> Value
  ) -> Value {
    modify(&self[key, default: generator(key)])
  }
}

var benchmark = Benchmark(title: "Collection Benchmarks")
defer { benchmark.main() }

benchmark.add(
  title: "Dictionary<Int, Int> memoization - 2-phase",
  input: [Int].self
) { input in
  return { timer in
    var d: [Int: Int] = [:]
    timer.measure {
      for i in input {
        let v = d.memoizedValue0(forKey: i, generator: { 2 * $0 })
        precondition(v == 2 * i)
      }
    }
    precondition(d.count == input.count)
    blackHole(d)
  }
}

benchmark.add(
  title: "Dictionary<Int, Int> memoization - regular subscript",
  input: [Int].self
) { input in
  return { timer in
    var d: [Int: Int] = [:]
    timer.measure {
      for i in input {
        let v = d.memoizedValue1(forKey: i, generator: { 2 * $0 })
        precondition(v == 2 * i)
      }
    }
    precondition(d.count == input.count)
    blackHole(d)
  }
}

benchmark.add(
  title: "Dictionary<Int, Int> memoization - defaulted subscript",
  input: [Int].self
) { input in
  return { timer in
    var d: [Int: Int] = [:]
    timer.measure {
      for i in input {
        let v = d.memoizedValue2(forKey: i, generator: { 2 * $0 })
        precondition(v == 2 * i)
      }
    }
    precondition(d.count == input.count)
    blackHole(d)
  }
}

@lukasa's approach using the regular subscript's _modify proved measurably faster than the two-phase variant. Surprisingly, using the defaulted subscript turned out to be 2x slower than the regular subscript on this microbenchmark. The defaulted subscript is significantly simpler than the regular one, so I'd have expected the opposite! This is a performance anomaly we'll need to look into.

As a Swift user, I wouldn't draw too many conclusions from these results -- in these microbenchmarks, the cached function is a trivial 2 * $0 closure, so these are dominated by hash table operations. In actual use cases, the generator function will likely be much slower, and there will be no measurable difference between the three variants.

tera · June 3, 2022, 1:48am

Interesting results, thank you.

How did you arrive at this conclusion? If hash function is slower → it should matter more if it is calculated once or twice. Could you rerun your test with some realistic key, e.g. with keys of the form UUID().uuidString ? (I presume there is no caching of String's hash values which would mud water even further).

lorentey · June 3, 2022, 2:00am

I'm talking about the function whose results are getting cached in the dictionary, not the hash function as implemented by Hasher. The multiply-by-two function I'm using in these microbenchmarks is not a very practical thing to memoize the results of. In actual cases, the function we're caching will be (hopefully!) doing a lot more work than that -- otherwise using a dictionary to cache its returned values would not be very useful.

Looking up a key in a dictionary has a certain amount of complexity, a part of which is spent calculating the hash value for the key (other parts are spent on equality comparisons while scanning for a matching key within the corresponding collision chain, RAM latency, etc.) For these caching/memoization schemes to be worth the memory they need, the function whose repeated invocations we'd like to optimize away needs to be overwhelmingly more expensive than a couple of dictionary lookups.