Assigning a dictionary without re-hashing

Out of curiosity, I did some microbenchmarking of my own.

Click to see benchmark sources
extension Optional {
  mutating func setIfNil(_ newValue: () -> Wrapped) -> Wrapped {
    if let v = self { return v }
    let v = newValue()
    self = v
    return v
  }
}

inline(__always)
func modify<T>(_ value: inout T) -> T { value }

extension Dictionary {
  mutating func memoizedValue0(
    forKey key: Key,
    generator: (Key) -> Value
  ) -> Value {
    if let value = self[key] { return value }
    let value = generator(key)
    self[key] = value
    return value
  }

  mutating func memoizedValue1(
    forKey key: Key,
    generator: (Key) -> Value
  ) -> Value {
    self[key].setIfNil { generator(key) }
  }

  mutating func memoizedValue2(
    forKey key: Key,
    generator: (Key) -> Value
  ) -> Value {
    modify(&self[key, default: generator(key)])
  }
}

var benchmark = Benchmark(title: "Collection Benchmarks")
defer { benchmark.main() }

benchmark.add(
  title: "Dictionary<Int, Int> memoization - 2-phase",
  input: [Int].self
) { input in
  return { timer in
    var d: [Int: Int] = [:]
    timer.measure {
      for i in input {
        let v = d.memoizedValue0(forKey: i, generator: { 2 * $0 })
        precondition(v == 2 * i)
      }
    }
    precondition(d.count == input.count)
    blackHole(d)
  }
}

benchmark.add(
  title: "Dictionary<Int, Int> memoization - regular subscript",
  input: [Int].self
) { input in
  return { timer in
    var d: [Int: Int] = [:]
    timer.measure {
      for i in input {
        let v = d.memoizedValue1(forKey: i, generator: { 2 * $0 })
        precondition(v == 2 * i)
      }
    }
    precondition(d.count == input.count)
    blackHole(d)
  }
}

benchmark.add(
  title: "Dictionary<Int, Int> memoization - defaulted subscript",
  input: [Int].self
) { input in
  return { timer in
    var d: [Int: Int] = [:]
    timer.measure {
      for i in input {
        let v = d.memoizedValue2(forKey: i, generator: { 2 * $0 })
        precondition(v == 2 * i)
      }
    }
    precondition(d.count == input.count)
    blackHole(d)
  }
}

@lukasa's approach using the regular subscript's _modify proved measurably faster than the two-phase variant. Surprisingly, using the defaulted subscript turned out to be 2x slower than the regular subscript on this microbenchmark. The defaulted subscript is significantly simpler than the regular one, so I'd have expected the opposite! This is a performance anomaly we'll need to look into.

As a Swift user, I wouldn't draw too many conclusions from these results -- in these microbenchmarks, the cached function is a trivial 2 * $0 closure, so these are dominated by hash table operations. In actual use cases, the generator function will likely be much slower, and there will be no measurable difference between the three variants.

5 Likes

Interesting results, thank you.

How did you arrive at this conclusion? If hash function is slower → it should matter more if it is calculated once or twice. Could you rerun your test with some realistic key, e.g. with keys of the form UUID().uuidString ? (I presume there is no caching of String's hash values which would mud water even further).

I'm talking about the function whose results are getting cached in the dictionary, not the hash function as implemented by Hasher. The multiply-by-two function I'm using in these microbenchmarks is not a very practical thing to memoize the results of. :wink: In actual cases, the function we're caching will be (hopefully!) doing a lot more work than that -- otherwise using a dictionary to cache its returned values would not be very useful.

Looking up a key in a dictionary has a certain amount of complexity, a part of which is spent calculating the hash value for the key (other parts are spent on equality comparisons while scanning for a matching key within the corresponding collision chain, RAM latency, etc.) For these caching/memoization schemes to be worth the memory they need, the function whose repeated invocations we'd like to optimize away needs to be overwhelmingly more expensive than a couple of dictionary lookups.

1 Like

FWIW, I did a quick test: hash value calculation of UUID().uuidString based keys takes about 30ns on my old computer, if the whole access is 100ns it's quite significant chunk.

When is the dictionary element index invalidated or may become invalid? Haven't found it in the documentation :astonished:. Possible candidates:

  1. when removing the indexed element (even if we are not using the index itself)
  2. when removing an element (different from the indexed one)
  3. when adding another element
  4. when changing element values (including the indexed element). Excluding changing elements' values to nil which is equivalent to removing elements (see above).

Obviously in 1 and it also feels like in 2 & 3, but not 4, is this correct assumption?

That's a good question. In the absence of documentation, you must assume that all indexes are invalidated on any mutation. MutableCollection imposes some additional requirements, so you can assume that mutating through dict.values[index] does not invalidate indexes.

Strictly speaking, mutating through dict[key] could invalidate indexes when key is already in the dictionary. It would be reasonable to ask for that as a guarantee in the documentation.

#1 and #3 can absolutely invalidate. #2 may or may not, but it's an implementation decision that I suspect we don't want to promise anything about.

1 Like

Thank you.

Do you know if the following foo and bar are equivalent?

struct Item {
    var foo: Int = 0
    var bar: String = ""
}

var items: [String: Item] = ...

func change(_ item: inout Item) {
    item.foo += 1
}

func foo(key: String) {
    change(&items[key, default: Item()])
}

func bar(key: String) {
    var item = items[key, default: Item()]
    change(&item)
    items[key] = item
}
1 Like

They should have the same effect; for a subscript implemented with get and set they would have the same behavior as well. However, a subscript implemented with _modify could provide an optimized implementation that only has to do one lookup of key.

1 Like

Dictionary never invalidates indices when changing the value of an existing key. This is not explicitly spelled out in the docs, but an important hint is that the Dictionary.Values view is a MutableCollection.

(We go to extraordinary lengths to preserve indices on value mutations even when the original dictionary is a "verbatim bridged" NSDictionary instance. There is some eldritch code hiding in there -- consider that NSDictionary doesn't really model indices at all. :sweat_smile:)

Dictionary does provide the following guarantee about index invalidation vs insertions:

A dictionary’s indices stay valid across additions to the dictionary as long as the dictionary has enough capacity to store the added values without allocating more buffer. When a dictionary outgrows its buffer, existing indices may be invalidated without any notification.

When you know how many new values you’re adding to a dictionary, use the init(minimumCapacity:) initializer to allocate the correct amount of buffer.

I believe this originated in C++. I find this is a rather unfortunate guarantee, as it prevents the Dictionary from reordering collision chains on insertion to improve average lookup performance (as in round robin hashing). For better or worse, we are stuck with it now, though.

Removing a key from a Dictionary always invalidates every index within it, as the removal operation involves rearranging the collision chain that included the original item.

Set operates the same way.

3 Likes

Can we formalize this guarantee?

Absolutely. This is a (semantic) API change, so I expect it'll require going through Swift Evolution. I've been collecting other ill-defined aspects of the stdlib -- perhaps we have a large enough batch of these now to be worth the expense of a small proposal.

(Edit: unifying and documenting the expected behavior of base properties on mutable slices could also be part of this, although that one also requires actual behavioral changes in some obscure corners of the stdlib.)

IMO, dealing with base semantics is big enough to tackle on its own, particularly if the rest of the semantic changes are going to be uncontroversial and require little debate.

3 Likes

i just do

{ _ in }(&dictionary[key, default: value])

Looks quite cryptic. :slight_smile:

I still wonder though whether the whole issue could be sidestepped (and as a bonus the overall app/system performance gain achieved) if we somehow cache the calculated hash value to tremendously reduce the number of hash calculations (and also provide a quick path to test if the two values aren't equal) – I guess this is only realistically possible with reference types or with value types like string/uuid/array/dictionary/etc (and their specific variants) that are backed by reference types. If to approach this systematically the first thing I'd do is somehow measure the time spent for hashing calculations (plus the EQ calculations that end up being false) in some realistic benchmarks, then if it's, say, 10% it's worth exploring further, if 1% - doesn't worth the complexity.

i like { _ in }(&). it’s short, clear about what it’s doing, easy for the compiler to detect and optimize, and doesn’t make us design and pitch and review and lobby a new feature. i just wish it didn’t need a semicolon on the previous statement.

Mmm…

6 Likes

Personally, I'd love to be able to express this:

dict[key] = dict[key] ?? value

as this:

dict[key] ??= value

Let's call it the nil-replacement operator. It'd work anywhere you have an optional, not only with dictionaries. And the intent is clear, unlike modify(&) or { _ in }(&).

Basic Implementation
infix operator ??=

func ??= <T>(a: inout T?, b: @autoclosure () -> T?){
	if a == nil {
		a = b()
	}
}
5 Likes

clear about what it’s doing

Just as a data point, I had to copy that code into a test project and dissect it into parts to figure out what it was doing )-:

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

3 Likes

well i suppose it takes getting used to, like @available(*, unavailable) or some vs any. all the alternatives suggested here so far involve adding new API to the standard library (or even new operators!) which feels like overkill to me.

I can see no world in which I would add this code to a project when I can use the func modify solution proposed by @lorentey upthread. I appreciate that these features compose this way and I'm glad this all works, but it's sufficiently inscrutable that I would encourage any team I work in to avoid this in favour of a named method whose intention is far clearer.

1 Like