Challenge: Make a better ordered set without reimplementing hashing / allocation

jrose · July 24, 2020, 8:05pm

I recently had need for a type that uniquely assigned IDs to incoming hashable values and allowed mapping back to the original value. The simplest way I could think was to use an ordered set type with a set-vector implementation:

struct SetVector<Value: Hashable>: RandomAccessCollection {
	private var allValues: [Value] = []
	private var indexesByValue: [Value: Int] = [:]

	mutating func insert(_ value: Value) -> Int {
		func addIfMissing(_ binding: inout Int?) -> Int {
			if binding == nil {
				binding = allValues.count
				allValues.append(value)
			}
			return binding!
		}
		// Avoid a double hash lookup
		return addIfMissing(&indexesByValue[value])
	}

	var startIndex: Int { 0 }
	var endIndex: Int { allValues.count }

	subscript(_ index: Int) -> Value {
		allValues[index]
	}
}

…but, this isn't super efficient because it stores the values twice (once for uniquing and once for lookup-by-ID). I'd like to put the dictionary index in the array instead, but those get invalidated when the dictionary needs to reallocate. I could box all the values, but that's a lot of boxing. And I could make SetVector a class, and invent a proxy key type that stores the reference to self and the index into the array, but then I lose value semantics (because the keys have references in them).

Now, there are dictionary implementations that maintain order, but Swift's isn't one of them. (Changing that is out of scope for this, and it's at least partly deliberate to make replayable denial-of-service attacks harder.) I could implement my own dictionary, and possibly even steal implementation details from the standard library, but, well, that's a lot of work! And easy to get wrong—hashing and knowing when to reallocate is tricky stuff.

All of the ordered set / ordered dictionary types I've seen online in a quick search do the same thing as I've done here, which does suggest that there's not an easy answer. But maybe someone else can think of something clever?

lorentey · July 24, 2020, 10:12pm

I'd like to put the dictionary index in the array instead, but those get invalidated when the dictionary needs to reallocate.

I have some plans to go the other way with ordered sets -- I'd put the elements in flat buffers and use the hash table to store their indices instead. This would speed up iteration and bulk access (which I expect to be important for an ordered set), and (depending on how much headroom we leave in buffer allocation) it could save lots of space since only buffer indices would suffer the <=75% fill ratio constraint, and they can be far smaller than the elements themselves.

I don't see an (easy) way to implement this with the stock Dictionary/Set types, because they don't provide hooks to access custom table-local data in their lookup code -- and storing a reference to the buffer inside the hashed items would be far too wasteful. If forced, one could use thread-local variables to communicate the buffer identity to the BufferIndex.hash(into:)/== implementation, but that's really rather messy (and we also don't have constructs for thread-local variables yet).

Ordered sets would potentially also be a good use case for size-dependent hash tables. (E.g., disable hashing if count < 16, use 8-bit buffer indices if count < 256, etc.)

lorentey · July 24, 2020, 10:56pm

BTW, _HashTable in the stdlib codebase attempts to isolate the table algorithms and I think it has the hooks to support ordered tables -- as long as you're happy with the stdlib's hash table implementation.

(There are many ways to dice this onion, and the stdlib's constraints are likely irrelevant for these more specialized use cases. E.g., if you have control over the element types, tombstones become a far more viable option. If you don't need to keep indices stable over regular insertions, tweaks like Robin Hood hashing could lead to interesting performance tradeoffs. Sometimes it could be worth creating a multi-level table to reduce disruptions due to all-or-nothing reallocations. These didn't pan out for the stdlib's Set/Dictionary, but any of these could be useful in an ordered set!)

s-k · July 27, 2020, 4:45am

I don't have much time to write now, so my answer will be a little short.

I'm not sure if this answers your question, but instead of having

private var indexesByValue: [Value: Int] = [:]

you could have

private var indexesByHash: [Int: Set<Int>] = [:]

Then in insert(), you could hash the value, look it up in indexesByHash and check all matches for equality with the value to be inserted.

jrose · July 27, 2020, 6:20pm

Oh, I like that! Here's a version with that implemented (and an enum to avoid allocations for non-colliding hashes):

fileprivate enum Indexes {
	case single(Int)
	case multiple([Int])
}

struct SetVector<Value: Hashable>: RandomAccessCollection {
	private var allValues: [Value] = []
	// Hash-based keys suggested by Stephen Kockentiedt
	// https://forums.swift.org/t/challenge-make-a-better-ordered-set-without-reimplementing-hashing-allocation/38729/4
	private var indexesByHash: [Int: Indexes] = [:]

	mutating func insert(_ value: Value) -> Int {
		// Helper to avoid a double lookup in indexesByHash
		func addIfMissing(_ indexesForHash: inout Indexes?) -> Int {
			let newIndex = allValues.count // may not get used

			switch indexesForHash {
			case nil:
				indexesForHash = .single(newIndex)

			case .single(let singleIndex)?:
				if allValues[singleIndex] == value {
					return singleIndex
				}
				indexesForHash = .multiple([singleIndex, newIndex])

			case .multiple(var indexes)?:
				for i in indexes {
					if allValues[i] == value {
						return i
					}
				}
				indexes.append(newIndex)
				indexesForHash = .multiple(indexes)
			}

			// If we got here, nothing matched.
			allValues.append(value)
			return newIndex
		}

		var hasher = Hasher()
		hasher.combine(value)
		let hash = hasher.finalize()

		return addIfMissing(&indexesByHash[hash])
	}

	var startIndex: Int { 0 }
	var endIndex: Int { allValues.count }

	subscript(_ index: Int) -> Value {
		allValues[index]
	}
}

(If anyone wants to use this for something let's say it's MIT licensed.)