An always-sorted set

CTMacUser · August 23, 2019, 5:57am

The Set and Dictionary types around here require their elements/keys to be Hashable, which makes node retrieval more efficient. Many other languages have this hash-ability requirement for their set and dictionary/map types. But I'm from an earlier age, where my first major introduction to these types was an earlier version of C++, where they used a comparability (i.e. <) requirement instead of hashing. I miss those days and wrote a version of that kind of set in Swift; it's in a gist.

(I was going to make a post about the binary-search algorithm I'm using there before this post. But searching on these forums about earlier B.S. efforts has inspired me to tinker a little more before posting here specifically about that.)

A key point of the implementation is that it does not arrange storage as a tree of nodes. It uses a single block of memory, which I guess can be an advantage sometimes. It's also cute in that it is its own SubSequence.

Writing this has shown me another flaw in our protocol hierarchy. This type, along with Set and Dictionary, need the arbitrary-removal part of RangeReplaceableCollection without the arbitrary-insertion part. That indicates a split to an intermediate protocol. (Yes, I know that adding non-leaf protocols to a hierarchy, even if done in a way that preserves source stability, still violates ABI stability. It has to wait until an ABI Version 2.) They all workaround by manually adding the removal methods of RRC without an associated protocol.

The elements are stored in a single block already sorted by <, and insertions maintain that sort. That's why I needed a binary-search algorithm.

Due to the single block, I feel the best way to use the sorted-set is to first read from it more than you will alter it. And to do all the element adds/removals in the beginning of the program.

Since they don't include much explanation, am I using the quasi-secret optimization methods correctly? (These are the members after count in the RandomAccessCollection block.)

Besides interest in copying the older C++ set design to Swift, the main interest (that made me write this now) is to solve a problem with the removing along arbitrary indices at once.

Edit: added link to multi-index removal thread.

millenomi · August 23, 2019, 2:51pm

Yeah, this is spot-on. Another problem is that you may not want MutableCollection at all, because the subscript set operation is essentially a range replacement insert special case.

millenomi · August 23, 2019, 3:12pm

(Unless you do something exceedingly clever with indices, but most of the cleverness may require Hashable.)

CTMacUser · August 23, 2019, 6:21pm

Out of the collection hierarchy, SortedSet only conforms to RandomAccessCollection (and its ancestors). It does not conform to MutableCollection nor RangeReplaceableCollection; the mutation interface comes from SetAlgebra and custom RRC-like removal methods.

lorentey · August 23, 2019, 8:43pm

I like this implementation, especially the way it wraps the underlying array index into an opaque struct.

The performance of an always-contiguous SortedSet is great at smaller element counts, but it gets disappointing beyond a few thousand elements, so a better approach would be to switch to a B-tree structure when the array grows too large. (See e.g. the (somewhat out of date) SortedSet implementation in BTree.)

Naturally, it wouldn't be wise to conform such a tree to RandomAccessCollection, or any of the current mutable collection protocols. (At the same time, there is no reason SortedSet wouldn't be a SetAlgebra.) This is fine -- it's okay for data structures to provide their own unique operations. Not every method has to fit into a neat protocol hierarchy.

If they prove useful, we can add a remove-only variant of RangeReplaceableCollection (or a swap-only variant of MutableCollection) later. (It may even be possible to insert them at their appropriate place in the Collection protocol hierarchy with minimal ABI breakage.) Meanwhile, there is nothing wrong with individual collections implementing some RRC/MC requirements without declaring full conformance.