[API Review for 1.1] Hash-Array Mapped Prefix Trees

Nevin · November 4, 2022, 12:55pm

In Swift parlance, I think we would call these “Branch on write” (BoW) collections.

Of course, we don’t use “Copy on write” (or CoW) in the names of types, so we wouldn’t want to use “Branch on write” there either.

Perhaps the module could be BranchOnWriteCollections. Then people would immediately understand that it contains BoW data structures.

For the types, TreeSet and TreeDictionary as suggested above sound good to me. They don’t per se communicate everything salient, but they do suggest “branchiness” which is a good start.

Another option would be BranchSet or BranchingSet. Those describe what the type does (“it branches”), and they don’t inadvertently give the impression of supporting standard tree-structure operations.

Either way, when someone reads the documentation for TreeSet or BranchSet or BranchingSet, they’ll see it is a branch-on-write data structure implemented with a tree, and the name will make sense.

lorentey · November 5, 2022, 5:12am

Oh, that is just due to me being a non-native speaker. A rudimentary search confirms that Shareable is the more typical spelling (by far), so we should definitely go with that -- I agree ShareableSet and ShareableDictionary is the better spelling.

If there are no better suggestions, I'm happy to go with these type names.

Yes, definitely! As I mentioned, SortedSet/SortedDictionary are the canonical counter-example, and their addition is very much on the horizon.

A tree-based "array"/list/string type would be another example of a highly useful persistent, but not hashed, collection, and those are also very much expected addition(s) to this package.

We will want these to ship in new module(s), separate from ShareableHashedCollections, because of to the sheer size of the code that implements them. (We ship these in separate modules to keep the code size costs of depending on this package low for clients that are sensitive to that.)

In your own code base, you can of course implement whatever search tree you like best, and you have my full blessing to name it whatever you like!

I suspect we have a fundamental disagreement on what the Swift Collections package is supposed to be, though. Let's make one (and just one) honest attempt at clearing this up, then! Here goes:

As I see it, the mandate of this package is to provide standard data structure implementations for common coding scenarios. I look at this package as an informal extension of the Swift Standard Library -- and in fact it has been the explicit goal of this package to eventually migrate its types there.

Accordingly, this package most definitely isn't supposed to be a comprehensive catalog of every data structure ever invented -- on the contrary, this needs to be a carefully curated collection of just the ones that we recommend for general use.

To curate a good selection of standard data structures, we must figure out a sensible way to carve up the space of all possible data structure implementations into nice, well-labeled categories, based on their function -- for example, we may draw a category for "Set" types, with sub-categories called "Hashed Sets", "Sorted Sets", "Radix Sets" etc. On another axis, we may categorize the various sets based on whether they provide value semantics, and if so, whether mutated copies of collection values are able to continue to share pieces of storage between them.

Then we have to go and carefully select a single standard implementation of each of the most frequently useful combination of categories. The goal is to put best of class implementations in the hand of programmers who just happen to need, say, a sorted set, but aren't particularly nitpicky about the details of how it works. (As long as it works quickly and it doesn't waste too much memory.)

The natural name for a canonical implementation of a sorted set is, of course, SortedSet. The precise data structure is generally immaterial, and (in theory) it may not even stay the same across releases; trying to precisely describe it would be silly, and would only serve to confuse people.

If this package remains successful, the implementations we add to this package will continue to be The Canonical Implementation of their corresponding category. They will be the baseline against which other implementations will compare and contrast themselves.

Furthermore, if these standard implementations are Good Enough, then they will eliminate the need to build other implementations in the first place, in all but the most specialized use cases. It takes weeks / months of difficult work to build a nontrivial, production quality Collection implementation; if the standard type is already good enough, that effort is generally far better spent on something that gives better bang for the buck.

For instance, before we released Swift Collections, the typical mid-to-large Swift project used to (anecdotally) have at least one untested, slightly broken and/or shockingly wasteful implementation of 25% of the OrderedSet/OrderedDictionary APIs. From what I can tell, that is no longer the case, and the Swift ecosystem is a tiny little bit better off overall because of it.

On the topic of sorted sets in particular:

There are many ways to implement a sorted set in Swift -- we have, what, 70-80 years' worth of active research to select from? Every data structures text book provides a variety of classic examples. The thing is, most of these are rather impractical on today's computers. Binary trees such as AVL trees and red-black trees are very clever coding exercises, but in my (not so humble) experience they make very little sense on today's machines. Data structures with poor locality of reference are generally not a recipe for success, and they tend to be terrible candidates for widespread production use.

Accordingly, binary search trees are not planned for inclusion in this package. Instead, we're working on a variant of in-memory B-trees, which I expect will, on the whole, provide ~an order of magnitude(ish) better performance, and far more sensible memory overhead than any binary tree ever could.

(Feel free to prove me wrong though -- I'd be very happy to eat my words once we have the candidate b-tree based SortedSet implementation to compare against!)

This term is not part of Swift's jargon. I have never heard anybody use it before, and I would not know what it means if I saw it written somewhere. I do not wish to see it become part of our terminology, either.

tclementdev · November 5, 2022, 8:52am

I don't see anyone mentioned SharedSet/SharedDictionary. Seems more in line with Ordered and Sorted, and sounds nicer than Shareable.

scanon · November 5, 2022, 1:26pm

Agreed, "BoW" doesn't make much sense to me. If I had to pick a name in that vein, I would call them "piecewise CoW" (insert butcher shop reference here).

Nevin · November 5, 2022, 1:49pm

I think you are being needlessly harsh and dismissive here.

Yes, of course “branch on write” was not previously part of Swift’s jargon. Of course you’ve never heard anybody else use it before.

That is because I thought of it specifically to describe the behavior of these collections. I had never heard it before either. As far as I knew, it was a brand new phrase.

You stated which aspect of the proposed data structures was important to succinctly convey, and in response to that prompt I started thinking about how to convey that information succinctly.

I came up with something that, to my mind, sounded clear and obvious and Swifty. Something that aligns with our existing conventions. Something that I expected anyone on these forums—anyone familiar with Swift data structures, value semantics, and copy-on-write optimization—would immediately understand.

And now that I search the web, I find that in fact other people already use “branch on write” for precisely this feature. The phrase I came up with on my own, independently, to describe the salient aspect of these data structures in a Swifty way, is the same phrase that other people elsewhere already use for the same purpose.

So not only do I think “branch on write” is intuitive, accurate, and fits well with Swift, but it is also an existing term of art with the meaning that we want.

This contrasts against “persistent” which is counter-intuitive (because the types do not persist the data), and whose technical definition either fails to apply (because past versions of the data are not accessible) or else means the same thing as “value semantic” (because the definition only requires that snapshots of the data remain independent).

The more I think about it, the more convinced I become that “branch on write” is the correct terminology to describe these collections in Swift. It parallels “copy on write”, it implies a tree structure, and it communicates how the data structures behave. And yet, despite that, “branch on write” still should not appear in the names of the types.

BranchOnWriteCollections is an accurate and descriptive name for the module, if that is indeed the module’s purpose.

For the types themselves, the best I’ve come up with is BranchingSet, and I think TreeSet is equally good. I’m not certain what the best name is, but both of those seem orders of magnitude better than the misleading term “persistent”.

tera · November 5, 2022, 2:49pm

The name "Shareable" suggests me this is a reference type collection rather than a value type. (btw, sometimes reference type collections would be handy for some cases). In that sense the newly brought "branch on write" concept and "Branch/BranchingSet" name may indeed convey the meaning somewhat better, even though it is a new term previously unheard of.

I only brought them as naming examples..

I love B-trees! Implemented quite a few of them myself, and agree that as data structures they offer multiple benefits in many cases over binary trees (besides locality the main advantage would be ease of balancing on removal / insertion and better applicability for page style cases like in file systems). Binary trees still have their uses today, and that's an excellent topic of discussion when to prefer one to another (e.g. you happen to need 10K of micro trees - the time and space overhead of B-trees here might be significant), and I love discussing topics like this over a beer, but let me not digress.

Nevin · November 5, 2022, 3:20pm

Another possible module name is HashTreeCollections.

Karl · November 6, 2022, 4:57am

I quite like the term "branch on write", and I think it is a good idea to think of a more general term for these values whose storage may be split up with some parts shared by multiple values. It seems like the kind of thing we may want more of in future - I'm reminded of DispatchData, and one could imagine the same storage-sharing mechanism being applied to a similar buffer type in the future.

And from there, I get the idea for another pair of candidates - DiscontiguousDictionary and DiscontiguousSet.

And if you start off 'reflexively hating' those names - well, so did I.

But as I consider them more, I actually come to like them. They very clearly communicate that the storage is broken up in to pieces, and that helps to explain what the benefit is in using these types - because the storage is broken up, pieces may be shared, and when COW occurs, it only needs to copy or create a piece rather than copying the entire storage for all values.

Even though contiguous storage is not part of the regular Dictionary/Set API, I don't think that actually matters when it comes to introducing versions that are explicitly discontiguous and which have different performance characteristics as a result of that.

lorentey · November 8, 2022, 2:30am

Thanks very much for all the input on the type names. I'm sticking to this principle:

We are not inventing a new data structure discipline from scratch here, so let's keep our feet planted firmly on the ground, and let's stick to established labels.

The name we choose must have some intuitive meaning that sounds reasonable to people who are familiar with purely functional data structures or to people who are well-versed in Swift -- preferably both. Inventing opaque new termini technicus that are equally unfamiliar to both groups is simply not on the table, unless we exhaust all other choices.

In addition, like all other data structures in both this package and the Standard Library, the names of the new HAMT collection types ought to reflect their functionality, rather than their internal implementation. I.e., we should name them after the thing that distinguishes these types in actual use. In our case, the primary thing that distinguishes the new types from Set and Dictionary is that the new types are algorithmically better at mutating collection values that have outstanding copies. I expect this is, in fact, the number one reason* why someone would reach for these types.

(* A (distant) second reason would be that primitive mutations of these types (inserting or removing a single item) tend to have far more consistent performance than Set and Dictionary, where the need to resize the underlying hash table sometimes induces dramatic latency spikes. This might be of some interest to ~"real-time" use cases, even if they never make a copy of these collections -- but, then again, this benefit is softened by two things: that (1) such spikes in the standard types can often be eliminated by reserving enough capacity in advance, and that (2) these are all hashed collections, and as such they are subject to random hash collision artifacts, which make it tricky/impossible to provide useful guarantees on worst-case performance -- so real-time use cases would probably be better off if they looked at something else entirely.)

Regarding TreeSet and similar variations: The fact that they organize their storage into trees is very interesting, but it isn't apparent from their API: the public interface does not expose their underlying tree structure. Additionally, tree-ness does not imply shareability -- a frequently desirable design choice is to put parent links in child nodes (or next/previous links in siblings), which makes some operations easier/faster, but it prevents sharing nodes across mutated copies. So calling these "tree set" and "tree dictionary" goes against our established naming principles. (Which aren't written in stone, but we shouldn't deviate from them unless we have exhausted all other options.)

Regarding DiscontiguousSet: It's true that the new types are discontiguous collections, but this doesn't distinguish them from the standard Set/Dictionary. While the standard types use a single, contiguous hash table as their storage representation, they are not contiguous collections in the sense that the Standard Library uses this term: they do not store their elements contiguously in a single buffer. So calling the new types "discontiguous set" and "discontiguous dictionary" would be, in my opinion, both redundant and irrelevant.

The distinguishing feature of the new collections is that they are better at mutating collection values that have outstanding copies, because they support partially sharing storage between collection instances. The original names drew attention to this feature, and this still feels very likely to be the right direction to me. If we decide not to go with the preexisting term of art "persistent", then using some variation of the word "share" instead seems like a promising direction.

SharedSet is not great; unlike OrderedSet and SortedSet, storage sharing is not an inherent property of these types. Just because two HAMT sets contain the same values, it does not follow that they share any storage between them.
ShareableSet seems better. It strongly hints at the fact that sharing only happens as a side effect of certain user actions. It solves both of my original problems with PersistentSet while hopefully remaining at least somewhat understandable to folks who might have encountered these types in a classic purely functional environment.
SharingSet is a third option. To me it feels like it sits somewhere in between SharedSet and ShareableSet, and it's less obvious about the circumstances of the sharing than the latter.

Note: the standard Set and Dictionary are also "shareable" in the sense that unmutated copies of their values do share references to the same hash table. I think this is fine; the new types are far more shareable than the existing ones, and it seems appropriate to set the precedent that basic copy-on-write behavior isn't enough to label a type "shareable".

For now, I'm provisionally updating the names of the new types to ShareableSet and ShareableDictionary. The new module name is ShareableHashedCollections. (It is necessary to label the module "hashed", because future shareable types will come in their own modules. The Standard Library calls Set and Dictionary hashed collections, so "shareable hashed collections" is the appropriate name for a module that defines their shareable variants.)

Future discussion on these type names is still welcome, of course, but any new suggestions need to position themselves against ShareableSet and ShareableDictionary, and they should do better when considering the two principles above:

The name needs to be at least somewhat understandable to folks who are familiar with this family of data structures.
The name needs to reflect the primary feature that distinguishes these types from Set and Dictionary; namely, that they are pretty good at mutating copied values. (Because they can continue to share parts of their storage with their copies after such mutations.)

The discussion is still fully open about other aspects of the new types.

Other updates:

Added ShareableSet.removeAll(where:) and ShareableDictionary.removeAll(where:).
Removed the fluent interfaces ShareableSet.inserting, ShareableDictionary.removingValue(forKey:) et al. I think these do have a useful role on shareable types, but they are easily implementable outside of this package using the mutating APIs, with no loss of efficiency, and I don't feel strongly enough about them to meaningfully argue for their inclusion. We can re-add them in a future update if their omission turns out to be more painful than expected.

The diffs are in PR #237, which includes an updated API proposal text. Here is a direct link to the new document:

swift-collections/ShareableHashedCollections.md

Updated DocC-generated documentation is also available, (temporarily) hosted under my own Git fork:

lorentey/swift-collections: ShareableHashedCollections

Thanks very much everyone for the interesting and fruitful review so far!

pyrtsa · November 8, 2022, 6:59am

Please let me express my disagreement to this reasoning which, I think, considers only one meaning for shared, that is the one which two or more owners take part to. There is also the social media style meaning which I think these data structures fit well to: being offered up for reuse. Also shared [ahem!] in a way by C++'s std::shared_ptr.

A big upside of SharedSet/SharedDictionary in my opinion is the terseness not only in the written form but also when pronounced.

kylemacomber · November 9, 2022, 5:16pm

Caveat: I haven't had the time to catch up on the full discourse in the thread, so I apologize for repeating arguments others may have made.

Any name based on "shared" (or a derivative of it) carries strong connotations of reference semantics and shared mutable state (at least for me!), so my intuition is to look for another word. The name shared is leaning too much on CoW, which is an implementation detail and thus an inappropriate basis for the name IMO—the mental model of passing around value types is copying not sharing.

I know "big" was suggested and rejected because it didn't fully convey the nuance of when these data structures should be used (i.e. when you're anticipating CoW). However, I think it captures some of the important connotations to keep in mind—these data structures are more important to use when you have large data sets. If your data set is small, it really doesn't make a significant difference between using Set and PersistentSet.

I think "big" is preferable to "shareable" because for me it's not actively conveying something misleading (i.e. reference semantics). "Big" is a short prefix and if we choose it, we'll effectively be establishing via naming convention that for Swift collections "big" stands for persistent data structure. I think that's good.

hooman · November 9, 2022, 6:59pm

I won't be able to catch up to all of the discussion, so my apologies if this feedback is way off. Trying to think out of the box, how about we go with a prefix that hints at potentially less expensive mutation compared to the ordinary variants? Some wild suggestions:

AdjustableSet / AdjustableDictionary
AdaptableSet / AdaptableDictionary
AlterableSet / AlterableDictionary

or even:

DuctileSet / DuctileDictionary
PliableSet / PliableDictionary
AmenableSet / AmenableDictionary
MalleableSet / MalleableDictionary
TractableSet / TractableDictionary

hisekaldma · November 9, 2022, 8:11pm

I’m increasingly leaning towards “persistent” being the best name after all. All of the alternatives suggested so far, “big” included, seem to invite misunderstandings of their own. And if the name is going to be misleading in some way or other, it’s better to just use the common name for these things. Then once users have gotten past the initial misunderstanding, they will have learned a term that is applicable not just in Swift, but in other languages too.

We have to keep in mind that persistent data structures are a pretty recent development. As they become more common, more people will know what they are, and the misunderstandings around the term “persistent” will be fewer and fewer. It would be ironic if, in our quest for an easy to understand name, we actually made these collections harder to find once they become more widely known in the industry. (As an anecdotal data point, I heard the term used on the Future of Coding podcast the other day.)

kiel · November 10, 2022, 2:09am

I like this line of thinking (and the word "big" or "large") because it helps me as a developer decide what type might be appropriate to use for the data I have to work with. It tells me what these data types are for. If I am looking for a type to work with a lot of data, "big thing" or "large thing" would occur to me before "sharable thing" (sharing with what? other scopes? I'll want to do that with pretty much any data type).

I think "persistent" is the wrong word unless these data types are getting the ability to preserve or revert to previous versions of themselves. A quick search for "persistent data types" and you get a definition of what these types are not since 1986.

I notice the API for these types seem to be identical with Set and Dictionary (unless the plan is for them to get the ability to preserve or revert to previous versions). Is that right? If so, is it fair to characterise these additions as amounting to an optimisation of an implementation detail (state is copied on write) of those existing data structures? If that is fair, is there any point to having entirely new data types? Could this functionality be accessed through special initialisers on those existing types and therefore avoid the naming problem -- or confine it to an initialiser? (I remember thinking class clusters were cool once upon a time).

tera · November 10, 2022, 2:23am

+1 for "Big", same reasoning.

+1 for the hint parameter to normal Dictionary / Set.

And for data structures that allow accessing previous versions names "VersionedSet" / "VersionedDictionary" sound reasonable.

lorentey · November 10, 2022, 4:18am

I don't feel as strongly about the distinction between "shared" & "shareable" as I do about some of the other naming options. I don't feel that shared_ptr is a great analogue, though -- in that case, the pointers are sharing ownership of their pointee, and two equal pointers are, by definition, pointing to the same physical object. Two equal ShareableSets aren't guaranteed to share any storage between them.

Interesting. Do other folks have similar expectations when reading ShareableSet?

FWIW, this does rhyme with my initial worry that the word "shareable" might confuse some folks into believing that these types support concurrent mutations of the same value.

Nevertheless, I like "shareable" because it seem less opaque to newcomers than "persistent", and it doesn't seem nearly as alienating to folks who are already familiar with these data structures as some of the other options.

The thing is, as far as I can see, the only use cases for ShareableSet/ShareableDictionary are situations where the all-or-nothing copying behavior of Set/Dictionary isn't desirable. The way these types implement shared mutations isn't just an implementation detail, but it's the core reason for their existence.

These types have (at first approximation) the same API surface as the standard hashed collections, with roughly the same behavior, too. The only meaningful behavioral difference is that they optimize for mutating shared rather than unique storage.

This crucial difference is why these types exist, and I think we should choose a name that reflects this.

I still think that calling these BigSet and BigDictionary would be extremely misleading. These types aren't universally good choices at storing large amounts of data -- the standard Set and Dictionary are far more efficient choices, unless we expect that the collection value will sometimes be mutated while copies of it are still around.

I expect a uniquely held flat hash table with pre-reserved capacity will win an insertion race against any persistent data structure, and it won't even break much sweat doing it. Insertions into a uniquely stored ShareableSet are 2-8 times slower than into Set, and (on average, for big enough data sets) the result also tends to use more memory.

(I'm pretty sure we can considerably improve the memory usage story (I filed some low-hanging bugs to that effect), and I expect we may be able to speed up insertions by, say, 30-40% with some extra work. However, improving memory use will require more frequent node reallocations (to get rid of unused space in internal nodes as their contents get replaced by child refs), and that will slow down insertions even more.)

The situation will be even more messy when we get persistent versions of Array and String. A tree-based Array equivalent will be nowhere near as efficient as a flat Array when accessing items at random positions; it won't even conform to a RandomAccessCollection. So calling these BigArray and BigString would likely be even more misleading than BigSet and BigDictionary. (Then again, we do have the option to go with something like List and Text instead, treating these as brand new categories rather than just variants of existing constructs. But ideally the qualifier we choose would be universally applicable -- we should be setting a precedent for consistent terminology.)

I think highlighting the relative plasticity of these types (and most tree-based data structures in general) is an interesting option, but it's firmly in the "inventing new terms of art" camp, and I don't think it's desirable here. These are about as intuitive names as if we called them ElbowSet and KneeDictionary.

That said, I think these do have the benefit over BigSet of not being actively misleading.

Indeed.

I expect at minimum we'll want to mention persistent data structures in the documentation of these types.

Well, this is all relative. They are more recent than the invention of, say, the B-tree, but the technical term has been around since at least 1986 (Drisco, Sarnak, Sleator, Tarjan 1986), and the precise sense in which we'd be using the term was very much established a decade later:

Traditional methods of amortization break down when old versions of a data structure, not just the most recent, are available for further processing. This property is known as persistence, and is taken for granted in functional languages. (Chris Okasaki's PhD thesis, September 1996)

I don't think 26-36 years ago was that recent.

lorentey · November 10, 2022, 4:43am

They are close, but the APIs aren't identical.
ShareableSet/ShareableDictionary don't provide the reserveCapacity method or the capacity property, and these persistent variants are much stricter about invalidating indices.

The standard Set and Dictionary are documented not to invalidate indices on insertions unless the insertion requires storage to be resized -- ShareableSet and ShareableDictionary have no concept of structure-preserving insertions, so they need to invalidate all indices on every insertion.

(A more obvious difference is that the ShareableDictionary.Values view is not a MutableCollection, for a similar reason. This one we could fix, but at the cost of considerably slowing down indexing operations -- we'd need to remove the direct pointer to the final node from the index representation, as the node may need to be copied on value mutations.)

No and no for the first two questions, even if we ignore the subtle behavioral differences about indices.

Like all other native collection types in the stdlib, the memory layout of the standard Set and Dictionary is part of Swift's ABI, and it is (naturally, and fundamentally) incompatible with the way ShareableSet and ShareableDictionary works.

Unlike NSArray/NSSet/NSDictionary/NSString, the standard collection types are not opaque. The native Swift way to vary representation is by using code that is generic over protocols.

I don't think this is an option. Even if we were okay with breaking ABI (we aren't), I strongly suspect we would still want these new set and dictionary representations to be different types.

Karl · November 10, 2022, 5:21am

I can see it. One thing that is worth considering is that what is actually shareable is the storage, not the set.

The idea of sharing storage does not make sense unless that storage has some kind of underlying reference semantics - by their nature, values cannot be "shared". The concept of "shared Ints" does not make sense, and similarly, the idea of "shared Set<Int>s" does not make sense.

I'm not a fan of the name "Big" though. I would have a lot of expectations of a "BigSet" that this data type would not meet. These days, I think the name "Big" is probably more closely associated with BigData/distributed systems (e.g. Google's BigTable and BigQuery systems)

hisekaldma · November 10, 2022, 12:47pm

Figured I might be wrong on that. As they say: “the best way to get the right answer on the internet is not to ask a question; it's to post the wrong answer”

Regardless of when they were invented, my impression is that it’s only during the last 15 years that they’ve become more widely used/known, basically popularized by Clojure. But I might be wrong on that too!

rvsrvs · November 10, 2022, 4:11pm

Okasaki did his implementations in ML and Haskell. i.e. the research on these all came out of the FP community. Clojure being in the line of FP langs with immutable data types was able to ingest those algorithms immediately.