[API Review for 1.1] Hash-Array Mapped Prefix Trees

Paul_Cantrell · November 1, 2022, 10:25pm

I think that’s exactly wrong.

As the proposal itself points out, there are many ways of implementing these ADTs so that portions of the underlying structure can remain shared (“persist”) across changes. A HAMT / CHAMP comes with very specific time / space complexity tradeoffs, and those specific tradeoffs are exactly what’s important here. If the purpose is to provide a specific set of performance and complexity characteristics provided by a specific data structure, just name the data structure!

lorentey · November 1, 2022, 11:48pm

"Persistent" is the correct term of art for these data structures. The #1 thing that distinguishes PersistentSet from Set is that the former is able to gracefully handle cases where older snapshots are preserved across mutations, i.e, that it is a persistent data structure.

That said, I chose these names specifically to trigger this argument.

I actually do not like the names PersistentSet and PersistentDictionary, not because they aren't precise (they are), or they are misleading (only to a subset of engineers, and the misunderstanding is superficial), but because these names are way too unwieldy to be practical. "Persistent dictionary" falls very much in the same naming philosophy that gave us "Unicode scalars", and I would very much like to avoid repeating the same mistake if we can. (The unwieldy, impractical name for Unicode scalars is, I believe, mostly responsible for the regretfully low use of String.UnicodeScalarView in Swift, and the correspondingly neglected state of its APIs.)

So I'm very much open to discussing naming alternatives!

Michael's original names for these were HashMap and HashSet (from Scala et al), which I don't feel are a good fit -- I admit I personally do prefer the word "map" for a dictionary, but Swift's well-established tradition is to use "dictionary" for these, and I think we should stick to it. The prefix "hash" is not relevant -- the fact that these are hashed collections is not a distinguishing feature of these types, as so are the standard Set and Dictionary.

Hard no. We are not inventing a new data structure discipline from scratch here, so let's keep our feet planted firmly on the ground, and let's stick to established labels.

Yeah; perhaps emphasizing the structure of these values rather than the main semantic distinguishing feature is the right way to go. It does break with the (implicit, but obvious) naming scheme we've been using so far, but if we cannot find a good term to succinctly express the distinction, and we aren't comfortable with the (inarguably) term-of-art "persistent", then this is probably the way to go.

(Note how none of the names Array, Set, Dictionary, String, OrderedSet, OrderedDictionary and Deque tell you anything about their implementation. (The same is true for the upcoming, B-tree based SortedSet and SortedDictionary.) The names are mostly determined by the functionality these types provide; it is never just the name of the underlying data structure. The names OrderedSet and SortedSet tell you how these types differ from the standard Set in their high-level capabilities; they don't give you any insight into their implementation details. I think this is Good.)

My notes:

I've been making an intentional effort to eliminate all mentions of the word "trie" throughout this codebase & associated documentation -- much like "deque", "trie" is a shiboleth, as its "correct" pronunciation (same as "tree", from "retrieval") is extremely non-obvious. (I regret using Deque, and I'd like to avoid making the same mistake again.) Unlike deque, the word "trie" is also quite confusing in everyday conversation -- I noticed I kept spelling it out aloud to avoid my conversation partners hearing it as "tree" or "try". I prefer using the term "prefix tree" instead, which doesn't have these drawbacks, but I think it's rather too long (and waay too technical) to use in top-level API names.
I would also prefer if we could avoid using acronyms or initialisms in top-level type names. HAMT and CHAMP are very precise and succinct names, but I feel they are quite uninviting unless you are already familiar with the underlying data structures.

One reasonable idea would be to direct attention to the obvious structural difference -- that the new types do not use a flat storage buffer, but they organize their contents into a tree. Knowing the name of the precise data structure doesn't really seem relevant to me when using these, but knowing that these are trees does strongly carry the implication that they would be persistent (with logarithmic complexity mutations), and that is the essential thing these names must achieve.

What if we named these TreeSet and TreeDictionary?

lorentey · November 2, 2022, 12:17am

Both isEmpty and count are O(1) operations.

Side note: Incidentally, in the current implementation, the tree is augmented with subtree counts, so the index(_:offsetBy:) operation takes logarithmic time. This seems unnecessarily fast (who needs to randomly jump around in an unordered collection type?), and storing all those subtree counts has very regrettable memory costs, so I have plans to get rid of them in the near future. (The current implementation was a nonetheless very interesting exercise that will prove extremely useful in the B-tree effort.) (isEmpty and count will remain O(1) operations even if the augmentation gets removed.)

Side note 2: even with the unnecessary augmentation, and even though node shrinking hasn't been implemented yet, the memory usage of a standalone PersistentDictionary (with no shared nodes) is already quite competitive with the standard Dictionary, especially at smaller counts, despite the overhead of having to account for node references.

lorentey · November 2, 2022, 12:40am

hisekaldma:

Another naming question:
lorentey:
func isEqualSet(to: `Self`) -> Bool
func isEqualSet<Value>(to: PersistentDictionary<Element, Value>.Keys) -> Bool
func isEqualSet<S>(to: S) -> Bool
It feels like there’s some potential for confusion with elementsEqual() here. Since these are in the same family as isSubset(of:), isSuperset(of:), etc would it make sense to call these isSameSet(as:) or isIdenticalSet()?

I think/hope that the suffix Set in isEqualSet should already strongly imply that these are testing set equality, rather than doing lexical comparisons.

The distinction does get extremely important for OrderedSet, whose == implementation implements lexical comparisons, but we still want its isEqualSet to treat it as a regular set (just like the preexisting implementations of isSubset/isSuperset etc.).

I don't love the names isSameSet and isIdenticalSet -- they could be easily read to mean that these are comparing the set's identity rather then its elements (like === does for class instances). The stdlib's documentation uses the term "identity" to mean that. These operations are testing for set equality in the mathematical sense, so it ought to be okay to name them that way.

Perhaps the real problem is with the naming of Sequence.elementsEqual. It implicitly assumes that the ordering of elements matters -- an assumption that isn't true for Set/PersistentSet. But sadly that's water under the bridge now.

Indeed! I see two options: formalize and define these for Set in swift-collections first, then propose them as official APIs in the stdlib, or we could go directly to Swift Evolution. I think it would be easier to argue for adding these to the stdlib if we already had a variety of use cases for these.

(There is also a chance that these are too niche to fit in the stdlib. The omission of a generalized equality check is quite an obvious symmetry violation in the stdlib, but I only realized it was missing while implementing my nth SetAlgebra type, so perhaps it's rarely if ever needed in practice.)

Good catch, thanks! Would this be better?

PersistentSet supports all standard set comparisons (subset tests, superset tests, disjunctness test), including the customary overloads established by Set. As an additional extension, the isEqualSet family of member functions generalize the standard == operation to support checking whether a PersistentSet consists of exactly the same members as an arbitrary sequence. Like ==, the isEqualSet functions ignore element ordering and duplicates (if any).

scanon · November 2, 2022, 2:30am

These are the only alternatives I've seen that I don't reflexively hate (but: I expect a decent chunk of users to think those are ordered set/dictionary at first). There's a rich literature that calls these "persistent" data structures, and thus a pretty heavy burden to justify using any other name.

Certainly PersistentX is not quite as unwieldy as String.UnicodeScalarView. I don't love it, but I can live with it, especially considering that these are mostly useful as building blocks for other things that can hopefully get more approachable names.

tera · November 2, 2022, 2:47am

Sorry, I still don't follow. "consists of exactly the same members as an arbitrary sequence" ?
Perhaps a small example will do that shows when "==" and "isEqualSet" give different results. Or two examples if both of these is possible: "==" -> true, "isEqualSet" -> false and "==" -> false, "isEqualSet" -> true.

~~As you are not commenting on the naming choice I take it you are firmly settled on it and we should no longer suggest renaming?~~

Edit: missed your comment on naming above.

I wonder what name would you give to a dictionary implementation that is persisted to disk?

"Tree" to me somehow suggests it will use "less than" vs "hash" and that I can traverse it left to right / right to left, depth first, etc.

patrickgoley · November 2, 2022, 3:59am

Big +1 and I’m all for keeping the Persistent name since it’s in line with prior art. A quick google of PersistentSet or persistent data structures will bring up similar implementations and detailed explanations.

As a side note. I do wonder how much the need for this will be mitigated by the proposed ownership model. takeing a Dictionary or Set and modifying it would be significantly faster than the extra bookkeeping needed for persistent mutations. Obviously not all scenarios allow for a take to occur, so still a useful tool.

Dante-Broggi · November 2, 2022, 4:05am

As a variation on PersistentSet, how about PersistableSet?
Because it does not persist itself, but is designed to be efficiently be persisted.

Otherwise I think TreeSet is best, and PersistentSet is acceptable so long as the domain in which it is a term of art is noted in the docs.

tera · November 2, 2022, 4:23am

Ok.. Here's a compromise naming:

/// `PSet` /ˌpiːˈset/
/// "P" stands for "Persistent" as in "Persistent data structure" (https://en.wikipedia.org/wiki/Persistent_data_structure)
/// Not to be confused with "Persistent storage".
///
/// An unordered collection of unique elements, optimized for mutating shared
/// copies and comparing different snapshots of the same collection.
///
/// `PSet` has the same functionality as a standard `Set`, and
/// ...
/// table. However the algorithmic improvements above usually more than make up
/// for this, as long as the use case can make use of them.

@frozen // Not really -- this package is not at all ABI stable
public struct PSet<Element: Hashable> {
    ...
}

Let's have a closer look, from wikipedia article:

Does the proposed data structure adhere to that definition? Would be really cool to be able enumerating versions / accessing previous versions (in read only mode would suffice). Ability to merge would be also cool to have. OTOH, if versions are not accessible (and there are no plans to make them accessible) then perhaps the name "persistent" would be confusing even to folks already familiar with the term.

I tend to agree with this. If it can be implemented via many different ways (as google suggests), including some naïve "copy everything" approaches – it could have very different O-big characteristics.

Is it correct to assume that all Big-O complexities of the proposed implementations match those of normal Set/Dictionary (just with a different constant factor upfront that doesn't change Big-O complexity)?

hisekaldma · November 2, 2022, 6:33am

Indeed, if they had been called CHAMPSet/CHAMPDictionary I wouldn’t have had a clue what these were. I’m only familiar with these types of data structures as a potential user of them, not as an implementer of them.

This doesn’t tell really tell you anything about the most salient characteristic of the types though: the sharing/persistence. And it doesn’t distinguish them from the B-tree collections; those are also “tree sets” and “tree dictionaries.”

If we don’t go with “persistent”, I think we need to find something that hints at you why you’d use these types. Something like SharedTreeSet and SharedTreeDictionary. Or maybe SharedNodeSet and SharedNodeDictionary? It’s a bit unwieldy, but at least it gets the point across.

EDIT: Come to think of it, would PersistentNodeSet and PersistentNodeDictionary alleviate the concerns people have with the word “persistent”? It is after all not the entire set/dictionary that persists but the nodes inside it.

dhoepfl · November 2, 2022, 6:40am

While it might be the correct term for the data structure, it also is the correct term for a data structure that is persisted on disk. So it is highly confusing. Especially since there are other use cases, such as NSPersistentStore (and many others in Core Date). In Foundation there is FileVersion.persistentIdentifier.

If you like the name, think about where you come from: You come from the data structure and think about a matching name. Most coders (and code readers!) will see the class name and try to understand what it does.

As someone reading code, I do not care about the implementation detail, I care about what the class is good for. The class should be named based on that.

These new classes shine when they store large amounts of data. So how about LargeSet? Or BigSet.

hisekaldma · November 2, 2022, 6:53am

The way I understand it isEqualSet always gives the same result as ==. The difference is how they can be used: == can only be used when both the LHS and RHS are PersistentSet. isEqualSet can be used with an arbitrary sequence as the RHS, e.g. an array, without first converting it to PersistentSet.

But that makes me wonder, should these actually just be overloads on ==? It is the canonical spelling for equivalence after all, and there is some precedent for overloading ==. Or would that completely trash type-checker performance?

tera · November 2, 2022, 7:33am

I can see now, thank you.

+1

So is it only when the sequence is on the RHS it is allowed, but not on the LHS? Somewhat strange.

Another strange thing would be lack of transitivity:

pset == seqA // true
pset == seqB // true
but
seqA == seqB // can be false

Would be more fair to have something like this:

pset == PSet(seqA) // true, where converting from setA to "a special kind of PSet" is "quick" (in some "toll free bridging" sense of the word)
pset == PSet(seqB) // ditto
PSet(seqA) == PSet(seqA) // still true, and still quick conversion
seqA == seqB // can be false, obviously, but doesn't contradict anything

Edit:

This would be somewhat analogous to:

var a: Double = 2.718281828
var b: Double = 2.718281827
var c: Float = 2.7182818

c == a // 🛑 Binary operator '==' cannot be applied to operands of type 'Float' and 'Double'
c == Float(a) // true, and conversion is quick
c == Float(b) // true, and conversion is quick
Float(a) == Float(b) // true, and conversion is quick
a == b // false

As a bonus there would be no proliferation of "isEqual" functions.
On the minus side there would be two versions of initialiser from a sequence - one that makes a real representation (O(n) time complexity) and another that makes "lazy" representation (O(1) complexity), and some added complexity when (and if) the lazy representation needs to be converted to the real representation. Tradeoffs.

hisekaldma · November 2, 2022, 7:56am

Yup. That’s a very good argument against making it an overload on ==.

Karl · November 2, 2022, 2:27pm

Aside: I actually quite like the term "unicode scalar".

The unfortunate truth is that Unicode is a mess of terminology - from code-units to code-points ("Unicode Scalars") to extended grapheme clusters (what we call a "Character"). A lot of this is due to legacy considerations and the particular way Unicode developed.

The real reason UnicodeScalarView is not used so often is, IMO, because the Character view is superior for high-level Unicode text processing, and the UTF8 view is superior for ASCII-centric text processing. The scalar view sits in a weird middle-ground and most developers should prefer one of the other views. It is a good thing that this view sees low use.

Bringing it back to the topic (and the discussion about naming things): in the case of Unicode, Swift ignored the established terms of art and - in fact - repurposed those terms (Unicode refers to code-points/scalars as "characters"). Swapping the meanings of terms in such a technical, nuanced domain would typically be considered highly inadvisable. And yet, IMO, our solution is clearer than Unicode's own terminology, and shows the value of designing the best API for today's Swift developers - the developers we actually have - rather than just inheriting all of the legacy baggage because of established practice or terminology.

tera · November 2, 2022, 10:33pm

+1. Another established state of art tradition is using reference types for mutable objects like dictionaries and arrays, and Swift wisely didn't follow that tradition.

lorentey · November 4, 2022, 3:16am

I believe the existing OrderedCollections/OrderedSet/OrderedDictionary are eminently named, because they highlight the most salient distinguishing feature of these implementations with a succinct, unambiguous prefix. (OrderedDictionary is a bit long, but oh well.) These are the right names, even though they do not convey that these are hashed collections, nor do they imply anything about the precise data structures that they implement.

Similarly, the future SortedCollections, SortedSet and SortedDictionary look as perfect to me as Swift names can be, for the same reason. "Sorted set" does not spell it out that this will also be a (sort of) persistent data structure, or that it will implement an in-memory b-tree variant, or even that it requires Comparable rather than Hashable from its Element type. While these are all super interesting features, the primary reason one would choose to use these is that they keep their items sorted, and this is what should be reflected in their naming.

I bring up OrderedSet and SortedSet as excellent names, because they highlight a very obvious but also very much unavoidable "issue" -- the difference between the two is far from intuitive. We're using the words "ordered" and "sorted" to mean very different things, but I know many (most?) newcomers initially expect "ordered set" to mean the same as a "sorted set", and (as I have seen, repeatedly) they often reach for one when they would be better off with the other.

However, as long as the documentation is properly explaining the distinction, and as long as we always use these terms consistently, this is not at all a problem: if you're new to these collection types, you quickly learn the terminology when you first encounter them, as part of the general learning process. Equally importantly, these names have enough precedent that a veteran programmer who has used these in other languages will hopefully find them relatively familiar and intuitively obvious.

The same is true for PersistentSet and PersistentDictionary. I don't love the prefix "persistent", as (1) I find it hard to say it aloud without rolling my eyes and pausing for breath and (2) I find that it's precise meaning is even less obvious to people new to this concept than the "sorted" vs "ordered" distinction.

Note that as Steve has pointed out, the first issue here is my own personal problem, not an objective truth -- and also note that I do like these well enough to propose to use these names in the first place.

The second issue, on the other hand, is probably unsolvable. I very much doubt we can find a usable name for these whose meaning will be intuitively obvious to people who have never come across such data structures before.

What I'm hoping though is that someone would point out an obviously better alternative that is still cromulent to folks who have used these in other languages. (And perhaps avoids the superficial & temporary -- but still real -- confusion about the meaning of the word "persistent".)

Failing that, I'm very happy to stick with PersistentSet and PersistentDictionary.

SharedTreeSet is a bit of a mouthful -- it would be nice if we could stick to the pattern, and prefix these names with a single, preferably short term. Highlighting that these are sharable would get rid of the need to describe their internal representation -- so we can delete Tree. (I don't see how using the term Node would add anything, either.) Instead of Shared, Sharable might be a more appropriate prefix.

This way, we'd end up with SharableSet and SharableDictionary. I don't think I have seen these used elsewhere, and a search did not pop up any prior art, either. But I also don't immediately hate them -- they are just generic enough to (maybe) let them go past my no-newly-invented-terminology rule.

The crucial question is: would these names be obvious enough to folks who have used such data structures elsewhere?

For example: would SharableSet be likely to be misunderstood as a set that is safe to share across threads or actors?

I do think highlighting that these are trees does heavily hint at their persistency / shareable nature, but on reflection I strongly agree with your second point about the module name -- PersistentCollections isn't nearly specific enough, and something like TreeCollections/SharableCollections would share the same drawback.

We got a free pass on omitting "hashed" from the names PersistentSet and PersistentDictionary, because it's implied from the naming of the standard Set & Dictionary. But PersistentCollections reads like it would be the right module to hold all persistent data structure implementations, which isn't correct -- for example, SortedSet and SortedDictionary will also be (at least somewhat) persistent data structures, but they will ship in a separate SortedCollections module, not in PersistentCollections. So probably the module name should include some qualifier about the hashing nature of its contents.

The stdlib documentation calls Set and Dictionary "hashed collections", so PersistentHashedCollections and SharableHashedCollections are the current front runners for the module name for me.

That is generally not true. These types optimize for mutations of shared copies; however they aren't necessarily any better for large data sets than the standard Set and Dictionary, unless the collections do get copied before mutation, at least some of the time.

In general, I expect that unless snapshots are kept, insertions/removals into these will be slower (by a constant factor) than the standard types. (Lookups will often be quite close, as not having to probe through multiple candidate buckets seems to sometimes make up for the overhead of having to dereference a series of children down the tree. The effect is likely larger for complex keys like String than it is for, say, simple integers.) To add insult to injury, the new types are typically using less memory than the standard types for really tiny counts (because of the clever node compression scheme), but if there are many items, the tree-based types (currently) tend to use a bit more memory overall.

"Set that optimizes for shared mutations" really is the thing that the names need to succinctly convey to people.

lorentey · November 4, 2022, 3:59am

Re isEqualSet(to:): I believe it would be desirable if we could check whether or not a set-like collection type contains the same elements as an arbitrary sequence, ignoring the order of items and any duplicates.

In Set and OrderedSet we already provide a family of such checks for isSubset, isStrictSubset, isSuperset, isStrictSuperset and isDisjunct, each taking sequences. That we do not provide a similar generalized check for equality seems like an oversight.

This omission gets more painful the more set-like collection types we have -- it would be very much desirable to be able to easily check if a PersistentSet contains the same elements as a Set, for example.

Adding additional overloads on == is a non-starter for me. I do not wish to add overloads that support heterogenous comparisons, or that make some overloads of == generic -- I find them too confusing to consider. Following the is* naming convention established by the other set comparison operations, isEqualSet(to:) seemed like the obvious choice.

Note also that as I pointed out above, == is not necessarily the same operation as isEqualSet(to:) -- the latter always implements mathematical set comparisons (just like isSubset(of:) / isStrictSuperset(of:) etc.), while the former may have different semantics. The canonical example where the difference matters is OrderedSet: its == operator does lexical comparisons, so it does not (cannot) conform to SetAlgebra. Still, it is desirable for OrderedSet to directly provide the customary set operations for convenience, and the customary operations ought to include the ability to compare two ordered sets for set (as opposed to lexical) equality (or indeed, to compare an ordered set with a standard Set, a sorted set, a persistent set, or whatever).

hisekaldma · November 4, 2022, 9:29am

Those aren’t too bad! They’re not as obvious to me as PersistentSet, and PersistentDictionary, but they definitely convey what’s important.

I think the only problem is spelling. I know that “sharable” is a common American spelling, but to my non-American eyes it just looks like someone couldn’t spell “shareable”. And for some reason, it’s much more grating than -able.

Regardless of spelling and whether the prefix is Persistent or something else, I don’t think Hashed adds anything in the module name. If a single word is good enough for the types, that same word is good enough for the module. And breaking the pattern established by OrderedCollections just invites confusion: if these are the persistent hashed collections, does that mean there are persistent non-hashed collections?

Given that the term for data that’s safe to share across concurrency domains in Swift is now ”sendable”, and the regular Set and Dictionary are already conditionally Sendable, that particular misunderstanding doesn’t seem too likely.

A more likely misunderstanding for me at least would be that they’re somehow shareable across the network, i.e. that they’re CRDTs.

tera · November 4, 2022, 12:44pm

So much so to the extent such naming doesn't really help – with so many years of experience I'd still have to look up what difference you are talking about – and doubly so because such naming is so generic... Granted the exact match searching for "OrderedSet swift" would lead to a proper page but not so search for "ordered set" / "sorted set"). Names like "BTree", "AVLTree", "RedBlackTree", (often shortened to RBTree) are so good because their names, while not saying anything obvious upfront, are unique, unambiguous and don't pretend to mean something that no other data structure can have ("are you telling me I can not have a completely different set implementation with totally different performance characteristics / requirements / oddities that is also sorted?"). To my ears the name "Persistent" for these structures is even worse than the other notorious term of art name we are doomed to have: "QuickSort".