[API Review for 1.1] Hash-Array Mapped Prefix Trees

drewmccormack · November 10, 2022, 4:22pm

Adding my thoughts as sparsely as possible...

Persistent - First reaction is it has to do with storage.
Shared - First reaction is it is about cross thread sharing.

"Persistence" of history is what distinguishes these types, but the term is very confusing because it is so commonly used for other purposes. Perhaps similar terms, which convey unchanging history, could work:

PerpetualSet
ArchivalSet
ChronicledSet
NarrativeSet
RecountingSet/RecountedSet

I realize these aren't existing terms from the field, but IMO trying to reuse such terms is destined to lead to confusion, simply because they are so overloaded. Better in my view to find a term that cannot be confused, and one which conveys the most important property of the types, ie, that they are history preserving.

hassila · November 10, 2022, 7:57pm

So it’s really the ‘MultiverseSet’

Edit: to not just be a drive by joke - I agree with what’s been said about both persistent and shareable and their connotations. But it’s logically a “MultiversionSet” or “VersionedSet”, no? (PolySet?!)

hisekaldma · November 11, 2022, 12:12pm

That implies that the set keeps track of all previous versions, which it doesn’t – it only shares data with them. It’s perfect as a building block for a VersionedSet though!

hooman · November 11, 2022, 5:15pm

Thank you for bringing this up.

The issue with (using the initial term of art) PersistentSet / PersistentDictionary is that the reason for using them is not a high-level domain requirement (as is the case for OrderedSet. SortedSet, ...). They are just the original currency types with different performance/memory trade-offs. They provide better performance/memory characteristics for some usage patterns, and worse in others. But they do not provide meaningful added functionality compared to Set/Dictionary.

It would be ideal if we could hide them as implementation details of the main currency types and have the currency types automatically switch their internal representations as the usage pattern of a given Set/Dictionary became clear. The APIs unique to them can also be added to the currency Set/Dictionary APIs and calling them could cause the internal representation to switch. This would be similar to what String and NSArray do today.

I think we should provide VersionedSet / VersionedDictionary as the meaningful-at-the-domain-level collections and also have the proposed collections public as the low-level foundation of these high-level collections. We may be able to choose a name that indicates they are the lower-level core data structures behind versioned Set/Dictionary that can be used directly to address performance issues with Set/Dictionary while these are not integrated as automatic runtime optimizations of the standard library Set/Dictionary.

It would make them easier to understand and teach: Each modification creates a new version/snapshot, not a full copy. Each instance sees its own version/snapshot and can modify its own branch. Then we can say if you know that you are going to have a large Set/Dictionary that is going to be passed around and mutated a lot, use this snapshotting variants of Set/Dictionary. This may help find a better/clearer name for them as well. (Maybe BranchedSet/BranchedDictionary?)

As a side note: I think we need to integrate STM (Software Transactional Memory) with Swift actors to provide elegant solutions for actor reentrancy issues and these could help there as well.

lorentey · November 12, 2022, 12:27am

This is true; however, providing such variants is well within the mandate of this package. Some new data structure implementations will be inherently just variants of existing types, with different performance trade-offs. We need to embrace and welcome this, not try to sweep it under the carpet.

For example, the upcoming BitSet is effectively is just a new variant of Set<Int>, with very different performance/memory trade-offs. BitArray is "just" a new variant of Array<Bool>. A tree based Text type (whatever we will end up calling it) will be just a new variant of String.

A data structure providing the same API as an existing type does not mean that the two implementations must be amalgamated into some sort of hybrid. On the contrary -- I believe in the value of keeping the performance of Swift's concrete collection types predictable. A type that magically switches between fundamentally different data structure representations is the opposite of that.

(I say this while being fully aware of (and in very deliberate opposition to) how the standard collection types are sometimes able to dispatch to the classic NSArray, NSSet, NSDictionary, NSString classes ("verbatim bridging" or "lazy bridging"). I don't consider blurring the lines between such mismatching data structures a success story, largely because of the way it makes performance difficult to predict (or rely on).)

There is also a significant up front performance cost to dynamically dispatching operations to select between various competing algorithms, and (for generic collection types) there is an especially large cost to keeping parts of the generic implementation internal to the module, to allow for replacing it with a newly implemented data structure later.

Once again: ideal or not, like it or not, this is not feasible. I see no practical way to do this. It's not in the cards. It's off the table.

There is no way to turn a Swift Array into a B-tree, or a Swift Set into a HAMT.

Swift has fundamentally diverged from Objective-C in this regard. The implementations of Swift's core container types aren't dynamic.

The memory layouts of Set, Dictionary, Array, and String are deeply etched into Swift's ABI. These types work as well as they do because they expose their implementation to clients. With the partial exception of the non-generic(!) String, none of the standard types have hooks to allow their operations to dispatch to non-inlinable (i.e., extensible) parts of the stdlib. These types are what they are.

The way to make a piece of Swift code work equally well over wildly different data structures is to make that code generic, and to express the algorithm using protocol requirements.

Indeed!

One important function of Swift Collections (and all the other stdlib-adjacent packages, like Swift Atomics or Swift Numerics) is to provide fundamental building blocks on which people can build more fanciful structures. In order to get to the higher-level types, we need to get these building blocks in place.

The HAMT-backed set and dictionary types are crucial foundations for certain higher-level types, ones that may have far more universal appeal. The concrete use case that we have mentioned before on this thread is CRDTs, but if you have a great idea for building a VersionedSet -- go for it!

Having excellent implementations of a core data structure such as a hash-array mapped prefix tree available for general use will allow a far wider set of people to experiment with such high-level constructs, without getting bogged down by the need to implement these basics.

Karl · November 12, 2022, 3:06pm

lorentey:

One important function of Swift Collections (and all the other stdlib-adjacent packages, like Swift Atomics or Swift Numerics) is to provide fundamental building blocks on which people can build more fanciful structures. In order to get to the higher-level types, we need to get these building blocks in place.

The HAMT-backed set and dictionary types are crucial foundations for certain higher-level types, ones that may have far more universal appeal. The concrete use case that we have mentioned before on this thread is CRDTs, but if you have a great idea for building a VersionedSet -- go for it!

Having excellent implementations of a core data structure such as a hash-array mapped prefix tree available for general use will allow a far wider set of people to experiment with such high-level constructs, without getting bogged down by the need to implement these basics.

To me, this suggests that it is appropriate for Swift Collections to vend types which are explicitly implemented using HAMTs; that instead of trying to think of a somewhat vague name which describes the usage but is neutral to the implementation strategy (e.g. ShareableSet), it should go in the complete opposite direction.

Once you start naming things based on usage, I would argue that you are instead implementing one of those higher-level types as you mention in the second paragraph.

The building blocks should be explicit about their implementation strategies in order to clearly communicate their strengths and weaknesses - this is wood, or brick, or stone; it doesn't help me build a house if my supplier only describes the building blocks as consisting of 'some house-building material'.

Nevin · November 12, 2022, 3:18pm

Karl:

To me, this suggests that it is appropriate for Swift Collections to vend types which are explicitly implemented using HAMTs; that instead of trying to think of a somewhat vague name which describes the usage but is neutral to the implementation strategy (e.g. ShareableSet), it should go in the complete opposite direction.

Once you start naming things based on usage, I would argue that you are instead implementing one of those higher-level types as you mention in the second paragraph.

The building blocks should be explicit about their implementation strategies in order to clearly communicate their strengths and weaknesses - this is wood, or brick, or stone; it doesn't help me build a house if my supplier only describes the building blocks as consisting of 'some house-building material'.

That’s a very good point.

Perhaps HashTreeSet would make sense.

lorentey · November 12, 2022, 10:15pm

You're arguing here against the established naming conventions of the Swift Standard Library, and standard libraries in general. This is an uphill battle, and I'm not really interested in participating in it.

I suspect you are also underestimating the value and importance of persistency as a functional distinguishing property. To me, a persistent set is functionally very different to a flat hash table, even if they implement the same API.

As I said above, the only reason these types exist is to provide persistency; the precise data structure they internally implement is useful to know, but it's strictly secondary to this property -- it ought to be described in the documentation, but it does not need to be repeated ad nauseam in the type name.

However, I also previously wrote:

So the current options that I find palatable are:

set type	dictionary type	module name
`PersistentSet`	`PersistentDictionary`	`PersistentHashedCollections`
`ShareableSet`	`ShareableDictionary`	`ShareableHashedCollections`
`TreeSet`	`TreeDictionary`	`HashTreeCollections`

The first (and original) entry is what I think are the natural names for these. The second entry is the current choice, designed to eliminate confusion* about the meaning of "persistency" and to find a friendlier name than a bone dry latin one.

(* I am not particularly convinced the confusion is more than skin deep; and no matter what name we choose, the documentation will need to explain that these types are persistent data structures, as that is the established term of art for these.)

People have noted that "shareable" can also be misinterpreted -- however, the confusion in that case seems even less convincing to me: this precise term isn't currently in widespread use for anything else in this field; the root "share" does pop up in a lot of places, but I don't see it being applied to container types. "Shareable" does have the drawback of being far less specific than "persistent", and it is arguably venturing uncomfortably far into the practice of Inventing New Terms of Art.

The structural names TreeSet/TreeDictionary would be mediocre choices that do not directly emphasize the crucial persistence property. But (1) they do indirectly hint at it; (2) they have plenty of precedence in other languages; and (3) they are so tepidly noncommittal that there is zero chance of (non-performative) confusion -- all users will need to learn the true properties of these types by reading the documentation. They also have the (to me) very attractive benefit of being succinct.

Thank you all for your input over these two weeks! It has been an interesting and productive discussion.

I'll ponder the arguments, weigh the pros and cons, run things by some experienced folks off this forum, and reach a decision on naming soon. I suspect we will end up going with one of these three choices.

Les_Pruszynski · November 12, 2022, 10:45pm

Having read the original "Pitch" and the discussion that ensued from that I still much prefer the names as suggested in the pitch.

barnard-b · November 13, 2022, 1:46am

My first thought when seeing the post title was persisting data to some type of storage.

This has been brought up already, but just to add another data point...

kiel · November 13, 2022, 9:43am

Some proposals include a “Future Directions” section so I’m guessing there are no further plans for these types that may affect their naming. Is this a fair guess?

Or perhaps there could be future directions but the package might introduce new, dedicated types that compose with these types, e.g. for true “persistent data structures” which preserve and revert between versions?

dhoepfl · November 14, 2022, 6:44am

Yes. 100%.

Two or more people have a shared history. You do not share your history with yourself.

drewmccormack · November 14, 2022, 8:21am

Indeed. I could live with "Shareable", but I think "Persistent" would be a grave mistake. It is so baked into our collective heads as pertaining to persistency between launches, ie, on disk, that it would certainly lead to confusion. The first question anyone is going to ask is "How do I make this PersistentSet persist?" Who's on first?

While "Shareable" is acceptable, I still think it would be better to go for a term that means basically the same, but is not widely used, to avoid any prejudice about what it is for. Eg. the term "Stake" is used in the blockchain world, but not in Swift world, and means basically the same thing (eg proof of stake = proof of your sharing, stake a claim = take a share). It is widely understood, but not overloaded in our community.

hooman · November 14, 2022, 3:07pm

Thank you very much for taking the time to provide the detailed response, I really appreciate it. It really helps clarify some very important points. It also deepens my concern with the direction of the development of the language and the lack of the voice of the vast majority of the language users who don't regularly follow or participate in these forums.

I am not a typical user of the language myself, but enough of an outsider to see the ever increasing bias in the language's developers and curators.

An aside about me if you are curious

I am just an old humble civil engineer who started programming with FORTRAN IV using punch cards and have been lucky enough to be exposed to a variety of languages and paradigms over the decades. Not to mention being through revolution, war, immigration, etc.

As you noted, the language is moving away from providing currency types that just work well enough in almost any case (I believe this is what mere mortal programmers like) towards static and well defined performance characteristics that computer scientists like. I understand that these precise and predictable set of data structures are absolutely necessary as low-level building blocks for framework and system level programming, and I really love and appreciate what Swift Collections package is doing.

On the other hand, I thought Swift wants to be a language that helps ordinary developers write safe and performant code, even if they are not mathematicians and computer scientists.

I am too busy to write more, but there is so much more to say.

Meanwhile, to help younger folks get a sense of how Apple's currency data structures used to be, check out this 17 years old blog post.

scanon · November 14, 2022, 6:29pm

I am confused by this perception; nothing is being taken away. Array and Set and Dictionary are still there, any most programmers should (and will) still use them. This proposal is about adding another option for cases where those types are inadequate, rather than making those types more complicated and difficult to use correctly.

hooman · November 14, 2022, 9:36pm

Sorry, I was not clear, and my post is technically off-topic:

We absolutely need the functionality provided by Swift Collections library and this particular API. I know we are not taking away anything from the standard library, and I understand that with ABI stability, Set, Dictionary and Array are here to stay as they are. Also, they are a good (to acceptable) fit for over 80% of use cases.

What concerns me is that we don't get the functionality provided by Swift Collections for (almost) free and as part of the currency types used with high level frameworks. I don't agree with this push away from providing high-level adaptable constructs like CFArray/NSArray (as linked above) which used to make life easier for average programmers by not forcing them to need something like Swift Collection library to get adequate performance for their ordinary boring needs.

I regret that Swift Collections (and knowing the details of data structure trade-offs) is becoming a mandatory requirement for writing performant every-day run of the mill apps. This will also push framework developers to either accept the lost performance or add dependency and API surface to cover different variants of the same conceptual container. I don't like having to write multiple versions of the same API to cover accepting/returning say Array<Bool> and BitArray.

I wish there was some more R&D to evaluate the feasibility of providing higher level and more opaque currency types for use by higher level Swift frameworks: A modern-day native Swift equivalent to the legacy Foundation types (CFArray/NSArray,...) instead of completely abandoning that approach.

Back to the topic:

Given all this, now I am convinced that we should stick with the CS terms of art and even using the full backing data structure names (as @lorentey already did in the title of this topic) to correctly communicate what these are.

lorentey · November 15, 2022, 5:18am

Fundamental issues with the core design principles of the Swift Standard Library are best raised with (and best addressed by) the Core Team or the Language Workgroup, not this (project-specific) forum. If you believe these additions surface a major issue that requires a course correction in the stdlib, please raise it on the Standard Library forum or directly on one of the sub-forums for Swift Evolution.

(I'm sorry for snubbing you like this. I actually composed a detailed reply but I won't post it -- I can't take on the time commitment of opening this particular box, and I'm in fact a bit miffed that I let myself be nerd sniped like that. I do think it would be very important to explain e.g. how your entire premise is fundamentally wrong (); or how the closest analogue to the NSArray class cluster in Swift is the RandomAccessCollection protocol, not Array; or why std::vector<bool> is bad design; or how that infamous article can be so misleading, etc. etc. Sadly this really isn't the appropriate time for me to do it, and (tragically) I wouldn't have capacity to track the inevitable discussion that follows, no matter how interesting it would be.)

Quick note: I don't know why the title reverted to Persistent -- I suspect it's either due to some technical issue, or it was some well-meaning but silent moderation. (I don't much care either way; the title is fine as is.)

In any case, the current title does not indicate anything about the eventual outcome of this review. (Which is not yet decided. Thanks y'all again for the valuable input; I need some time to digest it.)

John_McCall · November 18, 2022, 9:40pm

6 posts were merged into an existing topic: Adding 0-based integer signed offset and offset range subscripts to RandomAccessCollection

hooman · November 15, 2022, 2:29pm

Ah! I was baiting to get that reply and open that box . Since I also don't have time to fully participate, I was planning on just sitting back and enjoying the show . OK, I am really busy this morning. I will open a new appropriate topic to raise these concerns. I will help pushing you guys to better communicate and document these design choices and the chosen trade-offs.

I am fully aware that generics and protocols are supposed to fill the role that class clusters used to play, but there is a lot to be desired. Until very recently, generics (and existentials) have been suffering from serious usability issues that have been intimidating mere mortal programmers. I am grateful that they are being addressed now. Still, there is much more to be done, especially when module boundary and resilience is involved.

Moderator note: the final part of this post has been removed to a new thread, linked below.

michelf · November 19, 2022, 1:50am

Are we talking about a Merkle tree?