[API Review for 1.1] Hash-Array Mapped Prefix Trees

lorentey · October 31, 2022, 11:42pm

[API Review for 1.1] Hash-Array Mapped Prefix Trees

(née Persistent Collections)

Authors: Michael Steindorfer, Karoy Lorentey
Implementation: swift-collections/Sources/ShareableHashedCollections at release/1.1 · apple/swift-collections · GitHub

A while ago, Michael Steindorfer contributed a hash-array mapped prefix tree implementation to the Swift Collections package. This code has now bloomed into a new module that is very close to being production ready! One thing that is still missing is a critical review of the new API additions -- hence, this thread.

Background

The Swift Collections package is not currently governed by the Swift Evolution Process, but to leverage the immense collective knowledge & experience of the people using this package, we still follow an informal API review process that is similar to what is expected of Standard Library additions. Accordingly, we would like to gather feedback on the additions described in this document.

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, to determine the direction of the package. When writing your review, here are some questions you might want to cover:

Do you foresee ever using these new APIs in your own code? (Can you quickly describe your use case? E.g., how large is your typical data set? What operations are most interesting to you? Do you have particular performance requirements, such as about latency, throughput, or memory use?)
If you have used other libraries with a similar feature (in Swift or any other language), how do you feel that this proposal compares to those?
Do the proposed new APIs follow the Swift API naming guidelines? Are they a good fit with the Standard Library and the other packages in the Apple Swift Package Collection? Do any of the new APIs strike you as oddly named?

The current version 2 of this API proposal is available at the following link:

github.com

apple/swift-collections/blob/release/1.1/Documentation/Reviews/2022-10-31.ShareableHashedCollections/ShareableHashedCollections.md

# Shareable Hashed Collections Module

- Authors: [Michael Steindorfer](https://github.com/msteindorfer), [Karoy Lorentey](https://forums.swift.org/u/lorentey)
- Implementation: https://github.com/apple/swift-collections/tree/release/1.1/Sources/ShareableHashedCollections

## Table of contents

* [Introduction](#introduction)
* [Motivation](#motivation)
* [Proposed solution](#proposed-solution)
    - [`ShareableSet`](#shareableset)
    - [`ShareableDictionary`](#shareabledictionary)
* [Detailed design](#detailed-design)

## Introduction

`ShareableHashedCollections` is a new module in Swift Collections, containing new hashed collection types that store their elements in a prefix tree structure, based on their hash values. The new types are tree-based equivalents to the standard `Set` and `Dictionary`; they're called `ShareableSet` and `ShareableDictionary`.

Like the standard hashed collections, `ShareableSet` and `ShareableDictionary` are unordered collections of unique items. However, the new collections are optimizing for making mutations of shared copies as efficient as practical, both by making such changes faster, but, just as importantly, by letting mutated copies of the same collection value share as much of their structure as possible, saving a considerable amount of memory.

This file has been truncated. show original

The text of the first version of the proposal (with PersistentCollections) is still available at commit 1098c526 in the Git repository.

lorentey · November 1, 2022, 12:16am

To take the proposed APIs on a test drive, add a branch-based specification of the swift-collection package as a dependency to your test project:

let package = Package(
  ...
  dependencies: [
    .package(url: "https://github.com/apple/swift-collections", branch: "release/1.1"),
    ...
  ]
  ...
)

The release/1.1 branch includes all potential additions that are currently scheduled for the 1.1.0 release, including ones whose API hasn't yet been discussed on this forum.

(Important note: never use such a branch specification with the swift-collections package in production code. Only tagged releases promise source compatibility, and random commits that appear on branches may even contain known bugs that render them unusable.)

NSExceptional · November 1, 2022, 12:58am

Would it be possible for you to remove the manual line breaks from this post? Makes it hard to read on mobile

Edit: thank you!!

lorentey · November 1, 2022, 1:41am

Oh, aggravating! It's fixed now.

hisekaldma · November 1, 2022, 8:24am

This is absolutely fantastic! It looks like it could be the thing that finally pushes us to start using Swift Collections. Our use case is implementing a CRDT with change history. Today we’re using Dictionary for storage, but that has exactly the problems described in the post.

I haven’t taken the additions on a test drive yet, but overall the API looks great. Following Set and Dictionary as closely as possible makes perfect sense – should make it easy to swap out the non-persistent version for the persistent version.

Some random thoughts:

What about removeAll(where: (Element) throws -> Bool) rethrows (i.e. in-place filter)?

Having these on the persistent collection but not on the regular one sounds like it could be confusing when working with both types. "Why can I do this there, but not here?" It would also make it harder to switch an implementation from the persistent collection back to the regular one, e.g. for testing the performance implications or if you realize that you’re not actually making use of the sharing.

I’m assuming isEmpty is O(1)?
Is count O(1), O(n) or O(log n)?

hisekaldma · November 1, 2022, 11:55am

lorentey:

  func intersection(PersistentSet<Key>) -> Self
  func intersection<Value2>(PersistentDictionary<Key, Value2>.Keys) -> Self

  func subtracting(PersistentSet<Key>) -> Self
  func subtracting<Value2>(PersistentDictionary<Key, Value2>.Keys) -> Self

Also, the generic parameter name Value2 confused me initially here. It’s like it implies that there’s something about the values and order that’s important (cf. func zip<Sequence1, Sequence2>), when actually the values don’t matter here at all. Naming the generic parameter something like OtherValue might make this easier to grok.

(Great methods though!)

Nevin · November 1, 2022, 12:51pm

These look quite useful for certain applications.

However, in the grand bikeshed tradition… are we sure that “persistent” is the best name for them?

To me, “persistent” implies, well, persisting. My first impression on seeing the name is “Do these collections get stored to disk automatically?”

Reading the documentation I see that is not the case, which makes for an immediately jarring experience where the name does not match the expectations it evokes.

Furthermore, the original PR links to the Wikipedia page for persistent data structures, which states:

In computing, a persistent data structure or not ephemeral data structure is a data structure that always preserves the previous version of itself when it is modified.

And:

A data structure is partially persistent if all versions can be accessed but only the newest version can be modified. The data structure is fully persistent if every version can be both accessed and modified.

The two new proposed data structures do not preserve previous versions of themselves. Given an instance of one, there is no way for the programmer to access previous versions of it. The update history is not part of the type.

Instead, these types are designed in such a way that a persistent data structure could be efficiently built on top of them. Some other type could keep track of the update history. That other type would meet the definition of “persistent data structure”, but the two proposed types do not.

Speaking for myself, “persistent” in my mind is strongly linked to storing data to disk, or otherwise being able to restore state across runs of a program, and I imagine I am not alone in this. The technical definition of “persistent data structure” is one I have not previously encountered, and the proposed collections do not even satisfy it regardless.

Thus, I feel that “persistent” is not the best naming choice for these data structures.

hisekaldma · November 1, 2022, 1:17pm

FWIW, being tangentially familiar with this type of data structure (e.g. from reading about Clojure and watching a couple of Rich Hickey talks), I immediately understood what "persistent" meant here. The other term of art that I’ve seen for this type of data structure is "immutable", and that’s even worse in the context of Swift.

(That’s not to say that another name that’s not a term of art wouldn’t be better!)

tera · November 1, 2022, 1:21pm

Ditto.

rvsrvs · November 1, 2022, 2:50pm

I need to look at the pitch in more detail, and maybe I'm not understanding your point, but I believe that persistent here means that if you were holding a previous version and made a new version that your previous copy remains unchanged just as if you were holding an array and someone else modified the array, i.e. they got CoW behavior. In the case of these new types they get the CoW behavior without having to do a full copy of the entire collection. Persistent here, IIUC, does not mean that if you have the current version you can access all previous versions.

tera · November 1, 2022, 3:31pm

If that's the case "persistent" doesn't convey the meaning.
Perhaps "provident" / "prudent"?

rvsrvs · November 1, 2022, 3:40pm

As an old OOP guy, I agree, but this is the actual term of art in the FP community.

I was recently thinking of trying my hand at doing a swift implementation of some of the data structures from Okasaki's Purely Functional Data Structures and found this definition on page 2:

A data structure that supports multiple versions Is called persistent while a data structure that allows only a single version at a time is called ephemeral

All data structures in FP langs like Haskell are persistent in that sense. I also immediately think, oh this thing lives on disk, so maybe in the Swift eco-system we need to use another term. Or rename a lot of our current Collection types to be EphemeralXXX. :)

Karl · November 1, 2022, 3:41pm

lorentey:

Non-mutating Dictionary Operations

To acknowledge the efficiency of PersistentDictionary at inserting and removing items from copies of a dictionary value, we provide a family of non-mutating primitives that perform these operations. Instead of directly modifying the dictionary they're given, they return a brand new dictionary with the desired result. Note that when called on a unique copy of a dictionary, these are still considerably slower than the corresponding direct mutation operations, so these should not be generally considered an appropriate replacement for those.
func updatingValue(Value, forKey: Key) -> PersistentDictionary<Key, Value>
func removingValue(forKey: Key) -> PersistentDictionary<Key, Value>

Do these have any advantages over the mutating operations? Or are there only potential disadvantages?

Also, this sort of API tends to encourage chaining, which IIUC would give the very worst performance:

someDict
  .updatingValue("foo", forKey: "bar")
  .updatingValue("baz", forKey: "qux")
  .removingValue(forKey: "qax")

hisekaldma · November 1, 2022, 5:53pm

Another naming question:

It feels like there’s some potential for confusion with elementsEqual() here. Since these are in the same family as isSubset(of:), isSuperset(of:), etc would it make sense to call these isSameSet(as:) or isIdenticalSet()?

(And whatever they end up being called, having them on Set would be nice too!)

Paul_Cantrell · November 1, 2022, 6:57pm

Agreed that “persistent” is a bad name for this marvelous data structure, for all the reasons Nevin stated.

If “persistent” is in fact a term of art in this little corner of the data structures world, that’s an audience at least two orders of magnitude smaller than the audience who will hear it (as I did!) as “persistent storage.”

The term in general use for this family of data structures seems to be variations on “hash array mapped trie” or “hash-array mapped prefix tree:”

Hash array mapped trie - Wikipedia (which does mention “persistent” downpage, FWIW, but as a classification of the data structure, not its specific name)
GitHub - norswap/triemap: Compressed Hash-Array Mapped Prefix-tree (CHAMP) Implementation

Better names might be:

MappedTrieSet
TrieMapSet
HAMTSet
PrefixTrieSet
PrefixTreeSet

lorentey · November 1, 2022, 7:04pm

That sounds very similar to my own reason for working on this.

Oh, sure -- good catch! I don't think I have a good implementation of an in-place reduction yet (hence subtract, formIntersection forwarding to the copying variants), but we should at least have the API in place (on both collection types).

hisekaldma:

lorentey:
func inserting(Element) -> PersistentSet<Element>
func updating(with: Element) -> PersistentSet<Element>
func removing(Element) -> PersistentSet<Element>
lorentey:
func updatingValue(Value, forKey: Key) -> PersistentDictionary<Key, Value>
func removingValue(forKey: Key) -> PersistentDictionary<Key, Value>
Having these on the persistent collection but not on the regular one sounds like it could be confusing when working with both types. "Why can I do this there, but not here?" It would also make it harder to switch an implementation from the persistent collection back to the regular one, e.g. for testing the performance implications or if you realize that you’re not actually making use of the sharing.

Karl:

Do these have any advantages over the mutating operations? Or are there only potential disadvantages?

Also, this sort of API tends to encourage chaining, which IIUC would give the very worst performance:
someDict
  .updatingValue("foo", forKey: "bar")
  .updatingValue("baz", forKey: "qux")
  .removingValue(forKey: "qax")

The idea with these is to reduce the pain of having to explicitly make a var copy of the set/dictionary value when adding/removing a single element is precisely the operation we want. Chaining these is a bit silly, but (for large enough persistent collections) it is still going to be way faster than making just one full copy of, say, an equivalent flat collection's storage. (And if the collection is small, it generally doesn't matter if updating it takes a few dozen nanoseconds longer.)

These are purely convenience additions and they don't have any performance benefit over spelling out the copy+mutation by hand. For example, the two snippets below have precisely the same behavior:

// 1
return a.inserting(42)

// 2
var b = a
b.insert(42)
return b

The only difference is that readers may find the former more readable. Having to explicitly spell out such temporary copies can feel quite a bit annoying, especially for people who have some functional programming background.

The two new collection types have roughly similar API as the existing hashed collections in the stdlib, but I don't think an exact match would be practical*, and I think it makes sense to have the new types come with a small set of extensions that emphasize their unique strengths. I don't have a hard rule that code written specifically for PersistentSet or PersistentDictionary should directly build after simply changing the types to Set and Dictionary, although I do expect most code would work in the opposite direction.)

(* One small but crucial difference is that in the current implementation of the new types, all mutation operations invalidate every previously returned index. Set and Dictionary have C++-style rules about index invalidation that sometimes allow reusing indices across mutations, including insertions.)

That said, these are non-essential APIs so I wouldn't mind withdrawing them if the consensus opinion is negative. (Clients who prefer to use them will always be able to define them locally, with no loss of performance.)

tera · November 1, 2022, 7:21pm

I don't understand this. I assume that "==" also ignores element order (and there is no way to store duplicate elements in this structure). Please reword that paragraph, otherwise the difference is not clear.

Karl · November 1, 2022, 8:11pm

I remember we discussed this a looooong time ago for the standard library, and IIRC the consensus was that it doesn't scale - the logic applies to every mutating operation ever. As a result, we have Bool.toggle (mutating) but no corresponding Bool.toggled. We have String.append (mutating), but no String.appending (well... not in the standard library, anyway; it's a Foundation-ism).

But I think it is broadly accepted as a bit of an ergonomics issue, and I think a general solution would be welcome for a whole host of APIs. We'd like developers to take advantage of mutable value semantics, but it isn't nice that they have to introduce named, temporary copies which are scoped to the entire function.

One idea that has been floated (and which I quite like) is to introduce a with/using/modifying function. It could even make use of features like take parameters so ownership could be transferred, allowing for in-place mutation if the source variable is not used later in the caller.

func modifying<T>(_ value: take T, body: (inout T) -> Void) -> T {
  body(&value)
  return value
}

let newDict = modifying(someDict) {
  $0.update(...)
  $0.update(...)
  $0.remove(...)
}

hisekaldma · November 1, 2022, 9:59pm

I think these are all worse than the current names. These suggestions describe the specifics of how the types are implemented, and that’s really not what’s important here; what’s important is that large parts of the underlying structure are persisted/shared when making changes. And while “persistent” also has other connotations that are confusing here, at least it captures that.

tera · November 1, 2022, 10:11pm

Yes and no. As a related example for me it is much more important to know that "Dictionary" requires "hash" operation rather than "less than" operation then to know that if assign a directory to another variable and modify it - it works faster than usually (reusing large parts of the directory) at the expense of working overall slower in other regards – this is an implementation detail, could be a "hint" parameter passed to a "unified" dictionary implementation with appropriate ".auto" option as a default.

I'd maintain that "persistent" name would be more confusing (to a greater number of developers) than it would be helpful (to a smaller number of developers familiar with the term in this context).