[Proposal] A standard library type for working with shared data in a concurrent system

lorentey · October 19, 2021, 2:46am

I don't need to be convinced. As I said, I'm very, very interested in CRDTs. (And to guide my interest, of course I'm heavily relying on Will and other folks who are far more experienced in this domain than myself.)

If we introduce reusable conflict-free mergeable data structures, then ideally I'd like them to adopt as many of the conventions established by collection types in the stdlib as practical, while also making them as space/time efficient as possible. (For example, I have this silly idea lodged very firmly in my head that diffing/merging algorithms for these things ought to be ~linear in the size of the diff, not the size of the data structure.) Achieving all this while also allowing for robust serialization is an interesting challenge. (I have some promising ideas, but at this stage they are far too delicate to withstand the heat of this forum.)

Re: the impossibility of automated merging: I believe that fully reliable merging simply isn't possible without a deep understanding of not only the meaning of the data, but also the motivation behind the changes being merged. E.g., given this situation:

Original: Bob tagged a new release.
Change A: ~~Bob~~Alice tagged a new release.
Change B: Bob tagged a new release, then he went on vacation.

I expect a conflict-free mergeable string type would merge these two changes as "Alice tagged a new release, then he went on vacation" -- which is a reasonable outcome, but it's almost certainly not going to be the "correct" one. No algorithm (save some sort of strong AI) will ever be able to understand our documents well enough to merge parallel changes without supervision. (The correct outcome might be "Alice tagged a new release, then she continued working on the next one", or something completely different. It probably involves asking Alice about it.)

CRDTs err on the side of (apparent) simplicity and predictability, which I think is a good idea. It's a technology that obviously works well enough for a great many use cases. But there is still plenty of room for other approaches! For example, I am not holding my breath for CRDTs to replace version control systems in software engineering. (Although they are encroaching on some of their territory.)

I think so! We're extremely lucky to have Michael also working on their Swift implementation.

(I'm hoping to land his PR soon and then apply some stdlib engineering tricks to make it really scream.)

Indeed. What I meant is that I believe that persistent data structures make a far better building block for any "mergeable" data structure than the stdlib's current contiguous collections. (The existing collections are optimizing for read only access and in-place mutations of uniquely held storage. This makes perfect sense, but this particular use case is all about mutations of shared copies.)

CRDTs tend to accumulate oodles of data over their lifetime, and I expect that alone will cause plenty of trouble to deal with. All-or-nothing copy-on-write implementations would just add a needless exponent on top of these troubles.

We are very interested in CRDTs. I'm hopeful our explorations in this space will bear fruit, but it's too early to tell if it will taste sweet enough!

drewmccormack · October 19, 2021, 6:45am

This is partially why I favor the original 3-way merge approach. It doesn't rule out just asking Alice

We have been using 3-way merging extensively in the agenda.com note taking app. Sometimes we favor a simple merge, analogous to a CRDT, such as most recent wins. Other times, like in our upcoming collaborative text editing, we use the diff-match-patch that Google developed. It's not exactly asking Alice, but in theory, you really could do that.

Again, this feels like a tick for a 3-way merge approach. The fact that developers have had so much benefit from this approach in source control. Surprises me it hasn't made the jump to user land.

This is another reason I like the 3-way merge approach: it cleans up after itself. There is a cap on the maximum size: it is

 (number_of_auxiliary_branches * 2 + 1) * unbranched_size

That is, each branch can have a most recent value, and a common ancestor with the main branch. That's it. When you merge, you remove the old common ancestors, and only introduce new ones when branches actually diverge again. A fully merged branching resource is the same size as an unbranching resource.

Terje · October 19, 2021, 6:57am

The file format has a theoretical limitation in the hundreds of terabytes. A beast like that can’t be memory mapped, can it? I know SQLite has a virtual file system but that’s not a part I’m currently interested in. A few GB’s is already plenty (if blobs are stored externally). And the paging is in a separate class so should be easily swappable.

No, no pointers. Everything are page numbers and offsets within the pages and these are also what gets stored.

I’ll stop here because I’m afraid it will distract the thread from the original topic. Maybe I’ll ask specific questions in using swift once in a while or post updates of the progress if that’s permissible?

drewmccormack · October 19, 2021, 7:01am

Sorry for being so pushy here. Can you give an indication on where things are at with this? Is it a question of waiting for the persistent data types, and then seeing where those can be taken to make CRDTs? Or is there already someone working privately on a CRDTs package, including just very simple types like a register (not everything has to be an advanced b-tree-like structure).

I think the official Swift libraries can play an important role in:

Formalizing types, so that we don't end up with 10 libraries with 10 different CRDT register types
Giving some visibility to the project. Simply bringing this approach to developer attention, and giving them the confidence to move forward knowing it has some Apple backing.

The risk I see is that the collections project says "we might do it" or "we are investigating", and that nothing ever comes out of it, and in the meantime the oxygen is completely sucked out of the ecosystem because nobody will take this on if they think Apple is already working on it.

tera · October 19, 2021, 2:13pm

shall be ok provided your app is allowed to use that many bytes of VM. when you iterate through such a file from top to bottom sequentially or in a random manner the relevant I/O will happen, will be some trashing along the way of course. it shall not much be any different then the classical read(offset, len).

automerge people did it in a hybrid "rust backend + another language frontend" (swift included) to keep the core of their CRDT logic in one language. would be interesting to see pure swift CRDT.

lorentey · October 20, 2021, 8:41pm

Interestingly, you aren’t the first person to assume this. Nothing could be further from the truth — implementing a new container type is a fairly trivial exercise. All one needs to do is to read a paper and adapt the algorithms it presents into Swift; and the latter part gets easy once one has been repeatedly doing this exact thing in a production context for half a decade…

The real challenge is modeling syncing in a way that feels native to Swift, and works well enough to be useful for a variety of use cases.

I believe CRDTs have proven to be a robust enough syncing solution, and their relative ease of use makes them very interesting candidates for adoption in/near the standard library. However, they come with risks/challenges very much unlike a regular everyday container type.

As I have already mentioned, the way I think about this is that ideally, mergeable data structures ought to approximate regular collection types; this includes copy-on-write value semantics and conditional Sendable conformance. Implementing value semantics without overly compromising efficiency (e.g., without unnecessarily accumulating metadata) is a somewhat risky endeavor. (Syncing is inherently a referential activity; marrying it with value semantics comes with its own unique challenges.) I think we can make it work, but I’m not ready to write a manifesto about it -- yet?

It is a very safe bet that Apple is working on data modeling and syncing solutions; that has been the case for, what, 25, 30 years now? AFAICT, this has never really stopped motivated people from implementing their own alternatives, and I don’t see how it should be different this time.

Syncing is a blindingly obvious missing puzzle piece in the Swift ecosystem. I don't think there is a one size fits all solution that will work in all cases (as if there ever is...), but I think Swift can and should provide something useful that helps people do everyday (and perhaps some not so everyday) syncing tasks while remaining in the warm embrace of Swift's comfortable conventions.

We're looking at CRDTs, for they have proved to work reliably, for their high reusability, and for their (relative!) ease of use. I do have some concerns about the way they tend to sweep the merge problem under the carpet; and there are many practical challenges in designing/implementing a family of "standard" mergeable types. It would be really interesting to see other approaches that go in a completely different direction. (As well as other takes on CRDTs, of course.)

drewmccormack · October 21, 2021, 8:06am

Wasn't an assumption, it was a question based on your earlier posts, but I probably misunderstood.

You would think, but Apple has a pretty appalling record in this area, unfortunately. The recent CloudKit mirroring seems to be the best attempt so far. The others were so broken as to be unusable, and eventually abandoned. (Interestingly, it seems CRDTs are used internally to sync Apple Notes content, but nothing like that has ever been available to 3rd parties.)

On the contrary, it is probably the single most important reason for someone to not start a project of this nature.

OK, I'll steer clear then. Sounds like this is something that is going to be undertaken internally at Apple, so will look at other options, like 3-way merging approaches.

Thanks for being open about your plans.

dabrahams · October 21, 2021, 6:58pm

That strategy is not incompatible with the properties of CRDTs. People forget that what makes something a CRDT is the set of algebraic properties of its merge operation, and nothing else.

For example, I think it would be possible to design an auto-merge procedure for Git that makes it have CRDT properties in the case of concurrent edits to the same file. Whether or not anybody wants those results is of course another matter;-)

drewmccormack · October 21, 2021, 7:25pm

It is not incompatible, but also not guaranteed to meet the algebraic requirements of a CRDT. It depends on what your merge policy is. A merge policy may not meet the requirements of idempotency, for example, though you can certainly choose policies that do meet these requirements.

dabrahams · October 21, 2021, 9:26pm

Yes. And AFAICT idempotency can easily be layered on top of any merge algorithm you choose, so if one of the CRDT laws is going to be a challenge, it's not that one.

I have found only one potentially desirable merge property that conflicts with the CRDT laws: sequential consistency. You can read about how that arises here (See the section entitled “SEC is incomparable to sequential consistency”). I don't happen to think that would be a problem in practice for most collaborative editing apps: in the scenario outlined each user sees their own changes being applied sequentially, it's just the changes of other users that may appear re-ordered and I don't think most users would be trying to analyze the order in which other users apply their changes. But it surely could be a problem for some possible applications of CRDTs. I still wonder if there are other such properties.