[Proposal] A standard library type for working with shared data in a concurrent system

drewmccormack · October 5, 2021, 9:06am

The addition of async/await and actors has greatly enhanced Swift for handling concurrency, and the algorithms project adds numerous useful data types to the language. In this proposal, I want to push the idea that we can start to build on these to add data structures explicitly designed for concurrent systems: shared data types.

Currently, the philosophy of working with data and concurrency in Swift is to copy the data to guarantee exclusivity, pass it off to another thread/queue/task, and fetch the results at the end. In essence, it is a message passing approach. It is encouraged further by the widespread use of value types, which are automatically copied on a function call.

This approach to working with data in a concurrent system is established, and works well, but it isn't the only approach. There is room for safe shared data structures, which can be read and written safely from any thread, and are not susceptible to races, locking delays or clobbering of updates.

To give you a feel for what I am on about, I want to propose such a type here: a branching resource.

A BranchingResource would be a type with a generic parameter for the payload it carries (ie the resource). The resource would begin with a single branch, called "main" or "trunk" or "truth". The app could add as many auxiliary, named branches as it likes.

In pseudo Swift, the main API would look something like this:

struct Branch: Hashable {
    let name: String
    static let trunk = Branch(name: "trunk")
}

struct BranchingResource<Payload> {
    init(payload: Payload, auxiliaryBranches: Set<Branch>) {}
    private(set) var auxiliaryBranches: Set<Branch>
    func payload(in branch: Branch) throws -> Payload
    mutating func update(_ branch: Branch, with payload: Payload) throws
    mutating func mergeIntoTrunk(auxiliaryBranch: Branch, resolvingConflictsWith: Resolver) throws
    mutating func mergeTrunk(intoAuxiliaryBranch auxiliaryBranch: Branch, resolvingConflictsWith: Resolver) throws
}

Think of this as a small piece of Git functionality, but for in app data. (Note that it is not a complete history of changes, like Git is.)

The Payload could be simply Data, or Codable types, or even files on disk. The BranchingResource should include enough hooks for these possibilities. (It would even be possible to add asynchronous API that carries out tasks like querying a network resource during merging.)

Without going into the implementation in detail, such a type tracks the branches, and most importantly, ensures that each branch keeps a so-called "common ancestor" with the trunk. This decoupling of the trunk from other branches is what makes this data structure lockless, and safely sharable. What happens on one branch is independent to what is happening on another. Only when a particular branch wants to get the latest changes from other branches, is a merge carried out, and it is completely at the discretion of the branch owner.

The merging is a 3-way merge, like in Git. Given the two new branch payloads, and the common ancestor, the Resolver can choose how to merge the payloads. The Resolver would be a protocol so that users could design custom merge logic, but some standard resolvers could be included, like last-change-wins.

The advantage of an abstraction like a branching resource is that it decouples components of an app. Is there a component downloading something, but at the same time your UI is busy in some modal state which makes it difficult to handle the downloaded data? No problem, make a BranchingResource for the downloaded data. The "network" branch keeps the downloaded data until you are ready to merge it into the "ui" branch. If the app user also made changes while the downloading was taking place, they won't be clobbered by the network download. Each merge has a common ancestor, and you can always determine with a diff what has happened in each branch, and merge at the desired granularity.

I want to preempt a question I'm sure many will have: many branches does not necessarily mean many payloads. In the worst case, you can have a copy of the payload for each branch, in addition to common ancestors. Although that situation can arise, it is usually short lived, because once the trunk is merged into a branch, and the branch merged back into the trunk, the two are exactly the same. The branch payload, the trunk payload, and the common ancestor are all the same, so only one payload is stored. In a situation where all branches are full merged into the trunk, there is only a single payload, exactly the same size as a non-branching resource.

I realize this is quite a break from the types of data structures already implemented in Swift, and from the concurrency aspects too. A BranchingResource brings the two together. Just as actors help to make concurrency simpler, a BranchingResource type would help making working with shared data safer and simpler.

Ideas?

Saklad5 · October 8, 2021, 2:22pm

This reminds me of Core Data’s approach to concurrency, among others. That makes sense, as it is also an object graph.

I think it’s an excellent idea, though I’m a little unclear on how the merging would be implemented. It would need to block both the ancestor branch and each child branch being merged.

Why only allow three-way merges, by the way? You obviously need to allow two-way merges (it’ll usually be a fast-forward, but not if something else was merged into the interim). But you could also allow more than that.

Saklad5 · October 8, 2021, 2:47pm

I believe Swift Collections has a few draft implementations of persistent data structures. I’m not very familiar with the topic myself, but I believe you could use confluently persistent data structures to implement a lot of the desired behavior.

I think a history of changes is definitely important: without that, it’d be impossible to tell which changes happened when, and consequently difficult to handle merging. For the sake of minimizing memory usage, we could drop past versions of payloads (like a shallow clone in Git) if a branch has no parent.

By the way, I feel it may be worth considering whether branches should be reference types rather than value types.

drewmccormack · October 8, 2021, 3:39pm

Thanks for the feedback on this.

Indeed, Core Data does have a similar approach, using different NSManagedObjectContexts. You can think of a branch as a context. Core Data really only has two way merging though. There is no tracking of common ancestors, just the latest version of each context. And it goes without saying that Core Data's implementation is not generalizable outside the framework. The idea of this proposal is to make something generally applicable.

That is true. Updating of any of the branches would require them to be locked. For small amounts of data, this should be fast, and importantly, locking one branch needn't lock others. (Eg locking the network branch should not affect updating of the UI branch.)

It perhaps won't surprise you that I have been using a very similar approach for managing files on disk. In that case, there is a simple synchronous call, and an asynchronous merge which uses optimistic locking. In essence, the merge is launched, and snapshots taken of the branches involved. Upon completion, the branches are checked to see if anything changed before completing. If anything did change, an error occurs, and the merge has to be retried.

I think this is just terminology. I call a fast forward a 3-way merge, even though only two distinct values exist, ie, the common ancestor is equivalent to one of the recent branch values.

Thanks! I'm going to investigate that.

Well, not exactly. I have built a DAG-history based system, similar to Git (https://swiftpackageregistry.com/mentalfaculty/LLVS). But having done that, I realize it is possible to build a much more compact approach, which is self pruning. The trick is to make sure that when a branch is updated, the appropriate common ancestor is kept. There is some accounting to do for this, but it is definitely doable, because I've implemented this in a production app.

In short, you don't need to keep the whole history. In fact, as stated in the original text, in the best case scenario, the stored data is the same size as the equivalent non-branching resource.

That may well be. Haven't thought that far ahead, to be honest.

dabrahams · October 13, 2021, 6:41pm

Are you aware of CRDTs, which seem to overlap significantly with what you're trying to do? As far as I can tell, the challenge in this area isn't to have a standardized API for merging changes, but to develop the merging algorithms and the data structures that support them. But maybe I don't fully understand what you're proposing…

drewmccormack · October 13, 2021, 7:10pm

Yes, I have quite a lot of experience in the distributed data field. I am the developer of the Core Data sync framework Ensembles (ensembles.io), and wrote a series about CRDTs in Swift last year (Replicating Types in Swift | A p p D e c e n t r a l).

You are right, there is overlap. The challenge, whether across devices separated time and space, or just across threads isolated on one device, is to be able to handle and merge concurrent changes. Effectively the same challenge you have in a system like Git.

The branching resource idea is one I developed recently, and have been using in a collaborative editing app. It worked so well, I wanted to propose it as a general approach to handling concurrent changes to data, though perhaps it doesn't belong in the standard library.

CRDTs are one way of merging concurrent changes. Effectively they put in place a system that always merges in some specific way. The merge rules are hard coded in effect.

A branching resource says nothing about how you merge the data. It just gives you enough information to implement most possibilities. With the 3-way merge, you can determine the changes on the conflicting branches, and decide how you want to merge them. You could even decide to just use CRDTs with the branching resource, though this would be a bit wasteful, because you effectively never be making use of the common ancestor, and the price may be a CRDT that grows in size indefinitely.

If there is no interest in a branching resource, what about CRDTs in Swift? I haven't seen that they have been proposed anywhere either.

tera · October 13, 2021, 8:43pm

can you show a minimal working or semi-working example of using BranchingResource to share some data structure between threads?

drewmccormack · October 14, 2021, 5:59am

I haven't released any open source code for this. I also haven't yet developed a BranchingResource. What I did do is develop a BranchingFile, which is a file shared between threads. The code is not currently open source, so I can't link to it.

I'm happy to answer general questions. I think the API above gives a feel for how it works. Internally there is serialization, perhaps an actor or queue. Updating a particular branch should be done serially by the calling app, in general, on a single thread. This should be a fast, synchronous process.

Merging can involve more than one branch, which has the potential to cause a block. I have mitigated this in my branching file type by including optimistic locking, in which a block is avoided, but committing the merge result can fail.

If you have ever worked with Core Data, you have encountered these concepts. In Core Data, you share a resource, the SQLite store. You can have multiple contexts on different branches, each updated independently in memory, with no locking. When you want to commit/save, it has to lock the SQLite store and update the DB. At this point, it is possible that it fails, because some other thread has updated since it last loaded data. This would require the data be reloaded, and the save reattempted. That is optimistic locking in Core Data.

In short, my proposal is an abstraction of what Core Data has built in, which could be applied to many other types. Eg. You could apply it to a JSON file created using Swift Codable. But you could equally just use it in your data driven SwiftUI app to manage data movement between components of the app (eg network downloader, UI, sharing extension), in a more controlled (less ad hoc) way.

drewmccormack · October 14, 2021, 6:22am

Although I can't point directly to an open source implementation, I can post a few examples of using my BranchingFile type here. Maybe that will show how it is used in an app better.

First, setting up a BranchingFile, with two branches

        let data = try JSONEncoder().encode(sharedNote)
        let branchingFile = try BranchingFile(rootDirectory: fileRoot, initialContent: data)
        try branchingFile.createBranch(.app)
        try branchingFile.createBranch(.cloud)

Then updating a branch

        try file.update(branch, withContents: try JSONEncoder().encode(sharedNote))

And merging

        branchingFile.mergeMain(into: .cloud, resolvingConflictsWith: resolver, confirmingWith: committer) { result in
               ...
        }

The resolver decides how to handle conflicts (if any); it is a protocol that the user of the class can implement. It is like a Core Data merge policy, but 3-way merging rather than 2-way, ie, it has the common ancestor snapshot of what the data was before the branches diverged.

The committer is there to make the merge atomic in cases where you want to make sure something takes place before committing. Eg. Imagine a branch is representing the current state of a cloud store. The committer could upload the resulting merge result, and only after that does the merge actually fully succeed. This makes the merge effectively atomic with the upload. You won't be stuck with a merge that succeeds, but with the file missing from the cloud.

I'll throw a few more chunks of code from my test suite here, to give more ideas about the working...

    func testTwoBranches() {
        let hi = "hi".data(using: .utf8)!
        let there = "there".data(using: .utf8)!
        let you = "you".data(using: .utf8)!
        try! sharedFile.createBranch(.branch1)
        try! sharedFile.update(.branch1, withContents: hi)
        try! sharedFile.createBranch(.branch2)
        try! sharedFile.update(.branch2, withContents: there)
        XCTAssertEqual(String(data: try! sharedFile.mostRecentFileContents(in: .main), encoding: .utf8), "blah")
        XCTAssertEqual(String(data: try! sharedFile.mostRecentFileContents(in: .branch1), encoding: .utf8), "hi")
        XCTAssertEqual(String(data: try! sharedFile.mostRecentFileContents(in: .branch2), encoding: .utf8), "there")
        try! sharedFile.update(.main, withContents: you)
        XCTAssertEqual(String(data: try! sharedFile.mostRecentFileContents(in: .main), encoding: .utf8), "you")
        XCTAssertEqual(String(data: try! sharedFile.mostRecentFileContents(in: .branch1), encoding: .utf8), "hi")
        XCTAssertEqual(String(data: try! sharedFile.mostRecentFileContents(in: .branch2), encoding: .utf8), "there")
    }

    func testFastForwardingMainFromChangesOnBranch() {
        try! sharedFile.createBranch(.branch)
        let hi = "hi".data(using: .utf8)!
        try! sharedFile.update(.branch, withContents: hi)

        XCTAssertEqual(String(data: try! sharedFile.mostRecentFileContents(in: .main), encoding: .utf8), "blah")
        XCTAssertEqual(String(data: try! sharedFile.mostRecentFileContents(in: .branch), encoding: .utf8), "hi")

        let mergeResult = try! sharedFile.merge(.branch)
        XCTAssertEqual(mergeResult.action, .mainFastForwarded)
        XCTAssertEqual(String(data: try! sharedFile.mostRecentFileContents(in: .main), encoding: .utf8), "hi")
        XCTAssertEqual(String(data: try! sharedFile.mostRecentFileContents(in: .branch), encoding: .utf8), "hi")
    }

dabrahams · October 14, 2021, 5:50pm

I realize that some things you'd call branching resources don't have the properties of CRDTs but it's easy to imagine CRDTs that make use of common-ancestor information to do their merges. So I don't think the ideas are incompatible in any way. I guess the real question is, who really wants those non-CRDT properties, anyway?

IMO that would be a wonderful addition to the Swift ecosystem. /cc @wthimbleby

lorentey · October 16, 2021, 11:19pm

This sounds like an interesting approach. (Reliably syncing data is, needless to say, a pretty important thing these days, and the Swift ecosystem obviously needs tools that implement it. Some of these tools ought to be part of the core Swift distribution, perhaps even the Standard Library.)

My (very rude) general impression is that CRDTs seem to be primarily concentrating on the consistency of merged results over their correctness, by automating merges in a way that allows (and requires) only limited customization. Their reusability makes them very, very interesting to me as a stdlib engineer, so I’m focused in this direction — but there is definitely room for alternatives.

I think persistent data structures are absolutely essential to any scheme where shared copies of data are mutated on several parallel branches. (I consider it a priority that CHAMPs and B-trees make their way into Swift Collections in the foreseeable future; “luckily” we already have implementations of these in progress.)

drewmccormack · October 17, 2021, 7:26am

Thanks for giving that insight.

I had seen the work you were doing on persistent data structures. Very interesting.

I was initially confused by the term “persistent”, thinking it related to storage on disk, but if I understand it correctly now, it is about making the container “append only”; that the existing data is immutable. (I realize now that the LLVS store I developed as an experiment has these same properties, as well as being persistent on disk and mergable across devices. It is “doubly persistent” if you like )

Seems these persistent data structures are motivated by FP, and probably used on servers and the like, giving lockless shared data.

I come from a different direction. I develop apps for Mac and iOS, and these apps have to present an eventually consistent picture of the user’s data. To me it feels like there is not a lot around to help developers do this. CRDTs are certainly an interesting approach.

Is there interest in the collections project to develop CRDTs, both for cross thread usage on a single device, but even more importantly, as a means to reach data consistency across devices, and maybe even with clients in the cloud? Or would this fit better in a different project, and if so, which? I would quite like to be involved in any effort to solve the “sync” problem.

kiel · October 17, 2021, 10:51am

I would be disappointed if this topic didn’t progress much further because it seems to be an intriguing solution to a common enough kind of problem. And good solutions to common enough kinds of problems should be what (at least, I imagine) a trusted, high profile or standard library provides, to have your back when you have that problem.

In your experience, what specific kinds of use cases does/would this type enable that more people can relate to that might motivate a proposal? For example, I can imagine it being useful for an app and it’s extensions (widget, watch, notification UI etc) reading and writing to an essential, shared JSON file.

dabrahams · October 17, 2021, 5:58pm

If by “correctness” you mean avoiding unexpected behavior during conflict resolution and capturing human intention, I understand what you're saying… but I don't think that's a very useful point of view. The consistency is the part that's easily captured mathematically, so that's the part for which we have laws. The other stuff is the creative/engineering challenge of designing a good CRDT merge operation for a particular application. The advantage of following the CRDT laws, as opposed to an arbitrary open-ended merge operations without constraints (which is what seems to be proposed here) is that we understand the emergent properties we're going to get by following the laws (e.g. no need for server mediation, no changes rejected because of conflicts, etc…) and we actually have some useful constraints in selecting a merge procedure, which after all is an extremely open-ended space.

When I asked “who really wants those non-CRDT properties, anyway?” I was trying to say that, aside from a few specialized applications where centralized control is really necessary (e.g. a bank account that can never be allowed to be overdrawn, so edits must be able to be rejected by the server) I've never seen an example where the CRDT laws interfere with what (I assume) you mean by “correctness” or any other goals. Unless there is a substantial space of applications where we want to merge concurrent edits but the CRDT laws can be shown to be unachievable, I think it's far better to focus on the CRDT space.

You really should have a meeting with @wthimbleby (who's at Apple) if you haven't already; he's an authority in this area and I owe most of my understanding of CRDTs to him.

I think persistent data structures are absolutely essential to any scheme where shared copies of data are mutated on several parallel branches. (I consider it a priority that CHAMPs and B-trees make their way into Swift Collections in the foreseeable future; “luckily” we already have implementations of these in progress.)

Ooh, CHAMPs look interesting! I hadn't seen them before; is Michael Steindorfer's thesis the canonical source to read more on these?

To be clear, persistent data structures are classically studied in the context of immutability (and thus FP), but the same basic techniques can be used with copy-on-write to build in-place mutable data structures that maximize sharing across (mutated) copies. Composition of copy-on-write data structures already does this naturally (try [[Int]] and use withUnsafeBufferPointer to inspect addresses). Persistent data structures in Swift should be mutable.

drewmccormack · October 18, 2021, 7:05pm

Might be some crossed lines here. I had the feeling that @lorentey was speaking in favour of CRDTs, because they can be shown to work mathematically, as opposed to my 3-way merging approach, which leaves open the merge policy.

Anyway, interesting that @wthimbleby's name comes up, because I know Will from when he was just starting out, before he was at Apple. Did not know that he was an expert on CRDTs. Good to know.

As much as I appreciate that persistent data structures are very useful in FP, as a way to build a mutable data container based on essentially immutable, append-only data elements, I'm of the opinion that many more Swift users, most of whom are probably not working on server software but on apps, would benefit from data structures that are "mergeable", like 3-way merge structures, and CRDTs.

Feels like a gap in the Swift universe at this point, because even if we do get persistent data structures, I assume they will be essentially in-memory constructions, and not designed for saving, or merging divergent instances across a network.

I would love to see some movement in the decentral data space, where app developers could easily sync data across devices, or with app extensions or watch apps. Not particularly fussed if this is a CRDT approach, a 3-way merge approach, or something else. But some movement in this direction would be welcome.

drewmccormack · October 18, 2021, 7:21pm

Yes, I think you have named some good cases. Sharing a file containing codable types, is one case. Effectively this is the JSON case you mention.

Indeed, a sharing extension. The sharing extension can write into its branch, and later, the main app can merge the changes. Same goes the other way: the main app can keep up to date its branch, and the extension can merge the changes it needs when it launches.

But even within a single, more complex app, you have multiple components, often working independently. This is basically why actors were introduced. You can imagine an import component, or a network component, undertaking long running tasks, while the main UI has to keep working.

The risk you have when you try to solve this in an ad hoc way is that you have to interrupt the main UI thread to apply changes, perhaps even undoing recent in-memory edits made by the user. The solution I proposed, which is very similar to the multiple context approach of Core Data, is to decouple these processes, and only merge after the fact, at a convenient point in time.

To me, this is a truly data driven system. Changes can be made on a background thread, and you later merge, determining diffs that drive the UI. So SwiftUI is all about diffing, and with this approach, the data level is also all about diffing. When I have worked with this approach in SwiftUI, the whole thing feels like it fits together perfectly. The data drives the whole app.

Terje · October 18, 2021, 10:09pm

I agree. I have plenty of apps I want to make but storing data remains a black box sometimes, json files are an option but parsing an entire file can take a while. SQLite is an option but those can’t live on a network drive (iCloud, Dropbox,…), CoreData is nice but not cross platform. Long ago tried to make a git like file format by subclassing NSIncrementalStore.

I never seem to get exactly what I want to do out of existing solutions. So I am now delving deeper into how databases work under the hood. It’s refreshing to do and instructive. I got a BTree working (somewhat) as the storage layer. Keys can be anything, not just numbers. Values are spread out over multiple pages. Pages could be any size but I chose 4K since that is the page size on Apple’s ARM devices(?). Ints or page sizes/references are stored in a compressed format. Codable is used to transform a value into an internal data structure though anything could be stored in any way really by just conforming to the cell write protocol. This BTree is just a first step. Modelling, history, sharing, user permissions, local/remote…

But all of this could be improved (e.g capacity of Data is actually bigger than 4K , no regard for endianness yet, actors are reentrant) and I don’t have a CS background so I just stumble into things and figure things out as I go. I’m not sure anyone would be interested in my code or any collaboration like code-reviewing or mentoring me (self-taught, solo-dev, 1 day/week spare time) . If somebody is interested, nice , if not, my apologies for making you read this.

Yes I know there are other established projects out there, but using those doesn’t t teach me much about the underlying principles. I’ve just always been interested in our things work as close to the metal as possible.

tera · October 18, 2021, 11:04pm

i like your attitude ) been in that boat a number of years ago and one of the first project was ... B-tree!

two further ideas for your implementation: make your disk format and your memory format exactly the same and do not read/write but memory map (instant load/save times, free caching leveraging system VM to do the job).

Terje · October 18, 2021, 11:44pm

Thanks. That’s encouraging to hear. I was really hesitant to post my reply.

I’m now opening a file with FileHandle and reading 4K pages into Data. What I have in memory (I.e. Data) is the same as what’s on disk*. Pages are created as needed and appended to the end of the file. I knew not to read the file as a whole since that limits it to the available RAM, right. Does Memory-mapping not also read the entire file to, well, memory in one go? In the past I just usually read/write files in one to since that was the easiest to do but that’s of course not appropriate here. I’ll read up on it. Thanks for the tip.

*the struct value in userland and the on-disk representation aren’t the same if that’s what you mean Like the difference between a ManagedObject and a SQLite table. I’m bridging that gap via Codable. Sorry if I’m misunderstood your advice.

tera · October 19, 2021, 12:52am

that's the beauty of it: you map the whole, say, 1G file into memory. then you can access random locations of that memory and the relevant I/O happens magically. if system needs to unmap some area to free RAM for something else and that memory chunk is not changed it is just dropped. and if it was changed system magically writes it back to your file.

i meant even closer relationship when "your disk representation is your RAM representation". e.g. instead of having a pointer in your memory struct you have an offset. obviously sometimes there would be complications and it is not easy or wise to do this in all cases. but when you can - everything just flies.