Ordered collection diffing

I like the direction of this proposal. However there are two things that I‘m not sure about.

  • The naming scheme ˋby areEquivalentˋ reads weirdly (judging a non native English speaker)
  • I would have expectaed the Diff struct to also conform to the OrderedCollection protocol, which is mentioned in the alternative section.

I'm wholly in favor of moving forward on this. I've dug a bit in to diffing, and have a couple of thoughts:

  • I'd like to see a way to get an edit distance without getting the actual diff. Typically you can compute the distance between to ordered collections faster than you can compute the actual difference, and that can be useful in situations (such as sorting autocompletion results by "which one is 'closest'"); This is similar to how we've added .count(where:) to collections so I don't have to do filter(_:).count.

  • I'm not sure of the proper terminology for them (if it even exists), but there's a difference (ha) between "eventually consistent" and "absolutely consistent" diffs. An "absolutely consistent" diff would be one where the output of each individual step is a valid data structure. An "eventually consistent" diff is one where the insertions/deletions/moves might be out of order. As an example of this: UITableView allows you to specify eventually-consistent diff operations, because the call to endUpdates() will internally sort them in to absolute consistency for application. However (IIRC), NSTableView requires you to specify diffs in order, up front (ie, an absolutely consistent diff). It might worth considering allowing clients to request an eventually-consistent diff instead of always providing an absolutely-consistent diff. (An eventually consistent diff can be computed more quickly than an absolutely consistent one)

  • I think this was in the proposal (inferringMoves()?), but I want to make sure I can request a diff that has a deletion + subsequent insertion of a value, or a diff that just has a move of a value.

  • There needs to be a way to indicate that an element has changed, but that does not correspond to a deletion/insertion. This is analogous to wanting to reload a row in a tableview because the underlying model has updated.

I'll probably have more thoughts later, but this is what I've got for now. As I said, an enthusiastic +1 on the overall idea.

9 Likes

Sorry about that, this API is following the precedent of existing API such as starts(with:by:) and elementsEqual(_:by:).

As mentioned in the alternative section, two diffs produce the same state transition if they contain the same changes regardless of their order. For example, the enumeration order is well-defined for safety and a different order is used for fast application in RangeReplaceableCollection. This makes OrderedCollectionDifference an unordered collection.

I'm pretty sure that the methods of computing edit distance that claim to be dramatically faster than performing a diff are not conforming to longest common subsequence semantics.

I have been chatting with some people about exposing more access to the intermediate data structures that are involved in diffing in the future, as they're very useful for projects for which diffing is a core part of their functionality. That said, the triangular matrix that encodes the final edit path costs O(n*d) to construct and the backtracking traversal for generating the changes is O(d) so in practice the extra expense of generating changes when you're only interested in an overall count is forgettable.

And since OrderedCollectionDifference is a Collection, you can compute a levenshtein distance with custom weights using .count(where:) :slight_smile:

I'm not sure how to refer to the differing batch semantics that NSTableView and UITableView use to produce state transitions in polite company, but I am very familiar with them. The enumeration order of OrderedCollectionDifference is defined to be safe for iterative application to both.

You can drive either table view type with code such as:

let tableView : NSTableView
// ...
tableView.beginUpdates()
for c in diff {
    switch c {
    case .remove(offset: let o, element: _, associatedWith: _):
        tableView.removeRows(at: [o], withAnimation: .effectFade)
    case .insert(offset: let o, element: _, associatedWith: _):
        tableView.insertRows(at: [o], withAnimation: .effectFade)
    }
}
tableView.endUpdates()

Conspiciously, the code above avoids calling .moveRow(at:to:). This is because of the differing batch semantics between the two table views; NSTableView can't take advantage of associated offsets without accounting for element drift, but a naïve approach will work with UITableView:

let tableView : UITableView
// ...
tableView.beginUpdates()
for c in diff {
    switch c {
    case .remove(offset: let o, element: _, associatedWith: let a):
        if let a = a {
            tableView.moveRow(at: IndexPath(row: o, section: 0), to: IndexPath(row: a, section: 0))
        } else {
            tableView.deleteRows(at: [IndexPath(row: o, section: 0)], with: .fade)
        }
    case .insert(offset: let o, element: _, associatedWith: let a):
        if a == nil {
            tableView.insertRows(at: [IndexPath(row: o, section: 0)], with: .fade)
        }
    }
}
tableView.endUpdates()

For unassociated removals/insertions, you want the result of difference(from:). For move associations, pass the result through inferringMoves() (example).

Tracking the mutating contents of identity types that reside in a collection is a non-goal of this proposal, but you can represent item replacement in diffs you generate yourself by constructing a diff with associated .insert and .remove changes containing different elements at the same position, such as:

let diff = OrderedCollectionDifference<String>([
    .remove(offset:0, element: "oldvalue", associatedWith: 0),
    .insert(offset:0, element: "newvalue", associatedWith: 0)
])
5 Likes

This is great, @numist, thanks! I'm keen to see something like this in the standard library.

I can see how the inferringMoves() method works for both of these cases, however it'd be nice to be able to request some extended form of OrderedCollectionDifference that includes discrete move(…) and update(…) enumeration cases, given how common these two activities are in AppKit and UIKit code.

2 Likes

I think this is a great idea. I've definitely found myself googling algorithms for this and rolling my own in the past. +1

My most recent use case though, was diffing two arrays to decide how to insert & remove rows from a tableView. I know, that's not a Swift language problem. I agree. It's outside of this proposal, but I think that an integration with UITableView / UICollectionView / mac equivalents would be a really helpful

Greetings Scott,

I like where this is going. First off, contact me privately if you would like some fun background reading.

  1. Would the name "Delta" be appropriate here instead of "Difference"? Specifically as a productive suffix. I think it would save a lot of typing and spelling errors, for one, and for another, it's an already accepted term of art for this construction.

  2. Is there a mechanism to indicate moves oneself? For example when implementing undo, issuing the Undo command would behave differently if the user switched the order of elements or if the user, say, dragged an element away and then put an element back. Just to give the general idea of what I mean.

Cheers,
-Colin

2 Likes

I think the general functionality proposed here would be useful, but the major problem I have is that it creates a new concept of an OrderedCollection which tries to address, in a very narrow way, the fundamental issues with Dictionary and Set conforming to Collection that have been very extensively discussed elsewhere on the forums. There's no reason why this particular functionality is any different to all the other order-dependent methods on Collection, etc. This post is a great example:

As mentioned, you can already do this destructuring on Dictionary and Set manually with existing methods:

let d = [1: "one", 2: "two", 3: "three"]
let first = d.first
let rest = d.dropFirst()

so why would it be different in switch statements? Why are all these other order-dependent methods available on Set and Dictionary but not the diffing ones?

My suggestion is that this proposal focuses on the collection diffing part and not try to simultaneously litigate the orthogonal issue of the suitability of Set and Dictionary conformances to Collection. If you think the right solution to that orthogonal issue is to e.g. split Collection in two by introducing OrderedCollection and deprecating methods on Set and Dictionary, then I think that should be a separate proposal. Another much-discussed alternative would be to move Collection conformance to a "view" property on these types instead of conforming directly.

6 Likes

Early versions of this proposal included cases like .move and .replace, but they added significant complexity for adopters that didn't care about the distinctions and were impossible to order safely for naïve applicators. .update (distinct from .replace) was extra tricky because it requires identity types.

Multiplying the API surface is harder to pitch and doesn't resolve the safety concerns (the type would be impossible to apply to NSTableView).

As a compromise, associated offsets allow this API to use only insertions and removals to express moves (associated changes with the same element) as well as replacements (associated changes with different elements) that can be interpreted as updates by codebases that have their own contracts around identity.

"Delta" was one of the names we considered when I started working on this; while it is accepted jargon, it's not as good for people with more casual experience with diffing and can be awkward conversationally. "Patch" was unwieldy because it's both a noun and a verb; out of the notebook of synonyms we tried, the one that felt most natural to say and write was a "difference" comprised of "changes".

OrderedCollectionDifference.init(_:) allows you to create any difference you want as long as it meets the constraints of the validator.

.dropFirst() in those conditions operates functionally like .dropRandom() and is pretty easy to rationalize, but the result of diffing two instances of Dictionary or Set can never be applied onto a statically-predictable base state. As far as I know it would be the first order-dependant method on Collection that is completely unusable with Set and Dictionary.

The introduction of OrderedCollection is only supposed to be the beginning of a solution, with further reorganization to follow in future proposals as you suggest. Maybe BidirectionalCollection (as pointed out by @Michael_Ilseman) would be a better beachhead for diffing, but it would be a shame to leave out unidirectional ordered collections…

1 Like

All collections in Swift have an order, and as it's exposed, you can calculate a diff - just like you can reverse them, or ask for a prefix.
The only question is: Are the results of such operations meaningful?

I really don't think so (to such an extend that I consider the hierarchy of Collection/Sequence-protocols broken), but that's what we have. Actually, I can think of use cases for diffing Sets, whereas many other operations just make no sense without order (I've been waiting a long time for examples to prove the opposite :wink:).

If we had OrderedCollection, I think it would be sensible to assume that Collection is not ordered - and it would feel odd to keep the majority of order-depended operations where they are, and introduce a separate protocol for a small subset.

[a big +1 for the core of the pitch - Swift needs more order ;-)]

3 Likes

Do you expect users to implement this customization point in a way that differs than what Sequence.elementsEqual(_: by:) already does? If not, when why isn't OrderedCollection just a marker protocol (like RandomAccessCollection is relative to BidirectionalCollection)? The sub-sequence functions were ripped out as customization points for Sequence and Collection because there was no need to ever customize them, and said customizations required heroic measures.

Just a quick look, and I don't see what the point of this protocol is? Unless the point of its elementsEqual differs from the one in Sequence, Its functionality is just Collection! I'm not saying that the difference methods or the OrderedCollectionDifference type are bad ideas; those can just be extensions on Collection.

It looks like you're working around the fact that Set and Dictionary are inappropriately sequences. So, unless a user is making a type similar to those two types, 99% of them will be slapping OrderedCollection on their collection type instead of Collection, making them distinct without a difference. To use a C++ adage, you're inappropriately designing against Machiavelli instead of just Murphy. I can break your grand design by just adding conformances to OrderedCollection onto Set and Dictionary manually. You just have to accept the current stance on using ordered operations on Set and Dictionary: "don't do that"/"too bad".

The real solution is to make an IndexedContainer root protocol that has the following from Collection: Element, Index, subscript(position:), Indices, indices, isEmpty, and count. Index only has to be Equatable. Collection would refine both Sequence and IndexedContainer and still have the additional members to make it work as it does now. The for-in statement has to be modified to accept Sequence or IndexedContainer objects, calling indices for the latter.

Set and Dictionary would dump Collection conformance for IndexedContainer. This part can't be done if this plan is done after ABI stability, though.

2 Likes

@dabrahams has made this observation too, that in principle it’d be better for Set and Dictionary to not have been Collections directly and to instead have an items view property that vended an arbitrarily-ordered collection of elements. While I can see that point of view, it would also mean that Set and Dictionary lose a lot of useful generic API they get from Collection, and taking that conformance away now would be a pretty disruptive compatibility change. Having a protocol for meaningfully-ordered collections seems like a good idea to me so that APIs going forward have the opportunity to make this distinction.

3 Likes

This looks like a great addition to the standard library. I've written partial and bug-ridden implementations of diffing myself. I also like that you called it OrderedCollectionDifference. That leaves the possibility to have other kinds of diffs. For example, in the future, maybe we could one day generically derive diffing for a datatype similar to how Codable works.

6 Likes

Absolutely yes, but diffing keyed types is already trivial thanks to set operations and membership testing.

I'd expect a future proposal to deprecate Sequence.elementsEqual(_: by:) because it assumes not only that the elements are ordered but also that the sequences are finite.

I wanted to leave that kind of dirty work to @Ben_Cohen & co. though as my primary concern is diffing.

Related to my reply to @Tino, I believe there are two fundamental diff types, keyed and ordered, and diffing two objects of the same type is a keyed diff. I think introducing a concept like Keyed that comes with a diff type generic on Key and Value is absolutely worth doing, but it warrants a proposal of its own :slight_smile:

3 Likes

Diffing JSON or a dictionary would be super useful for unit tests. at the moment diffing those is not really straight forward. So keyed diffing would be super useful.

1 Like

I want to preface this by saying that I think that diffing methods would be a great addition to the standard library, I only disagree with mingling this with the ordered collection issue.

Depends what you mean by “on Collection” but there are several methods inherited from Sequence that are very difficult to think of a reasonable use for.

This seems like the wrong order to me. It hasn't been decided that OrderedCollection is the right way to address this issue, so making this proposal contingent on OrderedCollection entangles the two issues unnecessarily (and, in my opinion, counterproductively).

Sure, I don't want to push for a particular solution here, or really discuss it at all, I'm only arguing that this proposal is the wrong level to make this decision at because it has much broader implications for a controversial topic. I would really prefer either that the “ordered” collection question is discussed/solved before it's decided where these diffing methods should live, or that this proposal ignores that orthogonal issue entirely and leaves it to be dealt with holistically later.

5 Likes

To my surprise, there’s an elementsEqual function in section Order Dependent Operations on Dictionary

One thing I wonder about as part of formalizing ordered collections: should partition(by:) be stable by default for Ordered Collections? (i.e., it doesn't shuffle the order of the contents).

I, perhaps naively, expected partition(by:) on Array to be stable, since Array is conceptually an ordered collection, however it currently shuffles the elements. The current implementation is simpler to implement and possibly more efficient than a stable algorithm, but the behavior was unexpected.

In the Array partition(by:) std lib documentation you can see the shuffling:

var numbers = [30, 40, 20, 30, 30, 60, 10]
let p = numbers.partition(by: { $0 > 30 })
// p == 5
// numbers == [30, 10, 20, 30, 30, 60, 40]

To be clear, a stable algorithm would produce:

// p == 5
// numbers == [30, 20, 30, 30, 10, 40, 60]

Just for educational purposes and to exercise my brain, without looking up references I implemented a naive stable algorithm that I'm fairly certain has no increase in complexity of space or time (for Array at least) over the current unstable partition(by:) in Array. There are likely standard implementations that are more elegant or concise, but it does at least suggest it's possible to implement in a reasonable fashion.

If nothing else, including stablePartition(by:) in the std library for ordered collections would be useful.

Perhaps this is expanding scope on the proposal to formalize ordered collections and should be a separate proposal for a later time.

I would love an InsertionOrderedDictionary. I use one in C++ in a few places and would like one in Swift.