SE-0240: Ordered Collection Diffing


(Douglas Gregor) #1

The review of SE-0240: Ordered Collection Diffing begins now and runs through Tuesday, January 22nd, 2019.

Reviews are an important part of the Swift evolution process. All review feedback should be either on this forum thread or, if you would like to keep your feedback private, directly to me as the review manager via email or direct message on the forums. If you send me email, please put "SE-0240" somewhere in the subject line.

What goes into a review of a proposal?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift.

When reviewing a proposal, here are some questions to consider:

  • What is your evaluation of the proposal?

  • Is the problem being addressed significant enough to warrant a change to Swift?

  • Does this proposal fit well with the feel and direction of Swift?

  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

Thank you for contributing to Swift!

Doug Gregor
Review Manager


[Accepted with modifications] SE-0240: Ordered Collection Diffing
(Anders Ha) #2

-0.5.

As much as I'd love to have diffing available universally in Swift, the proposal does not cover in-place mutations that could be prevalent especially for GUI applications.

The proposed solution relies solely on the Equatable, which frames the algorithm to treat the whole value as the identity of an element. This in turn implies that there shall have no in-place mutation detected, because if an element is partially changed, the whole-value identity is consequentially different.

For an example where detecting in-place mutations is useful, let's say we have a stream of list of conversation models, where:

  1. each conversation has a natural unique identity alongside its other properties, e.g. an UUID; and
  2. we want to react to only conversations that have a particular set of attributes being changed.

With the current proposed solution, it would not be easy, as partially mutated conversation models are differentiated by their whole values, resulting in a pair of removal and insertion which does not directly resemble the scenario.

In short, the generalised solution should allow the identity to be decoupled from value equality, and using value equality only to determine whether the element has changed.

(Had gone through the proposal. Experience in deploying diffing in production apps for UI animations and management.)


#3

+1

I feel that diffing algorithms are sufficiently difficult to implement correctly that there is value in providing a standard implementation.

I have not used similar features in other languages.

I read the initial proposal and the associated discussion.


(Jeremy David Giesbrecht) #4

I’m neutral on this.

I wrote and use something similar myself.

I wish I had noticed the pitch thread to mention the following before it got this far, but there are several reasons the addition as proposed would not be a suitable replacement for many of my use cases.

  1. Reporting distances instead of indices can be a real slowdown. If you go that route, the whole thing should be constrained to RandomAccessCollection, not just BidirectionalCollection. But I would find it much more useful if it reported indices directly and stayed on BidirectionalCollection. For users who need to send it over a network or something, it would be source‐trivial to convert Array(bidirectionalOnly) in order to get correlated integer indices on both ends on the connection. If that is too much allocation, it is also possible to create an IndexedByOffset wrapper collection.

  2. Why is the element stored in the reported difference? That means the difference type is duplicating a lot of the memory of the base collections. If the changes carried indices, it would only be O(1) to get the elements if they are needed.

  3. The ranges of changes tend to be far more useful than exhaustive lists of indices (regardless of the index/offset issue). Was that forgone on purpose?

  4. It is often just as useful to know what didn’t change. I think case keep(...) is missing from the reported difference. (I know that information can be derived from the others, but with ranges reported it requires a clumsy loop, and with indices/offsets reported it requires re‐iterating the whole thing.)


(Daniel Höpfl) #5

What is your evaluation of the proposal?

+0.25
(+1 if there was a .modified and .moved as already posted by @Anders_Ha)

Still positive because it should be possible to transform the .insert/.remove change script in a post-processing step.

Is the problem being addressed significant enough to warrant a change to Swift?

Yes

Does this proposal fit well with the feel and direction of Swift?

Yes

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

Had to write my own diff for UITableView.
Looked at https://github.com/mcudich/HeckelDiff but did not like the (ab)use of hashValue vs. ==.


(Svein Halvor Halvorsen) #6

+1. I'd use this for table and collection view updates.


(David Hart) #7

As mentioned by @Anders_Ha, it will be difficult to use this diffing API for anything but the most simple table and collection view updates: the difference returned by the algorithm won't tell when you need to reload a row instead of deleting then inserting one. For that, the algorithm needs to evaluate identity vs equitability. For an example, have a look at the https://github.com/lxcid/ListDiff library.

To illustrate the difference (pun intended), imagine we have two states for a list of people:

Before

[
    {
        "uuid": "4628c50f-4516-4d52-ad30-28ed81b3c1b3",
        "name": "Chris Lattner",
        "employer": "Apple"
    },
    {
        "uuid": "6a5ff808-91e0-4b1a-a4dc-37b30704c8a1",
        "name": "Mark Zuckerberg",
        "employer": "Facebook"
    }
]

After

[
    {
        "uuid": "4628c50f-4516-4d52-ad30-28ed81b3c1b3",
        "name": "Chris Lattner",
        "employer": "Google"
    },
    {
        "uuid": "2e683426-801c-4754-924c-95920fc760a5",
        "name": "Bill Gates",
        "employer": "Microsoft"
    }
]

The algorithm we need for smooth table updates should generate a difference of (following the proposal's notation):

reload 0
delete 1
insert 1

instead of:

delete 0
insert 0
delete 1
insert 1

(Gwendal Roué) #8

I agree with everything that has been said about identity and, generally, moved/updated items. As a GUI app developper, this is something I often need. And this is something that is currently provided by hand-made code, or third-party libraries. And as most of fellow developers, I struggle a little bit when it is time to choose the best-suited algorithm, both in terms of feature set and computational complexity.

But: does it invalidate the whole proposal? Doesn't it address one thing, and doesn't it address it well? Shouldn't we consider its value solely based on what is proposed?

It is worth quoting the Motivation section, which does not address table and collection views:

Representing, manufacturing, and applying transactions between states today requires writing a lot of error-prone code. This proposal is inspired by the convenience of the diffutils suite when interacting with text files, and the reluctance to solve similar problems in code with libgit2.

Many state management patterns would benefit from improvements in this area, including undo/redo stacks, generational stores, and syncing differential content to/from a service.


(David Hart) #9

Not necessarily. I just wonder if there is a more general solution that would fit both needs. I just worry that it is a lot of new types (ie, complexity) in the Standard Library while still not bringing a solution to the kind of diffing that a huge part of the current Swift community (Apple platform GUI developers) needs.


(Erik Little) #10

Could you not use the provided

public func shortestEditScript<C>(
        from other: C, by areEquivalent: (Element, C.Element) -> Bool
    ) -> OrderedCollectionDifference<Element>
        where C : BidirectionalCollection, C.Element == Self.Element

to use identity instead of the straight Equatable method?

As to the naming of these methods, is script a term of art in diffing? I don't fully understand the meaning of script in this context. Or is it simply meaning that the returned value can be used as a "script" to apply the difference to another collection?


(Svein Halvor Halvorsen) #11

Although the proposal explicitly does not include a reversed() function, it does mention one:

For example, this propsal could have included a reversed() function on the difference type that would return a new difference that would undo the application of the original.

I would highly recommend against such a function. We already have a reversed() on over ninety varieties of collections and sequences, and they all return a new collection with the same elements, but in reverse order as the original.

If this function is to ever be implemented, it should be called inversed() or something like that.


(Nate Cook) #12

To support this given the proposed diffing tools, we would need to offer a chance to provide a different equality test for inferringMoves than for the original diff. This alternative works for me (pardon the long code sample — implementation here):

extension OrderedCollectionDifference {
    func inferringMoves<T: Hashable>(
        by f: (ChangeElement) -> T
    ) -> OrderedCollectionDifference<ChangeElement> { ... }
}

let before = [
    Person(uuid: "4628c50f-4516-4d52-ad30-28ed81b3c1b3",
        name: "Chris Lattner",
        employer: "Apple"),
    Person(uuid: "6a5ff808-91e0-4b1a-a4dc-37b30704c8a1",
        name: "Mark Zuckerberg",
        employer: "Facebook")
]

let after = [
    Person(uuid: "4628c50f-4516-4d52-ad30-28ed81b3c1b3",
        name: "Chris Lattner",
        employer: "Google"),
    Person(uuid: "2e683426-801c-4754-924c-95920fc760a5",
        name: "Bill Gates",
        employer: "Microsoft")
]

let diff = before
        .difference(from: after)
        .inferringMoves(by: { $0.uuid })
// Diffing.OrderedCollectionDifference<Person> = {
//   insertions = 2 values {
//     [0] = insert {
//       insert = {
//         offset = 0
//         element = {
//           uuid = "4628c50f-4516-4d52-ad30-28ed81b3c1b3"
//           name = "Chris Lattner"
//           employer = "Apple"
//         }
//         associatedWith = 0
//       }
//     }
//     [1] = insert {
//       insert = {
//         offset = 1
//         element = {
//           uuid = "6a5ff808-91e0-4b1a-a4dc-37b30704c8a1"
//           name = "Mark Zuckerberg"
//           employer = "Facebook"
//         }
//         associatedWith = nil
//       }
//     }
//   }
//   removals = 2 values {
//     [0] = remove {
//       remove = {
//         offset = 0
//         element = {
//           uuid = "4628c50f-4516-4d52-ad30-28ed81b3c1b3"
//           name = "Chris Lattner"
//           employer = "Google"
//         }
//         associatedWith = 0
//       }
//     }
//     [1] = remove {
//       remove = {
//         offset = 1
//         element = {
//           uuid = "2e683426-801c-4754-924c-95920fc760a5"
//           name = "Bill Gates"
//           employer = "Microsoft"
//         }
//         associatedWith = nil
//       }
//     }
//   }
// }

(Jordan Rose) #13

I'm lukewarm on actually adding this to the standard library. It's got varied uses, but it's not something everyone needs, and it's not an obvious gap in what it offers. So count me as abstaining from the "evaluation" part of the review.

Two API design comments:

applying(_:) throws -> Self instead of applying(_:) -> Self?

Applying a diff can only fail when the base state is incompatible. As such, the additional granularity provided by an error type does not add any value.

One thing I hate is when patch or git apply fails but won't tell me where the mismatch was. An advantage of throwing here is that you could include information about how the diff failed.

Use Index instead of offset in Change

Because indexes cannot be navigated in the absence of the collection instance that generated them, a diff based on indexes instead of offsets would be much more limited in usefulness as a boundary type. If indexes are required, they can be rehydrated from the offsets in the presence of the collection(s) to which they belong.

My first instinct is definitely to agree with @SDGGiesbrecht (and was when I ran into @numist in person a few weeks ago). At the time we talked about how you'd have to walk the collection anyway to apply the diff, but that's not necessarily true for single insertions or deletions (a very common operation when hooked up to a table view). For larger diffs it at least won't make the algorithm quadratic, though.

On the other hand, Indexes are definitely transient, and so using Indexes would make it impossible to "un-apply" a diff. That seems pretty important to me.

If we do decide to use indexes instead, I suggest allowing OrderedCollectionDifference itself to convert to and from an Int-based form, similar to the conversions String provides for NSRange, or Collection provides for RangeExpression.

extension OrderedCollectionDifference where Index == Int {
  func relative<C: Collection>(
    to collection: C
  ) -> OrderedCollectionDifference<Element, C.Index>
}

extension OrderedCollectionDifference {
  func usingOffsets<C: Collection>(
    in collection: C
  ) -> OrderedCollectionDifference<Element, Int>
  where C.Index == Self.Index
}

// (and then probably restrict the Codable conformance to the Int index case)

(David Hart) #14

That sounds like a good solution! Do we then infer reloads from insert/remove associated with the same offset?


(Nate Cook) #15

Right, you'd need to build some logic on top. Something like:

  • associated with same offset: reload
  • associated with different offset, elements equal: move
  • associated with different offset, elements not equal: move and reload? (not sure of how that gets handled in UIs)
  • no associated offset: add/delete

(Jeremy David Giesbrecht) #16

This would get the best of both worlds, wouldn’t it? I like that idea.


(Jeremy David Giesbrecht) #17

Another reason to leave the element out of it is to have universally Codable differences even when the element is not Codable.

For cases where the decoder won’t have the end result to pull additions from, there could be a constrained extension to fetch a variant with them stored internally:

struct OrderedCollectionMutation<Insert> { // ← The name is just a placeholder.
    // Almost the same as OrderedCollectionDifference,
    // but with a storage for inserts:
    enum Mutation {
        case insert(offset: Int, element: Insert, associatedWith: Int?)
        case remove(offset: Int, associatedWith: Int?)
    }
    // ...
}

extension OrderedCollectionDifference {
    func storingAdditions<C : Collection>(
        from result: C
    ) -> OrderedCollectionMutation<C.Element>
    where C.Element : Codable {
        // ...
    }
}

A related thought: What if it were used as proposed on a lazy LineView on a String, which vends its elements as Substring instances? (This is by far the most common thing I use difference analysis with.) Save for copy‐on‐write magic provided by String, wouldn’t that result in the difference having a possible memory footprint of up to the square of the first collection’s count plus the square of the second’s count, since the subviews hold the entire collection? I guess what I’m asking is, isn’t the API as proposed unwise to use with collections where the element type itself is intended to be transient, such as SubSequence in general? But dropping the storage of the element (until requested) would remove that restriction.


(Nate Cook) #18

Re providing indexes instead of offsets: Range-replaceable collections may invalidate any existing indexes upon insertion or deletion. Any indexes after the first element in the difference is added or deleted couldn't be guaranteed to be semantically valid. Offsets, on the other hand, are always valid as long as the correct algorithm to apply the diff is followed.


(Jeremy David Giesbrecht) #19

The proposal provides no in‐place mutation anyway:

Intentional omissions:

[...]

mutating apply(_:)

There is no mutating applicator because there is no algorithmic advantage to in-place application.


(Dave DeLong) #20

+1 on the idea
-1 on this API


I believe diffing belongs in the standard library. It is a complicated problem that is difficult to get efficiently right. The usage of a good diff is widely applicable.


I do not like this implementation, for two huge reasons:

  1. I do not like that this is essentially the naïve implementation of diffing: it will only tell you what has been inserted and what has been deleted. There is minimal advantage to this instead of just doing some basic Set<T> symmetric differences in my own code

  2. The code, as it stands, does not provide a way for me to supplement the diffing algorithm with my own varieties. There is inferringMoves(), but even that is insufficient because it does not allow me to indicate "equal but not identical" (ie, "updated")


I would much prefer to see a public protocol defined that defines the API to diff two collections. The standard library could provide a single, concrete implementation of this protocol to do the InsertAndDeleteDiff implementation as it currently stands. The inferringMoves() bit of the API would simply be an alternate implementation of the algorithm that delegates to InsertAndDeleteDiff, and then layers the move inference on top of it.

However, the usage of a protocol would then allow me, as an app author, to provide yet another diffing algorithm that accounts for object updatability, so that for my problem space, I can get the semantics that I require.

I could see the API looking something like this (I'm using Array as the collection type for brevity):

protocol DiffChange { }

protocol DiffingAlgorithm {
    associatedtype ChangeType: DiffChange
    func compute<T>(_ source: Array<T>, destination: Array<T>, elementsEqual: (T, T) -> Bool) -> Array<ChangeType>
}

The associated type on the protocol means that for an InsertAndDelete implementation, you could have an enum that only has .insert and .delete cases. But if you used the InferringMoves implementation, then the enum could have .insert, .delete, and .move cases. (Or it doesn't even have to be an enum!)

From this you have the basic building blocks to build a diffing system that could truly be extensible (as something in the standard library should be) instead of being locked in to a single implementation that will not be useful except for the most trivial of cases.

The default implementations on BidirectionalCollection or whatever would basically be:

extension BidirectionalCollection {
    func difference<D: DiffingAlgorithm>(to other: Self, using: D) → Array<D.ChangeType> { ... }
    func difference(to other: Self) → Array<InsertAndDelete.ChangeType> {
        return difference(to: other, using: InsertAndDelete())
    }
}