SE-0240: Ordered Collection Diffing

  • What is your evaluation of the proposal?

Like many of the other commenters, I very much support the idea of including collection diffing algorithm(s) in the standard library, but I also have some reservations about the proposed design.

Specifically, I think the support for changes (incorporating a model for identity independent of equality) and moves is inadequate and storing elements themselves in the change model is unnecessary (it isn't clear to me what the purpose of that is and the copies seem unnecessary). I would prefer to see the change type model moves and changes directly by default with support for projecting to the primitive insert / delete representation. The insert / delete representation requires two entries in the difference set for changes and moves which would be quite awkward to work with if you need to extract change / move information. It is much easier to go the other way and requires less entries in the difference collection.

I am also not a fan of the name shortestEditScript. The proposal says this is a term of art, so maybe it is the best name but I don't find it especially descriptive. Something along the lines of changes(relativeTo:) seems like a more descriptive approach to naming this operation.

Beyond this, I will just +1 everything @brentdax said above.

  • Is the problem being addressed significant enough to warrant a change to Swift?

Yes

  • Does this proposal fit well with the feel and direction of Swift?

Yes

  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

N/A

  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

I read the proposal and the review thread and am quite familiar with modeling collection changes in order to drive UI updates.

1 Like

I'll hold off at the moment, since a concern would make me give it a big ol' negative-infinity. Mainly because of this:

Collections can only be efficiently diffed when they have a strong sense of order, so difference production is added to BidirectionalCollection

We have a protocol for collections with a strong sense of order Collection (or Sequence if you don't need indexing support)! Don't upshift algorithms to BidirectionalCollection as a "solution" for the Dictionary-and-Set-aren't-really-Sequences problem if the algorithm doesn't actually need bidirectional traversal. This is a "bug" on Dictionary and Set, and we cannot let a solution sneak into the library; in the worst case, we would have to completely rebuild this library once a proper solution goes in.

So, downgrade shortestEditScript(from: by:) to Collection (or Sequence) if it doesn't really need BDC support. Extending from that, I don't think OrderedCollectionDifference is needed; any collection of Change should do. Past those two obstacles, I can see a decent API peeking out:

enum DeltaAtom<ChangeElement> {  // Formerly OrderedCollectionDifference.Change
    case insert(offset: Int, element: ChangeElement, associatedWith: Int?)
    case remove(offset: Int, element: ChangeElement, associatedWith: Int?)

    var isInsertion: Bool {  // These could be internal
        guard case .insert = self else { return false }
        return true
    }
    var isRemoval: Bool {
        guard case .remove = self else { return false }
        return true
    }

    static func inEnumeraitonOrder(lhs: DeltaAtom, rhs: DeltaAtom) -> Bool {
        switch (lhs, rhs) {
        case (.remove, .insert):
            return true
        case (.remove(let lo, _, _), .remove(let ro, _, _)):
            return lo > ro
        case (.insert(let lo, _, _), .insert(let ro, _, _)):
            return lo < ro
        case (.insert, .remove):
            return false
        }
    }
}

extension MutableCollection where Element == DeltaAtom<Any> {  // Yikes, I forgot how to actually specify this.
    mutating func validateChanges() -> Index? {
        let insertsStart = partition(by: { $0.isInsertion })
        /* ... */
        return insertsStart  // or nil if invalid
    }
}

extension Collection where Element == DeltaAtom<Hashable> {  // Here too
    func inferredMoves<Delta: RangeReplaceableCollection>() -> Delta where Delta.Element == Element {
        return /* ... */
    }
}

extension Sequence {  // Could this be lazy too?
    func applying<Result: RangeReplaceableCollection, Delta: Collection>(delta: Delta) -> Result? where Result.Element == Element, Delta.Element == DeltaAtom<Element> {
        return /*...*/
    }
}

extension Sequence {  // Same here on laziness
    func shortestEditScript<Delta: RangeReplaceableCollection, C: Collection>(from other: C, by areEquivalent: (Element, C.Element) -> Bool) -> Delta where C.Element == Element, Delta.Element == DeltaAtom<Element> {
        /*...*/
    }
}
2 Likes

-1

I'm not sure if this really belongs in the standard library. Originally we had the goal of keeping the standard library minimalist, and despite diffing being important for many applications and difficult to get right, I'm not sure if it really meets that threshold of being 'essential'. Why can't it be a 3rd-party package?

In fact, there are already 3rd-party packages for this. For instance, I've used the 'Differentiator' library in the past. It's from the RxSwift team but has been separated from their other work so you don't need to use reactive programming with it. As I understand it, the process for standard library evolution was supposed to be that we start as a 3rd-party package (whenever possible), collect experience, then eventually discuss absorbing that in to the stdlib. With that in mind, and if this does belong in the stdlib, I would suggest taking a look at the battle-tested RxSwift design instead of creating our own thing from scratch.

In particular, they have a protocol to represent record identity. I think that's an essential part of what people want to do with diffs, and would support adding that protocol (or a similar one) to the standard library.

1 Like

"script" is a term of art, and your definition is exactly correct :slight_smile:

This would work, but without Hashable conformance inferringMoves() would be O(n²), and taking a closure for both == and hash(into:) is pretty messy API. Boxing (or unwrapping) would be more readable and achieve best performance.

Honestly the problem of identity is one that I've grappled with constantly while designing this API, which has been a lot longer in the making than most people realize. Philosophically, I think it's best to avoid specialized API for serving identity types until the concept of identity is better formalized in the language itself. In the meantime my demos have used arr.map { $0.identity } to get around this and they've worked pretty well.

Not exactly, but you're on the right track! "Position" and "offset" are extremely related concepts, but offsets are vulnerable to shifting caused by insertions and removals of earlier elements. This is the most common reason people have trouble with representing (and using) diffs.

Ultimately to determine if an element moved from its start and end offsets, you need to take into account the number of insertions before its final offset minus the number of removes before its original offset (this is one reason why the diff type exposes sorted insertions and removals properties—they allow people to answer the question of position sameness in logarithmic time).

That said, your observation that associated offsets are capable of expressing replacement is exactly right and deserves restating: element and position differences between associated changes are two independent bits capable of expressing:

  • element moved (same element, different position)
  • element replacement/change (different element, same position)
  • element moved and replaced/changed (different element, different position)

My primary intention for this proposal was to introduce a diff type capable of standing alone as an interchange format between modules. Omitting the elements entirely requires that a collection be passed alongside the diff in most cases since the purpose of consuming a diff is usually either storage or application. It also reduces the expressiveness of associations (as described above).

If elements are especially unwanted, the diff type can still be used by making it generic on Optional<T> and an instance initialized from any sane collection of Changes with nil elements.

In the pitch, the function was named difference(from:) which I quite liked, but it was renamed due to the output semantics of LCS/SES (which the current implementation uses) which is specifically different from a more visually appealing (but more computationally demanding) solution space.

Ultimately even shortestEditScript(from:) may run in polynomial time so I'm still interested in the right compromise here. Currently my best ideas are either that difference(from:) should defer to a faster (but less LCS-conforming) algorithm when runtime complexity exceeds some limit or shortestEditScript(from:) should provide another parameter that allows the user to limit the runtime complexity in exchange for no result.

My instinct is that the former is more desirable for v1 of a diffing API, since the API can be extended in the future.

As usual, Dictionary and Set haunt us. The pitch attempted to resolve this by introducing OrderedCollection, which attracted some convincing feedback that resolving that tension was best left to another proposal.

I hope that there will be a resolution so the diffing API can be lowered to the appropriate level, but until then the largest common denominator of all the types that were to be granted conformance to OrderedCollection was BidirectionalCollection. This is useful since a variant of Myers (and some other diffing algorithms) can achieve significant speedups by running bidirectionally, producing a result when the algorithms meet somewhere in the middle.

Philosophically, I think that API should be easy and safe to use, expressive, and reserve the maximum amount of potential for future improvement. Which is to say: I agree that the BidirectionalCollection requirement is not perfect, but I would argue that it is good because it can be lowered in the future.

Assuming that Change would be renamed to something appropriate to avoid confusion with changes in other domains than "ordered" (e.g. for "keyed" or "multidimensional" (treelike) collections), I'd still argue that there is a lot of value in the standardized interpretation and validation of batch changes provided by OrderedCollectionDifference:

    /// To guarantee that instances are unambiguous and safe for compatible base
    /// states, this initializer will fail unless its parameter meets to the
    /// following requirements:
    ///
    /// 1) All insertion offsets are unique
    /// 2) All removal offsets are unique
    /// 3) All offset associations between insertions and removals are symmetric

While there's nothing stopping anyone from using collections of OrderedCollectionDifference<T>.Change for their own purposes, the guarantees above turn out to be extraordinarily useful for validation, minimizing representations, and maximizing the fidelity of Equatable—lessons learned the hard way by anyone who's written batch code against both NSTableView and UITableView.

1 Like

I think this is why our design priorities differ so vastly. My use cases have been primarily concerned with discovering changes caused by some external source in order to analyze or flag them. I am starting to see how your design makes a lot of sense for what you want to use it for. I’m now wondering if between what each of us thinking, the name and the underlying LCS algorithm are really all that’s shared, since the rest of the implementation would need to be designed quite differently for each case. So from now on I will try to put myself in your shoes and attempt to assess it solely based on the purposes you have in mind.

The first thought that jumps at me is what flags when you say this:

This sort of logic is only reliable for Array and its immediate relatives (ArraySlice, ContiguousArray, etc.) which have such semantics. You cannot expect offsets to remain valid after an insert or removal on any arbitrary BidirectionalCollection. A collection may make additional guarantees about it’s contents or order and perform ”fixes“ after each mutation to maintain that guarantee. I’ve seen things like SortedCollection and NonEmptyCollection out in the wild. I regularly use an NFDString. Indeed, even the Standard Library’s String type could produce a broken edit script in this case:

let start = "िकख"
let end = "नकग"

// Depending on the exact order of the inserts and removals in the script,
// this could trap on due to an out‐of‐bounds offset:
let reapplied = start.applying(end.shortedEditScript(from: start))

// And if it’s lucky and hasn’t trapped yet,
// this assertion likely does not hold:
assert(reapplied == end)

Whereas here we accomplish the same thing in a completely safe manner, because Array maintains its offsets:

let start = Array("िकख")
let end = Array("नकग")
let reapplied = String(start.applying(end.shortestEditScript(from: start)))
assert(reapplied == end)

The same gotcha and fix hold true for any of the collections mentioned above.

There are a few possible solutions:

  1. Drop BidirectionalCollection and go only with Array. Users can convert like the fix above.

  2. Provide a new protocol that makes the same guarantees as Array and restrict it to that. Users can conform their types where reasonable.

  3. Accept any BidirectionalCollection, but use Array internally for safety while applying the script.

  4. Require the presence of both start and end, never mutate them while constructing new, have the script offsets refer to positions in start and end, and have the script’s order represent that of new, such that the only operation ever done during construction is append() with elements fished from either start or end. (This is the strategy implemented by the library I use now.)

It strikes me that all of these three of these options erase the advantage of offsets over indices. Option 2 would even be faster with direct indices (e.g. less math for ArraySlice to do in its conformance). Edit: Crossed out some nonsense. ArraySlice is precicely why an offset would still be necessary for option 2.

This is basically how the applicator in the prototype works right now—diffing requires BidirectionalCollection and expresses offsets in terms of offsets into start and end, and applying(:) requires a base state that is isomorphic to start and conforms to RangeReplaceableCollection so the result can be efficiently constructed in terms of ranges from beginning to end.

1 Like

I like the idea, but like others have said I don't think it solves the problems most people are trying to solve and I'm not sure how I feel about this being in the standard library. Just my two cents.

My experience writing and using utilities like this is that people think they want the minimum edit script until they actually use it on a large collection with complicated differences (e.g. if you're a low-level serializer that doesn't know that the user just clicked "sort"), and then they remember that quadratic algorithms are bad. The almost-naive algorithm of stripping a common prefix and suffix and then just replacing one list with the other is often good enough and has a far favorable asymptotic bound. Fortunately, the Myers algorithm is pretty conducive to accepting an arbitrary bound on the complexity of the resulting edit script — if you haven't found a solution by iteration k, you abandon the search and do the simple thing. (Of course you want to strip the prefix and suffix as a pre-flight before getting into Myers in the first place.)

I would suggest taking an (optional?) parameter that bounds the complexity of the diff. Contra some of the other suggestions, I don't think this needs to be a rich interface, and instead I would allow exactly three possibilities:

  • fastest (guaranteed O(n)): strip the ends, maybe try a quick heuristic or two if that doesn't resolve it
  • balanced (guaranteed O(n log n)): strip the ends, compute a kMax that's O(log(n)) for the remaining n, iterate Myers through at most that many rows
  • smallest (guaranteed O(n * d)): strip the ends, run Myers to completion
10 Likes

I've taken some time to work up what I think a better API would be, which I've uploaded here: https://github.com/davedelong/SE-0240

The core of it is the DiffingStrategy protocol:

public protocol DiffingStrategy {
  associatedtype Change
  associatedtype Element
  associatedtype Changes = Array<Change>

  func apply<C: Collection>(changes: Changes, to collection: C) → Array<Element> where C.Element == Element

  func compute<C1: Collection, C2: Collection>(source: C1, destination: C2) -> Changes where C1.Element == Element, C2.Element == Element
}

A DiffingStrategy takes two collections of values, and computes the difference (or "edit script", I guess?) between them and returns that. Since the kind of edit (Change) is an associated type on the protocol, that means that different conformances can have different kinds of changes. By way of example, the repository linked to has a simple implementation of Wagner-Fisher (insert/replace/delete), Wagner-Fisher-with-inferred-moves (insert/replace/delete/move), SetMutations (insert/delete), DictionaryMutations (add/remove/replace), and Wagner-Fisher-with-identity (insert/replace/delete/reload).

This has several advantages:

  1. The standard library can vend "preferred" implementations, such as the Myer implementation
  2. There can be type-specific algorithms. In the repo, the SetMutations strategy is only usable on Sets (or set-like things) because it requires its diffable elements to be Hashable
  3. You can provide your own custom implementations, such as a "Wagner-Fisher with inferred moves and identity checking". The explicit nature of the protocol means that you can take advantage of composing implementations.
  4. It lends itself well to swapping in-and-out implementations based on developers' needs; as @John_McCall pointed out, sometimes computing the minimum script is more work than you are really interested in doing.

Edited to add: As for the scenario of "what if the user doesn't want to know what the explicit strategy used is", consider these extensions to Collection:

public extension Collection {
    func computeChanges<C: Collection, D: DiffingStrategy>(to other: C, using strategy: D) -> D.Changes where C.Element == Element, D.Element == Element {
        return strategy.compute(source: self, destination: other)
    }
}

public extension Collection where Element: Equatable {
    // this method would use the "preferred" strategy to compute changes
    // in this example, that's the basic Wagner-Fisher algorithm, along with whatever results it produces
    public func computeChanges<C: Collection>(to other: C) -> WagnerFisher<Element>.Changes where C.Element == Element {
        return computeChanges(to: other, using: WagnerFisher())
    }   
    public func computeChangesInferringMoves<C: Collection>(to other: C) -> WagnerFisherMoves<Element>.Changes where C.Element == Element {
        return computeChanges(to: other, using: WagnerFisherMoves())
    }
}
3 Likes

Yes, that will work safely. Quite clever how you even avoided the need for a reference to end. That shortcut hadn’t occurred to me. Good job on that part.

  • What is your evaluation of the proposal?

I like it!

I agree with @brentdax that the proposed names could be more consistent. shortestEditScript(from:) does not scream difference to me, making it harder to spot in e.g. contextual code completion lists. My favorite name for this diffing variant would be shortestDifference(from:).

I have pedantic reservations about having the OrderedCollectionDifference type itself conform to Collection. My feeling is that these APIs can (and probably will) get rich enough that people won’t typically need to manually look at individual changes in a difference, and as @sveinhal pointed out, Collection’s extensions can be confusing in this context. I think it would be slightly better to expose the full collection of changes as an explicit changes property, returning a nested view type (with the same collection implementation). This would reduce the OrderedCollectionDifference API surface to the handful of members that are directly relevant to differences, which would help with their discoverability. There is more than one way to enumerate changes in a diff — so it makes sense to highlight which ordering is used by making it an explicit choice.

  • Is the problem being addressed significant enough to warrant a change to Swift?

Yes. It would be extremely useful to have a standard boundary type for passing around ordered differences.

Having a stdlib-provided diffing algorithm is also helpful, especially given how tricky it is to implement. But this is a lot less important than standardizing the diff representation: diffs can be usefully created, merged, transformed, propagated and consumed without the need to ever compare collections.

  • Does this proposal fit well with the feel and direction of Swift?

Yes. There are two points that seem non-obvious at first: the use of BidirectionalCollection and the use of offsets rather than indices. I agree these are the correct choices:

Using BidirectionalCollection as a stand-in for OrderedCollection seems like a pragmatic choice. Bidirectionality on its own has nothing to do with whether the ordering is meaningful, but BidirectionalCollection does somewhat imply it in practice. E.g. Set and Dictionary aren’t bidirectional in Swift 3+ by deliberate choice rather than any inherent technical difficulty in implementing index(before:). So this choice makes sense even if we never introduce an OrderedCollection protocol.

The use of offsets rather than indices seems unavoidable in a general-purpose ordered diff type. Indices violate value semantics for collections in that they are generally only valid for use in the particular collection value that generated them (and its verbatim copies). (E.g., indices in ”café” aren’t necessarily valid in ”cafe\u{301}”, even though the two strings are equal.) Using offsets avoids these pitfalls, although it might slow down some usecases.

  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

In my own implementations, I kept the insert/remove and insert/remove/move/update variants as separate types. I was initially surprised to see moves/updates represented by optional associatedWith payloads, but I’ve grown to like this approach.

  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

I was involved in some pre-pitch discussions and I closely followed the discussion on these forums.

1 Like

I’ve been gradually warming to the API over the course of the thread.

These didn’t matter to me because of guarantees provided by my own context. But I can see so many ways a developer could write broken code on top of this without knowing. You’ve convinced me offsets are the better default.

I’ve also looked closer at the prototype implementation and reconsidered my existing use cases, asking how I have might have written them if this API had already been available. I believe I can get the same level of performance for all the same use cases just by refactoring at the usage site. It just requires thinking differently. So in the end I would probably not use any of the customization I asked for upthread after all.

I consider all my concerns answered and my vote is now a in favour of the API more‐or‐less as proposed. (I’ll stay out of the naming discussion. Any variation is fine with me.) Thank you for writing it @numist, and for patiently pushing back when I started asking questions.

2 Likes

Thank you for the thoughtful feedback! In general this thread has been very effective at getting to the most difficult decisions we had to make when developing this proposal, which is a very good sign.

The most common piece of feedback remaining in this thread is that shortestEditScript(from:) is an unwieldy name for a method. This is true, and wasn't the case with the API that was pitched. After chatting with @Douglas_Gregor and other members of the team, I've pushed a PR that renames the method back to difference(from:).

3 Likes

I've worked on several diff-related problems before. In my experience the subject area is just too broad to be covered by a universal solution without significant compromises. I see the following practical problems with proporsal:

  1. I often needed both bounded and unbounded versions of the algorithm. For some uses there may be a good fallback when the edit script is getting too long – think animating model updates in the UI or fallback to reload. In other cases length of the edit script can be used to control behaviour – like "when too many changes just reset" logic. But when the diff is part of the core logic, you want to always calculate it no matter what.
  2. In general there are multiple shortest edit paths to choose from. Even more broadly – sometimes you want not the shortest but the cheapest path on the weighted edit graph. For the canonical example of updating table view one great criteria is the range of currently visible items – using it you can achieve "prefer updating non-visible parts" and similar behaviour.
  3. There are several practical versions of the algorithm with widely varying performance characteristics:
    • Basic Myers version which is proposed (quadratic space by edit distance)
    • "A Linear Space Refinement" from the original article
    • Refinements from the xdiff implementation used in Git
  4. For some popular trivial cases faster algorithms could be used:
    • For sorted collections there is linear lazy algorithm
    • When external index is available for collections
    • When only edit distance is needed and not the edit script
  5. There are many edit script representations suitable for different use cases:
    • ranges vs single elements
    • indices vs elements
    • providing unchanged items
    • handling moves
    • identity vs equality

Given all these different use cases and requirements I think the task is not for a stdlib-level API but for a full featured framework/module which can:

  • have unlimited API surface to cover all use cases
  • provide distinct API's for trivial cases and encourage their use
  • have different levels API's from easy to use high-level to low-level for fine tuning or extension
  • be designed for extension and customisation

With all limitations I don't think I would be able to use the proposed version in any of my use cases. Neither directly nor by extending it or re-using any of the types.

11 Likes

This proposal has been accepted with modifications, thanks all!

5 Likes

Can you explain why this proposal was accepted without addressing the fundamental usage problems brought up in this review thread? Even by scrolling up the last ~10 posts, there are numerous problems listed that were not addressed by "weakening" the semantics on the returned difference. Changing the name from shortestEditScript() to difference() is pure bike-shedding that doesn't address the real problems.

As it stands, what we have here is an inflexible and non-extensible alternative for doing symmetric differences on two arrays.

So again: why was this proposal accepted without any broadcasted justification for leaving it as-is?

6 Likes

I agree with @davedelong. This seems pretty rushed just to potentially get it into Swift 5?

2 Likes

Aside from any potential usage problems, I'm just concerned that this seems to be a brand-new design which hasn't been tested well-enough.

I'm not saying it isn't well thought-out - just that some kind of incubation period as a library would increase my confidence.

Another option would be for the CollectionDifference type to hide its internal details (and probably also drop Codable for now), so it could possibly evolve to support moves.

2 Likes

There were some issues in the review that the Core Team felt sorted themselves out in the thread and don't need further discussion because the proposal as-written has the right design, such as the question of storing offsets rather than indices.

Much of this discussion here was about making the diffing algorithm more flexible, with a range of solutions from adding a policy parameter ("balanced" vs. "always shortest" vs. "always fast") to introducing protocols that abstract the diffing algorithm itself. The core team discussed these idea of parameterizing the diffing algorithm, but felt that it added complexity (to the implementation, to the API, etc.) for everyone without necessarily serving the needs of every user.

As an example, consider sort: we absolutely could parametrize sort based on various policy decisions (stable vs. unstable, whether it is allowed to allocation additional storage during its execution), and there are users for which the these choices may be critical. However, we have chosen not to expose such policies, opting instead for a simple API backed by an algorithm that works well for most people most of the time.

The presence of sort, or differences(from:), or any other algorithm in the standard library does not preclude the existence of a special-purpose package that does that one thing better. It would be great to have a Swift DiffUtils package that implements multiple algorithms and allows power users to tune their behavior. However, we don't need or want the full complexity of that in the standard library, where it's better to have a simpler API that works for many applications.

Doug

19 Likes

That's not consistent with my recollection. IIRC we were forced to make them unidirectional because Foundation doesn't provide reverse enumeration for bridged NSSet and NSDictionary representations. I remember being really disappointed that we had to make that sacrifice; certainly there's nothing inherent about the Swift-native data structures that makes them unsuitable for bidirectional traversal.

5 Likes