SE-0240: Ordered Collection Diffing

SDGGiesbrecht · January 15, 2019, 9:55pm

Such a protocol‐based approach (which reminds me of the randomization API) would provide the entry points to inject solutions to any or all of the 4 issues I mentioned earlier, while still using the standard library API and taking advantage of whichever parts it does do well for the particular use case.

When general collections are swapped in to make a real API, I would prefer the two collections not need to be the same type. Pairs like String and Substring should be usable without conversion.

(Edit for the sake of clarity: The following is a separate thought from the above.)

Some of the performance‐related comments seem to assume the difference will only ever be computed with the intention of actually applying it to something. The implementation should watch that it is still efficient even if it will never be applied. Most of what I would use the API for is concerned only with finding out/reporting/analyzing what changed, and never intending to reproduce the change.

beccadax · January 16, 2019, 1:02am

What is your evaluation of the proposal?

It's an awkward fit for some use cases

There's a ton to like about this proposal. A few years ago, I wrote an app whose view controllers diffed its models in didSet observers using a heavily modified version of ArrayDiff.swift to drive UI updates. This approach was so elegant and so successful that I ended up porting this code forward through several Swift versions and incorporating it into several different projects.

However, the proposed API is not particularly good for that task. For view updating, you need an algorithm which distinguishes record identity from equality and uses both. This would also be useful for several other use cases, such as undo/redo and syncing, which may care about the size of differences and want to dig into sub-objects to find the smallest possible diffs.

The proposed feature isn't very good at doing that, although it can probably be used as a building block to get that (by diffing and inferring moves for x.lazy.map { $0.id } instead of just x, mapping each OrderedCollectionDifference<X.ID>.Change back to an OrderedCollectionDifference<X>.Change, and equality-comparing the untouched elements to see if they need to be updated). Still, it would be nice to have a "batteries included" API which supported these use cases more directly. Or to at least include a method like inferringMoves(by:hashingWith:) which can be used to do identity diffing with fewer clumsy conversion steps.

Other issues

I kind of hate `associatedWith`

This representation of moves—an insertion/deletion pair where associatedWith contains the offset in the other part of the pair—is super-awkward. It's not clear how to translate it into calls to an API like UITableView.moveRow(at:to:)—do you process the remove? the insert? Both? Separately from them?
If shortestEditScript(from:) never fills in the field, why does it return a type which includes it? I'd prefer to see two separate types for "insertions and deletions only" vs. "insertions, deletions, and moves". shortestEditScript(from:) should return the type without move information; inferringMoves() should return the type with move information.

The names need work

I don't want to get bogged down debating specific alternatives, but I want to point out some issues with these ones:

The terminology is all over the place. shortestEditScript(from:) returns an OrderedCollectionDifference which contains Change instances. Why do none of these terms have anything to do with each other? Shouldn't shortestEditScript(from:) return an EditScript of Edits, or differences(from:) return an OrderedCollectionDifference containing Differences?
OrderedCollectionDifference is unnecessarily wordy. We don't care that it's Ordered and users don't care about the Collection protocol. What they care about is that this is a bunch of changes which should be applied together. Call it ChangeSet or Differences or EditScript or what-have-you. Or put two typealiases on Collection so you can call the types Array<Int>.Edit and Array<Int>.EditScript.
If associatedWith is kept, its name at least should not be. The name is content-free, giving no indication of how the integer provided is "associated with" the change. Labels like movedToOffset and movedFromOffset would convey the field's meaning more clearly.

Like I said, I don't want to debate exact replacements; I think the core team and proposal authors can select better ones without a giant thread full of bikeshedding.

Miscellaneous

It may make sense to have Change conform to Comparable with an implementation which matches the sort order of OrderedCollectionDifference. At that point, OrderedCollectionDifference might be better generalized into some kind of SortedSet or SortedArray, because that's all it really is. Or just return an [Array] and document that elements are in sorted order, and that this is the order it's usually safest to apply them in.
This is probably an outright mistake: OrderedCollectionDifference.Change should be Codable. There's no reason you should need to use OrderedCollectionDifference to code an individual Change.

All that said...

I think this design is basically workable—just not ideal for some of the use cases it's likely to be used for.

Is the problem being addressed significant enough to warrant a change to Swift?

Yes. We should encourage the use of diffing-driven UI updates, and this would help do that.

Does this proposal fit well with the feel and direction of Swift?

Very. It supports a powerful use of value semantics. It doesn't have to be in the standard library, but it will be great to have there.

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

I haven't done anything similar in another language.

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

Quick reading. I read the pitch thread but unfortunately didn't get around to replying.

davedelong:

I could see the API looking something like this (I'm using Array as the collection type for brevity):
protocol DiffChange { }

protocol DiffingAlgorithm {
    associatedtype ChangeType: DiffChange
    func compute<T>(_ source: Array<T>, destination: Array<T>, elementsEqual: (T, T) -> Bool) -> Array<ChangeType>
}

If ChangeType is opaque, I'm not sure how useful this protocol would actually be. You can get an array of changes, but what can you actually do with those changes? You can't, for instance, replay the changes to a UITableView without analyzing the ChangeType values.

Perhaps your sketch leaves out details that you imagine would be present in the real API, but it's hard to judge your suggestion without those details.

Correct. My diffing code ended up using offsets for this very reason: the indices in your diff become invalid as soon as you apply the first change.

anandabits · January 16, 2019, 1:51am

What is your evaluation of the proposal?

Like many of the other commenters, I very much support the idea of including collection diffing algorithm(s) in the standard library, but I also have some reservations about the proposed design.

Specifically, I think the support for changes (incorporating a model for identity independent of equality) and moves is inadequate and storing elements themselves in the change model is unnecessary (it isn't clear to me what the purpose of that is and the copies seem unnecessary). I would prefer to see the change type model moves and changes directly by default with support for projecting to the primitive insert / delete representation. The insert / delete representation requires two entries in the difference set for changes and moves which would be quite awkward to work with if you need to extract change / move information. It is much easier to go the other way and requires less entries in the difference collection.

I am also not a fan of the name shortestEditScript. The proposal says this is a term of art, so maybe it is the best name but I don't find it especially descriptive. Something along the lines of changes(relativeTo:) seems like a more descriptive approach to naming this operation.

Beyond this, I will just +1 everything @beccadax said above.

Is the problem being addressed significant enough to warrant a change to Swift?

Yes

Does this proposal fit well with the feel and direction of Swift?

Yes

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

N/A

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

I read the proposal and the review thread and am quite familiar with modeling collection changes in order to drive UI updates.

CTMacUser · January 16, 2019, 9:57am

I'll hold off at the moment, since a concern would make me give it a big ol' negative-infinity. Mainly because of this:

Collections can only be efficiently diffed when they have a strong sense of order, so difference production is added to BidirectionalCollection

We have a protocol for collections with a strong sense of order Collection (or Sequence if you don't need indexing support)! Don't upshift algorithms to BidirectionalCollection as a "solution" for the Dictionary-and-Set-aren't-really-Sequences problem if the algorithm doesn't actually need bidirectional traversal. This is a "bug" on Dictionary and Set, and we cannot let a solution sneak into the library; in the worst case, we would have to completely rebuild this library once a proper solution goes in.

So, downgrade shortestEditScript(from: by:) to Collection (or Sequence) if it doesn't really need BDC support. Extending from that, I don't think OrderedCollectionDifference is needed; any collection of Change should do. Past those two obstacles, I can see a decent API peeking out:

enum DeltaAtom<ChangeElement> {  // Formerly OrderedCollectionDifference.Change
    case insert(offset: Int, element: ChangeElement, associatedWith: Int?)
    case remove(offset: Int, element: ChangeElement, associatedWith: Int?)

    var isInsertion: Bool {  // These could be internal
        guard case .insert = self else { return false }
        return true
    }
    var isRemoval: Bool {
        guard case .remove = self else { return false }
        return true
    }

    static func inEnumeraitonOrder(lhs: DeltaAtom, rhs: DeltaAtom) -> Bool {
        switch (lhs, rhs) {
        case (.remove, .insert):
            return true
        case (.remove(let lo, _, _), .remove(let ro, _, _)):
            return lo > ro
        case (.insert(let lo, _, _), .insert(let ro, _, _)):
            return lo < ro
        case (.insert, .remove):
            return false
        }
    }
}

extension MutableCollection where Element == DeltaAtom<Any> {  // Yikes, I forgot how to actually specify this.
    mutating func validateChanges() -> Index? {
        let insertsStart = partition(by: { $0.isInsertion })
        /* ... */
        return insertsStart  // or nil if invalid
    }
}

extension Collection where Element == DeltaAtom<Hashable> {  // Here too
    func inferredMoves<Delta: RangeReplaceableCollection>() -> Delta where Delta.Element == Element {
        return /* ... */
    }
}

extension Sequence {  // Could this be lazy too?
    func applying<Result: RangeReplaceableCollection, Delta: Collection>(delta: Delta) -> Result? where Result.Element == Element, Delta.Element == DeltaAtom<Element> {
        return /*...*/
    }
}

extension Sequence {  // Same here on laziness
    func shortestEditScript<Delta: RangeReplaceableCollection, C: Collection>(from other: C, by areEquivalent: (Element, C.Element) -> Bool) -> Delta where C.Element == Element, Delta.Element == DeltaAtom<Element> {
        /*...*/
    }
}

Karl · January 17, 2019, 5:10pm

-1

I'm not sure if this really belongs in the standard library. Originally we had the goal of keeping the standard library minimalist, and despite diffing being important for many applications and difficult to get right, I'm not sure if it really meets that threshold of being 'essential'. Why can't it be a 3rd-party package?

In fact, there are already 3rd-party packages for this. For instance, I've used the 'Differentiator' library in the past. It's from the RxSwift team but has been separated from their other work so you don't need to use reactive programming with it. As I understand it, the process for standard library evolution was supposed to be that we start as a 3rd-party package (whenever possible), collect experience, then eventually discuss absorbing that in to the stdlib. With that in mind, and if this does belong in the stdlib, I would suggest taking a look at the battle-tested RxSwift design instead of creating our own thing from scratch.

In particular, they have a protocol to represent record identity. I think that's an essential part of what people want to do with diffs, and would support adding that protocol (or a similar one) to the standard library.

numist · January 17, 2019, 7:12pm

"script" is a term of art, and your definition is exactly correct

This would work, but without Hashable conformance inferringMoves() would be O(n²), and taking a closure for both == and hash(into:) is pretty messy API. Boxing (or unwrapping) would be more readable and achieve best performance.

Honestly the problem of identity is one that I've grappled with constantly while designing this API, which has been a lot longer in the making than most people realize. Philosophically, I think it's best to avoid specialized API for serving identity types until the concept of identity is better formalized in the language itself. In the meantime my demos have used arr.map { $0.identity } to get around this and they've worked pretty well.

Not exactly, but you're on the right track! "Position" and "offset" are extremely related concepts, but offsets are vulnerable to shifting caused by insertions and removals of earlier elements. This is the most common reason people have trouble with representing (and using) diffs.

Ultimately to determine if an element moved from its start and end offsets, you need to take into account the number of insertions before its final offset minus the number of removes before its original offset (this is one reason why the diff type exposes sorted insertions and removals properties—they allow people to answer the question of position sameness in logarithmic time).

That said, your observation that associated offsets are capable of expressing replacement is exactly right and deserves restating: element and position differences between associated changes are two independent bits capable of expressing:

element moved (same element, different position)
element replacement/change (different element, same position)
element moved and replaced/changed (different element, different position)

My primary intention for this proposal was to introduce a diff type capable of standing alone as an interchange format between modules. Omitting the elements entirely requires that a collection be passed alongside the diff in most cases since the purpose of consuming a diff is usually either storage or application. It also reduces the expressiveness of associations (as described above).

If elements are especially unwanted, the diff type can still be used by making it generic on Optional<T> and an instance initialized from any sane collection of Changes with nil elements.

In the pitch, the function was named difference(from:) which I quite liked, but it was renamed due to the output semantics of LCS/SES (which the current implementation uses) which is specifically different from a more visually appealing (but more computationally demanding) solution space.

Ultimately even shortestEditScript(from:) may run in polynomial time so I'm still interested in the right compromise here. Currently my best ideas are either that difference(from:) should defer to a faster (but less LCS-conforming) algorithm when runtime complexity exceeds some limit or shortestEditScript(from:) should provide another parameter that allows the user to limit the runtime complexity in exchange for no result.

My instinct is that the former is more desirable for v1 of a diffing API, since the API can be extended in the future.

As usual, Dictionary and Set haunt us. The pitch attempted to resolve this by introducing OrderedCollection, which attracted some convincing feedback that resolving that tension was best left to another proposal.

I hope that there will be a resolution so the diffing API can be lowered to the appropriate level, but until then the largest common denominator of all the types that were to be granted conformance to OrderedCollection was BidirectionalCollection. This is useful since a variant of Myers (and some other diffing algorithms) can achieve significant speedups by running bidirectionally, producing a result when the algorithms meet somewhere in the middle.

Philosophically, I think that API should be easy and safe to use, expressive, and reserve the maximum amount of potential for future improvement. Which is to say: I agree that the BidirectionalCollection requirement is not perfect, but I would argue that it is good because it can be lowered in the future.

Assuming that Change would be renamed to something appropriate to avoid confusion with changes in other domains than "ordered" (e.g. for "keyed" or "multidimensional" (treelike) collections), I'd still argue that there is a lot of value in the standardized interpretation and validation of batch changes provided by OrderedCollectionDifference:

    /// To guarantee that instances are unambiguous and safe for compatible base
    /// states, this initializer will fail unless its parameter meets to the
    /// following requirements:
    ///
    /// 1) All insertion offsets are unique
    /// 2) All removal offsets are unique
    /// 3) All offset associations between insertions and removals are symmetric

While there's nothing stopping anyone from using collections of OrderedCollectionDifference<T>.Change for their own purposes, the guarantees above turn out to be extraordinarily useful for validation, minimizing representations, and maximizing the fidelity of Equatable—lessons learned the hard way by anyone who's written batch code against both NSTableView and UITableView.

SDGGiesbrecht · January 17, 2019, 9:01pm

I think this is why our design priorities differ so vastly. My use cases have been primarily concerned with discovering changes caused by some external source in order to analyze or flag them. I am starting to see how your design makes a lot of sense for what you want to use it for. I’m now wondering if between what each of us thinking, the name and the underlying LCS algorithm are really all that’s shared, since the rest of the implementation would need to be designed quite differently for each case. So from now on I will try to put myself in your shoes and attempt to assess it solely based on the purposes you have in mind.

The first thought that jumps at me is what flags when you say this:

This sort of logic is only reliable for Array and its immediate relatives (ArraySlice, ContiguousArray, etc.) which have such semantics. You cannot expect offsets to remain valid after an insert or removal on any arbitrary BidirectionalCollection. A collection may make additional guarantees about it’s contents or order and perform ”fixes“ after each mutation to maintain that guarantee. I’ve seen things like SortedCollection and NonEmptyCollection out in the wild. I regularly use an NFDString. Indeed, even the Standard Library’s String type could produce a broken edit script in this case:

let start = "िकख"
let end = "नकग"

// Depending on the exact order of the inserts and removals in the script,
// this could trap on due to an out‐of‐bounds offset:
let reapplied = start.applying(end.shortedEditScript(from: start))

// And if it’s lucky and hasn’t trapped yet,
// this assertion likely does not hold:
assert(reapplied == end)

Whereas here we accomplish the same thing in a completely safe manner, because Array maintains its offsets:

let start = Array("िकख")
let end = Array("नकग")
let reapplied = String(start.applying(end.shortestEditScript(from: start)))
assert(reapplied == end)

The same gotcha and fix hold true for any of the collections mentioned above.

There are a few possible solutions:

Drop BidirectionalCollection and go only with Array. Users can convert like the fix above.
Provide a new protocol that makes the same guarantees as Array and restrict it to that. Users can conform their types where reasonable.
Accept any BidirectionalCollection, but use Array internally for safety while applying the script.
Require the presence of both start and end, never mutate them while constructing new, have the script offsets refer to positions in start and end, and have the script’s order represent that of new, such that the only operation ever done during construction is append() with elements fished from either start or end. (This is the strategy implemented by the library I use now.)

It strikes me that ~~all of these~~ three of these options erase the advantage of offsets over indices. ~~Option 2 would even be faster with direct indices (e.g. less math for ArraySlice to do in its conformance).~~ Edit: Crossed out some nonsense. ArraySlice is precicely why an offset would still be necessary for option 2.

numist · January 17, 2019, 11:16pm

This is basically how the applicator in the prototype works right now—diffing requires BidirectionalCollection and expresses offsets in terms of offsets into start and end, and applying(:) requires a base state that is isomorphic to start and conforms to RangeReplaceableCollection so the result can be efficiently constructed in terms of ranges from beginning to end.

Brad_Hilton · January 18, 2019, 4:33am

I like the idea, but like others have said I don't think it solves the problems most people are trying to solve and I'm not sure how I feel about this being in the standard library. Just my two cents.

John_McCall · January 18, 2019, 6:22am

My experience writing and using utilities like this is that people think they want the minimum edit script until they actually use it on a large collection with complicated differences (e.g. if you're a low-level serializer that doesn't know that the user just clicked "sort"), and then they remember that quadratic algorithms are bad. The almost-naive algorithm of stripping a common prefix and suffix and then just replacing one list with the other is often good enough and has a far favorable asymptotic bound. Fortunately, the Myers algorithm is pretty conducive to accepting an arbitrary bound on the complexity of the resulting edit script — if you haven't found a solution by iteration k, you abandon the search and do the simple thing. (Of course you want to strip the prefix and suffix as a pre-flight before getting into Myers in the first place.)

I would suggest taking an (optional?) parameter that bounds the complexity of the diff. Contra some of the other suggestions, I don't think this needs to be a rich interface, and instead I would allow exactly three possibilities:

fastest (guaranteed O(n)): strip the ends, maybe try a quick heuristic or two if that doesn't resolve it
balanced (guaranteed O(n log n)): strip the ends, compute a kMax that's O(log(n)) for the remaining n, iterate Myers through at most that many rows
smallest (guaranteed O(n * d)): strip the ends, run Myers to completion

davedelong · January 18, 2019, 4:35pm

beccadax:

davedelong:
I could see the API looking something like this (I'm using Array as the collection type for brevity):
protocol DiffChange { }

protocol DiffingAlgorithm {
   associatedtype ChangeType: DiffChange
   func compute<T>(_ source: Array<T>, destination: Array<T>, elementsEqual: (T, T) -> Bool) -> Array<ChangeType>
}
If ChangeType is opaque, I'm not sure how useful this protocol would actually be. You can get an array of changes, but what can you actually do with those changes? You can't, for instance, replay the changes to a UITableView without analyzing the ChangeType values.

Perhaps your sketch leaves out details that you imagine would be present in the real API, but it's hard to judge your suggestion without those details.

I've taken some time to work up what I think a better API would be, which I've uploaded here: GitHub - davedelong/SE-0240: An alternative API for Swift collection diffing

The core of it is the DiffingStrategy protocol:

public protocol DiffingStrategy {
  associatedtype Change
  associatedtype Element
  associatedtype Changes = Array<Change>

  func apply<C: Collection>(changes: Changes, to collection: C) → Array<Element> where C.Element == Element

  func compute<C1: Collection, C2: Collection>(source: C1, destination: C2) -> Changes where C1.Element == Element, C2.Element == Element
}

A DiffingStrategy takes two collections of values, and computes the difference (or "edit script", I guess?) between them and returns that. Since the kind of edit (Change) is an associated type on the protocol, that means that different conformances can have different kinds of changes. By way of example, the repository linked to has a simple implementation of Wagner-Fisher (insert/replace/delete), Wagner-Fisher-with-inferred-moves (insert/replace/delete/move), SetMutations (insert/delete), DictionaryMutations (add/remove/replace), and Wagner-Fisher-with-identity (insert/replace/delete/reload).

This has several advantages:

The standard library can vend "preferred" implementations, such as the Myer implementation
There can be type-specific algorithms. In the repo, the SetMutations strategy is only usable on Sets (or set-like things) because it requires its diffable elements to be Hashable
You can provide your own custom implementations, such as a "Wagner-Fisher with inferred moves and identity checking". The explicit nature of the protocol means that you can take advantage of composing implementations.
It lends itself well to swapping in-and-out implementations based on developers' needs; as @John_McCall pointed out, sometimes computing the minimum script is more work than you are really interested in doing.

Edited to add: As for the scenario of "what if the user doesn't want to know what the explicit strategy used is", consider these extensions to Collection:

public extension Collection {
    func computeChanges<C: Collection, D: DiffingStrategy>(to other: C, using strategy: D) -> D.Changes where C.Element == Element, D.Element == Element {
        return strategy.compute(source: self, destination: other)
    }
}

public extension Collection where Element: Equatable {
    // this method would use the "preferred" strategy to compute changes
    // in this example, that's the basic Wagner-Fisher algorithm, along with whatever results it produces
    public func computeChanges<C: Collection>(to other: C) -> WagnerFisher<Element>.Changes where C.Element == Element {
        return computeChanges(to: other, using: WagnerFisher())
    }   
    public func computeChangesInferringMoves<C: Collection>(to other: C) -> WagnerFisherMoves<Element>.Changes where C.Element == Element {
        return computeChanges(to: other, using: WagnerFisherMoves())
    }
}

SDGGiesbrecht · January 18, 2019, 7:33pm

Yes, that will work safely. Quite clever how you even avoided the need for a reference to end. That shortcut hadn’t occurred to me. Good job on that part.

lorentey · January 22, 2019, 4:36am

What is your evaluation of the proposal?

I like it!

I agree with @beccadax that the proposed names could be more consistent. shortestEditScript(from:) does not scream difference to me, making it harder to spot in e.g. contextual code completion lists. My favorite name for this diffing variant would be shortestDifference(from:).

I have pedantic reservations about having the OrderedCollectionDifference type itself conform to Collection. My feeling is that these APIs can (and probably will) get rich enough that people won’t typically need to manually look at individual changes in a difference, and as @sveinhal pointed out, Collection’s extensions can be confusing in this context. I think it would be slightly better to expose the full collection of changes as an explicit changes property, returning a nested view type (with the same collection implementation). This would reduce the OrderedCollectionDifference API surface to the handful of members that are directly relevant to differences, which would help with their discoverability. There is more than one way to enumerate changes in a diff — so it makes sense to highlight which ordering is used by making it an explicit choice.

Is the problem being addressed significant enough to warrant a change to Swift?

Yes. It would be extremely useful to have a standard boundary type for passing around ordered differences.

Having a stdlib-provided diffing algorithm is also helpful, especially given how tricky it is to implement. But this is a lot less important than standardizing the diff representation: diffs can be usefully created, merged, transformed, propagated and consumed without the need to ever compare collections.

Does this proposal fit well with the feel and direction of Swift?

Yes. There are two points that seem non-obvious at first: the use of BidirectionalCollection and the use of offsets rather than indices. I agree these are the correct choices:

Using BidirectionalCollection as a stand-in for OrderedCollection seems like a pragmatic choice. Bidirectionality on its own has nothing to do with whether the ordering is meaningful, but BidirectionalCollection does somewhat imply it in practice. E.g. Set and Dictionary aren’t bidirectional in Swift 3+ by deliberate choice rather than any inherent technical difficulty in implementing index(before:). So this choice makes sense even if we never introduce an OrderedCollection protocol.

The use of offsets rather than indices seems unavoidable in a general-purpose ordered diff type. Indices violate value semantics for collections in that they are generally only valid for use in the particular collection value that generated them (and its verbatim copies). (E.g., indices in ”café” aren’t necessarily valid in ”cafe\u{301}”, even though the two strings are equal.) Using offsets avoids these pitfalls, although it might slow down some usecases.

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

In my own implementations, I kept the insert/remove and insert/remove/move/update variants as separate types. I was initially surprised to see moves/updates represented by optional associatedWith payloads, but I’ve grown to like this approach.

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

I was involved in some pre-pitch discussions and I closely followed the discussion on these forums.

SDGGiesbrecht · January 22, 2019, 5:35am

I’ve been gradually warming to the API over the course of the thread.

These didn’t matter to me because of guarantees provided by my own context. But I can see so many ways a developer could write broken code on top of this without knowing. You’ve convinced me offsets are the better default.

I’ve also looked closer at the prototype implementation and reconsidered my existing use cases, asking how I have might have written them if this API had already been available. I believe I can get the same level of performance for all the same use cases just by refactoring at the usage site. It just requires thinking differently. So in the end I would probably not use any of the customization I asked for upthread after all.

I consider all my concerns answered and my vote is now a in favour of the API more‐or‐less as proposed. (I’ll stay out of the naming discussion. Any variation is fine with me.) Thank you for writing it @numist, and for patiently pushing back when I started asking questions.

numist · January 22, 2019, 7:13am

Thank you for the thoughtful feedback! In general this thread has been very effective at getting to the most difficult decisions we had to make when developing this proposal, which is a very good sign.

The most common piece of feedback remaining in this thread is that shortestEditScript(from:) is an unwieldy name for a method. This is true, and wasn't the case with the API that was pitched. After chatting with @Douglas_Gregor and other members of the team, I've pushed a PR that renames the method back to difference(from:).

Maksim_Odnoletkov · January 23, 2019, 3:36am

I've worked on several diff-related problems before. In my experience the subject area is just too broad to be covered by a universal solution without significant compromises. I see the following practical problems with proporsal:

I often needed both bounded and unbounded versions of the algorithm. For some uses there may be a good fallback when the edit script is getting too long – think animating model updates in the UI or fallback to reload. In other cases length of the edit script can be used to control behaviour – like "when too many changes just reset" logic. But when the diff is part of the core logic, you want to always calculate it no matter what.
In general there are multiple shortest edit paths to choose from. Even more broadly – sometimes you want not the shortest but the cheapest path on the weighted edit graph. For the canonical example of updating table view one great criteria is the range of currently visible items – using it you can achieve "prefer updating non-visible parts" and similar behaviour.
There are several practical versions of the algorithm with widely varying performance characteristics:
- Basic Myers version which is proposed (quadratic space by edit distance)
- "A Linear Space Refinement" from the original article
- Refinements from the xdiff implementation used in Git
For some popular trivial cases faster algorithms could be used:
- For sorted collections there is linear lazy algorithm
- When external index is available for collections
- When only edit distance is needed and not the edit script
There are many edit script representations suitable for different use cases:
- ranges vs single elements
- indices vs elements
- providing unchanged items
- handling moves
- identity vs equality

Given all these different use cases and requirements I think the task is not for a stdlib-level API but for a full featured framework/module which can:

have unlimited API surface to cover all use cases
provide distinct API's for trivial cases and encourage their use
have different levels API's from easy to use high-level to low-level for fine tuning or extension
be designed for extension and customisation

With all limitations I don't think I would be able to use the proposed version in any of my use cases. Neither directly nor by extending it or re-using any of the types.

Douglas_Gregor · January 31, 2019, 4:25am

This proposal has been accepted with modifications, thanks all!

davedelong · January 31, 2019, 3:49pm

Can you explain why this proposal was accepted without addressing the fundamental usage problems brought up in this review thread? Even by scrolling up the last ~10 posts, there are numerous problems listed that were not addressed by "weakening" the semantics on the returned difference. Changing the name from shortestEditScript() to difference() is pure bike-shedding that doesn't address the real problems.

As it stands, what we have here is an inflexible and non-extensible alternative for doing symmetric differences on two arrays.

So again: why was this proposal accepted without any broadcasted justification for leaving it as-is?

tkrajacic · January 31, 2019, 4:15pm

I agree with @davedelong. This seems pretty rushed just to potentially get it into Swift 5?

Karl · January 31, 2019, 4:28pm

Aside from any potential usage problems, I'm just concerned that this seems to be a brand-new design which hasn't been tested well-enough.

I'm not saying it isn't well thought-out - just that some kind of incubation period as a library would increase my confidence.

Another option would be for the CollectionDifference type to hide its internal details (and probably also drop Codable for now), so it could possibly evolve to support moves.