SE-0240: Ordered Collection Diffing

numist · January 17, 2019, 7:12pm

"script" is a term of art, and your definition is exactly correct

This would work, but without Hashable conformance inferringMoves() would be O(n²), and taking a closure for both == and hash(into:) is pretty messy API. Boxing (or unwrapping) would be more readable and achieve best performance.

Honestly the problem of identity is one that I've grappled with constantly while designing this API, which has been a lot longer in the making than most people realize. Philosophically, I think it's best to avoid specialized API for serving identity types until the concept of identity is better formalized in the language itself. In the meantime my demos have used arr.map { $0.identity } to get around this and they've worked pretty well.

Not exactly, but you're on the right track! "Position" and "offset" are extremely related concepts, but offsets are vulnerable to shifting caused by insertions and removals of earlier elements. This is the most common reason people have trouble with representing (and using) diffs.

Ultimately to determine if an element moved from its start and end offsets, you need to take into account the number of insertions before its final offset minus the number of removes before its original offset (this is one reason why the diff type exposes sorted insertions and removals properties—they allow people to answer the question of position sameness in logarithmic time).

That said, your observation that associated offsets are capable of expressing replacement is exactly right and deserves restating: element and position differences between associated changes are two independent bits capable of expressing:

element moved (same element, different position)
element replacement/change (different element, same position)
element moved and replaced/changed (different element, different position)

My primary intention for this proposal was to introduce a diff type capable of standing alone as an interchange format between modules. Omitting the elements entirely requires that a collection be passed alongside the diff in most cases since the purpose of consuming a diff is usually either storage or application. It also reduces the expressiveness of associations (as described above).

If elements are especially unwanted, the diff type can still be used by making it generic on Optional<T> and an instance initialized from any sane collection of Changes with nil elements.

In the pitch, the function was named difference(from:) which I quite liked, but it was renamed due to the output semantics of LCS/SES (which the current implementation uses) which is specifically different from a more visually appealing (but more computationally demanding) solution space.

Ultimately even shortestEditScript(from:) may run in polynomial time so I'm still interested in the right compromise here. Currently my best ideas are either that difference(from:) should defer to a faster (but less LCS-conforming) algorithm when runtime complexity exceeds some limit or shortestEditScript(from:) should provide another parameter that allows the user to limit the runtime complexity in exchange for no result.

My instinct is that the former is more desirable for v1 of a diffing API, since the API can be extended in the future.

As usual, Dictionary and Set haunt us. The pitch attempted to resolve this by introducing OrderedCollection, which attracted some convincing feedback that resolving that tension was best left to another proposal.

I hope that there will be a resolution so the diffing API can be lowered to the appropriate level, but until then the largest common denominator of all the types that were to be granted conformance to OrderedCollection was BidirectionalCollection. This is useful since a variant of Myers (and some other diffing algorithms) can achieve significant speedups by running bidirectionally, producing a result when the algorithms meet somewhere in the middle.

Philosophically, I think that API should be easy and safe to use, expressive, and reserve the maximum amount of potential for future improvement. Which is to say: I agree that the BidirectionalCollection requirement is not perfect, but I would argue that it is good because it can be lowered in the future.

Assuming that Change would be renamed to something appropriate to avoid confusion with changes in other domains than "ordered" (e.g. for "keyed" or "multidimensional" (treelike) collections), I'd still argue that there is a lot of value in the standardized interpretation and validation of batch changes provided by OrderedCollectionDifference:

    /// To guarantee that instances are unambiguous and safe for compatible base
    /// states, this initializer will fail unless its parameter meets to the
    /// following requirements:
    ///
    /// 1) All insertion offsets are unique
    /// 2) All removal offsets are unique
    /// 3) All offset associations between insertions and removals are symmetric

While there's nothing stopping anyone from using collections of OrderedCollectionDifference<T>.Change for their own purposes, the guarantees above turn out to be extraordinarily useful for validation, minimizing representations, and maximizing the fidelity of Equatable—lessons learned the hard way by anyone who's written batch code against both NSTableView and UITableView.