Ordered collection diffing

collection

(Scott Perry) #1

Hello everyone!

I'd like to pitch the formalization of ordered collections as well as the addition of diffing functionality and related types necessary to provide easy creation, representation, and application of ordered collection state transitions.

Formalizing the notion of ordered collections would be accomplished by the addition of an OrderedCollection protocol (and its adoption by suitable types):

@available(swift, introduced: 5.1)
public protocol OrderedCollection : Collection
    where SubSequence : OrderedCollection
{
    func elementsEqual<C>(
       _ other: C, by areEquivalent: (Element, C.Element) throws -> Bool
    ) rethrows -> Bool where C : OrderedCollection
}

extension OrderedCollection {
    public func difference<C>(
        from other: C, by areEquivalent: (Element, C.Element) -> Bool
    ) -> OrderedCollectionDifference<Element>
        where C : OrderedCollection, C.Element == Self.Element
}

extension OrderedCollection where Element: Equatable {
    public func difference<C>(from other: C) -> OrderedCollectionDifference<Element>
        where C: OrderedCollection, C.Element == Self.Element
    
    public func elementsEqual<C>(_ other: C) -> Bool
        where C : OrderedCollection, C.Element == Element
}

extension BidirectionalCollection : OrderedCollection {}
extension CountingIndexCollection : OrderedCollection where Base : OrderedCollection {}
extension Slice : OrderedCollection where Base : OrderedCollection {}
extension UnsafeMutableRawBufferPointer : OrderedCollection {}
extension UnsafeRawBufferPointer : OrderedCollection {}

The difference(from:) method produces an instance of a new type, OrderedCollectionDifference:

@available(swift, introduced: 5.1)
public struct OrderedCollectionDifference<ChangeElement> {
    public enum Change {
        case insert(offset: Int, element: ChangeElement, associatedWith: Int?)
        case remove(offset: Int, element: ChangeElement, associatedWith: Int?)
    }

    public init?<C: Collection>(_ c: C) where C.Element == Change

    public var insertions: [Change] { get }
    public var removals: [Change] { get }
}

extension OrderedCollectionDifference : Collection {
    public typealias Element = OrderedCollectionDifference<ChangeElement>.Change
    public struct Index: Comparable, Hashable {}
}

extension OrderedCollectionDifference.Change: Equatable where ChangeElement: Equatable {}
extension OrderedCollectionDifference: Equatable where ChangeElement: Equatable {}

extension OrderedCollectionDifference.Change: Hashable where ChangeElement: Hashable {}
extension OrderedCollectionDifference: Hashable where ChangeElement: Hashable {
    public func inferringMoves() -> OrderedCollectionDifference<ChangeElement>
}

extension OrderedCollectionDifference: Codable where ChangeElement: Codable {}

A difference could be applied to any compatible instance of RangeReplaceableCollection:

extension RangeReplaceableCollection {
    @available(swift, introduced: 5.1)
    public func applying(_ difference: OrderedCollectionDifference<Element>) -> Self?
}

There's a more complete proposal (including headerdocs) posted as a PR on swift-evolution, and a working prototype is available as a Swift package for anyone who's interested in experimenting with the API.


SE-0240: Ordered Collection Diffing
(Thomas Krajacic) #2

This looks great. Diffing is a really common problem to solve and approaching that in a formal/generic way seems fantastic.

However I am not sure that this should be in the standard library. I'd much rather see this a layer above. At some point the idea of standardized/endorsed packages for certain solutions was proposed and I think this would be a use case exactly for that.


(Alex Curran) #3

I like this! However I'd suggest a small API change:

public func applying(_ difference: OrderedCollectionDifference<Element>) throws -> Self

The pitch PR makes the comment:

Applying a diff to an incompatible base state is an error. applying(_:) expresses this by returning nil.

I'd suggest throwing an Error is more semantic to represent an error, rather than returning nil. It also gives more flexibility on how to handle the error, with try? or try! or do/catch


(Scott Perry) #4

Thanks for the thoughtful feedback!

@tkrajacic: I guess that's up to the standard library folks to decide, but I think there's some value to formalizing ordered collections in the standard library as well as making the difference type more widely available than a package would allow.

@amlcurran: I went back and forth on this a lot last year. throws is ideal for operations where the failure is unpredictable and difficult to defend against, such as failing to delete a file because something changed its permissions. But in the case of applying(_:), there's no reason the function can fail other than programmer error.

Returning nil was a compromise because invoking preconditionFailure() would discourage the use of diffs as boundary types.


(Dave Lee) #5

Would you mind detailing your reasoning?


(Ben Cohen) #6

From my perspective, this proposal fits solidly within the remit of the standard library, which is to provide basic building blocks of standard data structures and algorithms on them. This introduces a useful currency type for ABI-stable APIs (it's common that an API needs to talk in terms of differences) as well as an often-needed capability to diff two ordered structures.

Per the usual evaluation criteria:

  • Is it a common need? I'd say yes. Diffing strings or arrays or arrays of strings is a common need. We have quite a few requests for it in radar. Probably more common than other algorithms usually found in standard libraries like permutations or combinations (which I also think deserve inclusion in the std lib).
  • Is it non-trivial/hard to do correctly? Heck yeah.
  • Does it avoid a performance trap? Most naïve attempts at this, using looping and calls to contains, have terrible complexity.
  • Can it be implemented more efficiently in the standard library? Yes, specifically adding customized implementations for String in particular, taking advantage of the string's internal knowledge of its storage and the way Swift's strings implement canonical equivalence.

(Thomas Krajacic) #7

I just re-read https://developer.apple.com/documentation/swift/swift_standard_library to see if my own subjective idea about what should make up the standard library somehow fits, and I have to admit, that the proposal here fits actually quite well.

I have been wrong, and I am changing my opinion about this based on the statement in the documentation.

In my previous opinion I had the feeling that the functionality is maybe a bit too specialized for broad usage and hence not the perfect fit.


(Joe Groff) #8

There's an interesting optimization possible for tree structures built from COW containers in too, where we can short-circuit diffing when we know the underlying identities of two values' COW buffers is the same. If one operand of a diff is an in-place modification of the other, this should mean you only have to diff the changed nodes.


(Michael Ilseman) #9

Big +1 to this, look awesome!

I have a few questions / nits:

The Swift standard library lacks a protocol for operations that are only valid over Collection types with a strong sense of order, such as elementsEqual(_:) . The correctness of diffing operations also depend on order, so this proposal formalizes the characteristic with a new protocol, OrderedCollection

Currently, the closest thing we have to an ordered Collection is BidirectionalCollection. Could you compare/contrast OrderedCollection with BidirectionCollection?

  • Should all Bidi collections conform to Ordered, or vice-versa?
    • E.g., all of the concrete types that you conform are also BidirectionalCollections
  • How/why would someone want to write code generic over Ordered, but not over Bidirectional, or vice versa?

Because indexes cannot be navigated in the absence of the collection instance that generated them, a diff based on indexes instead of offsets would be much more limited in usefulness as a boundary type. If indexes are required, they can be rehydrated from the offsets in the presence of the collection(s) to which they belong.

What do you mean by "boundary type"?. I'm asking because for collections such as String, converting an offset to and Index is O(n) operation, which is unfortunate if you're applying this to the original string the offsets came from.


(Joe Groff) #10

Another nice thing we could eventually use OrderedCollection for is as the protocol for collection pattern matching in switch statements, e.g.:

switch [1, 2, 3] {
case [let first, let rest...]:
  ...
}

Any collection could support pattern matching this way using the index and slicing operations Collection defines, but we'd only want it to work on collections with a significant order.


(Matthew Johnson) #11

I really like this idea!


(Pierpaolo Frasa) #12

In the context of failing tests with complex value objects I've often found myself wanting to see a diff output instead of just "this huge object is not equal to that huge object".

I wonder if a more general Diffable protocol (that could be synthesized e.g. for structs) might be useful, not just for collections. But maybe this is outside of the scope of this.


(Scott Perry) #13

That's a very good point. BidirectionalCollection should probably conform to OrderedCollection.

It's possible to have an ordered collection that is not bidirectional (such as a singly-linked list). For such code I think it would make sense to define functionality in Ordered but provide specialized implementations in Bidirectional.

A type that is used in APIs to transmit data across a link boundary without dependencies—a diff based on indexes would not be useful in the absence of the instances that produced it for ordered collections that use opaque index types.

I adopted the terminology from an excellent talk about using value types as boundaries between components.

Applying a diff can be done as a single pass operation, so the indexes can be rehydrated without increasing the algorithmic complexity. Index rehydration is a valid concern though, and I'm still considering the future introduction of a sibling type for more temporary use if there is sufficient demand.


(Michael Ilseman) #14

Would we also want structural matching-from-the-end for Bidi collections? E.g.:

switch [1, 2, 3] {
case [let prefix..., let last]:
  ...
}

(Joe Groff) #15

It'd be reasonable, as long as you don't have multiple ... subpatterns.


(Matthew Johnson) #16

This syntax also suggests binding more than one value along with rest:

switch [1, 2, 3] {
case [let first, let second, let third, let rest...]:
  ...
}

(Rex) #17

I don't see an OrderedDictionary in the proposal. Intentional? Would that require a separate proposal?


(Ben Cohen) #18

Yup. Both ordered and sorted sets and dictionaries would make for good proposals though, but separate to this one.


(Jonathan Hise Kaldma) #19

This looks great!

However, I don't understand associatedWith. The doc comment states that it is "the offset of the complementary change." What does that mean?


(Scott Perry) #20

Associated changes give the diff type the ability to express moves:

let diff = [0, 1, 2].difference(from:[2, 0, 1])

print(diff)
// OrderedCollectionDifference<Int>(
//     insertions: [Diffing.OrderedCollectionDifference<Swift.Int>.Change.insert(offset: 2, element: 2, associatedWith: nil)],
//     removals: [Diffing.OrderedCollectionDifference<Swift.Int>.Change.remove(offset: 0, element: 2, associatedWith: nil)]
// )

print(diff.inferringMoves())
// OrderedCollectionDifference<Int>(
//     insertions: [Diffing.OrderedCollectionDifference<Swift.Int>.Change.insert(offset: 2, element: 2, associatedWith: Optional(0))],
//     removals: [Diffing.OrderedCollectionDifference<Swift.Int>.Change.remove(offset: 0, element: 2, associatedWith: Optional(2))]
// )

Associations aren't widely used outside inferringMoves() in this proposal but they enable functionality that's useful to adopters and future proposals, like a function that records a collection's mutations (using associations to track move and replace operations), or a UI that uses associations to determine what animations it should use when applying changes.