Remove duplicate elements from a collection

dlbuckley · February 3, 2019, 10:26pm

Thanks for the replies everyone, I've now had a few nights to sleep on this. Considering what @AlexanderM said and having spoken to @anon31136981 about the implementation, it does make sense to make Hashable a requirement. I was looking for other possibilities, but in this case its just the nature of the beast.

I've updated the draft proposal to reflect those changes and added an implementation that seems to run pretty well.

I did play with the idea of unique() myself, but it kind of felt like saying the array itself was unique not the elements within it. Thats why I took the naming idea from other languages and stuck with distinct. I'm open to changing it to that if others think the same, but thats the reason why I settled on the naming that I did.

This is also something I've been fighting with as you are right, it does make sense to be chained up like that. I think this capability makes sense, but the naming doesn't fit. On the other hand it also doesn't fit with uniqued() either, so maybe there is another terminology we can use here?

I do like the idea of having the ability available to String too. Maybe this is better to be an addition to the RangeReplaceableCollection without the restriction of MutableCollection too. I'm open to opinions on this and will happily update it if there aren't any objections.

dlbuckley · February 3, 2019, 11:08pm

Thinking about it I'm tempted to go back on my reasoning for dismissing the removeDuplicates() type naming. Having removeDuplicates() and removingDuplicates() would fit within Swift's naming scheme and actually it is more obvious to non native English speakers than distinct(). It also has the benefit of coming up as a suggestion when typing any other remove... method name so wouldn't be hidden away by a more obscure word.

Torust · February 4, 2019, 12:17am

As an implementation note: I'd recommend using an Equatable-based implementation for input sizes under ~256, particularly when the Collection is an Array.

// Rather than using an Array here, the indices could be generated on-demand.
var indicesToRemove = [Index]()

for i in self.indices.dropFirst() {
    if self[self.startIndex..<i].contains(self[i]) {
        indicesToRemove.append(i)
    }
}

self.remove(elementsAtSortedIndices: indicesToRemove)

This can also be done more efficiently if a new collection is being generated (rather than being performed in-place) since the new, smaller collection can be checked for membership rather than the old, duplicate-containing one.

ASwiftUser · February 4, 2019, 12:27am

I totally agree on this one. As you said, it aids discoverability, and the meaning is very clear. I used to feel that uniqued was the best way to go, but I'm reconsidering.

jawbroken · February 4, 2019, 1:34am

Why should this be a requirement and not just a specialisation that gives better performance? The standard library already contains algorithms with specialisations for better performance when possible (e.g. RandomAccessCollection vs Collection or BidirectionalCollection). You could then also switch implementations for small collections if advantageous, as @Torust demonstrates.

Edit: See complexity section here for a simple example.

TellowKrinkle · February 4, 2019, 1:59am

I would expect that the cutoff point would vary based on what type of object you're using (the more complicated the == method the earlier the cutoff is), was this testing with Ints or Strings?

Torust · February 4, 2019, 2:05am

This was with Strings formed from random Ints. It's not necessarily the complexity of the == method that matters; for example, String comparison can short-circuit very quickly on strings that differ on the first character but is very slow for long strings with shared prefixes.

This was partly the reason why I was somewhat opposed to adding this method in the other thread. The simple approach (e.g. Set(elements).sorted() or the O(n^2) approach) is fine for small numbers of elements, and for large data sets you really have to know what your data is to find the best-performing algorithm.

As a side note, de-duplication can be performed very quickly on sorted input, which is a common precondition for e.g. filtering song names. In that context, both a naïve O(n^2) approach and a set-based approach would be less than ideal.

anon31136981 · February 4, 2019, 11:14pm

@dlbuckley and I agree that we should have only the minimal set of requirements, and then provide specialized, optimized implementations where, for example, Element: Hashable. I admit that generic programming to this degree is a novel way of thinking for me, so it hadn’t previously occurred to me that that was possible or desirable.

dlbuckley · May 24, 2019, 9:01am

I've not really had time to work on any of this for a while (mostly down to personal time constraints), so in an effort to not let this die completely here is the latest implementation version that me and Zach (@anon31136981 he deleted his account?) put together a couple of months back. If anyone has the time to push this forwards then let me know and I will be happy to help.

extension RangeReplaceableCollection where Self: MutableCollection {

    /// Removes any duplicate elements where the equality is based on
    /// a value other than the element itself.
    ///
    /// Use this method to remove duplicate elements where their equality
    /// is based on the value returned by the closure. The order of the remaining
    /// elements is preserved.
    ///
    /// This example removes objects based on the equality of one of the
    /// properties of the object.
    ///
    ///     struct Animal {
    ///         let kind: String
    ///         let name: String
    ///     }
    ///
    ///     let animals = [
    ///         Animal(kind: "Dog", name: "Fido"),
    ///         Animal(kind: "Cat", name: "Tibbles"),
    ///         Animal(kind: "Lion", name: "Simba"),
    ///         Animal(kind: "Dog", name: "Spot"),
    ///     ]
    ///
    ///     animals.removeDuplicates(by: { $0.kind })
    ///
    ///     // animals == [
    ///     //   Animal(kind: "Dog", name: "Fido"),
    ///     //   Animal(kind: "Cat", name: "Tibbles"),
    ///     //   Animal(kind: "Lion", name: "Simba")
    ///     // ]
    ///
    /// - Parameter distinctValue: A closure that takes an element of the
    ///   sequence as its argument and returns a value that conforms to
    ///   `Hashable` which will in turn be used to compare if an object
    ///   is a duplicate.
    ///
    /// - Complexity: O(*n*), where *n* is the length of the collection.
    mutating func removeDuplicates<T: Hashable>(by distinctValue: (Element) -> T) {
        var seen: Set<T> = []
        removeAll(where: { element in
            let value = distinctValue(element)

            return !seen.insert(value).inserted
        })
    }

    /// Removes any duplicate elements where the equality is based on
    /// a value other than the element itself.
    ///
    /// Use this method to remove duplicate elements where their equality
    /// is based on the value returned by the closure. The order of the remaining
    /// elements is preserved.
    ///
    /// This example removes objects based on the equality of one of the
    /// properties of the object.
    ///
    ///     struct Animal {
    ///         let kind: String
    ///         let name: String
    ///     }
    ///
    ///     let animals = [
    ///         Animal(kind: "Dog", name: "Fido"),
    ///         Animal(kind: "Cat", name: "Tibbles"),
    ///         Animal(kind: "Lion", name: "Simba"),
    ///         Animal(kind: "Dog", name: "Spot"),
    ///     ]
    ///
    ///     animals.removeDuplicates(by: { $0.kind })
    ///
    ///     // animals == [
    ///     //   Animal(kind: "Dog", name: "Fido"),
    ///     //   Animal(kind: "Cat", name: "Tibbles"),
    ///     //   Animal(kind: "Lion", name: "Simba")
    ///     // ]
    ///
    /// - Parameter distinctValue: A closure that takes an element of the
    ///   sequence as its argument and returns a value that conforms to
    ///   `Equatable` which will in turn be used to compare if an object
    ///   is a duplicate.
    ///
    /// - Complexity: O(*n2*), where *n* is the length of the collection.
    mutating func removeDuplicates<T: Equatable>(by distinctValue: (Element) -> T) {
        var result = ContiguousArray<Element>()
        var distinctElements = ContiguousArray<T>()
        for element in self {
            let distinctElement = distinctValue(element)
            guard !distinctElements.contains(distinctElement) else {
                continue
            }
            distinctElements.append(distinctElement)
            result.append(element)
        }
        self = Self(result)
    }
}

extension RangeReplaceableCollection {

    /// Removes any duplicate elements where the equality is based on
    /// a value other than the element itself.
    ///
    /// Use this method to remove duplicate elements where their equality
    /// is based on the value returned by the closure. The order of the remaining
    /// elements is preserved.
    /// This example removes objects based on the equality of one of the
    /// properties of the object.
    ///
    ///     struct Animal {
    ///         let kind: String
    ///         let name: String
    ///     }
    ///
    ///     let animals = [
    ///         Animal(kind: "Dog", name: "Fido"),
    ///         Animal(kind: "Cat", name: "Tibbles"),
    ///         Animal(kind: "Lion", name: "Simba"),
    ///         Animal(kind: "Dog", name: "Spot"),
    ///     ]
    ///
    ///     let uniqueAnimals = animals.removingDuplicates(by: { $0.kind })
    ///
    ///     // uniqueAnimals == [
    ///     //   Animal(kind: "Dog", name: "Fido"),
    ///     //   Animal(kind: "Cat", name: "Tibbles"),
    ///     //   Animal(kind: "Lion", name: "Simba")
    ///     // ]
    ///
    /// - Parameter distinctValue: A closure that takes an element of the
    ///   sequence as its argument and returns a value that conforms to
    ///   `Hashable` which will in turn be used to compare if an object
    ///   is a duplicate.
    ///
    /// - Returns: A new collection with all duplicates removed.
    ///
    /// - Complexity: O(*n*), where *n* is the length of the collection.
    func removingDuplicates<T: Hashable>(by distinctValue: (Element) -> T) -> Self {

        var copy = self
        var seen: Set<T> = []
        copy.removeAll(where: { element in
            let value = distinctValue(element)

            return !seen.insert(value).inserted
        })

        return copy
    }

    /// Returns a new collection of the same type with all duplicate elements
    /// removed based on a value other than the element itself.
    ///
    /// Use this method to create a new collection with all duplicate elements
    /// removed based on the value returned by the closure. The order of the
    /// remaining elements is preserved.
    ///
    /// This example removes objects based on the equality of one of the
    /// properties of the object.
    ///
    ///     struct Animal {
    ///         let kind: String
    ///         let name: String
    ///     }
    ///
    ///     let animals = [
    ///         Animal(kind: "Dog", name: "Fido"),
    ///         Animal(kind: "Cat", name: "Tibbles"),
    ///         Animal(kind: "Lion", name: "Simba"),
    ///         Animal(kind: "Dog", name: "Spot"),
    ///     ]
    ///
    ///     let uniqueAnimals = animals.removingDuplicates(by: { $0.kind })
    ///
    ///     // uniqueAnimals == [
    ///     //   Animal(kind: "Dog", name: "Fido"),
    ///     //   Animal(kind: "Cat", name: "Tibbles"),
    ///     //   Animal(kind: "Lion", name: "Simba")
    ///     // ]
    ///
    /// - Parameter distinctValue: A closure that takes an element of the
    ///   sequence as its argument and returns a value that conforms to
    ///   `Equatable` which will in turn be used to compare if an object
    ///   is a duplicate.
    ///
    /// - Returns: A new collection with all duplicates removed.
    ///
    /// - Complexity: O(*n2*), where *n* is the length of the collection.
    func removingDuplicates<T: Equatable>(by distinctValue: (Element) -> T) -> Self {

        var result = ContiguousArray<Element>()
        var distinctElements = ContiguousArray<T>()
        for element in self {
            let distinctElement = distinctValue(element)
            guard !distinctElements.contains(distinctElement) else {
                continue
            }
            distinctElements.append(distinctElement)
            result.append(element)
        }

        return Self(result)
    }
}

extension RangeReplaceableCollection where Self: MutableCollection, Element: Hashable {

    /// Removes any duplicate elements.
    ///
    /// Use this method to remove duplicate elements. The order of the
    /// remaining elements is preserved.
    /// This example removes duplicate numbers from an array.
    ///
    ///     let numbers = [1, 2, 3, 4, 5, 4, 1, 2, 6]
    ///     numbers.removeDuplicates()
    ///     // numbers == [1, 2, 3, 4, 5, 6]
    ///
    /// - Complexity: O(*n*), where *n* is the length of the collection.
    mutating func removeDuplicates() {
        removeDuplicates { $0 }
    }
}

extension RangeReplaceableCollection where Element: Hashable {

    /// Returns a new collection of the same type with all duplicate elements
    /// removed.
    ///
    /// Use this method to create a new collection with all duplicate elements
    /// removed. The order of the remaining elements is preserved.
    ///
    /// This example takes an array of numbers and creates a new array from it
    /// with all duplicates removed.
    ///
    ///     let numbers = [1, 2, 3, 4, 5, 4, 1, 2, 6]
    ///     let uniqueNumbers = numbers.removingDuplicates()
    ///     // uniqueNumbers == [1, 2, 3, 4, 5, 6]
    ///
    /// - Returns: A new collection with all duplicates removed.
    ///
    /// - Complexity: O(*n*), where *n* is the length of the collection.
    func removingDuplicates() -> Self {
        return self.removingDuplicates { $0 }
    }
}

extension RangeReplaceableCollection where Self: MutableCollection, Element: Equatable {

    /// Removes any duplicate elements.
    ///
    /// Use this method to remove duplicate elements. The order of the
    /// remaining elements is preserved.
    /// This example removes duplicate numbers from an array.
    ///
    ///     let numbers = [1, 2, 3, 4, 5, 4, 1, 2, 6]
    ///     numbers.removeDuplicates()
    ///     // numbers == [1, 2, 3, 4, 5, 6]
    ///
    /// - Complexity: O(*n2*), where *n* is the length of the collection.
    mutating func removeDuplicates() {
        removeDuplicates { $0 }
    }
}

extension RangeReplaceableCollection where Element: Equatable {

    /// Returns a new collection of the same type with all duplicate elements
    /// removed.
    ///
    /// Use this method to create a new collection with all duplicate elements
    /// removed. The order of the remaining elements is preserved.
    ///
    /// This example takes an array of numbers and creates a new array from it
    /// with all duplicates removed.
    ///
    ///     let numbers = [1, 2, 3, 4, 5, 4, 1, 2, 6]
    ///     let uniqueNumbers = numbers.removingDuplicates()
    ///     // uniqueNumbers == [1, 2, 3, 4, 5, 6]
    ///
    /// - Returns: A new collection with all duplicates removed.
    ///
    /// - Note: For reduced complexity, ensure `Element` conforms to `Hashable`.
    ///
    /// - Complexity: O(*n2*), where *n* is the length of the collection.
    func removingDuplicates() -> Self {
        return self.removingDuplicates { $0 }
    }
}

There are specialised versions for using both Equatable and Hashable which include a bunch of feedback from this thread plus a bunch more research from our side on how it would best fit into the standard library.