Remove duplicate elements from a collection

I had considered using a Set for the implementation and there are already some examples out there of doing it that way, but that requires Hashable (which you have noted). I don't think requiring Hashable is the right way to go as it exposes an implementation detail that the user doesn't care about, and shouldn't need to care about. As far as the operation of the method goes, requiring Equatable makes sense as we are checking the Equality of elements to only leave unique elements in place.

Thats a nice overview though, so thanks!

I won't pretend that I'm very versed in algorithms, I wish I was but I'm not. But I am aware of @Dave_Abrahams1 great talk back at WWDC 18 which I was just re-watching to come up with a half decent implementation so it doesn't become a sticking point in the discussion.

A Set based implementation is an absolute necessity. O(n^2) gets really bad, really fast. Say you want to deduplicate song names in a user's library. It's not uncommon to have 1,000s, even 10,000s of songs. You don't want to be doing O(n^2) algorithms, which would be doing a number of operations on the order of 1,000,000s and 100,000,000, respectively.

Luckily, you can do both:

extension Collection where Element: Hashable {
    func distinct(...) { fastHashBasedImplementation() }
}

extension Collection where Element: Equatable {
    func distinct(...) { slowButBetterThanNothingLinearSearchingOperation() }
}

Implementation details matter. You can't abstract them away, because data structures and algorithms are a very leaky abstraction (because their performance is so obviously noticeable)

3 Likes

I understand that it matters and what the Big-o implications are, I'm not saying that it doesn't. All I'm saying is that I just wanted to explore other avenues to see if they were possible with a decent complexity without exposing implementation details without needing to.

Adding multiple methods to do the same thing with different complexity is an option, but I'm just trying to think if this is done in other places in the stdlib or if it's generally frowned against. Any ideas?

Why would it be? Swift automatically selects the most specific (best optimized, most applicable) implementation for the job. It's done all over the place, e.g. in the specialization of Sequence's methods.

I strongly support this proposal. This is one of the frequently referred questions I see in Stackoverflow

Seems in the case of the collection diffing proposal, a similar argument was made where different algorithms might be needed and eventually the resulting implementation was the one the makes the most sense to most people most of the time (as opposed to providing multiple implementations)

Edit: link - Collection diff link

FWIW this recent thread may be relevant: Uniquing collections efficiently

I think this would be a valuable addition on its own and don't agree that it is better handled by a hypothetical OrderedSet type. As for the implementation question, it is fine (and even preferred) if the complexity is different for types that are just Equatable and types that are Hashable, the difference just needs to be documented. The API should be the same in either case, though.

I would expect to be able to use this method in a pipeline

let firstNames = people.map { $0.firstName }.uniqued().sorted()

Oh, and I think it should be called unique() or uniqued(). distincted() doesn't really work.

Of course I already have an implementation from SO because I needed it so I'm in favor of it being in the standard library.

I responded with a similar idea when you mentioned this in another thread.

Your methods require the collection to conform to both MutableCollection and RangeReplaceableCollection. Since the algorithms ultimately change the number of elements (therefore invalidating all outstanding indices), RangeReplaceableCollection must be the home of these methods. Could there be a RRC-only overload too, for supporting String?

There could be a MutableCollection.partitionDistinctiveness(by:) variant method for just-MC types.

Thanks for the replies everyone, I've now had a few nights to sleep on this. Considering what @AlexanderM said and having spoken to @anon31136981 about the implementation, it does make sense to make Hashable a requirement. I was looking for other possibilities, but in this case its just the nature of the beast.

I've updated the draft proposal to reflect those changes and added an implementation that seems to run pretty well.

I did play with the idea of unique() myself, but it kind of felt like saying the array itself was unique not the elements within it. Thats why I took the naming idea from other languages and stuck with distinct. I'm open to changing it to that if others think the same, but thats the reason why I settled on the naming that I did.

This is also something I've been fighting with as you are right, it does make sense to be chained up like that. I think this capability makes sense, but the naming doesn't fit. On the other hand it also doesn't fit with uniqued() either, so maybe there is another terminology we can use here?

I do like the idea of having the ability available to String too. Maybe this is better to be an addition to the RangeReplaceableCollection without the restriction of MutableCollection too. I'm open to opinions on this and will happily update it if there aren't any objections.

Thinking about it I'm tempted to go back on my reasoning for dismissing the removeDuplicates() type naming. Having removeDuplicates() and removingDuplicates() would fit within Swift's naming scheme and actually it is more obvious to non native English speakers than distinct(). It also has the benefit of coming up as a suggestion when typing any other remove... method name so wouldn't be hidden away by a more obscure word.

5 Likes

As an implementation note: I'd recommend using an Equatable-based implementation for input sizes under ~256, particularly when the Collection is an Array.

// Rather than using an Array here, the indices could be generated on-demand.
var indicesToRemove = [Index]()

for i in self.indices.dropFirst() {
    if self[self.startIndex..<i].contains(self[i]) {
        indicesToRemove.append(i)
    }
}

self.remove(elementsAtSortedIndices: indicesToRemove)

This can also be done more efficiently if a new collection is being generated (rather than being performed in-place) since the new, smaller collection can be checked for membership rather than the old, duplicate-containing one.

1 Like

I totally agree on this one. As you said, it aids discoverability, and the meaning is very clear. I used to feel that uniqued was the best way to go, but I'm reconsidering.

Why should this be a requirement and not just a specialisation that gives better performance? The standard library already contains algorithms with specialisations for better performance when possible (e.g. RandomAccessCollection vs Collection or BidirectionalCollection). You could then also switch implementations for small collections if advantageous, as @Torust demonstrates.

Edit: See complexity section here for a simple example.

2 Likes

I would expect that the cutoff point would vary based on what type of object you're using (the more complicated the == method the earlier the cutoff is), was this testing with Ints or Strings?

1 Like

This was with Strings formed from random Ints. It's not necessarily the complexity of the == method that matters; for example, String comparison can short-circuit very quickly on strings that differ on the first character but is very slow for long strings with shared prefixes.

This was partly the reason why I was somewhat opposed to adding this method in the other thread. The simple approach (e.g. Set(elements).sorted() or the O(n^2) approach) is fine for small numbers of elements, and for large data sets you really have to know what your data is to find the best-performing algorithm.

As a side note, de-duplication can be performed very quickly on sorted input, which is a common precondition for e.g. filtering song names. In that context, both a naïve O(n^2) approach and a set-based approach would be less than ideal.

2 Likes

@dlbuckley and I agree that we should have only the minimal set of requirements, and then provide specialized, optimized implementations where, for example, Element: Hashable. I admit that generic programming to this degree is a novel way of thinking for me, so it hadn’t previously occurred to me that that was possible or desirable.

1 Like

I've not really had time to work on any of this for a while (mostly down to personal time constraints), so in an effort to not let this die completely here is the latest implementation version that me and Zach (@anon31136981 he deleted his account?) put together a couple of months back. If anyone has the time to push this forwards then let me know and I will be happy to help.

extension RangeReplaceableCollection where Self: MutableCollection {

    /// Removes any duplicate elements where the equality is based on
    /// a value other than the element itself.
    ///
    /// Use this method to remove duplicate elements where their equality
    /// is based on the value returned by the closure. The order of the remaining
    /// elements is preserved.
    ///
    /// This example removes objects based on the equality of one of the
    /// properties of the object.
    ///
    ///     struct Animal {
    ///         let kind: String
    ///         let name: String
    ///     }
    ///
    ///     let animals = [
    ///         Animal(kind: "Dog", name: "Fido"),
    ///         Animal(kind: "Cat", name: "Tibbles"),
    ///         Animal(kind: "Lion", name: "Simba"),
    ///         Animal(kind: "Dog", name: "Spot"),
    ///     ]
    ///
    ///     animals.removeDuplicates(by: { $0.kind })
    ///
    ///     // animals == [
    ///     //   Animal(kind: "Dog", name: "Fido"),
    ///     //   Animal(kind: "Cat", name: "Tibbles"),
    ///     //   Animal(kind: "Lion", name: "Simba")
    ///     // ]
    ///
    /// - Parameter distinctValue: A closure that takes an element of the
    ///   sequence as its argument and returns a value that conforms to
    ///   `Hashable` which will in turn be used to compare if an object
    ///   is a duplicate.
    ///
    /// - Complexity: O(*n*), where *n* is the length of the collection.
    mutating func removeDuplicates<T: Hashable>(by distinctValue: (Element) -> T) {
        var seen: Set<T> = []
        removeAll(where: { element in
            let value = distinctValue(element)

            return !seen.insert(value).inserted
        })
    }

    /// Removes any duplicate elements where the equality is based on
    /// a value other than the element itself.
    ///
    /// Use this method to remove duplicate elements where their equality
    /// is based on the value returned by the closure. The order of the remaining
    /// elements is preserved.
    ///
    /// This example removes objects based on the equality of one of the
    /// properties of the object.
    ///
    ///     struct Animal {
    ///         let kind: String
    ///         let name: String
    ///     }
    ///
    ///     let animals = [
    ///         Animal(kind: "Dog", name: "Fido"),
    ///         Animal(kind: "Cat", name: "Tibbles"),
    ///         Animal(kind: "Lion", name: "Simba"),
    ///         Animal(kind: "Dog", name: "Spot"),
    ///     ]
    ///
    ///     animals.removeDuplicates(by: { $0.kind })
    ///
    ///     // animals == [
    ///     //   Animal(kind: "Dog", name: "Fido"),
    ///     //   Animal(kind: "Cat", name: "Tibbles"),
    ///     //   Animal(kind: "Lion", name: "Simba")
    ///     // ]
    ///
    /// - Parameter distinctValue: A closure that takes an element of the
    ///   sequence as its argument and returns a value that conforms to
    ///   `Equatable` which will in turn be used to compare if an object
    ///   is a duplicate.
    ///
    /// - Complexity: O(*n2*), where *n* is the length of the collection.
    mutating func removeDuplicates<T: Equatable>(by distinctValue: (Element) -> T) {
        var result = ContiguousArray<Element>()
        var distinctElements = ContiguousArray<T>()
        for element in self {
            let distinctElement = distinctValue(element)
            guard !distinctElements.contains(distinctElement) else {
                continue
            }
            distinctElements.append(distinctElement)
            result.append(element)
        }
        self = Self(result)
    }
}

extension RangeReplaceableCollection {

    /// Removes any duplicate elements where the equality is based on
    /// a value other than the element itself.
    ///
    /// Use this method to remove duplicate elements where their equality
    /// is based on the value returned by the closure. The order of the remaining
    /// elements is preserved.
    /// This example removes objects based on the equality of one of the
    /// properties of the object.
    ///
    ///     struct Animal {
    ///         let kind: String
    ///         let name: String
    ///     }
    ///
    ///     let animals = [
    ///         Animal(kind: "Dog", name: "Fido"),
    ///         Animal(kind: "Cat", name: "Tibbles"),
    ///         Animal(kind: "Lion", name: "Simba"),
    ///         Animal(kind: "Dog", name: "Spot"),
    ///     ]
    ///
    ///     let uniqueAnimals = animals.removingDuplicates(by: { $0.kind })
    ///
    ///     // uniqueAnimals == [
    ///     //   Animal(kind: "Dog", name: "Fido"),
    ///     //   Animal(kind: "Cat", name: "Tibbles"),
    ///     //   Animal(kind: "Lion", name: "Simba")
    ///     // ]
    ///
    /// - Parameter distinctValue: A closure that takes an element of the
    ///   sequence as its argument and returns a value that conforms to
    ///   `Hashable` which will in turn be used to compare if an object
    ///   is a duplicate.
    ///
    /// - Returns: A new collection with all duplicates removed.
    ///
    /// - Complexity: O(*n*), where *n* is the length of the collection.
    func removingDuplicates<T: Hashable>(by distinctValue: (Element) -> T) -> Self {

        var copy = self
        var seen: Set<T> = []
        copy.removeAll(where: { element in
            let value = distinctValue(element)

            return !seen.insert(value).inserted
        })

        return copy
    }

    /// Returns a new collection of the same type with all duplicate elements
    /// removed based on a value other than the element itself.
    ///
    /// Use this method to create a new collection with all duplicate elements
    /// removed based on the value returned by the closure. The order of the
    /// remaining elements is preserved.
    ///
    /// This example removes objects based on the equality of one of the
    /// properties of the object.
    ///
    ///     struct Animal {
    ///         let kind: String
    ///         let name: String
    ///     }
    ///
    ///     let animals = [
    ///         Animal(kind: "Dog", name: "Fido"),
    ///         Animal(kind: "Cat", name: "Tibbles"),
    ///         Animal(kind: "Lion", name: "Simba"),
    ///         Animal(kind: "Dog", name: "Spot"),
    ///     ]
    ///
    ///     let uniqueAnimals = animals.removingDuplicates(by: { $0.kind })
    ///
    ///     // uniqueAnimals == [
    ///     //   Animal(kind: "Dog", name: "Fido"),
    ///     //   Animal(kind: "Cat", name: "Tibbles"),
    ///     //   Animal(kind: "Lion", name: "Simba")
    ///     // ]
    ///
    /// - Parameter distinctValue: A closure that takes an element of the
    ///   sequence as its argument and returns a value that conforms to
    ///   `Equatable` which will in turn be used to compare if an object
    ///   is a duplicate.
    ///
    /// - Returns: A new collection with all duplicates removed.
    ///
    /// - Complexity: O(*n2*), where *n* is the length of the collection.
    func removingDuplicates<T: Equatable>(by distinctValue: (Element) -> T) -> Self {

        var result = ContiguousArray<Element>()
        var distinctElements = ContiguousArray<T>()
        for element in self {
            let distinctElement = distinctValue(element)
            guard !distinctElements.contains(distinctElement) else {
                continue
            }
            distinctElements.append(distinctElement)
            result.append(element)
        }

        return Self(result)
    }
}

extension RangeReplaceableCollection where Self: MutableCollection, Element: Hashable {

    /// Removes any duplicate elements.
    ///
    /// Use this method to remove duplicate elements. The order of the
    /// remaining elements is preserved.
    /// This example removes duplicate numbers from an array.
    ///
    ///     let numbers = [1, 2, 3, 4, 5, 4, 1, 2, 6]
    ///     numbers.removeDuplicates()
    ///     // numbers == [1, 2, 3, 4, 5, 6]
    ///
    /// - Complexity: O(*n*), where *n* is the length of the collection.
    mutating func removeDuplicates() {
        removeDuplicates { $0 }
    }
}

extension RangeReplaceableCollection where Element: Hashable {

    /// Returns a new collection of the same type with all duplicate elements
    /// removed.
    ///
    /// Use this method to create a new collection with all duplicate elements
    /// removed. The order of the remaining elements is preserved.
    ///
    /// This example takes an array of numbers and creates a new array from it
    /// with all duplicates removed.
    ///
    ///     let numbers = [1, 2, 3, 4, 5, 4, 1, 2, 6]
    ///     let uniqueNumbers = numbers.removingDuplicates()
    ///     // uniqueNumbers == [1, 2, 3, 4, 5, 6]
    ///
    /// - Returns: A new collection with all duplicates removed.
    ///
    /// - Complexity: O(*n*), where *n* is the length of the collection.
    func removingDuplicates() -> Self {
        return self.removingDuplicates { $0 }
    }
}

extension RangeReplaceableCollection where Self: MutableCollection, Element: Equatable {

    /// Removes any duplicate elements.
    ///
    /// Use this method to remove duplicate elements. The order of the
    /// remaining elements is preserved.
    /// This example removes duplicate numbers from an array.
    ///
    ///     let numbers = [1, 2, 3, 4, 5, 4, 1, 2, 6]
    ///     numbers.removeDuplicates()
    ///     // numbers == [1, 2, 3, 4, 5, 6]
    ///
    /// - Complexity: O(*n2*), where *n* is the length of the collection.
    mutating func removeDuplicates() {
        removeDuplicates { $0 }
    }
}

extension RangeReplaceableCollection where Element: Equatable {

    /// Returns a new collection of the same type with all duplicate elements
    /// removed.
    ///
    /// Use this method to create a new collection with all duplicate elements
    /// removed. The order of the remaining elements is preserved.
    ///
    /// This example takes an array of numbers and creates a new array from it
    /// with all duplicates removed.
    ///
    ///     let numbers = [1, 2, 3, 4, 5, 4, 1, 2, 6]
    ///     let uniqueNumbers = numbers.removingDuplicates()
    ///     // uniqueNumbers == [1, 2, 3, 4, 5, 6]
    ///
    /// - Returns: A new collection with all duplicates removed.
    ///
    /// - Note: For reduced complexity, ensure `Element` conforms to `Hashable`.
    ///
    /// - Complexity: O(*n2*), where *n* is the length of the collection.
    func removingDuplicates() -> Self {
        return self.removingDuplicates { $0 }
    }
}

There are specialised versions for using both Equatable and Hashable which include a bunch of feedback from this thread plus a bunch more research from our side on how it would best fit into the standard library.

1 Like