Remove duplicate elements from a collection

dlbuckley · February 1, 2019, 2:33pm

I've started putting together a pitch to add a couple of methods to the stdlib for removing duplicate elements from a collection. You can obviously do this by using a Set, but you loose element ordering which is required in a number of cases. We don't have an OrderedSet value in Swift so I thought that this would be the next best thing. It's also worth noting that this is a feature found in a number of other languages out there but not in Swift, so this seems a good opportunity to fill that hole.

Lantua · February 1, 2019, 3:25pm

I think it’d be better to have OrderedSet esp. when Foundation already has NSOrderedSet. Though I don’t have strong opinion on whether we should have it at all or not.

dlbuckley · February 1, 2019, 3:49pm

Having OrderedSet would be great, but with ABI stability around the corner it pushes the barrier to entry for adding a new type to greater heights. There is also the fact that you might not always want to use an OrderedSet and would prefer to use an array, so these methods would bridge that gap in the mean time and would also continue to be useful if and when OrderedSet is added.

Lantua · February 1, 2019, 4:03pm

I’m thinking that in case where you need to repeatedly distinct data, without OrderedSet, the performance may be less ideal.

Alejandro · February 1, 2019, 4:07pm

I’m almost certain they aren’t taking any new changes into Swift 5 at this point (maybe besides some of the string issues). Not arguing for or against the pitch, but you shouldn’t have to worry about adding a new type to the stdlib like OrderedSet post 5.

dlbuckley · February 1, 2019, 4:10pm

I’m thinking that in case where you need to repeatedly distinct data, without OrderedSet , the performance may be less ideal.

While thats true I think thats two different use cases, right? OrderedSet would be great when you need to always work with ordered unique values, but distinct() simply removes duplicate values so you can do something else with the remaining ones.

I’m almost certain they aren’t taking any new changes into Swift 5 at this point (maybe besides some of the string issues). Not arguing for or against the pitch, but you shouldn’t have to worry about adding a new type to the stdlib like OrderedSet post 5.

Oh I totally agree, this isn't getting into Swift 5 and I wouldn't expect it to. I'm just saying that adding these new methods and adding OrderedSet aren't mutually exclusive. It's just adding these new methods will be a whole lot easier to get in than a whole new type which will have a vast number of inputs and opinions to take into account.

Alejandro · February 1, 2019, 4:13pm

You’d be surprised x)

dlbuckley · February 1, 2019, 4:17pm

You’d be surprised x)

Well after the first few replies I'm starting to think thats true!

Unfortunately I've seen it a number of times over the past few years of swift evolution that someone puts forward a really nice small pitch and then it gets put on pause for the much grander idea to solve all the woes plus more, and then the grander idea never comes to pass.

AlexanderM · February 1, 2019, 4:37pm

(foreword: I'm aware the implementations are marked as preliminary)

I think there are multiple implementations that are each worth considering:

Algorithms

Linear searching: filter the collection, building up an Array of "seen" elements and using Array.contains(_:) (linear search, itself O(n) in time) it for every element.
- Preserves ordering
- Only requires Equatable conformance
- O(n) in time
- O(n) in extra space
Set searching: filter the collection, building up a Set of "seen" elements and using Set.contains(_:) (hash-based search, O(1) in time) it for every element.
- Preserves ordering
- Requires Hashable conformance
- O(n) in time
- O(n) in extra space
Convert to Set: Simply Set(sourceCollection).
- Loses ordering
- Requires Hashable conformance
- O(n) in time
- No extra space requirement
Convert to TreeSet: Simply TreeSet(sourceCollection). Requires TreeSet implementation (a tree-based Set that preserves natural ordering of elements.)
- Enforced "natural" ordering
- Requires Comparable conformance
- O(log(n)) in time
- No extra space requirement

Use cases:

Element type	Order Doesn't Matter	Discover Order Desired	Sorted Order Desired
`Equatable`	1	1	?
`Hashable`	3	2	?
`Comparable`	1	?	4

All space complexity assessments assume that the original collection is discard.

Discussion

I think that the trade-offs between these algorithms are too large for us to prescribe a one-size-fits-all solution. I think we should implement them all. However, I don't think we should introduce separate functions for them, instead, I think we should try to unify them under the minimum possible number of functions, each parameterized with things like whether ordering is desired or not. Something like:

Note: I take "first unique element" to mean: "Out of a set of duplicate elements, take the one which was discovered earliest"

enum OutputOrderingType {
    // Output contains first unique elements, in the order they were first seen. 
    // E.g. [4, 6, 4, 1] => [4, 6, 1]
    case discoveryOrder
    // Output contains first unique elements, in sorted order
    // E.g. [4, 6, 4, 1] => [1, 4, 6]
    case sortedOrder
    // Output contains first unique elements, in an undefined order
    // E.g. [4, 6, 4, 1] => [4, 1, 6]  or [1, 4, 6]  or [6, 4, 1], or ...
    case undefinedOrder
}

//prototype:
func distinctElements(
    by deriveKey: KeyDerivingFunction = { $0 },
    ordering: OutputOrderingType
) -> ReturnType

dlbuckley · February 1, 2019, 4:51pm

I had considered using a Set for the implementation and there are already some examples out there of doing it that way, but that requires Hashable (which you have noted). I don't think requiring Hashable is the right way to go as it exposes an implementation detail that the user doesn't care about, and shouldn't need to care about. As far as the operation of the method goes, requiring Equatable makes sense as we are checking the Equality of elements to only leave unique elements in place.

Thats a nice overview though, so thanks!

I won't pretend that I'm very versed in algorithms, I wish I was but I'm not. But I am aware of @Dave_Abrahams1 great talk back at WWDC 18 which I was just re-watching to come up with a half decent implementation so it doesn't become a sticking point in the discussion.

AlexanderM · February 1, 2019, 4:57pm

A Set based implementation is an absolute necessity. O(n^2) gets really bad, really fast. Say you want to deduplicate song names in a user's library. It's not uncommon to have 1,000s, even 10,000s of songs. You don't want to be doing O(n^2) algorithms, which would be doing a number of operations on the order of 1,000,000s and 100,000,000, respectively.

Luckily, you can do both:

extension Collection where Element: Hashable {
    func distinct(...) { fastHashBasedImplementation() }
}

extension Collection where Element: Equatable {
    func distinct(...) { slowButBetterThanNothingLinearSearchingOperation() }
}

Implementation details matter. You can't abstract them away, because data structures and algorithms are a very leaky abstraction (because their performance is so obviously noticeable)

dlbuckley · February 1, 2019, 5:13pm

I understand that it matters and what the Big-o implications are, I'm not saying that it doesn't. All I'm saying is that I just wanted to explore other avenues to see if they were possible with a decent complexity without exposing implementation details without needing to.

Adding multiple methods to do the same thing with different complexity is an option, but I'm just trying to think if this is done in other places in the stdlib or if it's generally frowned against. Any ideas?

AlexanderM · February 1, 2019, 5:34pm

Why would it be? Swift automatically selects the most specific (best optimized, most applicable) implementation for the job. It's done all over the place, e.g. in the specialization of Sequence's methods.

amosavian · February 1, 2019, 5:51pm

I strongly support this proposal. This is one of the frequently referred questions I see in Stackoverflow

GalCohen · February 2, 2019, 6:03am

Seems in the case of the collection diffing proposal, a similar argument was made where different algorithms might be needed and eventually the resulting implementation was the one the makes the most sense to most people most of the time (as opposed to providing multiple implementations)

Edit: link - Collection diff link

Torust · February 2, 2019, 7:05am

FWIW this recent thread may be relevant: Uniquing collections efficiently

jawbroken · February 2, 2019, 10:36am

I think this would be a valuable addition on its own and don't agree that it is better handled by a hypothetical OrderedSet type. As for the implementation question, it is fine (and even preferred) if the complexity is different for types that are just Equatable and types that are Hashable, the difference just needs to be documented. The API should be the same in either case, though.

phoneyDev · February 2, 2019, 8:13pm

I would expect to be able to use this method in a pipeline

let firstNames = people.map { $0.firstName }.uniqued().sorted()

Oh, and I think it should be called unique() or uniqued(). distincted() doesn't really work.

Of course I already have an implementation from SO because I needed it so I'm in favor of it being in the standard library.

CTMacUser · February 3, 2019, 8:59am

I responded with a similar idea when you mentioned this in another thread.

CTMacUser · February 3, 2019, 9:09am

Your methods require the collection to conform to both MutableCollection and RangeReplaceableCollection. Since the algorithms ultimately change the number of elements (therefore invalidating all outstanding indices), RangeReplaceableCollection must be the home of these methods. Could there be a RRC-only overload too, for supporting String?

There could be a MutableCollection.partitionDistinctiveness(by:) variant method for just-MC types.