[Pitch] Remove duplicate elements from a collection

collection
pitch

(Dale Buckley) #21

Thanks for the replies everyone, I've now had a few nights to sleep on this. Considering what @AlexanderM said and having spoken to @zachrwolfe about the implementation, it does make sense to make Hashable a requirement. I was looking for other possibilities, but in this case its just the nature of the beast.

I've updated the draft proposal to reflect those changes and added an implementation that seems to run pretty well.

I did play with the idea of unique() myself, but it kind of felt like saying the array itself was unique not the elements within it. Thats why I took the naming idea from other languages and stuck with distinct. I'm open to changing it to that if others think the same, but thats the reason why I settled on the naming that I did.

This is also something I've been fighting with as you are right, it does make sense to be chained up like that. I think this capability makes sense, but the naming doesn't fit. On the other hand it also doesn't fit with uniqued() either, so maybe there is another terminology we can use here?

I do like the idea of having the ability available to String too. Maybe this is better to be an addition to the RangeReplaceableCollection without the restriction of MutableCollection too. I'm open to opinions on this and will happily update it if there aren't any objections.


(Dale Buckley) #22

Thinking about it I'm tempted to go back on my reasoning for dismissing the removeDuplicates() type naming. Having removeDuplicates() and removingDuplicates() would fit within Swift's naming scheme and actually it is more obvious to non native English speakers than distinct(). It also has the benefit of coming up as a suggestion when typing any other remove... method name so wouldn't be hidden away by a more obscure word.


(Thomas Roughton) #23

As an implementation note: I'd recommend using an Equatable-based implementation for input sizes under ~256, particularly when the Collection is an Array.

// Rather than using an Array here, the indices could be generated on-demand.
var indicesToRemove = [Index]()

for i in self.indices.dropFirst() {
    if self[self.startIndex..<i].contains(self[i]) {
        indicesToRemove.append(i)
    }
}

self.remove(elementsAtSortedIndices: indicesToRemove)

This can also be done more efficiently if a new collection is being generated (rather than being performed in-place) since the new, smaller collection can be checked for membership rather than the old, duplicate-containing one.


(Happy Human Pointer) #24

I totally agree on this one. As you said, it aids discoverability, and the meaning is very clear. I used to feel that uniqued was the best way to go, but I'm reconsidering.


#25

Why should this be a requirement and not just a specialisation that gives better performance? The standard library already contains algorithms with specialisations for better performance when possible (e.g. RandomAccessCollection vs Collection or BidirectionalCollection). You could then also switch implementations for small collections if advantageous, as @Torust demonstrates.

Edit: See complexity section here for a simple example.


(Tellow Krinkle) #26

I would expect that the cutoff point would vary based on what type of object you're using (the more complicated the == method the earlier the cutoff is), was this testing with Ints or Strings?


(Thomas Roughton) #27

This was with Strings formed from random Ints. It's not necessarily the complexity of the == method that matters; for example, String comparison can short-circuit very quickly on strings that differ on the first character but is very slow for long strings with shared prefixes.

This was partly the reason why I was somewhat opposed to adding this method in the other thread. The simple approach (e.g. Set(elements).sorted() or the O(n^2) approach) is fine for small numbers of elements, and for large data sets you really have to know what your data is to find the best-performing algorithm.

As a side note, de-duplication can be performed very quickly on sorted input, which is a common precondition for e.g. filtering song names. In that context, both a naïve O(n^2) approach and a set-based approach would be less than ideal.


(Zach Wolfe) #28

@dlbuckley and I agree that we should have only the minimal set of requirements, and then provide specialized, optimized implementations where, for example, Element: Hashable. I admit that generic programming to this degree is a novel way of thinking for me, so it hadn’t previously occurred to me that that was possible or desirable.