Uniquing elements in a Sequence

soroush · September 30, 2018, 11:54pm

One thing that Swift has almost no ability to do is remove duplicate elements from a Sequence. The only real built-in way to solve this problem is let uniqued = Array(Set(original)). However, this approach doesn't 1) preserve the original order, 2) requires Hashable values, and 3) doesn't lend itself to chaining.

Uniquing is common in other languages and toolkits:

uniq in shell programming
uniq in Ruby
@distinctUnionOfObjects.self in Objective-C's KVC)

It's been has proven to be useful enough to me to be included in my default extensions that I bring to new project.

We've had some discussion on this forum about OrderedSet, which solves the ordering problem (1), but doesn't solve the strict Hashable requirement (2) or chaining (3).

There are a few ways to write this algorithm, with various affordances for bare values, Equatable values, and Hashable values. In this thread, I'd like to pitch a few of them to get a temperature of the community for a full proposal.

By projection

If the value isn't Equatable, in almost every case, I want to project to a child of the value, and unique on that. This projection can be done with a block ((Element) -> T where T: Hashable) or keypath (KeyPath<Element, Value>, Value: Hashable).

Here's a sample implementation of that:

// 1️⃣
extension Sequence {
    func uniqueElements<T: Hashable>(byProperty propertyAccessor: (Element) -> T) -> [Element] {
        var seen: Set<T> = []
        var result: [Element] = []
        for element in self {
            let property = propertyAccessor(element)
            if !seen.contains(property) {
                result.append(element)
                seen.insert(property)
            }
        }
        return result
    }
}

This is useful when you have objects where you only care about the uniqueness of a part of them. For example, in a Swift on the server project, one of our queries results in devices that can sometimes be duplicated:

let uniquedDevices = devices.uniqueElements(byProperty: { $0.deviceID })

Doing it with a block would dovetail nicely with any proposal that allows keypaths to be upgraded to blocks.

Projection to self

You can also add a small extension for when the element itself is Hashable:

// 2️⃣
extension Sequence where Element: Hashable {
    func uniqueElements() -> [Element] {
        return uniqueElements(byProperty:  { $0 })
    }
}

By custom equality

One option that brings even more flexibilty is to allow non-Equatable values to bring their own defintion of equality. This is similar to how sort(by:) or elementsEqual(_:by:) work.

// 3️⃣
extension Sequence {
    func uniqueElements(by elementsEqual: (Element, Element) -> Bool) -> [Element] {
        var result: [Element] = []
        for element in self {
            if !result.contains(where: { resultElement in elementsEqual(element, resultElement) }) {
                result.append(element)
            }
        }
        return result
    }
}

This, like option , has no constraints on the type, leading to maximum flexibility.

Equality on self

You can also add a small extension for elements that actually are Equatable:

// 4️⃣
extension Sequence where Element: Equatable {
    func uniqueElements() -> [Element] {
        return uniqueElements(by: ==)
    }
}

This function, becuase it can't lean on Hashable, has a big downside, which is that it requires nested passes of the original collection, making it have quadratic complexity: O(n²).

For elements that are hashable

Lastly for elements that are themselves Hashable, it's possible to make a function with linear time complexity (O(n)) and no parameters. This is similar in interface to option above, but has no dependency on another function.

// 5️⃣
extension Sequence where Element: Hashable {
    func uniqueElements() -> [Element] {
        var seen: Set<Element> = []
        var result: [Element] = []
        for element in self {
            if !seen.contains(element) {
                result.append(element)
                seen.insert(element)
            }
        }
        return result
    }
}

Uniquing in-place

In the same way that we added removeAll(_:) as a mutating, in-place version of filter(_:), we may want to add an in-place version of this, perhaps called removeDuplicates(). The main advantage to this approach is that we can add optimizations, minimizing array shifting. The disadvantage is that we could theoretically do an in-place version of all 4 of the examples above, doubling our API surface.

Which should we choose?

I personally like all four of the approaches above. They are all my children, and it's hard to choose between them. If I had to pick only one, I think , the projection/keypath approach, provides the best combination of speed, functionality, and simplicity.

I'm curious what the greater community thinks of these additions to the standard library. I'm happy to synthesize any conclusions from this thread into a propsoal, and write up a reference implementation for inclusion into Swift.

Paul_Cantrell · October 1, 2018, 1:39am

I was momentarily puzzled by this until I remembered that Hashable implies Equatable. (Go, Paul brain, go! You can do it!)

is just an alternate implementation of , indistinguishable in both signature and behavior, right?

My quick take:

or : necessary; this proposal would be incomplete without
: very nice; wish we had this for sorting. I’d suggest just uniqueElements(by:) or byValue: or byKey: instead of byProperty:, since “property” to me means the thing you declare with a type-level var/let, but the projection may be some derived value that is not literally a property of the elements.
and : no. Less useful, and a performance pitfall.

The other one we could consider is an O(n log n) version for comparable items:

extension Sequence {
    // comparator returns true iff $0 < $1
    func uniqueSortedElements(by comparator: (Element, Element) -> Bool) -> [Element] { … }
}

extension Sequence where Element: Comparable {
    func uniqueSortedElements() -> [Element] {
        return uniqueElements(byComparator:  { $0 < $1 })
    }
}

However, I really doubt that would carry its weight.

jrose · October 1, 2018, 3:46am

I'll also note there are two kinds of unique, and the shell-programming uniq and the C++ std::unique are a more limited but more efficient one: remove consecutive duplicate elements. That's useful for data already known to be sorted, but probably more expensive than the Set implementation when it isn't.

beccadax · October 1, 2018, 5:13am

I think that’s a good reason to make some versions of uniqued() be customization points, but I’m not sure it justifies a separate adjacent-only version on all collections.

soroush · October 1, 2018, 3:27pm

Yeah, byProperty: is just a label for now to distinguish between this version and the (Element, Element) -> Bool version.

Oh wow I forgot about this distinction. Either way, I think Ruby's uniq is more what we're after, rather than the shell's uniq.

Yeah, that could work really nicely with SortedCollection protocol discussion that's also happening.