One thing that Swift has almost no ability to do is remove duplicate elements from a Sequence. The only real built-in way to solve this problem is let uniqued = Array(Set(original))
. However, this approach doesn't 1) preserve the original order, 2) requires Hashable
values, and 3) doesn't lend itself to chaining.
Uniquing is common in other languages and toolkits:
-
uniq
in shell programming -
uniq
in Ruby -
@distinctUnionOfObjects.self
in Objective-C's KVC)
It's been has proven to be useful enough to me to be included in my default extensions that I bring to new project.
We've had some discussion on this forum about OrderedSet, which solves the ordering problem (1), but doesn't solve the strict Hashable
requirement (2) or chaining (3).
There are a few ways to write this algorithm, with various affordances for bare values, Equatable
values, and Hashable
values. In this thread, I'd like to pitch a few of them to get a temperature of the community for a full proposal.
By projection
If the value isn't Equatable
, in almost every case, I want to project to a child of the value, and unique on that. This projection can be done with a block ((Element) -> T
where T: Hashable
) or keypath (KeyPath<Element, Value>
, Value: Hashable
).
Here's a sample implementation of that:
// 1️⃣
extension Sequence {
func uniqueElements<T: Hashable>(byProperty propertyAccessor: (Element) -> T) -> [Element] {
var seen: Set<T> = []
var result: [Element] = []
for element in self {
let property = propertyAccessor(element)
if !seen.contains(property) {
result.append(element)
seen.insert(property)
}
}
return result
}
}
This is useful when you have objects where you only care about the uniqueness of a part of them. For example, in a Swift on the server project, one of our queries results in devices that can sometimes be duplicated:
let uniquedDevices = devices.uniqueElements(byProperty: { $0.deviceID })
Doing it with a block would dovetail nicely with any proposal that allows keypaths to be upgraded to blocks.
Projection to self
You can also add a small extension for when the element itself is Hashable
:
// 2️⃣
extension Sequence where Element: Hashable {
func uniqueElements() -> [Element] {
return uniqueElements(byProperty: { $0 })
}
}
By custom equality
One option that brings even more flexibilty is to allow non-Equatable
values to bring their own defintion of equality. This is similar to how sort(by:)
or elementsEqual(_:by:)
work.
// 3️⃣
extension Sequence {
func uniqueElements(by elementsEqual: (Element, Element) -> Bool) -> [Element] {
var result: [Element] = []
for element in self {
if !result.contains(where: { resultElement in elementsEqual(element, resultElement) }) {
result.append(element)
}
}
return result
}
}
This, like option , has no constraints on the type, leading to maximum flexibility.
Equality on self
You can also add a small extension for elements that actually are Equatable
:
// 4️⃣
extension Sequence where Element: Equatable {
func uniqueElements() -> [Element] {
return uniqueElements(by: ==)
}
}
This function, becuase it can't lean on Hashable
, has a big downside, which is that it requires nested passes of the original collection, making it have quadratic complexity: O(n²)
.
For elements that are hashable
Lastly for elements that are themselves Hashable
, it's possible to make a function with linear time complexity (O(n)
) and no parameters. This is similar in interface to option above, but has no dependency on another function.
// 5️⃣
extension Sequence where Element: Hashable {
func uniqueElements() -> [Element] {
var seen: Set<Element> = []
var result: [Element] = []
for element in self {
if !seen.contains(element) {
result.append(element)
seen.insert(element)
}
}
return result
}
}
Uniquing in-place
In the same way that we added removeAll(_:)
as a mutating, in-place version of filter(_:)
, we may want to add an in-place version of this, perhaps called removeDuplicates()
. The main advantage to this approach is that we can add optimizations, minimizing array shifting. The disadvantage is that we could theoretically do an in-place version of all 4 of the examples above, doubling our API surface.
Which should we choose?
I personally like all four of the approaches above. They are all my children, and it's hard to choose between them. If I had to pick only one, I think , the projection/keypath approach, provides the best combination of speed, functionality, and simplicity.
I'm curious what the greater community thinks of these additions to the standard library. I'm happy to synthesize any conclusions from this thread into a propsoal, and write up a reference implementation for inclusion into Swift.