Uniquing collections efficiently

It often comes up that you want to unique a collection, that is create a new collection with no duplicate elements. However, I often see people do something like this:

var uniqued = [Int]()
for element in myArray {
  if !uniqued.contains(element) {
    uniqued.append(element)
  }
}

The problem with this approach is that it runs in O(N^2) times.

This stack overflow question tried to use reduce.

However, using Hashable we can make an algorithm that runs in linear worst-case time.

var uniqued = [Int]()
var set = Set<Int>()
for element in myArray {
  if set.insert(element).inserted {
    uniqued.append(element)
  }
}

Would it be useful to add a uniqued function to the standard library? And also, if it is, should there be a RangeReplacable overload to return Self rather than [Element]?

extension Collection where Element: Hashable {
  /// Returns a collection with no duplicate elements.
  ///
  /// Complexity: O(n), where n is the number of elements in the collection.
  func uniqued() -> [Element] {
    var uniqued = [Element]()
    var set = Set<Element>()
    for element in myArray {
      if set.insert(element).inserted {
        uniqued.append(element)
      }
    }
    return uniqued
  }
}

extension RangeReplacableCollection where Element: Hashable {
  /// Returns a collection with no duplicate elements.
  ///
  /// Complexity: O(n), where n is the number of elements in the collection.
  func uniqued() -> Self {
    var uniqued = Self()
    var set = Set<Element>()
    for element in myArray {
      if set.insert(element).inserted {
        uniqued.append(element)
      }
    }
    return uniqued
  }
}
2 Likes

I think there are two different needs to address here.

If one wants to convert a Sequence into a Collection with unique elements, the obvious thing to do is introduce a OrderedSet type, similar to NSOrderedSet.

It is not difficult to implement using Array and Dictionary as underlying types, but I think that it would be a very valuable addition to the Standard Library to ensure that it is as efficient as possible.

On the other hand, if one just wants to lazily iterate over a Sequence, I'm using this in my code:

struct UniquingSequenceIterator<Base: Sequence>: IteratorProtocol where Base.Element: Hashable {
    typealias Element = Base.Element

    private var baseIterator: Base.Iterator
    private var seen = Set<Element>()

    init(_ base: Base) {
        baseIterator = base.makeIterator()
    }

    mutating func next() -> Element? {
        var n: Element?
        repeat {
            n = baseIterator.next()
        } while (n != nil && !seen.insert(n!).inserted)
        return n
    }
}

struct UniquingSequence<Base: Sequence>: Sequence where Base.Element: Hashable {
    typealias Element = Base.Element
    typealias Iterator = UniquingSequenceIterator<Base>

    private var base: Base

    init(_ base: Base) {
        self.base = base
    }

    func makeIterator() -> Iterator {
        return UniquingSequenceIterator(base)
    }
}

extension Sequence where Element: Hashable {
    func droppingDuplicates() -> UniquingSequence<Self> {
        return UniquingSequence(self)
    }
}

Performance-wise, this unfortunately isn't quite as straightforward as you've presented it. Depending on the number of duplicates in the sequence, the cost of hashing, memory allocations etc. the naïve O(N^2) version may run faster than the Set-based version. Iterating over linear arrays is just generally really fast (cache-friendly and low CPU instruction cost).

In addition, I think uniqued(), if it were to exist, should exist on all Equatable sequences and not just Hashable ones.

If you compare the performance of this:

let array = (0..<1024).map { i in Int.random(in: 0...i) }

var array0 = array
var array1 = array

var startTime = DispatchTime.now().uptimeNanoseconds

var i = 0
var set = Set<Int>()
while i < array0.count {
    if set.insert(array0[i]).inserted {
        i += 1
    } else {
        array0.remove(at: i)
    }
}

var endTime = DispatchTime.now().uptimeNanoseconds
print("Set-based took \(endTime - startTime)")
startTime = DispatchTime.now().uptimeNanoseconds
i = 0
while i < array1.count {
    if array1[0..<i].contains(array1[i]) {
        array1.remove(at: i)
    } else {
        i += 1
    }
}

endTime = DispatchTime.now().uptimeNanoseconds
print("Array-based took \(endTime - startTime)")

you'll find that both algorithms are roughly even at around 1000 elements (with the O(N^2) method usually being quicker depending on the composition of the array to unique), that below that the array method wins handedly, and only when the element count is extremely high does the set version start to outperform the array.

In short, I don't think the standard library can easily provide an implementation that will consistently outperform a naïve implementation on all data. I also think that if order doesn't matter Set(sequence) is concise enough, and if the order matters it's probably the sorted order in which case Set(sequence).sorted() works.

3 Likes

I suppose that strings make the set algorithm are longer?

Just for reference using the other method.
image

The speed of string comparison vs. hashing could definitely be a factor. For strings, having a bunch of very long strings which all only differ at the end would be a pathological case for the comparison-based algorithm, but it would perform very well if the strings differ in the first couple of characters. On the other hand, each string only needs to be hashed once for the Set. Again, it all depends on the data.

It's worth noting I did my performance testing in -O (release) configuration. I'm not sure what online.swiftplayground.run uses by default; it may have optimisations disabled.

I have never used the command line to run Swift, only Xcode. I was using that website because at that time I only had my phone.

Maybe I will test, say, 500 to 1000 character strings.

Input:

import CoreFoundation

func measure(block: () -> ()) -> CFAbsoluteTime {
    let start = CFAbsoluteTimeGetCurrent()
    block()
    let end = CFAbsoluteTimeGetCurrent()
    return end - start
}

extension String {
    public static func random(length: Int) -> String {
        let chars = (0 ..< length).map { _ -> Character in
            let codePoint = UInt8.random(in: 0 ..< 100)
            let scalar = Unicode.Scalar(codePoint)
            return Character(scalar)
        }
        return String(chars)
    }
}

let count = 500
let strings: [String] = (0 ..< 10).map { _ -> String in
    return String.random(length: count)
}

var results = [(Double, Double, Double)]()
let testCount = 10

for _ in 0 ..< testCount {
    let array = measure {
        var array = [String]()
        for str in strings {
            if !array.contains(str) {
                array.append(str)
            }
        }
    }
    let set = measure {
        var array = [String]()
        var set = Set<String>()
        for str in strings {
            if set.insert(str).inserted {
                array.append(str)
            }
        }
    }
    let setOnly = measure {
        var set = Set<String>()
        for str in strings {
            set.insert(str)
        }
        var array = Array(set)
    }
    results.append((array, set, setOnly))
}

let (array, set, setOnly) = results.reduce(into: (0.0, 0.0, 0.0)) {
    $0.0 += $1.0 / Double(testCount)
    $0.1 += $1.1 / Double(testCount)
    $0.2 += $1.2 / Double(testCount)
}

print("Array: \(array)")
print("Set: \(set)")
print("Set init: \(setOnly)")

Output:

Array: 0.003508102893829346
Set: 0.0033342003822326664
Set init: 0.0012320280075073242

I see what you mean. The set approach is only faster by a better of milliseconds.

I have been thinking about this for a while, and am wondering if this should be moved to a different thread, but here is my idea.

Design

There should be a protocol called CheapHashable. (The name is just a stand-in for a much more creative/expressive name). The protocol should apply to all types which know that can hash cheaply. It should simply be defined as:

/// A type that can be cheaply hashed into a Hasher to produce an integer hash value
public protocol CheapHashable: Hashable {}

Certain types, like String will only conform to Hashable, but things that can be easily hashed like Bool will implement CheapHashable

extension Collection where Element: Hashable {
  internal func uniqued<T: RangeReplacableCollection>(storingAs type: T.Type) -> T where T.Element == Element {
    // preprocessing the collection.
    let cheapCount = self.count { $0 is CheapHashable }
    var uniqued = T()
    if cheapCount > self.count {
      var set = Set<Element>()
      for element in self {
        if set.insert(element).inserted {
          uniqued.append(element)
        }
      }
    } else {
      for element in self {
        if uniqued.contains(element) {
          uniqued.append(element)
        }
      }
    }
    return uniqued
  }
  /// Returns a collection with no duplicate elements.
  ///
  /// Complexity: O(n), where n is the number of elements in the collection.
  func uniqued() -> [Element] {
    return uniqued(storingAs: Array<Element>.self)
  }
}

Basic implementation

Potential downfalls

Most beginners will not know about the CheapHashable protocol so it will not be overused. I can see a flaw in this that if a beginner does see the protocol, they may assume that they have a type that can be cheaply hashed, resulting in a performance loss that could have been easily avoided if the protocol was not added in the first place.