Unordered Array / Counted Set

Saklad5 · September 17, 2021, 5:19pm

I think it would be very useful to have a data structure representing an unordered collection of non-unique elements. Using an Array ascribes a certain degree of significance to the order of elements, which can be misleading and detrimental to both hashing and equivalence checks.

There are reference-type implementations of the concept in Foundation (NSCountedSet) and CoreFoundation (CFBag).

As for my proposed implementation: if elements were Hashable, a simple Dictionary of type [Element: UInt] could be used to back the data structure, with the value being the quantity of elements. I’m not certain that constraint is truly necessary, though, so I’d welcome alternative suggestions.

Saklad5 · September 17, 2021, 5:26pm

An example of when this is useful is storing summands: summation is associative and commutative, meaning order should not affect equivalence. That makes an Array unsuitable. It is not idempotent, however, so the number of times an element appears is consequential. That makes a Set unsuitable.

Note

Yes, I know that computer addition cannot be truly associative. That’s an implementation detail, not a property of my API, and it is handled accordingly.

lorentey · September 18, 2021, 1:13am

Yep, a counted, unordered, hashed multiset could be a reasonable addition to the package. I think this is mostly an API design challenge, rather than a data structure implementation task, as [Element: Int] seems like the right data structure for implementing such a set.

We discussed this a little bit in Counted Set · Issue #20 · apple/swift-collections · GitHub.

Whether this proves to be a worthy addition depends on how much better API the wrapper type provides over a raw dictionary of integers -- and on how well the design would translate to the other multiset variants that we will likely want to add, including the non-counting SortedMultiset. (There are fundamental differences between these two that I expect will be reflected in their API; but how much do they have in common? And what other multiset-like things do we need to consider?)

Saklad5 · September 18, 2021, 1:17am

It would not necessarily be difficult to implement. To optimize, to test, to achieve consistent use across the Swift ecosystem? That’s slightly harder. Not that the difficulty of the implementation should be a barrier to inclusion, of course.

In terms of API interface, I think it should be extremely similar to an Array. The key difference would be hashing, equivalence, and potentially certain performance characteristics. Storing many copies of an element could take far less memory, for instance.

Saklad5 · September 18, 2021, 1:52pm

How should Collection.count work, in terms of time complexity? Adding up every value in a backing dictionary would be an O(n) operation, obviously, so it might be preferable to cache count internally and update it on mutation.

I recognize that NSCountedSet.count returns the number of distinct elements rather than the number of elements overall, but I feel that is misleading at best. The type should probably expose a view for distinct elements, akin to Dictionary.Keys, that can serve that purpose.

jawbroken · September 19, 2021, 12:10pm

There's some discussion of this in this previous thread, and I did a review of some other programming language's implementations there.

Saklad5 · October 10, 2021, 8:32pm

Having reviewed that thread, I might take a crack at an implementation myself.

The way I see it, a CountedSet is just a wrapper around a Dictionary. In light of that, I’ll literally start there. I’m going to ignore Foundation’s CountedSet: anyone who wants a drop-in replacement should probably petition the Foundation team instead.

SetAlgebra conformance seems fairly straightforward in my book: if the type conforms, it has to meet all the axioms. If that can’t be done without being unnecessarily confusing, it shouldn’t conform.

By the way, I find it sort of hilarious that the standard library literally defines a CountedSet type as a code example in documentation.

Saklad5 · October 10, 2021, 8:54pm

@lorentey I assume I should make a separate target for this. Since you expressed plans to potentially add other multisets in the future, should I call it MultiSets rather than CountedSet?

Saklad5 · October 11, 2021, 10:06pm

Should the CountedSet API trap if provided a count that isn’t greater than zero (for instance, in a dictionary literal)? The alternatives are ignoring such counts and trapping (using a precondition or fatal error).

I’m leaning towards trapping with a precondition.

extension CountedSet: ExpressibleByDictionaryLiteral {
  /// Creates a counted set initialized with a dictionary literal.
  ///
  /// Do not call this initializer directly. It is called by the compiler to
  /// handle dictionary literals. To use a dictionary literal as the initial
  /// value of a counted set, enclose a comma-separated list of key-value pairs
  /// in square brackets.
  ///
  /// For example, the code sample below creates a counted set with string
  /// keys.
  ///
  ///     let countriesOfOrigin = ["BR": 2, "GH": 1, "JP": 5]
  ///     print(countriesOfOrigin)
  ///     // Prints "["BR", "BR", "JP", "JP", "JP", "JP", "JP", "GH"]"
  ///
  /// - Parameter elements: The element-count pairs that will make up the new
  ///   counted set. Each key in `elements` must be unique.
  /// - Precondition: Each count must be greater than zero.
  @inlinable
  public init(dictionaryLiteral elements: (Element, Int)...) {
    _storage = .init(uniqueKeysWithValues: elements.lazy.map {
      precondition($0.1.signum() == 1)
      return $0
    })
  }
}

tera · October 11, 2021, 11:20pm

also consider "signed multisets" option (those allow negative, zero and infinite multiplicities).
perhaps via an initialiser parameter:

init(...., options: Multiset.Options = .default)
// .positive == .default
// .nonNegative
// .signed
// .weighted (real)
// .withInfinity (perhaps implemented as Int.max / .min for integral multiplicities)

also worth mentioning: other languages make the difference between hash and tree based data structures obvious (HashSet vs TreeSet / SortedSet / OrderedSet). the difference is important for custom types: whether to provide a hash or lessThan operation. if not too late for Swift to adopt this practice then the name could be HashMultiset.

Saklad5 · October 11, 2021, 11:26pm

There are tentative plans for other multisets in the future. This one mostly just has the semantics of an array.

Saklad5 · November 4, 2021, 11:34pm

I’ve managed to get SetAlgebra conformance working, including tests for all the axioms.

It diverges slightly from the documentation with regards to inserting elements that are already present, admittedly, but it otherwise meets the requirements quite nicely.

Saklad5 · December 4, 2021, 7:04pm

I’ve submitted a draft pull request with what I have so far. I think it’s currently usable, but not quite up to the standard of the Standard Library yet.

I designed the implementation with RawRepresentable, so the underlying dictionary is deliberately exposed. My hope is that that will massively cut down on the type’s complexity: if you want to work with it as a dictionary, you can literally just do so.