RFD: CountedSet

Nevin · April 13, 2018, 4:03am

Ooh, negative counts. My interest is piqued!

Erica_Sadun · April 16, 2018, 5:15pm

Changed count and added elementCount:

  /// Return the number of unique elements in the counted set
  public var elementCount: Int { return _counts.keys.count }
  
  /// Return the number of unique elements in the counted set
  public var count: Int {
    return _counts.keys.reduce(0, {
      accumulated, key in
      accumulated + _counts[key, default: 0]
    })
  }

The remaining set operations and what I plan to change:

Contains: Right now I have it contain a key. I do not have it contain a (key, count). No change.
==, isSubset, isSuperset: Right now, these only compare keys, not counts.
isDisjointWith: keys only
union/formUnion: add counts
intersection/formIntersection: intersect keys and choose the minimum value for the two counts.
subtracting/subtract: subtract counts, removing any count that falls to zero or below

Any objections to these? I know someone mentioned negative counts but I need some real world context as to why this would be allowed. It would throw a monkey wrench into my "membership means non-zero" model.

riley · April 16, 2018, 9:48pm

My gut says there shouldn't be an "elementCount" property, but rather follow the precedent set by Dictionary (and String with its character views) and expose the elements themselves:

var elements: Set<T> {
    return _contents.keys
}

Then, if people want the number of distinct elements, they can retrieve that by elements' own count property:

let count = countedSet.elements.count

Similarly, I think we should expose counts as it's own property too:

var counts: [Int] {
    return _contents.values
}

let counts = countedSet.counts

(Just like Dictionary uses LazyMapCollection, CountedSet should probably have its own collection types for these properties, but I'm just using Set and Array for now)

HOWEVER, if we choose to define counts as "sum of all elements' counts", we're in a very strange position. Specifically, what should happen when iterating through CountedSet? I would expect iterating would return (element, count) tuples, but then count does not match the number of items returned by iteration, which I think is a violation of the Collection protocol requirements, but even if it isn't, it's still very strange.

I'm not sure what to do about this if count is the sum of all elements' counts, so despite the fact that I think this definition of count makes more sense, given this constraint I think count should instead return the number of distinct elements in the CountedSet.

Furthermore, I don't think CountedSet should bother exposing a "sum of all elements' counts" property in the first place. I'm not sure what use it really has beyond checking if the CountedSet is empty (which is still fulfilled with isEmpty), but if someone did still want this information, they can calculate relatively easily it like so:

countedSet.counts.reduce(0, +)

palimondo · April 17, 2018, 12:35am

Scala has gone through major collections rewrite in version 2.8 and has landed in an exemplary place that is well worth emulating. Looking at documentation of the root package, it’s fair to say scala.collection is their . Not only because it comes with all batteries included, but it is extremely modular, which allows you to build your own collections on top of all the generic algorithms quite easily.

Though multisets are not included, there is this Bag implementation that’s worth a look at least for inspiration (@jawbroken ?).

jawbroken · April 17, 2018, 2:08am

To me this is still a strange mix of functionality from a multiset and a set that happens to have some auxiliary count information. ==, isSubset, isSuperset, isDisjointWith ignore counts. union is addition, not multiset union. intersection is multiset intersection. I find this pretty confusing and would have to consult the documentation every time to understand what happens with the counts. It's also still very unclear to me which algorithms that are generic in SetAlgebra will break with this model. I definitely agree that there's no reason for counts to be negative.

Yes, this is what other languages call the uniqueElements view, which is very useful for doing set operations ignoring the counts.

This is a good point, but if we're thinking about the CountedSet as a multiset then the obvious answer is that iterating over it directly would produce just a sequence of Elements, and the elements would appear in the sequence with the same multiplicity as their count. This is what the Python implementation does, which to me is the example to follow here (if we pretend they didn't implement negative counts):

elements() Return an iterator over elements repeating each as many times as its count. Elements are returned in arbitrary order.
>>> c = Counter(a=4, b=2, c=0, d=-2)
>>> list(c.elements())
['a', 'a', 'a', 'a', 'b', 'b']

You would probably want to also provide a separate iteration view that produced (element, count) tuples, but I think this is the sensible option if you want reasonable Collection and SetAlgebra conformances. To me, a CountedSet is a fundamentally a collection of elements that is efficiently stored and allows set-like queries to be answered efficiently, not a collection of (element, count) tuples.

Thanks for the pointer. Looking quickly at the tutorial for this:

Operation	Notes
Relationship to `Set`	Doesn't relate to `Set`/`SetAlgebra` (not a sub- or supertype), I think largely because they chose an incoherent definition of `union`.
Iteration	Provides multiple iteration options: `iterator`, `distinctIterator`, `bucketsIterator`
`++`, `--`	Addition and subtraction of counts
`subset`	Multiset definition (this.count ≤ other.count)
`union`	Actually addition (this.count + other.count) which is also provided as `++`. Only definition that isn't the multiset one.
`maxUnion`	Multiset `union` (max(this.count, other.count)
`intersection`	Multiset definition (min(this.count, other.count))

I still advocate for:

Following the multiset definitions for the operations, which provides a simple and learnable mental model (it's just a set that can contain multiple of the same element) and makes most or all generic SetAlgebra algorithms work for CountedSet.
Including addition and subtraction operations that just add and subtract counts (but don't go negative).
A view that exposes a SetAlgebra conforming type containing just the unique elements, probably called .uniqueElements or similar.
count for the CountedSet being the multiplicity of the multiset (sum of all element counts).
Iteration/Sequence/etc producing a sequence of elements with each element repeated to match its count.
A view that exposes a sequence of (element, count) pairs, which I haven't thought of a good name for yet.

But I'm open to changing my mind on these aspects, especially if they cause issues with protocol conformances and generic algorithms, or they turn out to not be ergonomic enough.

palimondo · April 17, 2018, 5:06am

It seems that many multiset-like collections from other languages have a bunch of idiosyncrasies. I’d venture a guess they were driven by pressures from varying type hierarchies in their host languages. Therefore the most honest direction to me would be to follow the strict, coherent and best studied definition from mathematics. Would you agree?

Erica_Sadun · April 17, 2018, 8:30pm

I'm going to table this proposal for the moment:

Following Apple's NSCountedSet model seems problematic.
Scala's collection model is a much bigger design task.
I don't personally want to implement or push forward on a "studied mathematical" definition.

If you decide to pick up the proposal and run with it, I'll be happy to contribute further opinions but I will not continue with an implementation or write-up.

Thank you everyone for your feedback. I hope a counted set of some kind and/or a full collection model is proposed and adopted.

palimondo · April 18, 2018, 3:06am

When I read the above, I get a feeling you were discouraged by my pedantry (based on points 2&3). If so, I’d like to apologize (again) and add some context to my post.

Your approach to grounding the emerging design in tangible use case and especially your excellent choice of easily relatable example from RPG gaming are the best. This approach leads to internally consistent result, which is IMO the most important goal.

During this process naturally come moments where multiple reasonable definitions are possible and we need to make choices. For them to not be arbitrary, we can look at prior art in other languages, informed by priceless summaries @jawbroken has contributed.

Lastly, and only if prior art from other languages shows no consensus, we should step out of the programming domain and ground our choice in mathematics. I too, dislike to pursuit of purity for its own sake and prefer pragmatic solutions.

I’ve pointed out to Scala’s example several times recently in the forum to illustrate an alternative approach, when the shortcomings of current Swift collection type hierarchy emerged as roadblocks. I take no joy in this, as it seems we’re in a final push towards ABI stability and will have to live with status quo forever.

jawbroken · April 18, 2018, 3:45am

I'm sure that if anyone's pedantry was discouraging here then it was mine, not yours. I'll see if I can find the time to do some work on this, because I think that a CountedSet or similar would be a valuable addition.

Adahus · November 24, 2019, 7:40pm

I'm aware this has been closed for a year bit I find it a very useful API to have and have a usable implementation and it seems a shame if it's not used more widely.

I used the name Multiset (in line from the maths instead of the given CountedSet)

My implementation is at: https://github.com/adahus/Multiset
With fairly complete documentation at: https://adahus.github.io/Multiset/Structs/Multiset.html

I've been using this for some time in various bits of code and haven't needed to edit for quite a while.
I based the implementation on Swift's existing Set and thought fairly carefully about changes to the API from that - i.e. just when it was needed because of the differences between Set and Multiset.

It includes:

Fairly complete range of initialisations including literals
Inserting & Removing elements in various ways
3 ways to iterate (Through all elements, through groups of like elements, just the distinct elements)
Sequence & Collection Operations
Conformity to SetAlgebra

I think there's at least one test for every feature but it's by no means an exhaustive set of tests.
I haven't had the time to check out the complexity and/or document with Big O notation.

Realistically I haven't got much time to give to this but thought I'd share to see if this might be of use.