RFD: CountedSet

I decided that I liked a counted set returning a @discardableResult for insertions and removals. Thoughts?

Here's where I stopped for now: https://gist.github.com/erica/8abad5f7154fb4dc7dfc9d924a0a7a1b

2 Likes

Thanks for the concrete code example, it is really helpful and raises some interesting questions in my mind. To confirm, is a CountedSet essentially supposed to model a multiset, where the count is the multiplicity of each element? Presuming that is the case, and it’s not some other object entirely, here are the questions that I had while reading your gist:

  • Should count return the number of unique elements, or the sum of all the counts? The former would be the cardinality of the support, while the latter is just the cardinality. Should both of these be exposed in some way?
  • How should a subset defined? It looks like the code currently checks for equality between counts, but the multiset equivalent would be inclusion, which checks ≤ instead, e.g. {a, b} is included in {a, a, b}.
  • Similar question for union and intersection, which are also defined here for multisets. Your code currently defines the union as what would be called the sum of multisets (adding the counts together), while the mathematical definition takes the maximum of the counts. Your intersection currently removes all the elements that have count 0 in the other CountedSet and leaves the other counts untouched (which is probably an issue because it isn’t commutative), while the mathematical definition is the minimum of the counts.
  • Subtracting just removes all the keys present in the other CountedSet. Should there instead be addition and subtracting operations (and operators?) that add and subtract the counts? This seems useful (e.g. add all the items from this chest to your inventory, subtract all the items you sold to the store) and more in line with the definition of sum for multisets. Subtraction opens the question of what do to when the count goes below 0 though (throwing probably isn’t possible because of SetAlgebra, should it trap or just go to 0?).

My current thoughts are:

  • count should be the sum of all the counts because I believe this makes more semantic sense when used in combination with the multiset operation definitions (e.g. the sum of two multisets has a count that is the sum of their counts). The number of keys (or the cardinality of the support) should be exposed under another name.
  • Subset, union and intersection should have the mathematical definitions, along with addition and subtraction.
  • If subtraction results in a negative count it should just silently be set to 0. It’s a little tempting to make this a programmer error and trap, but generic algorithms on SetAlgebra expect to be able to use subtract(_:) without checking that other is a subset by silently ignoring elements that aren’t present.

I haven’t checked carefully if these definitions would be compatible with generic algorithms on SetAlgebra, though.

1 Like

I stole a lot of the docs / design (not all but most) from NSCountedSet

I was thinking of how this would be used more than what is the "correct" way. I originally was going to go with exactly your way and then decided in practice, if someone were combining things or seeing if a set had a "subset", they'd probably ignore the counts. So that's the way I went. The questions I were imagining were like "Does this user have at least this core group of arrows in his quiver?" so I went more set-like than counted-like. I also omitted proper subset and proper superset because I couldn't figure a reason they'd be used with counted sets. I did not include "same elements ignoring count" but I think I'd want to add that for practical reasons on similar lines.

Union I assumed was "Add the arrows from this quiver to that quiver". Intersection: "What arrows and magnitude appear in both". I think I might have forgotten to do the - on the count for intersection.

Again practically inspired and I"m not sure I love what I came up with, but "Get rid of any arrow that appears in that other quiver".

I may be trying to go too far with the set operations.

As for your other thoughts, count is by Apple. I can add sumCount or something distinct. I'm not happy with any of the mathematical definitions without practical use cases considered. My guide is "how would this be used in an RPG game" over "what is the mathematical core of a proper multiset"

It could return an Optional, to signify that there is no valid substraction (e.g. “you are missing some of the items required to craft this recipe”)

I think one guiding principle here would be to try to ensure that algorithms written for SetAlgebra generally behave reasonably when used with a CountedSet. There is a heavier focus on (correct) generic algorithms in Swift than there was in Objective-C, so the design might end up in a different place to NSCountedSet. Or perhaps CountedSet shouldn’t conform to SetAlgebra at all?

Yes, I like the practical example of an RPG inventory to anchor the proposal. I’m not sure it points to set-like being more useful though. Questions like “Does this user have enough of each ingredient to make this potion?” [subset] and the examples from my previous post (“Add all the items from this chest to your inventory” [addition], “subtract all the items you sold to the store” [subtraction]) are more counted-like.

Union and intersection are trickier to know what to do with. My assumption is that sticking to the mathematical definitions would preserve more of the semantics of SetAlgebra. “Add the arrows from this quiver to that quiver” seems more naturally covered by addition than union (though there is a potential for confusion because they are essentially the same for Set). If I understand “What arrows and magnitude appear in both” correctly, then this is the multiset definition of intersection (the minimum count over all elements from both sets).

Sure, a count is required for SetAlgebra, but which definition? As I said, I think defining it as the cardinality of the multiset is more compatible with the semantics of SetAlgebra, if conformance to SetAlgebra is a goal.

The set-like vs counted-like divide that you mention is a good way to think about it. If it’s the latter then we can probably just take the multiset definitions directly, but if it’s the former then we have to think about what happens to the counts during every SetAlgebra operation. Are they added together? Subtracted? Max? Min? Just preserved from self? I’m not sure yet how to make that decision in a principled way, and I’m not convinced that there’s always a right answer that makes sense for all uses of CountedSet.

Perhaps CountedSet itself could follow multiset definitions, but provide a Set view of the elements so you can also easily do set-like operations when you want to ignore the counts (e.g. “Does this user have at least this core group of arrows in his quiver”).

1 Like

In an RPG, “union” could be used for things like “does the party collectively have the skills to open this door”. Maybe it needs someone with at least 5 ranks in deciphering runes, someone with at least 15 points in search, someone with at least 10 points of lockpicking, and someone with at least 20 strength, but they don’t have to be the same person.

So you take the union of everyone’s stats (using the “max” definition), and check whether the requirements are a subset (using “≤”) of the result.

However, I think the real questions is, what would people expect when they read the code. If “union” is likely to be a major point of confusion because people expect it to add the counts, then that is a strong argument for giving it those semantics, or using a different spelling for the “max” version.

2 Likes

Sounds like a consensus is forming that there be “counted” and “set” versions of the set operations. How would these be named? Don’t forget there are two variations for each operation: mutating and non-mutating, so you must name a total of four method for every overlapped intent.

We can certainly consider adding this type to Foundation/swift-corelibs-foundation. I don’t think we need a separate library dedicated to collections. Between the standard library and Foundation we can find a home for the critical ones.

2 Likes

Also, cc @Ben_Cohen since we are having an ongoing discussion about how to split up some of these types.

That’s one question, though I think the ability to use generic algorithms for SetAlgebra is another one, at least if it’s decided that a CountedSet should conform to SetAlgebra. The multiset definitions seem mostly safe to use generically as sets (because they behave essentially like you have a set where multiple items are distinguished by a sequential index, i.e. {a, a, b} represented as a set by {a₁, a₂, b}) but I fear the ad hoc nature of the alternative definitions (sometimes ignoring counts, sometimes adding them, sometimes breaking commutativity by retaining the counts from the lhs, etc) makes it hard to understand what you can safely do with them from a SetAlgebra perspective.

I had some time to poke around for CountedSet equivalents in other programming languages. Here are some quick notes:

  • C++ multiset: Maintains a sorted list of all items using a custom comparison function, rather than counting them. The count is the cardinality of the multiset (not the number of unique elements), and their set intersection method for multisets appears to behave like a mathematical multiset.
  • Java Bag (Apache Commons): Looks like it also maintains all inserted items, instead of counting them. Makes a mess of things, with most methods seemingly violating the semantics of Collection and tagged “(Violation)”, which is what I’m trying to avoid with SetAlgebra. size is the cardinality of the multiset (sum of all counts), uniqueSet returns the set of unique elements (like my suggestion above for a Set-like view that ignores counts).
  • Java Multiset (Google): This one is actually counted. size is the multiset cardinality, provides an elementSet view without the counts, doesn’t appear to have union or intersection operations.
  • Objective-C NSCountedSet: The starting point for this proposal. The behaviour of this seems all over the place (and not well documented). I played around with it in a playground for a while and it seems like intersection ignores counts entirely, just intersecting the sets of unique elements and setting the counts to 1 (?). union also behaves weirdly, it seems to ignore the counts in the second NSCountedSet and just insert one of each unique element from it. e.g.
    [a: 2, b: 1, c: 1].union([a: 3, b: 1]) = [a: 3, b: 2, c: 1]. I don’t think NSCountedSet is a good example to emulate.
  • Python collections.Counter: Dictionary with counts. “Addition and subtraction combine counters by adding or subtracting the counts of corresponding elements. Intersection and union return the minimum and maximum of corresponding counts” i.e. the standard multiset definitions. Counts can be negative for some reason.

I guess my conclusion is that some languages provide a multiset where all the elements are actually stored, rather than counting them (I guess because sometimes elements considered “equal” might be distinct in other ways). Some languages have multisets that you can’t do set operations on, or can only do a subset of them. Of languages that do have set operations, Python has the most reasonable implementation to me (if you ignore that counts can be negative), and the Objective-C behaviour seems very idiosyncratic in ways that don’t seem particularly useful, let alone correct.

Another option to mixing this functionality together and coming up with distinct names, mentioned at the end of my previous post and with precedent in several of the languages above, would be to provide a Set/SetAlgebra view of the Elements with non-zero count, so you could do operations that only depend on the presence or absence of Elements. Perhaps you could also (or instead) provide some targeted APIs that answer questions like “does this CountedSet contain at least one of each of the elements in this Sequence.

6 Likes

Ooh, negative counts. My interest is piqued!

2 Likes

Changed count and added elementCount:

  /// Return the number of unique elements in the counted set
  public var elementCount: Int { return _counts.keys.count }
  
  /// Return the number of unique elements in the counted set
  public var count: Int {
    return _counts.keys.reduce(0, {
      accumulated, key in
      accumulated + _counts[key, default: 0]
    })
  }

The remaining set operations and what I plan to change:

  • Contains: Right now I have it contain a key. I do not have it contain a (key, count). No change.
  • ==, isSubset, isSuperset: Right now, these only compare keys, not counts.
  • isDisjointWith: keys only
  • union/formUnion: add counts
  • intersection/formIntersection: intersect keys and choose the minimum value for the two counts.
  • subtracting/subtract: subtract counts, removing any count that falls to zero or below

Any objections to these? I know someone mentioned negative counts but I need some real world context as to why this would be allowed. It would throw a monkey wrench into my “membership means non-zero” model.

My gut says there shouldn’t be an “elementCount” property, but rather follow the precedent set by Dictionary (and String with its character views) and expose the elements themselves:

var elements: Set<T> {
    return _contents.keys
}

Then, if people want the number of distinct elements, they can retrieve that by elements’ own count property:

let count = countedSet.elements.count

Similarly, I think we should expose counts as it’s own property too:

var counts: [Int] {
    return _contents.values
}

let counts = countedSet.counts

(Just like Dictionary uses LazyMapCollection, CountedSet should probably have its own collection types for these properties, but I’m just using Set and Array for now)

HOWEVER, if we choose to define counts as “sum of all elements’ counts”, we’re in a very strange position. Specifically, what should happen when iterating through CountedSet? I would expect iterating would return (element, count) tuples, but then count does not match the number of items returned by iteration, which I think is a violation of the Collection protocol requirements, but even if it isn’t, it’s still very strange.

I’m not sure what to do about this if count is the sum of all elements’ counts, so despite the fact that I think this definition of count makes more sense, given this constraint I think count should instead return the number of distinct elements in the CountedSet.

Furthermore, I don’t think CountedSet should bother exposing a “sum of all elements’ counts” property in the first place. I’m not sure what use it really has beyond checking if the CountedSet is empty (which is still fulfilled with isEmpty), but if someone did still want this information, they can calculate relatively easily it like so:

countedSet.counts.reduce(0, +)
1 Like

Scala has gone through major collections rewrite in version 2.8 and has landed in an exemplary place that is well worth emulating. Looking at documentation of the root package, it’s fair to say scala.collection is their :crown: :gem:. Not only because it comes with all batteries included, but it is extremely modular, which allows you to build your own collections on top of all the generic algorithms quite easily.

Though multisets are not included, there is this Bag implementation that’s worth a look at least for inspiration (@jawbroken ?).

1 Like

To me this is still a strange mix of functionality from a multiset and a set that happens to have some auxiliary count information. ==, isSubset, isSuperset, isDisjointWith ignore counts. union is addition, not multiset union. intersection is multiset intersection. I find this pretty confusing and would have to consult the documentation every time to understand what happens with the counts. It’s also still very unclear to me which algorithms that are generic in SetAlgebra will break with this model. I definitely agree that there’s no reason for counts to be negative.

Yes, this is what other languages call the uniqueElements view, which is very useful for doing set operations ignoring the counts.

This is a good point, but if we’re thinking about the CountedSet as a multiset then the obvious answer is that iterating over it directly would produce just a sequence of Elements, and the elements would appear in the sequence with the same multiplicity as their count. This is what the Python implementation does, which to me is the example to follow here (if we pretend they didn’t implement negative counts):

elements() Return an iterator over elements repeating each as many times as its count. Elements are returned in arbitrary order.
>>> c = Counter(a=4, b=2, c=0, d=-2)
>>> list(c.elements())
['a', 'a', 'a', 'a', 'b', 'b']

You would probably want to also provide a separate iteration view that produced (element, count) tuples, but I think this is the sensible option if you want reasonable Collection and SetAlgebra conformances. To me, a CountedSet is a fundamentally a collection of elements that is efficiently stored and allows set-like queries to be answered efficiently, not a collection of (element, count) tuples.

Thanks for the pointer. Looking quickly at the tutorial for this:

Operation Notes
Relationship to Set Doesn’t relate to Set/SetAlgebra (not a sub- or supertype), I think largely because they chose an incoherent definition of union.
Iteration Provides multiple iteration options: iterator, distinctIterator, bucketsIterator
++, -- Addition and subtraction of counts
subset Multiset definition (this.count ≤ other.count)
union Actually addition (this.count + other.count) which is also provided as ++. Only definition that isn’t the multiset one.
maxUnion Multiset union (max(this.count, other.count)
intersection Multiset definition (min(this.count, other.count))

I still advocate for:

  • Following the multiset definitions for the operations, which provides a simple and learnable mental model (it’s just a set that can contain multiple of the same element) and makes most or all generic SetAlgebra algorithms work for CountedSet.
  • Including addition and subtraction operations that just add and subtract counts (but don’t go negative).
  • A view that exposes a SetAlgebra conforming type containing just the unique elements, probably called .uniqueElements or similar.
  • count for the CountedSet being the multiplicity of the multiset (sum of all element counts).
  • Iteration/Sequence/etc producing a sequence of elements with each element repeated to match its count.
  • A view that exposes a sequence of (element, count) pairs, which I haven’t thought of a good name for yet.

But I’m open to changing my mind on these aspects, especially if they cause issues with protocol conformances and generic algorithms, or they turn out to not be ergonomic enough.

1 Like

:thinking: It seems that many multiset-like collections from other languages have a bunch of idiosyncrasies. I’d venture a guess they were driven by pressures from varying type hierarchies in their host languages. Therefore the most honest direction to me would be to follow the strict, coherent and best studied definition from mathematics. Would you agree?

I’m going to table this proposal for the moment:

  • Following Apple’s NSCountedSet model seems problematic.
  • Scala’s collection model is a much bigger design task.
  • I don’t personally want to implement or push forward on a “studied mathematical” definition.

If you decide to pick up the proposal and run with it, I’ll be happy to contribute further opinions but I will not continue with an implementation or write-up.

Thank you everyone for your feedback. I hope a counted set of some kind and/or a full collection model is proposed and adopted.

2 Likes

When I read the above, I get a feeling you were discouraged by my pedantry (based on points 2&3). If so, I’d like to apologize (again) and add some context to my post.

Your approach to grounding the emerging design in tangible use case and especially your excellent choice of easily relatable example from RPG gaming are the best. This approach leads to internally consistent result, which is IMO the most important goal.

During this process naturally come moments where multiple reasonable definitions are possible and we need to make choices. For them to not be arbitrary, we can look at prior art in other languages, informed by priceless summaries @jawbroken has contributed.

Lastly, and only if prior art from other languages shows no consensus, we should step out of the programming domain and ground our choice in mathematics. I too, dislike to pursuit of purity for its own sake and prefer pragmatic solutions.

I’ve pointed out to Scala’s example several times recently in the forum to illustrate an alternative approach, when the shortcomings of current Swift collection type hierarchy emerged as roadblocks. I take no joy in this, as it seems we’re in a final push towards ABI stability and will have to live with status quo forever.

2 Likes

I’m sure that if anyone’s pedantry was discouraging here then it was mine, not yours. I’ll see if I can find the time to do some work on this, because I think that a CountedSet or similar would be a valuable addition.

2 Likes

I'm aware this has been closed for a year bit I find it a very useful API to have and have a usable implementation and it seems a shame if it's not used more widely.

I used the name Multiset (in line from the maths instead of the given CountedSet)

My implementation is at: https://github.com/adahus/Multiset
With fairly complete documentation at: https://adahus.github.io/Multiset/Structs/Multiset.html

I've been using this for some time in various bits of code and haven't needed to edit for quite a while.
I based the implementation on Swift's existing Set and thought fairly carefully about changes to the API from that - i.e. just when it was needed because of the differences between Set and Multiset.

It includes:

  • Fairly complete range of initialisations including literals
  • Inserting & Removing elements in various ways
  • 3 ways to iterate (Through all elements, through groups of like elements, just the distinct elements)
  • Sequence & Collection Operations
  • Conformity to SetAlgebra

I think there's at least one test for every feature but it's by no means an exhaustive set of tests.
I haven't had the time to check out the complexity and/or document with Big O notation.

Realistically I haven't got much time to give to this but thought I'd share to see if this might be of use.

7 Likes
Terms of Service

Privacy Policy

Cookie Policy