RFD: CountedSet

Erica_Sadun · April 9, 2018, 5:11pm

I'm looking to take the community temperature about adding CountedSet to Swift Foundation/Standard Library. This is a Swift native version of NSCountedSet with the following preliminary design:

/// An unordered collection of unique hashable elements that
/// may appear more than once in the collection.
///
/// Each distinct value inserted into an `CountedSet`
/// has an associated counter. `CountedSet` keeps
/// track of the number of times each value is inserted.
/// That value must be removed the same number
/// of times.
///
/// Like `Set`, each `CountedSet` contains one entry for
/// each value, even when a value has been added multiple
/// times. The `count` method for both `Set` and `CountedSet`
/// returns the total number of values stored in each set.
/// A `CountedSet` does _not_ return the total number of times
/// each value is represented in the set.
///
/// New counted sets can be constructed from arrays.
/// Assign an array literal to a variable or constant
/// of the `CountedSet` type.
///
///    let quiver: CountedSet<Arrow> = [Arrow.iron, Arrow.elven, Arrow. elven]
///    if let count = quiver[.iron] {
///        print("There are \(count) iron arrows in your quiver")
///    }

Counted sets (also called bags, see CFBag) are often used in game development for tracking player inventories. They are useful for any task where one keeps count of unique values in a collection. They provide support for standard set operations (subsets, contains, insert, remove, union, intersection, etc). They can be iterated over (without guarantees of order), providing (value, count) tuples. Some function, likely a subscript, returns the number of times a value has been inserted into the set.

The type is easily implemented by wrapping a dictionary.

What are your thoughts about adding this type to Swift?

felix91gr · April 9, 2018, 6:34pm

It kinda sounds to me like it belongs to Foundation. I like the API and I think it serves a use case that should be addressed, but somehow it feels like it should be outside of the stdlib...? I dunno. I don’t really feel strongly that way either.

But I would add it. +1

GarthSnyder · April 9, 2018, 7:07pm

For what it's worth, I've needed this data structure dozens of times over the last year. And the dictionary-based stand-in, although a one-liner to create, is neither efficient nor readable:

var dictOfCounts = Dictionary(sequence.map({ ($0, 1) })) { $0 + $1 }

Later use of the dictionary is a mess, too. To delete one copy of an object, for example, you have to check whether it's the last copy, and if so, delete the record from the dictionary entirely. (Either that or live with dictionary entries that have 0 copies.)

So yes: it's a common need with a not-so-great workaround, and I think it should be in the stdlib somewhere. My only caveat is that I'd like to see it framed in more general terms. See some of the other recent discussions about sets: the major axes of variation seem to be that they (allow duplicates, don't allow duplicates) and (are unordered, are ordered by the user, have an intrinsic order set by a rule). I don't see why all possible combinations shouldn't be allowed.

nuclearace · April 9, 2018, 9:06pm

There have been a lot of talks recently about adding different collections. Someone else mentioned that maybe we need a dedicated framework for these that would ship with the default toolchain, and I'm inclined to believe that that is a good idea. Open-Source Foundation is a bit of a mess IMO.

Erica_Sadun · April 10, 2018, 1:54am

I'd like to see a Grand Unified Collection but is it pragmatic and are there examples from other languages that provide models for Swift to follow?

felix91gr · April 10, 2018, 3:25am

C# has one

jawbroken · April 10, 2018, 3:31am

I would like to see something to provide this functionality, because while you note that this can be easily implemented by wrapping a Dictionary, it's fairly onerous to reimplement all the protocol conformances by forwarding to the underlying Dictionary.

I will say, without necessarily suggesting that it's a better option, that there is an alternative here that is more flexible and precedented in other languages: a Dictionary type where you can customise the behaviour of looking up an unknown key. In Ruby, for example, the default Hash (their name for Dictionary) behaviour is to return nil for unknown keys, but there are two other initialisers that can change this. The first one takes an object to return instead of nil. The second one takes a closure that is provided the Hash instance and the unknown key as parameters, so you can provide more custom behaviour.

A DictionaryWithDefault type in Swift that provided one or both of those options could be useful. Then a CountedSet could be emulated using a Dictionary that defaults to returning 0 for unknown keys, so you could insert using inventory[item] += 1 and remove similarly. This can be useful in other situations, like defaulting to returning an empty Array so you can blindly append to it without scattering the checks for if the key is already present throughout your code.

riley · April 10, 2018, 7:07am

As of Swift 4.0 (I believe), this is actually possible :)

dictionary[key, default: 0] += 1

As for the idea of CountedSet, I agree we need a struct cousin to NSCountedSet, but IMO it should be in Foundation, not the standard library (I don’t believe it’s more important than Data, for instance, yet Data is just in Foundation)

jawbroken · April 10, 2018, 9:17am

That's not really the same. The default I mentioned is provided in the initialiser so you don't have to write the default value at every single usage site throughout your program. Imagine the refactoring job if you then needed to change the default.

xwu · April 11, 2018, 9:38pm

I agree with @felix91gr and @nuclearace that this sounds useful and appropriate for Foundation and/or another dedicated collection library. I wonder if a similar design process to that of OrderedSet right now would be a good approach, and maybe we can get these two value types into Foundation at the same time.

Erica_Sadun · April 11, 2018, 10:44pm

I decided that I liked a counted set returning a @discardableResult for insertions and removals. Thoughts?

Here's where I stopped for now: CountedSet.swift · GitHub

jawbroken · April 12, 2018, 2:36am

Thanks for the concrete code example, it is really helpful and raises some interesting questions in my mind. To confirm, is a CountedSet essentially supposed to model a multiset, where the count is the multiplicity of each element? Presuming that is the case, and it's not some other object entirely, here are the questions that I had while reading your gist:

Should count return the number of unique elements, or the sum of all the counts? The former would be the cardinality of the support, while the latter is just the cardinality. Should both of these be exposed in some way?
How should a subset defined? It looks like the code currently checks for equality between counts, but the multiset equivalent would be inclusion, which checks ≤ instead, e.g. {a, b} is included in {a, a, b}.
Similar question for union and intersection, which are also defined here for multisets. Your code currently defines the union as what would be called the sum of multisets (adding the counts together), while the mathematical definition takes the maximum of the counts. Your intersection currently removes all the elements that have count 0 in the other CountedSet and leaves the other counts untouched (which is probably an issue because it isn't commutative), while the mathematical definition is the minimum of the counts.
Subtracting just removes all the keys present in the other CountedSet. Should there instead be addition and subtracting operations (and operators?) that add and subtract the counts? This seems useful (e.g. add all the items from this chest to your inventory, subtract all the items you sold to the store) and more in line with the definition of sum for multisets. Subtraction opens the question of what do to when the count goes below 0 though (throwing probably isn't possible because of SetAlgebra, should it trap or just go to 0?).

My current thoughts are:

count should be the sum of all the counts because I believe this makes more semantic sense when used in combination with the multiset operation definitions (e.g. the sum of two multisets has a count that is the sum of their counts). The number of keys (or the cardinality of the support) should be exposed under another name.
Subset, union and intersection should have the mathematical definitions, along with addition and subtraction.
If subtraction results in a negative count it should just silently be set to 0. It's a little tempting to make this a programmer error and trap, but generic algorithms on SetAlgebra expect to be able to use subtract(_:) without checking that other is a subset by silently ignoring elements that aren't present.

I haven't checked carefully if these definitions would be compatible with generic algorithms on SetAlgebra, though.

Erica_Sadun · April 12, 2018, 2:55am

I stole a lot of the docs / design (not all but most) from NSCountedSet

I was thinking of how this would be used more than what is the "correct" way. I originally was going to go with exactly your way and then decided in practice, if someone were combining things or seeing if a set had a "subset", they'd probably ignore the counts. So that's the way I went. The questions I were imagining were like "Does this user have at least this core group of arrows in his quiver?" so I went more set-like than counted-like. I also omitted proper subset and proper superset because I couldn't figure a reason they'd be used with counted sets. I did not include "same elements ignoring count" but I think I'd want to add that for practical reasons on similar lines.

Union I assumed was "Add the arrows from this quiver to that quiver". Intersection: "What arrows and magnitude appear in both". I think I might have forgotten to do the - on the count for intersection.

Again practically inspired and I"m not sure I love what I came up with, but "Get rid of any arrow that appears in that other quiver".

I may be trying to go too far with the set operations.

As for your other thoughts, count is by Apple. I can add sumCount or something distinct. I'm not happy with any of the mathematical definitions without practical use cases considered. My guide is "how would this be used in an RPG game" over "what is the mathematical core of a proper multiset"

felix91gr · April 12, 2018, 3:38am

It could return an Optional, to signify that there is no valid substraction (e.g. “you are missing some of the items required to craft this recipe”)

jawbroken · April 12, 2018, 3:51am

I think one guiding principle here would be to try to ensure that algorithms written for SetAlgebra generally behave reasonably when used with a CountedSet. There is a heavier focus on (correct) generic algorithms in Swift than there was in Objective-C, so the design might end up in a different place to NSCountedSet. Or perhaps CountedSet shouldn't conform to SetAlgebra at all?

Yes, I like the practical example of an RPG inventory to anchor the proposal. I'm not sure it points to set-like being more useful though. Questions like “Does this user have enough of each ingredient to make this potion?” [subset] and the examples from my previous post (“Add all the items from this chest to your inventory” [addition], “subtract all the items you sold to the store” [subtraction]) are more counted-like.

Union and intersection are trickier to know what to do with. My assumption is that sticking to the mathematical definitions would preserve more of the semantics of SetAlgebra. “Add the arrows from this quiver to that quiver” seems more naturally covered by addition than union (though there is a potential for confusion because they are essentially the same for Set). If I understand “What arrows and magnitude appear in both” correctly, then this is the multiset definition of intersection (the minimum count over all elements from both sets).

Sure, a count is required for SetAlgebra, but which definition? As I said, I think defining it as the cardinality of the multiset is more compatible with the semantics of SetAlgebra, if conformance to SetAlgebra is a goal.

The set-like vs counted-like divide that you mention is a good way to think about it. If it's the latter then we can probably just take the multiset definitions directly, but if it's the former then we have to think about what happens to the counts during every SetAlgebra operation. Are they added together? Subtracted? Max? Min? Just preserved from self? I'm not sure yet how to make that decision in a principled way, and I'm not convinced that there's always a right answer that makes sense for all uses of CountedSet.

Perhaps CountedSet itself could follow multiset definitions, but provide a Set view of the elements so you can also easily do set-like operations when you want to ignore the counts (e.g. “Does this user have at least this core group of arrows in his quiver”).

Nevin · April 12, 2018, 3:38pm

In an RPG, “union” could be used for things like “does the party collectively have the skills to open this door”. Maybe it needs someone with at least 5 ranks in deciphering runes, someone with at least 15 points in search, someone with at least 10 points of lockpicking, and someone with at least 20 strength, but they don’t have to be the same person.

So you take the union of everyone’s stats (using the “max” definition), and check whether the requirements are a subset (using “≤”) of the result.

However, I think the real questions is, what would people expect when they read the code. If “union” is likely to be a major point of confusion because people expect it to add the counts, then that is a strong argument for giving it those semantics, or using a different spelling for the “max” version.

Erica_Sadun · April 12, 2018, 4:09pm

Sounds like a consensus is forming that there be "counted" and "set" versions of the set operations. How would these be named? Don't forget there are two variations for each operation: mutating and non-mutating, so you must name a total of four method for every overlapped intent.

Tony_Parker · April 12, 2018, 8:10pm

We can certainly consider adding this type to Foundation/swift-corelibs-foundation. I don't think we need a separate library dedicated to collections. Between the standard library and Foundation we can find a home for the critical ones.

Tony_Parker · April 12, 2018, 8:12pm

Also, cc @Ben_Cohen since we are having an ongoing discussion about how to split up some of these types.

jawbroken · April 13, 2018, 3:52am

That's one question, though I think the ability to use generic algorithms for SetAlgebra is another one, at least if it's decided that a CountedSet should conform to SetAlgebra. The multiset definitions seem mostly safe to use generically as sets (because they behave essentially like you have a set where multiple items are distinguished by a sequential index, i.e. {a, a, b} represented as a set by {a₁, a₂, b}) but I fear the ad hoc nature of the alternative definitions (sometimes ignoring counts, sometimes adding them, sometimes breaking commutativity by retaining the counts from the lhs, etc) makes it hard to understand what you can safely do with them from a SetAlgebra perspective.

I had some time to poke around for CountedSet equivalents in other programming languages. Here are some quick notes:

C++ multiset: Maintains a sorted list of all items using a custom comparison function, rather than counting them. The count is the cardinality of the multiset (not the number of unique elements), and their set intersection method for multisets appears to behave like a mathematical multiset.
Java Bag (Apache Commons): Looks like it also maintains all inserted items, instead of counting them. Makes a mess of things, with most methods seemingly violating the semantics of Collection and tagged “(Violation)”, which is what I'm trying to avoid with SetAlgebra. size is the cardinality of the multiset (sum of all counts), uniqueSet returns the set of unique elements (like my suggestion above for a Set-like view that ignores counts).
Java Multiset (Google): This one is actually counted. size is the multiset cardinality, provides an elementSet view without the counts, doesn't appear to have union or intersection operations.
Objective-C NSCountedSet: The starting point for this proposal. The behaviour of this seems all over the place (and not well documented). I played around with it in a playground for a while and it seems like intersection ignores counts entirely, just intersecting the sets of unique elements and setting the counts to 1 (?). union also behaves weirdly, it seems to ignore the counts in the second NSCountedSet and just insert one of each unique element from it. e.g.
[a: 2, b: 1, c: 1].union([a: 3, b: 1]) = [a: 3, b: 2, c: 1]. I don't think NSCountedSet is a good example to emulate.
Python collections.Counter: Dictionary with counts. “Addition and subtraction combine counters by adding or subtracting the counts of corresponding elements. Intersection and union return the minimum and maximum of corresponding counts” i.e. the standard multiset definitions. Counts can be negative for some reason.

I guess my conclusion is that some languages provide a multiset where all the elements are actually stored, rather than counting them (I guess because sometimes elements considered “equal” might be distinct in other ways). Some languages have multisets that you can't do set operations on, or can only do a subset of them. Of languages that do have set operations, Python has the most reasonable implementation to me (if you ignore that counts can be negative), and the Objective-C behaviour seems very idiosyncratic in ways that don't seem particularly useful, let alone correct.

Another option to mixing this functionality together and coming up with distinct names, mentioned at the end of my previous post and with precedent in several of the languages above, would be to provide a Set/SetAlgebra view of the Elements with non-zero count, so you could do operations that only depend on the presence or absence of Elements. Perhaps you could also (or instead) provide some targeted APIs that answer questions like “does this CountedSet contain at least one of each of the elements in this Sequence.