RFP: OrderedSet

After thinking about this some more, I agree there are important use cases for API like this:

orderedSet.insert(3, at: 0)

But I think this is the wrong name. This should be an ordering API, not an insertion API.

I think there's value in roughly dividing OrderedSet APIs into two subgroups: one for insertions and deletions, and one for orderings. I'd expect there might be cross-over cases, but establishing subgroups helps for semantic consistency.

So, for insertions and deletions, OrderedSet would use the words insert and remove, just like Set, along with update and the various operations from SetAlgebra like union and subtract. The semantic rule would be something like: unless otherwise specified, elements Equatable-equal to existing elements are silently ignored, and elements actually inserted go to the end of the current ordering. You would not normally have an index parameter with this kind of API.

For orderings, there would be API to rearrange objects by index, and the above example would be an ordering API, something like:

orderedSet.order(3, at: 0)

which would mean "find the element in the set that is Equatable-equal to 3, and move it to index position 0; if no such element exists, insert the first parameter value, and move it to index position 0". Other ordering APIs would generally have an index or a range of indexes as a method parameter.

I think that makes the semantics comprehensible. If you're doing an ordering operation, any insertions implied by the operation are handled in a standardized default way. If you're doing an insertion or deletion operation, any reorderings implied by the operation are handled in a standard default way.

2 Likes

I think I like this approach. I'm not certain I love the spelling of order(_:at:) but that's something we can bike-shed as we go along and fill in the rest of the APIs. Over-all, this approach falls into the third option listed above: "Do something entirely different that I haven’t thought of yet" :wink:

Reordering implies just moving the same object but I'm not sure that works with the point you made about equal but different objects.

There are useful 2 cases I can think of:

  1. Add objects to Set, keep the first one -> ignore new objects.
  2. Add objects to Set, keep the most recent one -> remove old and append new object.

Maybe we should consider 2 methods for each use case with very obvious names to avoid confusion.

The concept was that the value you passed in as a parameter is an "ID" (a way of specifying) the element that's already in the set — the ID tells you which element to move.

There are 2 cases, but it's not clear they're both useful, and it's not obvious that completeness would be a good justification either. Some actual use cases might help clarify.


I've been delving into the protocols today, and I think the semantics of OrderedSet are constrained (in some ways) by the un-desirability of introducing some radically new behavior into the standard library:

  • It seems to me that OrderedSet needs to have the same behavior as Set, which is to say it must conform to Collection and to SetAlgebra. (Apart from initializers, I don't think Set has any behavior except what comes from these and the other protocols it conforms to, although I didn't check very carefully.)

  • It seems to me that the ordering behavior that OrderedSet would need is already pretty much covered by MutableCollection and RangeReplaceableCollection. Other ways of handling reordering are possible, but I think re-using these two protocols is going to be a win.

Of course, OrderedSet can't conform to MutableCollection and RangeReplaceableCollection without some adjustments. I think it can be done with just the following rules:

  1. It can conform to MutableCollection if we give its subscript setter a precondition: the existing element identified by the Index parameter must be Equatable-equal to the value of the newValue parameter. (So, the subscript setter is basically the same function as Set's update.) Since this precondition will crash if not met, the subscript setter is not extremely useful on its own, but it enables the other useful API of MutableCollection such as swapAt. That is, the MutableCollection conformance gives the ability to rearrange existing elements in an OrderedSet's order.

  2. It can conform to RangeReplaceableCollection if there is a convention about what happens to inserted values already in (i.e. Equatable-equal to something in) the OrderedSet. For consistency with Set, I think the rule has to be that values Equatable-equal to elements in the OrderedSet are ignored, and the insertion or replacement just uses the rest of the values passed in.

That's about all it takes, AFAICT, and we'd have a very functional OrderedSet API. The only other piece I can think of is the one we were talking about: forcing the arrangement of the OrderedSet at a given index to match a sequence of one more values. To be concrete, let's call it restructure, rather than insert or order, with the understanding that the actual name can be chosen later.

The API here would be something like:

func restructure (with value: Self.Element, at index: Self.Index)
func restructure (with values: Sequence<Self.Element>, at index: Self.Index)

(Well, Sequence isn't generic, but you get what I mean, I hope.)

The use-case for this sort of API is a MRU/LRU cache, where you want to promote an entry to (or towards) the beginning or end when it's used in some way. As I said before, I think it makes more sense to keep the existing set element, if there is one, and only use the parameter value as a way of finding it — unless it isn't present, when the parameter value is actually inserted. The other case — forcibly replacing the existing element — can easily be done as a remove followed by an insert, and isn't likely to be much more inefficient done that way.

Dictionary and Set have element-manipulating methods, but they’re custom because of the needed restrictions. They only conform to Collection since they can’t support the general manipulations of MutableCollection and RangeReplaceableCollection.

Is this the accepted pattern (interfaces with same spelling but no direct conformance) in the stdlib for this situation?

Can we not meet the complexity guarantees for RandomAccessCollection?

I think this pattern works from a usability perspective (you probably already know the API) but it does prevent usage of generic algorithms. Maybe that's ok as a way to signal that it's close but not quite on the conformance to the protocol.

Missing out on improvements to the protocols would be a loss, but I think for such a fundamental type we have to be absolutely sure that we're 100% conforming with all of the expectations of conformance, including things not enforced by the compiler like complexity.

@Ben_Cohen or @lorentey may have some additional thoughts on this point.

Whether or not OrderedSet can/should conform to RandomAccessCollection impacts the kind of implementation that's used. For example, if it needs to conform, then the implementation should probably use an array to store the order, which has impacts on add/removal performance. If it's not a requirement, then the order could be stored in a non-contiguous format, like a linked list or a sparse array, which might have better performance for add/removal (but worse for non-sequential traversal).

I don't think either of these types can have MutableCollection support, since the algorithms written for MutableCollection assume that updating a value at a certain index (a) doesn't change anything else about the collection (length, etc) and (b) doesn't invalidate any other indices. This assumption isn't documented anywhere, but is present in the algorithms. Thanks for this discussion!

2 Likes

Yes—this is typical. For example, Dictionary and Set both define remove(at i: Index), which matches the RangeReplaceableCollection requirement, even though they can't conform to RRC.

Agree with all Nate said.

Additionally, I don't think it would be useful to conform something to MutableCollection with a precondition like that. If you think about the use case for conforming to a protocol such as MutableCollection, it is so that you can use it with a variety of different generic algorithms. With this precondition, there would be almost no useful algorithms for MutableSet you could even use with it.

The precondition as given is probably a little too strong: with some more implementation complexity you could weaken it to "you must not update an entry to have the value that already exists elsewhere in the collection". This would let you swap elements around in the collection (very useful).

But then when given a type that has a precondition like that, how do you know which algorithms are ok and which not? You would have to theorize about the underlying implementation of the algorithm to figure it out.

For example, a reasonable implementation of remove(where:) that relied on MutableCollection might either 1) swap elements around, or 2) just copy kept elements down over the top of removed ones. The "no duplicates" precondition would be OK with 1), but violated by 2). But you have no way of knowing which is used inside the algorithm.

Actually, it wouldn't. To swap two elements via assignment to a subscript, you'd have to go through an intermediate state where one of the elements was present at two different subscripts. You couldn't remove one while setting the other, because MutableCollection has no API for removing elements from a collection. (But MutableCollection has swapAt anyway.)

Actually, it wouldn't be able to do (2), for the same reason as above. MutableCollection has no API for moving anything down, and trying to do it via sequential assignment would be trying to introduce duplicates temporarily.

This really ought to be documented somewhere!


I understand that providing a settable subscript for MutableCollection conformance isn't palatable, and (based on what Nate and Ben said) probably not viable, so let me try falling back a little:

Both MutableCollection and RangeReplaceableCollection have two fundamental methods, one that does insertions (settable subscript and insert, respectively), and one that rearranges without inserting (swapAt and remove, respectively).

I think it's viable for OrderedSet to provide the functionality of the non-inserting APIs, plus its own unique API for insertions (that may be similar to, but is not constrained by, existing Collection-related protocols).

From that point of view, it's useful to think of MutableCollection and RangeReplaceableCollection each being split into two more primitive protocols:

protocol MutableCollection: ElementsSwappableCollection, ElementsAssignableCollection { }
protocol RangeReplaceableCollection: RangeRemovableCollection, RangeInsertableCollection

(For now, this is a thought experiment, not a proposal. Methods not mentioned above would go into one or other of the primitive proposals according to what methods their default implementations depend on, or would go into the composed protocol if they depend on both.)

With that breakdown, OrderedSet would conform to Collection, ElementsSwappableCollection, RangeRemovableCollection as well as SetAlgebra, and would have ad-hoc API for insertions (and hence, in the case of ranges, replacement).

That would allow OrderedSet to have standard APIs for swapping and removing elements, along with any related API that depends only on those (maybe some sorts, maybe filter, etc).

It may also be acceptable to have OrderedSet conform to RangeInsertableCollection, if a constraint like the one I mention earlier (values already in the set are silently ignored rather being inserted, which is more or less the current Set semantics) is viable. That gives the rest of RangeReplaceableCollection for free.

The wins here are:

  • Almost all of the OrderedSet API would be familiar.

  • Only a few new APIs (things like my earlier restructure example) would need to be invented.

I would strongly, strongly hope that the best natural implementation of OrderedSet would have data structures based on hashing like Set, and random access data structures like Array, in which case complexity guarantees for RandomAccessCollection should be achievable.

That's to answer the criticism of NSOrderedSet that I mentioned earlier, that it doesn't perform very well. I think faster is better than smaller, for this particular implementation.

While this is completely understandable from a timing/planning/compatability perspective, I think it’s really unfortunate that we’re going to be stuck with a subpar interface to Core Data. Is there nothing that can be done with the compiler for compatibility, a la bridging peepholes? Even just establishing the Bridgeable relationship but not rewriting to use it in imported interfaces?

Terminology question: do you mean (an implementation of) multisets?

In order to preserve source compatibility, we certainly can't import NSOrderedSet as OrderedSet.

We should look at some way to do conversions though. One possibility is conformance to the bridging protocols, so you can as cast. That would be the most flexible mechanism, so that if you receive an Any from ObjC API, and it's backed by an NSOrderedSet, you can freely cast it to OrderedSet, same as other bridging mechanisms.

3 Likes

I didn't want to open a new thread just for one question so I post it here into the original thread.

Would it make sense to tackle a whole family of Set types (OrderedSet, CountedSet, SortedSet) - maybe even a SortedArray? All of these types can improve our code dramatically if standartilized.

1 Like

I talked about this a bit with @Ben_Cohen and we'd like to split up the discussion as follows:

Sorted and friends: separate thread, perhaps a larger concept of Sorted-ness as a protocol.

Ordered: this type.

One thing that may help with the potential confusion between the two is giving OrderedSet a name that more directly describes its purpose. One brainstorming idea we had was ArraySet.

2 Likes

In the Java world sets that retain their insertion order are called linked sets and they don't have a seperate protocol they conform to the normal set protocol. Whereas sorted sets do have a seperate protocol. It would be great if in Swift sets had a common set protocol and a seperate sorted protocol, something like:

            Sequence
               |
           Collection
               |
             Unique
               |________ __________
               |        |          |
              Set    LinkedSet  Sorted
                                   |
                               SortedSet

PS Have used Swift terminology; in Java a protocol is an interface and some of the names are not exactly as I stated.

That matches collections in Scala that have following type hierarchy:
image
These are Traits, roughly equivalent to Protocols in Swift. The concrete implementation of SortedSet is TreeSet which is a Red-Black Tree data structure.

Then there is LinkedHashSet a mutable Set implementation offering iterator that maintains the insertion order.


Also forgot to mention before, that @lorentey already quite thoroughly explored the design space and implementation of ordered set. See his 20 min dotSwift 2017 talk, GitHub and excellent book.

2 Likes