RFC: Deque: should it have an opaque index?

lorentey · April 7, 2021, 10:10am

I don’t think self-slicing should be an option for any collection other than:

SubSequence types, where the potential risk of a memory leak is low — it’s quite obvious that values of these may be tiny slices in a huge collection.
Cases where the collection value does not own its storage (like UnsafeBufferPointer) — in these cases, the memory leak simply isn’t possible.

I expect I’ll start to draw up some guidelines at some point in the package documentation — but I’m a very unreliable oracle. (These discussions are extremely valuable for hashing out / refining some of the details that will probably end up there.)

The problem of course is that there are no hard and fast rules; beyond the “formal” protocol requirements (which I’m sad to admit are sometimes ignored when inconvenient — hopefully we’ll be able to fix some of these), there are only opinions, sometimes backed by hard-earned experience, sometimes less so. (For example, I was very strongly convinced that opaque indices are always the right way to go (and was quite vocal about it too), until I tried to sit down and actually use the resulting implementations in the first cut of what became the collections package. And even then I kept trying to make it work for weeks and weeks, because I was so invested in the idea.)

Given all this, sitting down and thinking deep thoughts about what might be the most sensible design for an Ideal Collection Implementation seems far less productive an idea to me than… well, sitting down and actually implementing more collection types. Incidentally, just the three collections we currently have in swift-collections already highlighted several areas where the existing collection hierarchy might be in need of some work:

Deque (and my old B-tree work) underscore the need for a better way to handle piecewise contiguous collections, especially when it comes to copying data between them. (I have a reasonable-looking idea to extend IteratorProtocol to replace/fix the (unpleasantly limited, underscored) Sequence._copyContents requirement we have now.
We probably need a PermutableCollection protocol so that ordered collections can take advantage of generic algorithms that belong in the shuffle/sort/partition family.
We may want to introduce a RangeRemovableProtocol to capture the parts of RRC that can be implemented by ordered collections. (This may be of limited usefulness in general, though.)
We will need some sort of dictionary protocol. Or not! We’ll see clearer when we have more than two dictionary types.
It would be nice to have a UniqueCollection protocol, providing the semantic requirement that elements are unique. (Set, Dictionary, Dictionary.Keys, OrderedSet, OrderedSet.UnorderedView, OrderedDictionary, OrderedDictionary.Elements (plus their subsequence types) might be enough examples to justify the protocol.)
SetAlgebra does not seem very good for anything but Set and OptionSet. Maybe it will work out for SortedSet. However, we may want to consider adding a lightweight “lookupable” protocol that simply adds stricter upper bounds on the complexity of firstIndex(of:)/lastIndex(of:)/contains(_:).

Imagine how many more ideas we’d have if we added three more collections. (E.g., what if we made a serious attempt at implementing a SortedBag? How would it fit in the protocol hierarchy?)

Most of these ideas can and should be prototyped in package form. (Although some are more difficult than others to fully develop this way — IteratorProtocol additions in particular might be too annoying to do that way.)

As for popFirst:

EDIT: aargh, I wrote a whole section thinking about removeFirst, and calling it popFirst. It’s been a long day. I removed it and replaced it with something more relevant:

First of all, popFirst should be a RangeReplaceableCollection requirement (or at least an extension method expressed in terms of removeFirst). I think it is a fairly obviously a bug that it isn’t, whether or not it was an intentional choice.

Second, because popFirst was omitted from RRC, there are no requirements about its behavior. Set and Dictionary certainly make no attempt to emulate the generic extension’s indexing invalidation in their own implementations. Someone could implement a collection type where popFirst returned the tenth element, changing nothing — and they wouldn’t violate or break anything but common sense.

That said, in generic contexts over Collection where Self == SubSequence, we can rely on the fact that popFirst resolves to the stdlib’s extension method that is (currently) expressed in terms of slicing. However, this isn’t true in cases where we’re calling popFirst on a concrete type.

Additionally, the default popFirst does not document what indices it invalidates, so technically callers should assume that it may invalidate all of them — like any other under-documented mutation operation. (Note: this is just a pedantic observation; we obviously wouldn’t risk breaking code by changing the stdlib’s current implementation in an incompatible way, unless we find a very good reason to do so.)

With all that said, I think I agree that it would probably be rude for a self-slicing type to shadow the standard popFirst with different semantics for no good reason. But no requirements apply here, so even the flimsiest of excuses can be considered good enough reason to do so. (And, of course, this doesn’t apply to Deque’s popFirst, because Deque isn’t self-slicing, so its implementation isn’t shadowing anything. (Self-slicing generally isn’t a good idea anyway. )

Remarks:

Beware, SubSequence == Self does not imply that popFirst has O(1) complexity — this does not follow from slicing/indexing requirements. It’s sometimes a reasonable assumption, but it’s just that. For example, the best promise we can make about Substring.popFirst() is that it takes O(n) time, where n is the size of the substring’s UTF-8 encoding.

The rules of index invalidation vary wildly between data structures. There are no universal rules, each data structure is a unique, beautiful butterfly that should be allowed to fly wherever its wings take it. I recommend implementing Array-like indexing for most cases of RandomAccessCollection, because I’d very much prefer if we’d at least kept the trivial cases consistent — and in RAC’s case there is usually no point in doing anything more complicated.