SE-0270 (Review #2): Add Collection Operations on Noncontiguous Elements

dabrahams · January 15, 2020, 10:45pm

nnnnnnnn:

dabrahams:
Why is the first method's first parameter a RangeSet? Its algorithm will work with any representation of a set of a collection's indices. In fact, all it needs is a predicate that tells us whether the element at a given index should be gathered.

This isn't really true — there's going to be a significant difference between performing a gather when using a RangeSet to represent the indices vs a predicate. In particular, using a predicate for the common cases of an empty range, a single element, or a single range of elements will require iterating over the entire collection, while an algorithm that is designed to work with a RangeSet will be able to perform the gather with a single rotation, or to skip the operation entirely. These aren't Big-O differences, but they are differences for the most common intended usages of this method.

Ahh… well, now this comes down to what we mean by “the algorithm.” I view what you're doing here as composing two generic algorithms—gather and rotate—to create a specialized version of gather for RangeSet. This specialization is almost certainly an optimization for empty RangeSets and probably also for those containing everything in a single range. Nothing wrong with that: probably there should be a version of the algorithm specialized for RangeSet, eventually. I realize this is highly subjective, but I just can't get over the feeling that shipping it without the generic gather and rotate is backwards. If we just implement the generic algorithms, they're useful right now for RangeSet, Set<Index>, and predicates. The specialization for RangeSet is nice-to-have, but I consider the fundamental generic algorithms essential.

Me neither, and to boot, I don't think it's a sensible concept. We don't really have an optimization strategy when you get beyond a single contiguous range, do we? That suggests that being sorted isn't even important. But even if we did have such a strategy, there's not enough in such a concept to justify its existence as a protocol; we'd just want an algorithm that accepted a Collection of ranges, with the precondition that it was sorted. Thinking through how this concept (and the more-general PredicateSet) would play out is what convinced me to propose the predicate version of gather instead of pushing for preotocols.

Instead of just saying you don't agree with it, maybe you could explain what's wrong with the binary predicate approach. I haven't heard anything that convinces me there's a better form for the “one true generic gather.” Build as many specialized/convenience implementations as you like on top of it, but AFAICT, it's still the one.

Now we're coupling separate concerns again.

How about we start by making it unnecessary to build a RangeSet if you already have a set of indices, and see if people need it, and then really think about how to do this right. It seems to me that if anything, the method that does that construction should be on Collection rather than RangeSet, since a RangeSet of indices is potentially relevant to every Collection but there are many RangeSets that are completely unrelated to any Collection.

No; since a Range<T> is not necessarily Hashable, you can't always form that type. So to make a generic set of any Range type, you'd build it just like your RangeSet, as a sorted data structure, but it would semantically be a set of Range<T> rather than a set of T.

But I also think this is some kind of logical fallacy I don't know the name of. The fact that you can't ensure nobody will ever misinterpret a name doesn't mean you shouldn't try to head off the really easy misinterpretations:

Adding an extra word to the name RangeSet also doesn't meaningfully disambiguate these things — if someone is confused that a RangeSet is literally a set of ranges, couldn't someone else reasonably be confused that a RangeBasedSet is that same thing?

This sort of thing is always a matter of degree. Sure, someone else could reasonably be confused, but it's a whole lot less likely. The word “based” is a good cue that Range is playing a different role than Element, because—among other things—RangeBasedSet is a worse name for a set of ranges than RangeSet. If nothing else, it's a prompt to ask yourself, “what is the word ‘based’ adding here?” and then be motivated to find out what the name means.

That statement reveals a fundamental difference in the way we're understanding most of the naming questions here. From my point of view, the APIs here are only “Range-related” to the same extent that Array<T>.indices is: Range is just a participating type and says nothing about the semantics or role of the range instance in the API.

Even if there were no Collection protocol forcing us not to encode the specific type in the name indices, we would never have called it Array<T>.range because we have a more-specific relevant and meaningful thing to say about it than simply reflecting its type. In the same way, all of the ranges participating in these APIs play a role of contiguous set of element values.

So it's about the roles things play. If you're going to argue that rangeSet.insert(contentsOf: someRange) is more understandable (and discoverable, but that is much lower priority) than rangeSet.formUnion(someRange), then I think you need to explain why we shouldn't have someSet.insert(contentsOf: someOtherSet). I could totally be convinced that's how we should spell that operation, but I cannot imagine any good argument for spelling these two APIs, which have the same semantics, inconsistently.

Sorry, I don't understand this argument. The weirdness of OptionSet aside, It seems to rely on a false equivalence.

dabrahams:
RangeSet([0..<10, 10..<20])
Doesn't look symmetric with the previous one, and seems to imply we're building a set of ranges. Should it be:
RangeSet(union: [0..<10, 10..<20])
I'm not opposed to your point about adding a label to the parameter, though I don't think it's strictly necessary. init(unionOf:) or init(merging:) both seem appealing.

I don't think it's “strictly necessary” either. The question is, what is the best API, and why? If you're not sure, let's talk through the advantages and disadvantages of each one.

I'm sorry, I must be missing something. That argument seems to be based on the idea that your design is already out there in wide use or something? If there's anything extant that could set an expectation, it's the prior art that semantic union operations on sets include the word “union” in their names.

This again reflects our fundamental differences with regard to naming. We wouldn't name an API Array<T>.ints just because it returns a collection of Int. The fact that this API returns a collection of Range is not an argument for calling it ranges, as far as I can tell. See also my reply to John.

As I've been trying to say, we don't already describe a range of positions in a collection by simply saying “range.” There's no precedent in Swift for talking about “a range of a collection” as far as I know, and to use it that way would implicitly establish a new term of art. I'm not attached to region, and actually, we already have a term of art for this: “subrange.” We intentionally used that word instead of range when naming removeSubrange because it is clearer:

var rs = [1..<2, 3..<4, 5..<6]
rs.removeRange(1..<2)     // if we had made the wrong naming choice
print(rs.contains(1..<2)) // prints "true"!

I understand the rationale behind the name, and I know it's new-ish to RangeReplaceableCollection, which is why I feel somewhat more comfortable challenging its use as a precedent for RangeSet—less extant material depends on it than something that's been around for a few releases. The problem with using “all” to indicate that more than one element may be removed is that it even more strongly implies that all the elements will be removed. removeElements(where:) would do the job without encouraging the wrong interpretation. This is obviously not something to do lightly, but if we are to make a change, sooner is better than later and I wanted to raise it, but perhaps it should go to a different thread.

Yeah, I'm not sure we've hit on the answer yet either. Other ideas might spring from words like “arrange” or “condense.”