SE-0270 (Review #2): Add Collection Operations on Noncontiguous Elements

We chose the name "gather" because that's what this algorithm is called in C++. However, I think a name based on "move" is nice and approachable.

We originally included the "shift" family of algorithms in this proposal in anticipation of this feedback. Gathering and shifting are both types of moving, so I think it behooves us to anticipate these algorithms as well when considering the name for gather:

public mutating func shift(
    from range: Range<Index>, to insertionPoint: Index
) -> Range<Index>

public mutating func shift(
    from source: Index, to insertionPoint: Index
) -> Index
1 Like

Dave — thanks so much again for your detailed review and substantive feedback. I'm going to respond point-by-point to your messages here.

This isn't really true — there's going to be a significant difference between performing a gather when using a RangeSet to represent the indices vs a predicate. In particular, using a predicate for the common cases of an empty range, a single element, or a single range of elements will require iterating over the entire collection, while an algorithm that is designed to work with a RangeSet will be able to perform the gather with a single rotation, or to skip the operation entirely. These aren't Big-O differences, but they are differences for the most common intended usages of this method.

There is a generalization possible — this kind of optimization doesn't require a RangeSet per se, but a type that can vend a collection (or possibly a sequence?) of sorted, nonoverlapping ranges. A protocol that describes such a type would more or less codify the implementation of RangeSet into a protocol, however, and it's not clear to me what other types would usefully conform.

There is definitely a place for a predicate-based version of gather. While I don't agree that the (Element, Index) -> Bool version that you suggest as the "one true gather" can do everything we need, it does point to a serious issue with extending the use of the standard library's typical (Element) -> Bool predicates to the realm of mutating collection methods. If this is indeed a problem to solve, we're already in trouble, as we've added removeAll(where:) with only an element-based predicate, and don't have a way of working with indices at all.

For this reason, I think it makes sense to remove the gather(at:where:) method from this proposal and look at finding the right way to handle this, for gather, removeAll, and any other mutating collection methods we might add in the future.

As an additional note, your examples of how convoluted the code is to convert a Set of indices into a RangeSet are quite illustrative. That is certainly a common operation that the type should support directly, likely by bringing back the RangeSet initializer that takes a sequence of indices and the collection that they index.

Happily, ExpressibleByArrayLiteral conformance was dropped from the implementation that accompanied the revised proposal; I erred in not removing it from the proposal itself. Sorry about that!

I disagree with the reasoning expressed here. There's no need to give a set of ranges another name, as it has already has a name: Set<Range<T>>. We don’t need to make room for a hypothetical typealias, particularly when that alias would contradict the meaning of RangeSet in other languages and libraries.

Adding an extra word to the name RangeSet also doesn't meaningfully disambiguate these things — if someone is confused that a RangeSet is literally a set of ranges, couldn't someone else reasonably be confused that a RangeBasedSet is that same thing? It feels like this discussion is trying to hypothesize about what could be confusing and what could resolve that confusion, relying on the single name of the type to do all the work. Instead, let's look at how the type will be used in context and look at (and improve, if necessary) the documentation for the type and for its API.

I think this approach leads to a less discoverable, less understandable set of methods for common operations. It is true that you can think of a single Range as a contiguous set of indices, but that doesn’t require that we shift all Range-related API to treat them as sets. After all, you can apply the same reasoning to a single element of a Set, but we don't add individual members to a set by calling mySet.formUnion("new element").

I'm not opposed to your point about adding a label to the parameter, though I don't think it's strictly necessary. init(unionOf:) or init(merging:) both seem appealing.

As stated above, I'm not keen on this renaming. The type needs methods for working with individual ranges, and renaming them in pursuit of consistency with SetAlgebra would hide them from people who are looking for methods they recognize.

contiguousSubsets is a step in the wrong direction — the elements of that collection are individual ranges, not sets, even if you can construe them in that way. This feels like an over-emphasis of one somewhat tortured perspective on the fact that a single range can be used semantically as the base case for a RangeSet.

I worry that the medicine of adding this method to Collection is worse than the disease it was intended to cure. This may just need to be cut. I'm definitely not looking to add a new term of art at this point — coming up with a term like "region" isn't necessary to describe something that we already describe as a "range".

removeAll(where: p) is more or less a brand-new method — I think we'd need to show actual harm being caused by the name before we could consider renaming. IIRC the "all" part of the name is to indicate that more than one element may be removed, explicitly distinguishing it from remove(at:) and remove(Element).

Reading more about scatter and gather, there is a bit of an unfortunate overlap in naming. I'm not totally convinced that any of these specific alternatives wins outright for me. move and moveElements feel a bit problematic, because of those words' relationship to moving from unsafe memory and Swift's future move semantics, but I'm not sure those problems are disqualifying. I do like Kyle's proposed shift name, since it works well for single elements, single ranges, and discontiguous elements. The conflict I know of is with that name Javascript's shift/unshift, but we already have names for those operations.


Thanks again for the considered review — I really appreciate the time and seriousness you've committed to this proposal!

5 Likes

I agree with Nate's assessment that extra words add clutter without clarity. I'd add another example for why RangeBasedSet isn't a better solution:

Swift currently has a type Set, which happens to be hash-based. If we'd gone a different route, maybe we'd have an ordered tree-based set called Set, or maybe both, and a reasonable name for the current one would be HashSet. It could be argued (probably would be argued, on this forum) that this is ambiguous. Is it a set of hashes? Maybe it should be called HashBasedSet to avoid that possible confusion. I don't think such a suggestion would be credible, and I feel similarly about RangeSet.

2 Likes

Reading this clarified for me a feeling I have, which is that we took a wrong turn during the review of this proposal.

@Nevin is correct in suggesting that the method should probably be made more flexible. A method on Collection that creates a single-element range at an index is really narrowly useful. Maybe it'd come in handy for other things, or maybe it would be useful for exactly one operation and that's adding an index to a RangeSet. Probably forming a range over n elements would be more useful – but I'm still not sure how useful.

What we have done is taken a situation where there's a clear use case: adding a single index to a proposed RangeSet type. And then we've argued ourselves into thinking that somehow the most obvious and convenient solution – a method RangeSet.insert(index:within:) (or whatever method name... not making a naming point here) – somehow doesn't conform to some conventions.

So instead, we've taken a narrow use case for a specific type, and proposed solving it by adding a dubiously-useful helper method onto Collection. But extensions on Collection are the most valuable real-estate in the standard library. The last thing we should be doing is adding barely-useful convenience methods to it.

It is possible that there is a more general use for creating arbitrary-length ranges of indices, and we should consider that as a proposed extension to Collection. But that addition must stand on its own merits. Convenience for RangeSet use would not be enough.

And even if that method is added, I still think for the sake of discoverability and usability, there should be a convenient method on RangeSet to insert a single index using a collection. Even if it can be trivially composed from elsewhere, it would be less fiddly to use that helper, and unlike methods on Collection, the real-estate on a single concrete type is much cheaper.

So my suggestion is we go back to the original proposed way of inserting a single element, and scrap the Collection method.

9 Likes

Ahh… well, now this comes down to what we mean by “the algorithm.” I view what you're doing here as composing two generic algorithms—gather and rotate—to create a specialized version of gather for RangeSet. This specialization is almost certainly an optimization for empty RangeSets and probably also for those containing everything in a single range. Nothing wrong with that: probably there should be a version of the algorithm specialized for RangeSet, eventually. I realize this is highly subjective, but I just can't get over the feeling that shipping it without the generic gather and rotate is backwards. If we just implement the generic algorithms, they're useful right now for RangeSet, Set<Index>, and predicates. The specialization for RangeSet is nice-to-have, but I consider the fundamental generic algorithms essential.

Me neither, and to boot, I don't think it's a sensible concept. We don't really have an optimization strategy when you get beyond a single contiguous range, do we? That suggests that being sorted isn't even important. But even if we did have such a strategy, there's not enough in such a concept to justify its existence as a protocol; we'd just want an algorithm that accepted a Collection of ranges, with the precondition that it was sorted. Thinking through how this concept (and the more-general PredicateSet) would play out is what convinced me to propose the predicate version of gather instead of pushing for preotocols.

Instead of just saying you don't agree with it, maybe you could explain what's wrong with the binary predicate approach. I haven't heard anything that convinces me there's a better form for the “one true generic gather.” Build as many specialized/convenience implementations as you like on top of it, but AFAICT, it's still the one.

:scream: Now we're coupling separate concerns again. :scream:

How about we start by making it unnecessary to build a RangeSet if you already have a set of indices, and see if people need it, and then really think about how to do this right. It seems to me that if anything, the method that does that construction should be on Collection rather than RangeSet, since a RangeSet of indices is potentially relevant to every Collection but there are many RangeSets that are completely unrelated to any Collection.

No; since a Range<T> is not necessarily Hashable, you can't always form that type. So to make a generic set of any Range type, you'd build it just like your RangeSet, as a sorted data structure, but it would semantically be a set of Range<T> rather than a set of T.

But I also think this is some kind of logical fallacy I don't know the name of. The fact that you can't ensure nobody will ever misinterpret a name doesn't mean you shouldn't try to head off the really easy misinterpretations:

Adding an extra word to the name RangeSet also doesn't meaningfully disambiguate these things — if someone is confused that a RangeSet is literally a set of ranges, couldn't someone else reasonably be confused that a RangeBasedSet is that same thing?

This sort of thing is always a matter of degree. Sure, someone else could reasonably be confused, but it's a whole lot less likely. The word “based” is a good cue that Range is playing a different role than Element, because—among other things—RangeBasedSet is a worse name for a set of ranges than RangeSet. If nothing else, it's a prompt to ask yourself, “what is the word ‘based’ adding here?” and then be motivated to find out what the name means.

That statement reveals a fundamental difference in the way we're understanding most of the naming questions here. From my point of view, the APIs here are only “Range-related” to the same extent that Array<T>.indices is: Range is just a participating type and says nothing about the semantics or role of the range instance in the API.

Even if there were no Collection protocol forcing us not to encode the specific type in the name indices, we would never have called it Array<T>.range because we have a more-specific relevant and meaningful thing to say about it than simply reflecting its type. In the same way, all of the ranges participating in these APIs play a role of contiguous set of element values.

So it's about the roles things play. If you're going to argue that rangeSet.insert(contentsOf: someRange) is more understandable (and discoverable, but that is much lower priority) than rangeSet.formUnion(someRange), then I think you need to explain why we shouldn't have someSet.insert(contentsOf: someOtherSet). I could totally be convinced that's how we should spell that operation, but I cannot imagine any good argument for spelling these two APIs, which have the same semantics, inconsistently.

Sorry, I don't understand this argument. The weirdness of OptionSet aside, It seems to rely on a false equivalence.

I don't think it's “strictly necessary” either. The question is, what is the best API, and why? If you're not sure, let's talk through the advantages and disadvantages of each one.

I'm sorry, I must be missing something. That argument seems to be based on the idea that your design is already out there in wide use or something? If there's anything extant that could set an expectation, it's the prior art that semantic union operations on sets include the word “union” in their names.

This again reflects our fundamental differences with regard to naming. We wouldn't name an API Array<T>.ints just because it returns a collection of Int. The fact that this API returns a collection of Range is not an argument for calling it ranges, as far as I can tell. See also my reply to John.

As I've been trying to say, we don't already describe a range of positions in a collection by simply saying “range.” There's no precedent in Swift for talking about “a range of a collection” as far as I know, and to use it that way would implicitly establish a new term of art. I'm not attached to region, and actually, we already have a term of art for this: “subrange.” We intentionally used that word instead of range when naming removeSubrange because it is clearer:

var rs = [1..<2, 3..<4, 5..<6]
rs.removeRange(1..<2)     // if we had made the wrong naming choice
print(rs.contains(1..<2)) // prints "true"!

I understand the rationale behind the name, and I know it's new-ish to RangeReplaceableCollection, which is why I feel somewhat more comfortable challenging its use as a precedent for RangeSet—less extant material depends on it than something that's been around for a few releases. The problem with using “all” to indicate that more than one element may be removed is that it even more strongly implies that all the elements will be removed. removeElements(where:) would do the job without encouraging the wrong interpretation. This is obviously not something to do lightly, but if we are to make a change, sooner is better than later and I wanted to raise it, but perhaps it should go to a different thread.

Yeah, I'm not sure we've hit on the answer yet either. Other ideas might spring from words like “arrange” or “condense.”

3 Likes

I have questions similar to @nnnnnnnn's about the generic gather algorithm:

  • Should the predicate (Element, Index) -> Bool also include the offset, i.e. (Element, Index, Int) -> Bool?
  • Is this predicate specific to gather or are there other algorithms that need to change their predicate?
  • Why is the closure-based gather preferable to one that accepts something like a PredicateSet restricted argument?

But more importantly, I don't see why we would block this proposal on resolving these questions if we already agree that there should eventually be a specialized version of gather for RangeSet.

As a general rule, starting with the concrete case, and generalizing it subsequently, is very much the appropriate way around.

Generic code harms compile time and client code size, leads to performance cliffs when it can't be specialized, and even when specialized isn't always as efficient due to the need to generalize the implementation. So implementing something generically into the standard library needs a higher justification that it'll be useful beyond a more concrete implementation.

In either case, the common concrete case will need to be implemented within the standard library so that it can be outlined and dispatched to to avoid the potential downsides of the generic code.

This is not an apt comparison, because the word "range" has more meaning in this context than the word "int". That is, when I refer to "a range within an array", it's pretty clear what I mean: the portion of the collection between two indices. Whereas if I said "the ints of an array matching the predicate" when what I meant was "the indices" people would have no idea what I meant.

I disagree. People use this terminology all the time.

Personally I think the word "sub" is a needless word fragment in removeSubrange and if it were proposed today I'd advocate stripping it. I think this example is a contrived and uncompelling argument for it. It is just not reasonable to interpret the above code as removing the first element and I don't think this would ever happen in practice.

Yes, if I have an array of ranges, then the term "range" becomes ambiguous when talking about that specific case. This kind of thing happens in life. We should not litter extra distracting words about the place, at all use sites, worsening the readability of code in most cases, because of the rare possibility of ambiguity and even rarer possibility of confusion when combining specific element types with method names.

For comparison, here is a similar performative misunderstanding for removeSubrange:

var rs = `[1..<5, 1..<10]`
rs.removeSubrange(1..<3)
// surely rs now contains `[3..<5, 3..<10]`

Therefore the method should be called removeRangeOfIndices.

1 Like

Please note that—unless I've overlooked something in this thread—none of @nnnnnnnn's questions about the generic gather algorithm have been revealed to me yet :wink:, so thank you for listing yours…

  • Should the predicate (Element, Index) -> Bool also include the offset, i.e. (Element, Index, Int) -> Bool ?

This is a good point. The offset isn't strictly needed for generality because the algorithm can give a guarantee about the order of traversal and the user can maintain their own counter. Of course, that wouldn't work for a parallel version of the algorithm, but since we're not in a position to implicitly parallelize anything, that would require a different method signature anyway. If we really don't know what the generic algorithm should look like, I agree, it isn't time to add it. I'm not sure whether we're there yet.

These other two, I can answer more definitively:

  • Is this predicate specific to gather or are there other algorithms that need to change their predicate?

The usefulness of a binary (or ternary, as you point out) predicate formulation is specific to mutating algorithms that do ≤O(n) mutations and have a useful unary predicate form. We don't currently have any of those in the standard library, and I don't know of any such algorithms in general other than gather and [stable] partition (on which gather is built)—oh, well, there's the new remove(where:) that is built on partition. It has to be ≤O(n) mutations because otherwise the algorithm would have to pass index positions and offsets of elements that it had already been moved by the algorithm, which would be meaningless to the author of the predicate. It has to be a mutating algorithm to be useful because otherwise the caller could just zip together the elements and indices and/or offsets, and operate on that.

  • Why is the closure-based gather preferable to one that accepts something like a PredicateSet restricted argument?

I've already made reference to that in this thread. The short answer is that it doesn't pull its own weight. If you think about what a PredicateSet protocol would look like, it basically has one method.

protocol PredicateSet {
    associatedtype Element
    func contains(_ e: Element) -> Bool
}

This is nothing but a predicate in another form; hardly a good excuse for adding a new protocol. And now, every algorithm that can accept a unary predicate really ought to have an overload that accepts a PredicateSet argument… or you have to provide an adapter type that makes a predicate into a PredicateSet. On the other hand, if you already have a type with a PredicateSet API (but no PredicateSet protocol), it's really straightforward to call its contains method in a predicate; no adapter needed—closures are like adapters built into the language. I think I've shown elsewhere in this thread that even a hinted lookup is pretty natural to use with a closure.

Absolutely, you make a good point. That said, it's not an exercise you need to repeat in every language in which you build the generic algorithms. C++ has already done the work for us of understanding how to write many generic algorithms—for example, I don't think we erred by failing to start with non-generic versions of sort, index(where:), etc—but maybe the Swift-specific CoW issues that lead to the binary predicate formulation mean that stable partition is a different case.

But more importantly, I don't see why we would block this proposal on resolving these questions if we already agree that there should eventually be a specialized version of gather for RangeSet .

We shouldn't, I'm now convinced, thank you.

Let me go back to my original motivation for pressing this point. I consider the library's thin repertoire of generic algorithms to be a major missed opportunity, and really, an incomplete part of the original vision for the standard library.¹ I spent years unsuccessfully arguing for including a generic stable partition and rotate in the library. Now we're proposing to add a library component whose implementation requires a stable partition, whose optimized implementation requires a rotate—both of which are trivial to genericize—and I feel like I'm the only one talking about how/where the generic forms fit into the library. I hope you can understand why, on a purely non-technical level, that's hard for me to swallow.

The eventual shape of the generic algorithms does have a real bearing on how the specialized algorithms should look—you at least want to create some consistency. So, I'd be a lot happier if we as a group were giving them more attention in this discussion, but again, I agree, they're not a prerequisite for adding the specialized ones.

-Dave

šThis omission is costing Swift adoption: we know of important components that were implemented in C++ instead of Swift, one of the important reasons being that the C++ library had a complete suite of generic algorithms that were ultimately heavily used and greatly sped up development.

4 Likes

On this particular point, these generic algorithms are only excluded from the proposal so that it can be focused on a particular set of APIs — they're absolutely still on the hit list of what we want to add. There are some issues relating to rotate, in particular, that we need to iron out, but there's no reason a proposal for them couldn't be parallel to this one.

1 Like

Apologies for not being more clear! My concern about the binarily-predicated gather centered only on its ability to fully replace the one that's optimized for a collection of subranges; I wasn't trying to say that it shouldn't exist at all. Beyond that, I believe that it's a good design, but it needs to be looked at with the rest of the standard library (including removeAll(where:), as you mentioned), an implementation, a pitch, etc, so that we're not discussing something that's only been presented as part of a single forum post, while trying to evaluate the proposal at hand.

1 Like

It isn't actually clear what you mean, because:

  • you could be referring to a range of values in the array
  • it could be an array of ranges, which is where ”a range within the array” becomes really confusing.

But more importantly, we're not even talking about an API on the collection whose indices are being stored. In fact, such a collection may not exist, because this is an API on RangeSet, which has nothing to do with collections. RangeSet<Float> is a really useful type all on its own (and Collections with Float indices are pretty rare). The same rationale—that there may not be any Collection involved, also argues against subranges for this case—which is why contiguousSubsets remains the best name I've seen yet.

Even if I bought into the idea that “range within a collection” was a well-understood concept, that would still make ranges as inappropriate a name as indices is for a property of RangeSet that is not also a Collection and may not even be related to any Collection. So perhaps my comparison could have been more apt, but making it so only hurts the case for the name ranges.

I'd like to see some evidence to back up that assertion, please. I've heard “range of elements in a collection” and “range of positions in a collection” but never just “range of a collection.”

You're missing the point; this is about comprehensibility of code for someone who doesn't necessarily already know what they're dealing with. If I am reading unfamiliar code, it's very likely I don't know what the element type of a given collection is. When I read x.removeRange(1..<2), and I don't know the standard library APIs well, and I don't know the type of x, what am I going to assume that code means? The obvious interpretation is that x is a thing that contains ranges, and I'm going to remove one.

Now you could say removeSubrange suggests that x is a thing that contains “subranges,” and that's true, but the problem is not as bad because Subrange is not a thing that you will have encountered in the library and may even recognize as the argument to the method. It would be reasonable to claim that removeSubrange does not improve the problem enough to be worth the added syllable. I don't know how to judge that; by any objective measure those three characters are very small penalty to pay—but it's still always possible we could have done better; that's not really the point. The question is really whether it's worth “sweating the details” when it comes to naming.

I'm going to remind everyone again, but in a different way, of how we got here. The process that resulted in adding “sub” to that name was the product of about a year of really hard work and thinking about how to name APIs by a group that included me, @Douglas_Gregor, @Chris_Lattner3, @Tony_Parker, and several others. Then @gribozavr, @moiseev and I worked on the specific changes, got them reviewed by the aforementioned group, and took them through an extended review on this list. That is to say, it wasn't just me, there were principles driving these changes, a substantial group of smart people worked really hard to develop, scrutinize, and ratify them with the community, with the intention to set a precedent for how things should be done—and especially, which things matter when considering naming—in the future. One of the first conclusions (actually precedented in the Cocoa guidelines) was that it's important to have some principles to guide naming choices. Another important conclusion while we would try to avoid names that “feel weird,” surface aesthetics take a back seat to communicating semantics and avoiding easy misinterpretations.

Now, it's no longer my job to defend the design principles of the standard library. Even if it were, I'm not in a position to stop the community if it wants to break with the precedents set in 2016. The questions are a). what are you going to replace those old principles with? and, b). are you going to change the existing names made on the old principles, or are you just going to things differently going forward?

9 Likes
  • The first parameter of replaceSubrange(_:with:) is documented as "The subrange of the collection to replace."
  • The first parameter of removeSubrange(_:) is documented as "The subrange of the collection to remove."

Exactly! subrange, functioning as a term of art.

Furthermore I don’t think it’s been particularly successful at being adopted outside of our documentation.

So I guess subrange is a term of art meaning "range of a collection"?

Nice try;-). It means “range of (consecutive) elements in a collection.” That’s what gets removed/replaced.

1 Like

Folks, please take this to the new review thread.