SE-0270 (Review #2): Add Collection Operations on Noncontiguous Elements

This might not be a huge proposal, but imho the topic itself is - I think even two reviews aren't enough to discuss all implications. I'm not saying we need more review cycles, though: Imho this is a perfect candidate for SE-0264 (review #2) — Standard Library Preview Package...

I agree with @bzamayo that it's odd to talk about ranges all the time, while the real code is actually focused on single indices. Take Collection.range(at:) as an example - this could as well have an additional length parameter, which imho would also make it significantly more useful in other contexts besides this proposal.

Another interesting example is public func ranges(of element: Element) -> RangeSet<Index>.
The natural result of this operation is a set of indexes, and using ranges here looks like an (possible) optimization which leaks out.
On the other hand, there is a very similar functionality which naturally produces ranges, and which I consider to be a big gap in Swift:
We have https://developer.apple.com/documentation/foundation/nsstring/1412937-replacingoccurrences, but that's only part of NSString, not String.
Therefor, I'm puzzled that something like public func ranges<C: Collection>(of subsequence: C) -> RangeSet<Index> isn't even mentioned as a possible next step.

In the long run, I even see some overlap with [Prototype] Protocol-powered generic trimming, searching, splitting,, but I no idea how well the two concepts could work together.

My bottom line:
Improving the ease of use of collections is a crucial topic, and I really doubt that old procedure with small, simple steps will produce great results. Instead, I'd prefer a preview package with a complete vision of how to work with Collection - or even several competing approaches, so that we can compare the pros and cons.

2 Likes

If you only want one, then the index sequence version I proposed would be the one to keep, since you can simulate the predicate version with a prior index-testing pass. But that would be an extra explicit step.

I'm not sure the preview package addresses this problem. IIUC, it's only for things the standard library team thinks are ready to ship, not just for getting something out there to experiment with and seeing if it flies.

Is RangeSet Really an OK Name?

The argument in favor, IIUC, is that the generic argument to RangeSet tells you what it's a set of, so you'll see RangeSet<Int> and read it as “(range set) of Int.” I was prepared to be convinced by that argument, but today I started thinking about a similar, totally reasonable data structure: a set of ranges with generic bound type. What would we call that? “Probably RangeSet, I thought.” In fact, here's a typealias that does it:

typealias RangeSet<T> = Set<Range<T>>

Then a set of ranges of Int would be spelled RangeSet<Int> :frowning:

The fact that the name is equally (or maybe more!) appropriate for another useful type troubles me. When I get to this point, I look at how we'd disambiguate it, and evaluate the costs. Since RangeBasedSet is clearer and not unwieldy, I'm suggesting it as a replacement. I'm interested to hear the community's thoughts.

7 Likes

I agree. The name hasn’t felt right to me for this reason as well. I wasn’t able to think of anything better though.

This is the first name I’ve seen that is both completely clear and relatively concise. I like it.

1 Like

RangeSet treats Ranges as sets; should we name operations accordingly?

[Almost done with these posts; I have just one more on the stack—so far!]

Welcome back, @maven! Actually, this is logically a “set of indices”, or rather, a “set of anything comparable”. I know it can be confusing: here's the place where I finally realized it myself.

I had planned to let someone else answer this post, but I was just re-reading it and after refactoring your statement above I noticed you were saying Range<Bound> is a bit like Set<Bound>, which is an excellent point: Ranges are treated by RangeSet as contiguous sets of the Bound type. Once you realize that, a number of the APIs bear re-examination:

1.

    /// Creates a range set containing the values in the given range.
    public init(_ range: Range<Bound>)
RangeSet(0..<10)

LGTM; it's an information- and semantics- preserving construction of one set from another.

2.

    /// Creates a range set containing the values in the given ranges.
    public init<S: Sequence>(_ ranges: S) where S.Element == Range<Bound>

Hmm,

RangeSet([0..<10, 10..<20])

Doesn't look symmetric with the previous one, and seems to imply we're building a set of ranges. Should it be:

RangeSet(union: [0..<10, 10..<20])

? That doesn't read beautifully to me, but it's at least trying to address the issue. Open to other ideas.

3.

    /// Adds the values represented by the given range to the range set.
    public mutating func insert(contentsOf range: Range<Bound>)
        
    /// Removes the given range of values from the range set.
    public mutating func remove(contentsOf range: Range<Bound>)

Now that we have the contentsOf: label, these are less suggestive than they used to be of the idea that RangeSet is a set of ranges, but the use sites are now longer, and these APIs fail to acknowledge the set-like role played by parameter. The obvious alternatives are s.formUnion(someRange) and s.subtract(someRange), which are both consistent with the set role of the parameter and more compact at the use site. I used to be rather ambivalent about suggesting this change, but seeing the usage written out, it now looks like a no-brainer to me.

Full disclosure: changing these names as I've suggested implies the possiblility of a whole pile of related API, with usage like

let s1 = s.union(someRange)        // non-mutating
let s2 = otherRange.subtracting(s) // non-mutating, reversing order

etc. However, I don't think that implication obliges us to add any of those other APIs today. We should start conservatively and only expand when we have real use cases demonstrating a benefit.

4.

    public struct Ranges: RandomAccessCollection {
        public var startIndex: Int { get }
        public var endIndex: Int { get }
        public subscript(i: Int) -> Range<Bound>
    }
    
    /// A collection of the ranges that make up the range set.
    public var ranges: Ranges { get }

Seems to me contiguousSubsets is a better name for this, despite its length. This is not a name that will be uttered often, and where it is useful, IMO, there's usually a better way. For example, we should discourage

for r in s.ranges {
    for i in r {
        ... // something with i
    }
}

in favor of

for i in c.indices[s] {
    ... // something with i

}

because the latter is simpler, but also because the former is wrong with an arbitrary collection (the inner loop needs to be “for i in a.indices[r]”).

[Also, the property should be documented as being sorted by lowerBound and having non-overlapping elements.]

Conclusion

The proposed naming of these APIs may be a vestige of our initial failure to clearly answer the “what the heck is it?” question, and we should consider going all the way to reflecting a consistent view of the roles of RangeSet and Range. If we decide not to make the changes, we should at least understand why not.

7 Likes

Afaics the preview-proposal isn't even accepted yet - and I think we (well, not actually we in the sense that I'm included ;-) are free to change processes so that they serve us better.
I'm convinced that the current process makes it extremely hard to add big changes to Swift, and that it's time to have an alternative.
Btw, what about Function builders or _modify? Looking at those two, I think we just have to acknowledge that Evolution doesn't work as it used to do...

Has it ever really been the case that every change introduced since Apple started the evolution process has gone through a proposal? I see the process as providing 3rd-party input on proposed features, and a way to propose new features. Has it ever been the only way for Swift to receive new features?

Off the top of my head, I can’t think of anything that didn’t. Function builders and the other features introduced to support SwiftUI went up for review the day SwiftUI was announced, IIRC.

Those features are going to remain, and no review is going to change that. All we can do is shape the syntax and propose improvements. We can't provide any feedback that would cause Apple to remove them completely.

This leaning into the idea of RangeSet<Bound> being like a Set<Bound> is an interesting direction.

RangeSet(union: [0..<10, 10..<20]) seems like an improvement on the version with an unlabeled parameter. I think RangeSet(unionOf: [0..<10, 10..<20]) reads even better. Though I also wonder if this overload is needed at all. Apart from the new RangeSet.ranges, are there any other cases in the standard library where a sequence of Ranges is returned?

Similarly, given the type conversion enabled by public init(_ range: Range<Bound>) and the presence of the set operator methods, I wonder if insert and remove are necessary. That is, instead of:

s.insert(contentsOf: someRange)

it is equivalent, and perhaps clearer, to write:

s.formUnion(RangeSet(someRange))

At the least, this line of thinking makes clear that insert and remove are convenience API and must be justified on ergonomic grounds.

1 Like

Agreed.

Works for me…

These are good questions. There is an efficiency argument for the constructor, which can pre-allocate required space (though it may over-allocate depending on the input sequence). The methods combining RangeSet with plain Ranges could avoid allocations altogether if RangeSet didn't have inline storage for a single Range, but IIUC the inline storage is planned,

One of my general concerns with the proposal has been that it has always been a bit too liberal about committing to new API. Because the proposal is bigger than it needs to be, it's easy to miss things that warrant scrutiny. The more instances we find of API that isn't strictly needed, the more inclined I am to ask that the proposal be pruned back to the absolute minimum set of basis operations.

Miscellaneous Naming Issues

[This is the last of my separate threads].

  • a.range(at: i) is misleading; it strongly implies a is a collection of Ranges. The name I originally suggested to Nate was a.rangeOfElement(at: i), but I actually think a.regionOfElement(at: i) is more appropriate (see last bullet). This is an example where spending a few characters to avoid a potential misreading is totally worth it. Clarity is more important than brevity.

  • I've never been pleased with removeAll(where: p), because the commonality of the base name with removeAll(keepingCapacity:) strongly implies that these methods are a basic family. As such, it can easily be misread to mean “if p is satisfied, remove all elements.” The proposed method really means “remove (all elements where p is satisfied)” and is thus much more related to remove(at:) and Set's remove(value) method—which could be extended to RangeReplaceableCollection—so would be better named remove(where:) or even remove(allWhere:). We already have this API in RangeReplaceableCollection, but I believe the change could be made gradually without substantially disturbing the validity of extant code.

  • I'll repeat Chris' concern about gather conflicting with the naming of common scatter/gather operations from numerics. Aside from moveElements we could consider collect as a name but I'm just throwing it out there; not convinced it's better. FWIW, the argument against moveElements that goes “we avoid using ‘Elements’ in collection method names because it's redundant” is specious. We only avoid it when it's redundant, and a.move(…) would strongly imply that a is being moved, so it's not redundant here.

  • I'll repeat my concern about the use of the word “range” in many of the APIs here, because it drops semantic information: the relationship between the range and the collection. Without it, we could easily be speaking of ranges of element values. We're describing contiguous regions of the collection, or even of set element values.

Thanks for your patience with my many posts in this review, everybody.

10 Likes

Since I can't tell what's happening with this review, I guess I should be explicit and answer all the questions here:

I think it's solving an important problem, but I think the proposal suffers from its evolution from something that was too large and not fully considered. It should not be accepted without at least having some robust discussion of the many issues I've raised in this thread.

  • Is the problem being addressed significant enough to warrant a change to Swift?

Yes.

  • Does this proposal fit well with the feel and direction of Swift?

Mostly.

  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

My previous experience is with the STL; it only overlaps a little with what's in this proposal.

  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

An in-depth study; I hope that's obvious from my other posts.

1 Like

We chose the name "gather" because that's what this algorithm is called in C++. However, I think a name based on "move" is nice and approachable.

We originally included the "shift" family of algorithms in this proposal in anticipation of this feedback. Gathering and shifting are both types of moving, so I think it behooves us to anticipate these algorithms as well when considering the name for gather:

public mutating func shift(
    from range: Range<Index>, to insertionPoint: Index
) -> Range<Index>

public mutating func shift(
    from source: Index, to insertionPoint: Index
) -> Index
1 Like

Dave — thanks so much again for your detailed review and substantive feedback. I'm going to respond point-by-point to your messages here.

This isn't really true — there's going to be a significant difference between performing a gather when using a RangeSet to represent the indices vs a predicate. In particular, using a predicate for the common cases of an empty range, a single element, or a single range of elements will require iterating over the entire collection, while an algorithm that is designed to work with a RangeSet will be able to perform the gather with a single rotation, or to skip the operation entirely. These aren't Big-O differences, but they are differences for the most common intended usages of this method.

There is a generalization possible — this kind of optimization doesn't require a RangeSet per se, but a type that can vend a collection (or possibly a sequence?) of sorted, nonoverlapping ranges. A protocol that describes such a type would more or less codify the implementation of RangeSet into a protocol, however, and it's not clear to me what other types would usefully conform.

There is definitely a place for a predicate-based version of gather. While I don't agree that the (Element, Index) -> Bool version that you suggest as the "one true gather" can do everything we need, it does point to a serious issue with extending the use of the standard library's typical (Element) -> Bool predicates to the realm of mutating collection methods. If this is indeed a problem to solve, we're already in trouble, as we've added removeAll(where:) with only an element-based predicate, and don't have a way of working with indices at all.

For this reason, I think it makes sense to remove the gather(at:where:) method from this proposal and look at finding the right way to handle this, for gather, removeAll, and any other mutating collection methods we might add in the future.

As an additional note, your examples of how convoluted the code is to convert a Set of indices into a RangeSet are quite illustrative. That is certainly a common operation that the type should support directly, likely by bringing back the RangeSet initializer that takes a sequence of indices and the collection that they index.

Happily, ExpressibleByArrayLiteral conformance was dropped from the implementation that accompanied the revised proposal; I erred in not removing it from the proposal itself. Sorry about that!

I disagree with the reasoning expressed here. There's no need to give a set of ranges another name, as it has already has a name: Set<Range<T>>. We don’t need to make room for a hypothetical typealias, particularly when that alias would contradict the meaning of RangeSet in other languages and libraries.

Adding an extra word to the name RangeSet also doesn't meaningfully disambiguate these things — if someone is confused that a RangeSet is literally a set of ranges, couldn't someone else reasonably be confused that a RangeBasedSet is that same thing? It feels like this discussion is trying to hypothesize about what could be confusing and what could resolve that confusion, relying on the single name of the type to do all the work. Instead, let's look at how the type will be used in context and look at (and improve, if necessary) the documentation for the type and for its API.

I think this approach leads to a less discoverable, less understandable set of methods for common operations. It is true that you can think of a single Range as a contiguous set of indices, but that doesn’t require that we shift all Range-related API to treat them as sets. After all, you can apply the same reasoning to a single element of a Set, but we don't add individual members to a set by calling mySet.formUnion("new element").

I'm not opposed to your point about adding a label to the parameter, though I don't think it's strictly necessary. init(unionOf:) or init(merging:) both seem appealing.

As stated above, I'm not keen on this renaming. The type needs methods for working with individual ranges, and renaming them in pursuit of consistency with SetAlgebra would hide them from people who are looking for methods they recognize.

contiguousSubsets is a step in the wrong direction — the elements of that collection are individual ranges, not sets, even if you can construe them in that way. This feels like an over-emphasis of one somewhat tortured perspective on the fact that a single range can be used semantically as the base case for a RangeSet.

I worry that the medicine of adding this method to Collection is worse than the disease it was intended to cure. This may just need to be cut. I'm definitely not looking to add a new term of art at this point — coming up with a term like "region" isn't necessary to describe something that we already describe as a "range".

removeAll(where: p) is more or less a brand-new method — I think we'd need to show actual harm being caused by the name before we could consider renaming. IIRC the "all" part of the name is to indicate that more than one element may be removed, explicitly distinguishing it from remove(at:) and remove(Element).

Reading more about scatter and gather, there is a bit of an unfortunate overlap in naming. I'm not totally convinced that any of these specific alternatives wins outright for me. move and moveElements feel a bit problematic, because of those words' relationship to moving from unsafe memory and Swift's future move semantics, but I'm not sure those problems are disqualifying. I do like Kyle's proposed shift name, since it works well for single elements, single ranges, and discontiguous elements. The conflict I know of is with that name Javascript's shift/unshift, but we already have names for those operations.


Thanks again for the considered review — I really appreciate the time and seriousness you've committed to this proposal!

5 Likes

I agree with Nate's assessment that extra words add clutter without clarity. I'd add another example for why RangeBasedSet isn't a better solution:

Swift currently has a type Set, which happens to be hash-based. If we'd gone a different route, maybe we'd have an ordered tree-based set called Set, or maybe both, and a reasonable name for the current one would be HashSet. It could be argued (probably would be argued, on this forum) that this is ambiguous. Is it a set of hashes? Maybe it should be called HashBasedSet to avoid that possible confusion. I don't think such a suggestion would be credible, and I feel similarly about RangeSet.

2 Likes

Reading this clarified for me a feeling I have, which is that we took a wrong turn during the review of this proposal.

@Nevin is correct in suggesting that the method should probably be made more flexible. A method on Collection that creates a single-element range at an index is really narrowly useful. Maybe it'd come in handy for other things, or maybe it would be useful for exactly one operation and that's adding an index to a RangeSet. Probably forming a range over n elements would be more useful – but I'm still not sure how useful.

What we have done is taken a situation where there's a clear use case: adding a single index to a proposed RangeSet type. And then we've argued ourselves into thinking that somehow the most obvious and convenient solution – a method RangeSet.insert(index:within:) (or whatever method name... not making a naming point here) – somehow doesn't conform to some conventions.

So instead, we've taken a narrow use case for a specific type, and proposed solving it by adding a dubiously-useful helper method onto Collection. But extensions on Collection are the most valuable real-estate in the standard library. The last thing we should be doing is adding barely-useful convenience methods to it.

It is possible that there is a more general use for creating arbitrary-length ranges of indices, and we should consider that as a proposed extension to Collection. But that addition must stand on its own merits. Convenience for RangeSet use would not be enough.

And even if that method is added, I still think for the sake of discoverability and usability, there should be a convenient method on RangeSet to insert a single index using a collection. Even if it can be trivially composed from elsewhere, it would be less fiddly to use that helper, and unlike methods on Collection, the real-estate on a single concrete type is much cheaper.

So my suggestion is we go back to the original proposed way of inserting a single element, and scrap the Collection method.

9 Likes

Ahh… well, now this comes down to what we mean by “the algorithm.” I view what you're doing here as composing two generic algorithms—gather and rotate—to create a specialized version of gather for RangeSet. This specialization is almost certainly an optimization for empty RangeSets and probably also for those containing everything in a single range. Nothing wrong with that: probably there should be a version of the algorithm specialized for RangeSet, eventually. I realize this is highly subjective, but I just can't get over the feeling that shipping it without the generic gather and rotate is backwards. If we just implement the generic algorithms, they're useful right now for RangeSet, Set<Index>, and predicates. The specialization for RangeSet is nice-to-have, but I consider the fundamental generic algorithms essential.

Me neither, and to boot, I don't think it's a sensible concept. We don't really have an optimization strategy when you get beyond a single contiguous range, do we? That suggests that being sorted isn't even important. But even if we did have such a strategy, there's not enough in such a concept to justify its existence as a protocol; we'd just want an algorithm that accepted a Collection of ranges, with the precondition that it was sorted. Thinking through how this concept (and the more-general PredicateSet) would play out is what convinced me to propose the predicate version of gather instead of pushing for preotocols.

Instead of just saying you don't agree with it, maybe you could explain what's wrong with the binary predicate approach. I haven't heard anything that convinces me there's a better form for the “one true generic gather.” Build as many specialized/convenience implementations as you like on top of it, but AFAICT, it's still the one.

:scream: Now we're coupling separate concerns again. :scream:

How about we start by making it unnecessary to build a RangeSet if you already have a set of indices, and see if people need it, and then really think about how to do this right. It seems to me that if anything, the method that does that construction should be on Collection rather than RangeSet, since a RangeSet of indices is potentially relevant to every Collection but there are many RangeSets that are completely unrelated to any Collection.

No; since a Range<T> is not necessarily Hashable, you can't always form that type. So to make a generic set of any Range type, you'd build it just like your RangeSet, as a sorted data structure, but it would semantically be a set of Range<T> rather than a set of T.

But I also think this is some kind of logical fallacy I don't know the name of. The fact that you can't ensure nobody will ever misinterpret a name doesn't mean you shouldn't try to head off the really easy misinterpretations:

Adding an extra word to the name RangeSet also doesn't meaningfully disambiguate these things — if someone is confused that a RangeSet is literally a set of ranges, couldn't someone else reasonably be confused that a RangeBasedSet is that same thing?

This sort of thing is always a matter of degree. Sure, someone else could reasonably be confused, but it's a whole lot less likely. The word “based” is a good cue that Range is playing a different role than Element, because—among other things—RangeBasedSet is a worse name for a set of ranges than RangeSet. If nothing else, it's a prompt to ask yourself, “what is the word ‘based’ adding here?” and then be motivated to find out what the name means.

That statement reveals a fundamental difference in the way we're understanding most of the naming questions here. From my point of view, the APIs here are only “Range-related” to the same extent that Array<T>.indices is: Range is just a participating type and says nothing about the semantics or role of the range instance in the API.

Even if there were no Collection protocol forcing us not to encode the specific type in the name indices, we would never have called it Array<T>.range because we have a more-specific relevant and meaningful thing to say about it than simply reflecting its type. In the same way, all of the ranges participating in these APIs play a role of contiguous set of element values.

So it's about the roles things play. If you're going to argue that rangeSet.insert(contentsOf: someRange) is more understandable (and discoverable, but that is much lower priority) than rangeSet.formUnion(someRange), then I think you need to explain why we shouldn't have someSet.insert(contentsOf: someOtherSet). I could totally be convinced that's how we should spell that operation, but I cannot imagine any good argument for spelling these two APIs, which have the same semantics, inconsistently.

Sorry, I don't understand this argument. The weirdness of OptionSet aside, It seems to rely on a false equivalence.

I don't think it's “strictly necessary” either. The question is, what is the best API, and why? If you're not sure, let's talk through the advantages and disadvantages of each one.

I'm sorry, I must be missing something. That argument seems to be based on the idea that your design is already out there in wide use or something? If there's anything extant that could set an expectation, it's the prior art that semantic union operations on sets include the word “union” in their names.

This again reflects our fundamental differences with regard to naming. We wouldn't name an API Array<T>.ints just because it returns a collection of Int. The fact that this API returns a collection of Range is not an argument for calling it ranges, as far as I can tell. See also my reply to John.

As I've been trying to say, we don't already describe a range of positions in a collection by simply saying “range.” There's no precedent in Swift for talking about “a range of a collection” as far as I know, and to use it that way would implicitly establish a new term of art. I'm not attached to region, and actually, we already have a term of art for this: “subrange.” We intentionally used that word instead of range when naming removeSubrange because it is clearer:

var rs = [1..<2, 3..<4, 5..<6]
rs.removeRange(1..<2)     // if we had made the wrong naming choice
print(rs.contains(1..<2)) // prints "true"!

I understand the rationale behind the name, and I know it's new-ish to RangeReplaceableCollection, which is why I feel somewhat more comfortable challenging its use as a precedent for RangeSet—less extant material depends on it than something that's been around for a few releases. The problem with using “all” to indicate that more than one element may be removed is that it even more strongly implies that all the elements will be removed. removeElements(where:) would do the job without encouraging the wrong interpretation. This is obviously not something to do lightly, but if we are to make a change, sooner is better than later and I wanted to raise it, but perhaps it should go to a different thread.

Yeah, I'm not sure we've hit on the answer yet either. Other ideas might spring from words like “arrange” or “condense.”

3 Likes