SE-0270: Add Collection Operations on Noncontiguous Elements

dabrahams · November 11, 2019, 8:17pm

The review of SE-0270: Add Collection Operations on Noncontiguous Elements begins now and runs through November 21, 2019.

The proposal is written by @nnnnnnnn.

Reviews are an important part of the Swift evolution process. All review feedback should be either on this forum thread or, if you would like to keep your feedback private, directly to me as the review manager via email or direct message on the forums. If you send me email, please put "SE-0270" somewhere in the subject line.

What goes into a review of a proposal?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift.

When reviewing a proposal, here are some questions to consider:

What is your evaluation of the proposal?
Is the problem being addressed significant enough to warrant a change to Swift?
Does this proposal fit well with the feel and direction of Swift?
If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

Thank you for contributing to Swift!

Dave Abrahams
Review Manager

QuinceyMorris · November 12, 2019, 12:59am

I strongly agree that we need a NotAnIntIndexSet type to generalize Foundation's current IndexSet. However, I have a series of issues with the proposal, which I'll put into separate posts. First issue …

Please, let's not make the mistake of calling this RangeSet, as if it somehow exhausts all future possibilities for things that have range and/or set semantics. Let's either go more specific (CollectionIndexRangeSet or IndexRangeSet or something), or re-use IndexSet for this and rename the current IndexSet to something else (yeah, right, I know that's not gonna happen).

QuinceyMorris · November 12, 2019, 1:07am

Second issue…

I don't understand why this is proposed as something about ranges, when the range semantics are actually an implementation detail.

Specifically, the following APIs are proposed:

    public func indices(where predicate: (Element) throws -> Bool) rethrows
        -> RangeSet<Index>

    public func indices(of element: Element) -> RangeSet<Index>

which in principle could retrieve any subset of indexes. The choice of arranging these into a set of ranges seems like an implementation choice. (If, for example, the indices found by these functions are sparse, then you could end up with a set of 1-element ranges. That's certainly bearable, but doesn't seem like a necessary design.)

Note that I'm not objecting to the implementation actually using ranges to optimize the set of indexes, and I'm not objecting to the possibility of enumerating indexes as a set of ranges, but it seems wrongly limiting to build this into the APIs as the only and paradigm mechanisms for accessing sets of elements.

I think there was some discussion of this in the pitch thread, but this really seems to stick out of the proposal as a weird conflation of design and implementation.

QuinceyMorris · November 12, 2019, 1:12am

Third issue…

This is not really an objection, more of an observation about something that might benefit from discussion.

With an implementation of RangeSet as proposed, suddenly we have (kind of) 2 domains (I can't use the word "sets" here, which would be more natural) of APIs: collections of elements, and collections of indexes to elements. Is it just me, or does this leave us with an uncomfortable feeling of duplication?

In many cases, it might be difficult to decide whether to form a sub-collection of elements, or a set of [ranges of] indexes to elements.

It's a little bit like Sequence vs. Subsequence, or is it? Does it make anyone else a little uncomfortable?

Fourth issue (sorry, forum software won't let me make a 4th post)…

I don't think I see anything in the proposal about the invalidation of RangeSets when the underlying collection is mutated (or when … what?). At the very least, I think the proposal should at least state the relationship between the integrity of the RangeSet vs. the integrity of the Collection.

Not a big point, but let's make sure this is clear. (Confusion on the validity of indexes has certainly come up in the forum before, so it's obviously not obvious.)

CTMacUser · November 12, 2019, 3:02am

Unlike how we use the term, a subsequence is a not-necessarily contiguous, but still in-order, subset of elements. What we call a SubSequence is a substring under that article's philosophy. (And the "substring" term applies to any sequence-set, not just a String or other sequence of characters.) Maybe we could use "DiscontiguousIndices," or "SomeIndices"?

We could make a lazy version of indices(where:), but the implementation wouldn't be that different from RangeSet (or whatever we rename it), so I'm not sure it's worth it.

I didn't see it either. But it should be the same as Index; a generated index value is valid until a RangeReplaceableCollection method is applied to the source collection (even if that operation's target doesn't specifically include your retained index).

CTMacUser · November 12, 2019, 3:24am

Is this review too fast? It's been less than a month from when the proposal's initial thread was started.

As @QuinceyMorris and I have said, the fact that the type is implemented as a set of ranges isn't as important as what operations it enables for collections, so maybe an improved name is warranted.

The documentation around this part is messed up, but is RangeReplacableCollection.removeAll(at:) a customization point or not? It's important that types with exploitable internal representations (like Array or String) get a chance to optimize the algorithm for their memory pool. And instead of a direct non-mutating variant of remove-all in Collection, maybe an anti-subscript would work.

extension Collection {
    /// Accesses a view of this collection with the elements **not** at the given
    /// indices.
    ///
    /// - Complexity: O(1)
    public subscript(not antiIndices: RangeSet<Index>) -> DiscontiguousSlice<Self> { get }
}

extension MutableCollection {
    /// Accesses a mutable view of this collection with the elements **not** at the
    /// given indices.
    ///
    /// - Complexity: O(1) to access the elements, O(*m*) to mutate the
    ///   elements at the positions outside of `indices`, where *m* is the
    ///   number of elements outside those indicated by `indices`.
    public subscript(not antiIndices: RangeSet<Index>) -> DiscontiguousSlice<Self> { get set }
}

gwendal.roue · November 12, 2019, 7:19am

Thanks for the pitch!

Is it possible to remove the within argument from the inverted method?

For example, we can imagine a user wants to use RangeSet as an efficient way to store integer ranges, independently from any collection. Finally Swift has its NSIndexSet! And now the user stumbles upon inverted(within:). This method can bring much confusion. Is RangeSet the correct type for my plain integer set? When I perform set algebra, why can't I rewrite a.subtracting(b) as a.intersection(b.inverted()) when relevant?

gwendal.roue · November 12, 2019, 12:38pm

I'm sorry to reply to myself, but:

If RangeSet is fundamentally tied to a collection, then why not providing the collection in the initializer, instead of providing it in within arguments, with the risk of user mistake?

If RangeSet is not supposed to be fundamentally tied to a collection (which means that it can be used as a replacement for NSIndexSet), then should it expose an implementation made of a set of ranges? If yes, then we should not have any inverted method, because you can't invert a range into other ranges. If no, then it's possible to implement inverted with a plain inner private boolean flag (or any other implementation deemed efficient enough) - as maybe Foundation.CharacterSet does.

dabrahams · November 12, 2019, 3:29pm

IMO this'd be a great addition to the standard library, but I do have some concerns about the way the semantics of RangeSet are described, which I'd like to see addressed somehow, possibly by renaming or documentation changes, but at least by discussion.

RangeSet is documented as “a set of ranges of any Comparable value.” I believe that is a proper description of its non-mutating semantics. That means that the elements of the set are ranges.

However, the mutation semantics and naming seem to clash with expectations a bit. For example,

let a = 0.0..<10.0, b = 1.0..<9.0
assert(a != b)
var s = RangeSet<Double>()
assert(s.isEmpty)
s.insert(a)
assert(s.contains(b))

Usually, if we take an empty set and insert one element, the set will not contain some other element, unequal to the one inserted. I totally agree that, given the RangeSet data structure, s should contain b at this point. Maybe the problem is with the name insert on the previous line. Or maybe this is a change in naming convention we can tolerate, but it should be called out and discussed.

I'm also uncomfortable with what @QuinceyMorris is calling the "two kinds of domains" problem, starting with the overloaded use of contains to ask whether a given range is in the set and whether any range in the set contains a Bound value. I think we should consider moving all the within: operations somewhere else, possibly onto Collection. The operations that mutate a RangeSet could be seen as similar to the formIndex methods.

I'd also like to see it explicitly stated that a range set containing an empty range is not an empty range set ;-)

-Dave

Ben_Cohen · November 12, 2019, 6:26pm

From the proposal:

The ranges that are exposed through this collection are always in ascending order, are never empty, and never overlap or adjoin. For this reason, inserting an empty range or a range that is subsumed by the ranges already in the set has no effect on the range set. Inserting a range that adjoins an existing range simply extends that range.

Ben_Cohen · November 12, 2019, 6:42pm

This set's primary motivation may be to store indices into collections, but it also works for ranges of other things, like a RangeSet<Float>. Using "Index" in the name of the type would be inappropriate.

The range-based nature is more than just an implementation detail. Being only a set of ranges of indices, not the individual indices, also allows the users to do things otherwise not possible, such as adding a new range given just two indices, not the underlying collection.

nnnnnnnn · November 12, 2019, 6:50pm

RangeSet is the set of all the values in its constitutive ranges, where the values can be of any Comparable type — integers, floats, strings, collection indices, etc. For example, here's a drawing of the space represented by a Comparable type T:

We can create a RangeSet called set1 that covers three different ranges of T — the yellow strips are the ranges comprised by set1, and the gray regions are the values of T that are contained in set1:

We can then define set2 as a set of different ranges, and then union the two, creating a third set that represents the union of the values represented by set1 and set2:

This is why the documentation for the two contains(_:) methods is the way it is: a value is contained by a RangeSet when it is contained within one of the set's ranges, and a range is only contained if it is fully contained within one of those same ranges.

Likewise, RangeSet has no notion of containing empty Ranges, because empty ranges represent no values in T. Inserting an empty range therefore never has an effect on a RangeSet.

nnnnnnnn · November 12, 2019, 6:58pm

A RangeSet is not fundamentally tied to a collection — its only requirement is that its generic parameter be Comparable. On such a type, you're right that it isn't possible to invert the set of ranges with the current design — you'll always need some external way to provide lower and upper bounds for the inverted RangeSet. The most useful point for this is when working with a collection, but we could look at adding additional inverted() overloads for certain kinds of types (e.g. FloatingPoint).

Inverting a RangeSet within a collection is a useful building block for other operations. For example, there have been requests over the years for a filter that returns both sides of the predicate — this is trivial with RangeSet:

extension Collection {
    func twoWayFilter(_ predicate: (Element) -> Bool) -> (yes: DiscontiguousSlice<Self>, no: DiscontiguousSlice<Self>) {
        let matches = self.indices(where: predicate)
        return (self[matches], self[matches.inverted(within: self)])
    }
}

Ben_Cohen · November 12, 2019, 7:01pm

Ignoring that RangeSet isn't exclusively something that can be used with collections, this approach would lead to problems with triggering copy-on-write. If the RangeSet held a copy of the collection, it would mean it is no-longer uniquely referenced, and therefore mutations would trigger COW:

var numbers = Array(1...15)
// get a RangeSet into numbers
let indicesOfEvens = numbers.indices(where: { $0.isMultiple(of: 2) })
// and remove the ranges in-place efficiently
numbers.removeAll(at: indicesOfEvens)

If RangeSet held a copy of numbers, then the call to numbers.removeAll would trigger a fully copy of numbers on first mutation during the remove operation, because the reference count to the underlying buffer would be non-uniquely referenced.

nnnnnnnn · November 12, 2019, 7:13pm

Remember that collections of "elements" can always themselves be collections of indexes, too. Adding RangeSet extends the group of types that are primarly-focused-on-but-not-limited-to-indexes, which right now includes Range, ClosedRange, the RangeExpression protocol (and its three implementors: PartialRangeTo, PartialRangeThrough, and PartialRangeFrom), and finally the crazy UnboundedRange that no one ever talks about.

To put it another way, indices are really important in Swift! We already include a wide variety of ways of representing them and applying them to collections; RangeSet is another in the quiver.

dabrahams · November 12, 2019, 7:21pm

nnnnnnnn
Nate Cook
    November 12
dabrahams:
RangeSet is documented as “a set of ranges of any Comparable value.” I believe that is a proper description of its non-mutating semantics. That means that the elements of the set are ranges.

RangeSet is the set of all the values in its constitutive ranges, where the values can be of any Comparable type — integers, floats, strings, collection indices, etc.

So it’s a set of comparable elements, not a set of ranges I guess the fact that I interpreted it the other way means this proposal needs some editing and clarification! This also explains (along with the fact that I overlooked that part of the proposal) why we're on different pages about constituent empty ranges.

Likewise, RangeSet has no notion of containing empty Ranges, because empty ranges represent no values in T. Inserting an empty range therefore never has an effect on a RangeSet.

In that case, IMO it's clear that the insert that takes a range ought to be spelled something like s.insert(allIn: range), and other APIs should be similarly reconsidered.

-Dave

glessard · November 12, 2019, 11:00pm

What I get from this is that the new type is not a set of ranges as the name seems to indicate, but a Set that is described in the form of ranges.
Hopefully a different name can capture this better.
(I thought of RangedSet, but it also seems wrong.)

michelf · November 13, 2019, 1:51am

This is a good proposal. I'm a big user of IndexSet and I welcome the ability to do the same with more than integers.

But the name RangeSet sounds wrong. It's a set of elements implemented with a collection of ranges of the those elements, but "range set" seems to imply the semantics of a collection of ranges, which is slightly off.

It's obviously important that the name gives a clue about the implementation which is based on ranges, so I'd propose to call it DiscontinuousRange. In other words it's a range that can have discontinuities in it. Of course it's also a set, but I don't think we necessarily need "set" in the name.

Karl · November 13, 2019, 3:37pm

I'm not sure I agree. In principle, a RangeSet does not contain the values in its constitutive ranges any more than a standard Range does, since it cannot iterate over those values. Range is only a Collection when Bound: Strideable.

Some other thoughts:

I feel the name and documentation are "too technical" (if you understand what I mean by that), and the examples too Collection-focussed. This could be a really great data type for a lot of problems, so it would be nice to make it more friendly to newcomers.
I'm trying to think of good non-collection-y examples. One example might be modelling occupancy at a stadium with an Array of RangeSet<Int>s, where each RangeSet represents a row of seats, and marking which places are taken or not by inserting or removing them in the set.
So considering that, inverted sets stand out as a strange omission. That would let you easily answer questions like "is there a space with 3 seats in this row?" by writing something like rowOccupancy.inverted.first { $0.count >= 3 }. Easy inversion is one of the main benefits of Range-based structures (along with faster contains operations) and with thinking in ranges in general.
In order to do inversions, we need a RangeSet which is itself bound to a Range. I think something like that would be a great idea as its own type, withall the same nice APIs RangeSet has (perhaps trapping or clipping when attempting to insert invalid values), or perhaps all RangeSets should just include optional bounds?

Also, I think this should go in the standard library preview package first, so we can all get a chance to thoroughly test the API. Are there any reasons why this couldn't be staged in the preview package first?

BigZaphod · November 13, 2019, 3:47pm

I feel like what I usually want might better be called a SparseRange or something like that. Basically I want a range that I can poke holes into (or invert it and do the opposite).