SE-0270 (Review #2): Add Collection Operations on Noncontiguous Elements

John_McCall · December 13, 2019, 6:22pm

The second review of SE-0270: Add Collection Operations on Noncontiguous Elements begins now and runs through December 19th, 2019. The previous review is here. I have taken over as review manager to give Dave Abrahams more flexibility to participate in the review.

The proposal was returned for revision in order to investigate splitting it into smaller pieces. Nate Cook, the proposal author, has chosen to make several revisions, the most significant of which is to remove several sets of methods; you can see the raw difference here. The Core Team has elected to run a review of the proposal as revised rather than sending it back into the pitch phase.

It sometimes happens with re-reviews that the Core Team has substantively approved parts of the proposal and is asking the community to provide feedback on specific revisions. That is not the case here because the Core Team did not feel like it received a strong signal on the core proposal. Accordingly, please consider this to be a de novo review and provide feedback on the full proposal.

Reviews are an important part of the Swift evolution process. All review feedback should be either on this forum thread or, if you would like to keep your feedback private, directly to me as the review manager via email or direct message on the forums. If you send me email, please put "SE-0270" somewhere in the subject line.

What goes into a review of a proposal?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift.

When reviewing a proposal, here are some questions to consider:

What is your evaluation of the proposal?
Is the problem being addressed significant enough to warrant a change to Swift?
Does this proposal fit well with the feel and direction of Swift?
If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

As always, thank you for contributing to Swift.

John McCall
Review Manager

Nevin · December 13, 2019, 7:20pm

I’m not sure how I feel about renaming indices(where:) to ranges(where:). It seems less clear in the example from the proposal:

// new version:
let indicesOfEvens = numbers.ranges(where: { $0.isMultiple(of: 2) })

// previous version:
let indicesOfEvens = numbers.indices(where: { $0.isMultiple(of: 2) })

In particular, it is not necessarily obvious that ranges(where:) is returning ranges of Index.

Ilias_Karim · December 13, 2019, 8:49pm

This proposal sounds great to me after following along the discussion from the previous thread.

Personally, the revised proposal reads easier and the naming changes decided upon improve it. The former naming of .ranges(where implied that the return ought to be of type Set<Index> whereas the renamed method clearly implies a set of ranges.

+1

dabrahams · December 14, 2019, 6:17am

Do you think that example may be a little dissonant for you because of the name of the variable, which could have been rangesOfEvens or evenRegions or just evens instead?

maven · December 14, 2019, 8:57am

What led you to make this (essentially) a Set<Range<Index>> (which is a bit like Set<Set<Index>>) instead purely a Set<Index>?

I've been working with something that is essentially the later (its internal representation is either a single range (which is not exposed) or a set of indices -- the API just yields all contained indices).

Nevin · December 14, 2019, 2:02pm

I’d put it the other way around. The name of the variable is the only hint that we are dealing with indices at all. If the variable had a different name, there would be nothing indicating “index” whatsoever.

In fact, I almost removed the variable and just quoted the method call itself, to make my point even more clearly.

Just going by the method call, numbers.ranges(where: isEven) could plausibly (on the surface at least) seem to return a collection of Element.

xwu · December 14, 2019, 3:02pm

I appreciate the clarification in semantics on this revision. In particular, I think ranges(where:) and ranges(of:) are a significant improvement compared to their prior incarnations. This addresses the concerns laid out in the previous iteration of review.

I have one nit: the helper method proposed here, range(at:), seems inconsistent with Swift standard library conventions and harkens back to older Foundation idioms (NSString.doubleValue, for instance).

The API is one to give more convenience to conversions from an index to a range with respect to a collection. Typically, conversion APIs are spelled in the standard library as an initializer of the destination type. Here, it is a method—and on Collection at that. I wonder if it would be just as convenient when spelled more conventionally (Range(index, in: collection)).

curt · December 14, 2019, 4:23pm

Thanks for the revised proposal. This will be a useful addition. The API benefitted from the tighter proposal.
I do have a couple of questions and points of confusion:

Under Accessing Underlying Ranges and Individual Indices, this sentence seems to describe an extension on Collection or perhaps a requirement on Collection.Indices:

You can access the individual indices represented by the range set by using it as a subscript parameter to a collection's indices property.

However, the API snippet immediately following doesn’t seem to address that, but does have a subscript, which confused me.

Under Accessing elements via RangeSet, I’d appreciate an example of using subscript(indices:) on a MutableCollection. That might be a good addition to the doc comment.

curt · December 14, 2019, 4:28pm

I think range(at:) is the more natural spelling. We’re asking the collection to perform the conversion.

Are there other examples of conversions where the context for the conversion is passed in the initializer?

Nevin · December 14, 2019, 6:32pm

I think range(at:) should actually be range(at:length:) with a default length of 1:

extension Collection {
  func range(at position: Index, length: Int = 1) -> Range<Index> {
    return position ..< index(position, offsetBy: length)
  }
}

• • •

I also think the argument label on insert(contentsOf:) is unnecessary and verbose. When inserting a range of elements into a RangeSet of that same element type, there is no ambiguity. The concise spelling is every bit as clear, without the extraneous noise:

var set = RangeSet([0..<5, 10..<15])
set.insert(contentsOf: 5..<7)  // Current proposal
set.insert(5..<7)              // Original proposal

No one is going to find the version without a label confusing, but people will find the version with the label needlessly tedious.

• • •

Finally, having read the revised proposal more thoroughly now, I am quite convinced that ranges(where:) and ranges(of:) should revert to their original spellings indices(where:) and indices(of:).

Using the example from the proposal:

let str = "Fresh cheese in a breeze"
let allTheEs = str.ranges(of: "e")       // Current proposal
let allTheEs = str.indices(of: "e")      // Original proposal

The method is asking the collection to return a collection of indices. The resulting collection happens to be a RangeSet, but that is almost irrelevant. The specific type of collection, and its internal representation, don’t actually matter for the purpose of writing algorithms. The important point is that these methods return some collection of indices.

It has been said many times on Swift Evolution that code is read far more often than it is written, and the API design guidelines place clarity at the point of use as the foremost concern. From the perspective of someone reading the code, “ranges of e” doesn’t really convey any meaning. On the other hand, “indices of e” says exactly what it does.

Furthermore, a major part of the revision of this proposal is to change RangeSet from modeling a collection of ranges, to modeling a collection of comparable elements. It seems entirely contrary to that goal if the methods themselves get renamed away from talking about the element type into talking about the ranges.

nnnnnnnn · December 14, 2019, 11:32pm

To clarify, the move away from indices(where/of) is in part specifically because these methods don’t return a collection of indices, but a set of values represented by ranges. A range set isn’t a collection, although you can access its constituent ranges as a collection. A collection’s indices property, on the other hand, is exactly what you describe here — a collection whose element type is Index.

Nevin · December 15, 2019, 1:10am

I am looking at this from the perspective of a programmer who wants to get all the indices of a collection where the elements at those indices satisfy a predicate. Unless I am quite mistaken, that is the primary use-case for the indices(where:) method, and it should be named accordingly.

When the programmer describes in words what they want, it is going to sound a lot like “I want all the indices where the predicate is satisfied” and it is not going to sound like “I want all the ranges where the predicate is satisfied”.

The fact that those indices are stored in a RangeSet is an implementation detail. The important part is that the programmer will use this method when they want to get indices into their collection. The indices(where:) method belongs to Collection, and its entire purpose to to get the indices for the elements which satisfy the predicate.

That is why I am firmly of the view that the correct and clear spelling is indices(where:).

Regardless of whether RangeSet conforms to Collection, the purpose of indices(where:) is to collect a subset of indices. When a programmer calls indices(where:) it is because they do not want all the indices, only those indices where the predicate holds.

dabrahams · December 15, 2019, 2:03am

Personally, I don't think it's useful to try to communicate type information in names, except where it's actually needed to prevent confusion. Semantically, an index is just a pointer to an element, and I don't think it's very useful, in most cases, to distinguish a pointer-to-thing from a thing-itself, in naming. You can say, “here's a list of the names of people coming to the party,” but most of us will just say, “here's a list of the people coming to the party” without causing problems of understanding.

Could you explain how you arrived there? I don’t see anything in that name that suggests it is returning a collection of Element. I could buy the argument that it appears to expose a collection of ranges, and that because it only holds-a collection of ranges but it is-a set of indices, the name ranges is misleading. From that point of view, regions might be a better name, because it avoids making any implications about types.

Naming around the RangeSet type is going to be difficult in some ways, and it’s not unique in this regard. There is a large category of types—call them “basic sets”—that can provide an efficient membership test for some type (indices in this case), but can’t function as a collection of that type. Some ranges fall into that category as well (e.g. Range<Float>), and the others may expose elements that are invalid indices for whatever collection they’re being used with. For example, 2..<4 is a valid range of indices for a collection whose indices are [2, 4, 6, 8], but if you treat 2..<4 as a collection, you’ll find it contains 3.

So you could argue that “ranges” is better than “indices” because x.ranges(where: isEven) is a set represented by ranges of positions in x where we can find even values, but the set may also have members that are arbitrary other values of x's index type.

Anyway, as I said, I think the naming problems are tough here. So far I haven't convinced myself of any principle by which we can say one name is clearly better than the other… except that maybe regions holds a slight edge because it doesn't tread into certain potentially misleading territory.

Nevin · December 15, 2019, 2:58am

I’m not quite sure what you’re getting at here. When you want an index, you call a method like firstIndex(where:) or index(after:). When you want more than one index, it follows that you should call a method like indices(where:).

An argument could be made that allIndices(where:) has more symmetry with firstIndex(where:), but regardless the word “indices” should be involved.

I think this supports my point. There’s no reason to make people ask for “a bunch of ranges of indices that satisfy the predicate” when they could instead simply ask for “a bunch of indices that satisfy the predicate”.

It’s more that ranges(where:) could be ranges of anything. There’s no indication that the method is at all related to indices of the collection. The other index-returning methods are named things like firstIndex(where:), and for clarity and consistency we should continue that pattern.

If we were to name it ranges(where:) then it would be clear that it has something to do with ranges, but not what kind of ranges nor what those ranges are for nor what the purpose of the method itself might be. The code becomes unclear and confusing to readers.

In contrast, if we name it indices(where:), then it is clear that the result is a bunch of indices, and readers of the code will immediately grok what they are for. This spelling does make it unclear how those indices are stored and what the return type is, but, if I may quote a member of the Core Team and co-author of the API Design Guidelines:

dabrahams · December 15, 2019, 4:02am

What I'm getting at: is that you don't get indices in the same sense that you get an index from those other methods. What you get is, semantically, a predicate over indices that given an index answers the question, “is this index in the set?“

I hope it is clear that on that question I have no strong position, because I think the naming questions are difficult in this area. But I guess I should also point out that we have a property called indices that is a collection of indices, whose elements you can enumerate, and there's a pretty strong implication that what you'd get from indices(where:) should have the same properties. This one doesn't.

That's very different. The analogous statement would be that there's no reason to ask for a bunch of indices that satisfy the predicate when they could ask for the elements that satisfy the predicate.

Well, OK. I think what you're getting at is part of why I prefer the name regions(where:). IMO, that obviously means “the regions of the collection where the elements satisfy the predicate.” I think it should be obvious that contiguous regions of the collection are represented as half-open ranges bounded by indices.

Yeah, but as I noted above, this method doesn't return any indices: it returns a “basic set” which is essentially a predicate that can answer the containment question. I mean, sure, you can derive from the invariants of RangeSet that the index values that are the lower bound of each range are in the set, and if you happen to know the collection value the set was derived from, you know those are also indices in the collection. But, without the collection (or its indices), you can't use that return value to get any of the set members.

OK, I agree, understanding that name depends on the term “range of a collection” meaning “a range of positions in the collection.” That could become a term of art, but it isn't today, and it's a bit awkward (or at least I find it so). So as we work through these arguments I'm liking regions(where:) more and more.

Touché ! I'll parry with: regions(where:) doesn't have that problem

dabrahams · December 15, 2019, 4:17am

IMO we shouldn't be thinking of this method as a type conversion at all. Ranges are at a different level of abstraction from their Bounds. It's similar to thinking of a Int -> [Int] function as a type conversion. I would never argue that Array<Int>(3) should create an array of one element.

I am not fond of range(at:) either, though. It makes code read as though the collection contains ranges, and to echo @Nevin, this could be a range of anything. We don't have a notion of a “range of a collection.” I'd prefer to see something like region(ofElementAt:). It's a little unwieldy, yes, but it's clear.

gwendal.roue · December 15, 2019, 12:51pm

I frankly don't know what to think of this proposal. I've never seen such a proposal in Swift Evolution yet.

What is your evaluation of the proposal?

The API surface of RangeSet is splattered across many types. I think it will be difficult to remember, learn, use, and understand. I understand all the good reasons for this, but in the end we still have a container type which is not self-contained, with diffuse responsibilities.

Is the problem being addressed significant enough to warrant a change to Swift?

The problem is that the "problem" is not well defined. Yes we need an IndexSet. Why not more general than IndexSet. But does it have to be this RangeSet? Why isn't NSCharacterSet discussed? Is it because of NSCharacterSet.inverted?

Does this proposal fit well with the feel and direction of Swift?

Frankly, no. No existing Swift container exposes implementation details as RangeSet does.

I think the Core Team has to clarify their intentions regarding standard collections. Are we stil in a post-ObjC world where collections are defined by their abstract behavior, or do we move to a Java-C++ world where collections are defined by their implementations?

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

I've never heard of RangeSet.

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

Followed both reviews, never understood if we are talking about indices or ranges.

Ben_Cohen · December 15, 2019, 5:11pm

This is not true. All the current collections expose aspects of their implementation detail.

Array holds its elements in a contiguous block of memory.* Knowing this is very important to the user of Array. It is why insertions at the front are linear but appending is constant. It’s why you can get an UnsafeBufferPointer of its elements in constant time. Contiguous storage is really important when optimizing high-performance code.
Set and Dictionary are hashing containers. This is why you can’t rely on their ordering, why their elements need to be Hashable, why lookup takes constant not log time. Tree-based collections would have different observable benefits/limitations.
Range, when it can be a collection, is just a pair of numbers, so is very efficient, but can’t do lots of things like insertion/deletion of elements. Understanding when a collection uses a Range for its indices property is vital to understanding whether you can mutate the collection efficiently when iterating its indices.
CollectionOfOne exists purely because of its implementation details.
And so on. Lots more examples already in the library, and hopefully more to come.

*with one exception relating to interop with Objective-C, which we should ignore for now, but if you want not to, it’s yet another implementation detail worth knowing about.

You might take the position that RangeSet is one step beyond these examples, that it takes its exposure of implementation too far. If so, I disagree. The fact that it represents a sorted list of ranges is really important for performance purposes. It is what makes it a maximally efficient representation for operations like removing multiple elements from an Array.

The goal of Swift is to span from low-level performance-critical use cases to high-level abstractions. The low level part isn’t some grubby side-goal we should be resistant to infecting the API surface. It is co-equal with the goal of high-level expressivity. I think RangeSet as proposed here strikes the right balance between these two.

(speaking for myself not the core team)

We are not still in this world because we never were in this world. This is to be celebrated. It’s sometimes taken as a given that hiding implementation details is “obviously” virtuous, but IMO this view can be very harmful when applied too liberally. It’s important to know the precise performance characteristics of Array (which is why the asterisked comment above is so unfortunate). In other cases, it is preferable to hide implementation details (either from the ABI, or from the user) in order to be able to change them later. Both are valid choices, with pros and cons.

I think there’s some confusion here when it comes to concrete types versus collections. The Collection hierarchy is indeed a description of behaviors, and you can write generic code against it instead of against concrete types. When you do this, you also make a choice with pros and cons. The pro is mainly the potential for code reuse. The con is you give up more detailed knowledge of the type you’re working on, and that might result in less efficient or elegant code that could leverage the knowledge of a concrete type and its implementation details.

gwendal.roue · December 15, 2019, 5:16pm

Thank you for your answer. I would have never hoped my questions to trigger a post which is that well articulated.

curt · December 16, 2019, 2:50pm

As you mentioned up thread, with this proposal we might be in the business of coining what will become a term of art. I’m sympathetic toward “region” for that reason.

If we’re willing to entertain some unwieldy options, at least to generate ideas, I’d offer indexRegion:

let names: [String] = ["John", "Paul", "George", "Ringo"]
let position = names.firstIndex(of: "Paul")!

let paul = names.indexRegion(for: position)
let alsoPaul = names.indexRegion(of: "Paul")
let paulIsDead = names.indexRegions(where: { $0.contains("o") })

with apologies to Sir McCartney.