SE-0265: Offset-Based Access to Indices, Elements, and Slices

lorentey · October 19, 2019, 8:46pm

You’re right! The behavior should match what happens if there actually were elements at the specified offset. So the example should read:

"collection".insert("FOOBAR", at: .first - 7, padding: ".")
  // ⟹ "FOOBAR.......collection"
"collection".insert("FOOBAR", at: .first - 5, padding: ".")
  // ⟹ "FOOBAR.....collection"
"collection".insert("FOOBAR", at: .first - 3, padding: ".")
  // ⟹ "FOOBAR...collection"
"collection".insert("FOOBAR", at: .first - 1, padding: ".")
  // ⟹ "FOOBAR.collection"

Serves me right for posting with 2% brain capacity after a long day.

I don’t quite understand this statement. The whole point of this proposal is to introduce new collection behaviors and new collection APIs.

I fully agree with @Lantua here. The most exciting part of this proposal for me personally is that it gives us a sanctioned way to express locations that may be outside of the boundaries of a collection. This is brand new functionality and it is well within the scope of the proposal to define the best possible semantics for operations involving these new kind of locations. Offset bounds create a parallel universe to indices; we’re free to make up new rules on how they work.

(This is why I haven’t loudly complained yet about losing .end — even though I feel extremely strongly that universal support for half-open bounds and zero-based counting is critically important for any stdlib feature that involves ranges and collections.)

The padding behavior for out-of-bounds insertions/replacements is nice in that unlike any other choice I’ve seen so far, the explicit padding argument makes it obvious what the code is going to do when given a shorter than expected collection. (This also holds true for the fancy version that allows you to select the behavior you need.)

lorentey · October 19, 2019, 9:26pm

Adding a label to the subscript may in fact be necessary anyway for technical reasons.

OffsetBound conforms to Equatable, so it should also conform to Hashable. (This isn’t currently part of the proposal, but this is an omission that should be fixed.)

This leads to an overload ambiguity for OptionBound-keyed dictionaries (which are a useful construct we should support) — the keying subscript will conflict with the offsetting subscript.

let d: [OffsetBound: Foo] = ...
// This must unambigously mean the keying subscript:
d[.start + 6]  
// ⟹ Optional<Value>

// which means the offsetting subscript needs a label:
d[offset: .start + 6]  
// ⟹ Optional<(key: Key, value: Value)>

Nevin · October 19, 2019, 9:28pm

The core of the proposal is to introduce a new type, OffsetBound, and a pair of subscripts on Collection that take, respectively, an OffsetBound and a RangeExpression of OffsetBounds.

These subscripts parallel the existing subscripts that work with Index.

I am opposed to the new subscripts having any difference in behavior from the existing ones. In particular, the return types should be the same (non-optional), and out-of-bounds locations should trap. That is what I mean by no new behavior—the behavior should be the same.

In a similar manner, I think that any other method added by the proposal should exactly parallel an existing method, simply replacing Index with OffsetBound in the signature. That is what I mean by no new API—the API using OffsetBound should be exactly the same as (or a subset of) the existing API using Index.

• • •

I see two main use-cases for the OffsetBound feature.

First, when writing code generic over RandomAccessCollection, it will be much more convenient to write things like

.start + n

instead of

index(startIndex, offsetBy: n)

In this situation, the performance and behavior of the two ought to be identical. The OffsetBound version should simply be a shorter, simpler, and more convenient spelling for a common operation.

The second use-case is when working on non-random-access collections like String. Here, the fact that OffsetBound always uses an end of the collection as an anchor makes it a performance trap (though for String specifically, breadcrumbing might help). Furthermore, the simpler syntax makes it an attractive nuisance, because people will reach for OffsetBounds out of habit and convenience without realizing the drawbacks.

I am not fully convinced that we should add OffsetBound to Collection at all. It makes much more sense on RandomAccessCollection where there is no performance penalty.

But if we do add it, regardless of where, I am firmly of the view that the OffsetBound API should exactly match the Index API.

Tino · October 19, 2019, 10:52pm

Maybe the contradiction can be resolved with another outstanding decision:
There has been some demand for adding array accesors which return Optional instead of crashing — and imho the biggest blocker for this addition is the lack of a good spelling...
If we could finally agree how this feature could be implemented, the same way might be the right choice for relative indexing.

lorentey · October 20, 2019, 5:30pm

To be clear, I believe this proposal is the answer to that demand, and it takes it to its logical conclusion.

In my opinion, if we ever add support for out-of-range offsets, then we also need to allow ranges built out of them, and we need to work out sensible semantics for mutating operations that involve such offsets. Merely adding a new subscript leads to an incomplete (and as such, inadequate) abstraction.

Lantua · October 20, 2019, 8:21pm

If we're trying to recover a sensible argument. There's one case that bothers me: range .start + n ..< .last - m with insufficiently-sized collection. This is especially problematic for replaceSubrange.

We can snap to one specific anchor, or swap lowerBound and upperBound. None of these feels like the right thing to do. If we otherwise have mechanism to signal failure due to non-compliant argument, then I see no reason to have the clamp-behavior on other ill-formed scenarios.

kansaichris · October 21, 2019, 7:48am

I realize that this may be a silly question, but is there any scenario in which it would make sense for a subscript operator to throw an error given invalid indexes/offsets? I assume that a throwing subscript operator would be undesirable for a number of reasons, but it would definitely be a way to signal that an operation may not behave the way the programmer expects under certain circumstances.

Lantua · October 21, 2019, 9:28pm

Whether or not we perform recovery on invalid arguments, we should have an opt-in signalling to check if the Offset-Collection pair is proper. We could

Use discardable Bool (or Optional if the action has return value) for mutating actions with recovery, or
Use non-discardable Bool for mutating actions without recovery.

Non-mutating action (getter) shouldn't have much problem (whether we do recovery or not).

I can see the use case of this feature to be in the scripts where high-level prototypes are written. So it shouldn't have much visual signalling that the call can fails, they will be ignored and become noise anyway.

At first, I liked the idea, or at least the model, that there are infinitely many elements outside the collection, with elements outside being solely padding, but the more I think about it, the more I think this isn't very suitable because:

It doesn't solve the case .start + n ..< .last - m with small collection
Users may have a bug that generates ill-formed offset but being silenced by this. Making the bug hard to find.
It would require padding on every usage, even in the scenario where it is guaranteed to not be padded (due to precondition or previous checking).

I changed my mind regarding this.

a[.start + 3 ..< .last - 3]

This to me seems that a should have at least 7 elements. Because then a.index(at: .start + 3)! <= a.index(at: .last - 3)!. We could get 6 if we use begin/end, but that's not the big problem here.

So all ranges have visual cue on the # of elements the original collection should have, while homogenous anchor further add cue on the # of elements the resulting collection should have. This would also give replaceSubrange a valid range to work with.

lorentey · October 23, 2019, 1:43am

This only applies to the replaceRange variant, and it is only a problem for heterogeneous offset ranges.

I think trapping is a fine way to handle this case.

We can also choose to avoid providing a replaceRange operation altogether. Requiring users to model it with an explicit pair of removeRange + insert(_:,at:,padding:) operations eliminates this issue.

This isn't really a new issue. If I calculate the wrong offsets, I will get the wrong results, even in today's Swift. Such an error may or may not lead to an out-of-bounds index; all too often it doesn't.

The explicit padding argument is a great way to set expectations about what will happen in the irregular case.

This isn't necessarily the case -- let me remind you of the "fancy" variant, where padding is an optional sequence:

"collection".insert("A", at: .first + 15, padding: ".") 
    // ⟹ "collection.....A"
"collection".insert("B", at: .first + 15, padding: ". ") 
    // ⟹ "collection. . .B"
"collection".insert("C", at: .first + 15, padding: "")
    // ⟹ "collectionC"
"collection".insert("D", at: .first + 15)
    // ⚠️ Cannot insert an element at an offset outside of the collection.
    //    Did you mean to specify a padding value?

If you omit padding, the operation would default to trapping. (A slightly less fancier variant would disallow multi-element padding sequences, leaving us with a choice between a single padding element, nil padding (i.e., clamping semantics), or trapping. I don't really see a reason to do that; we can be as fancy as we want.)

I also argue that such a flexible insert/replace would be a useful addition to the stdlib anyway. (It is annoyingly fiddly to implement when you do need it.)

lorentey · October 23, 2019, 3:06am

It turns out I can't help thinking about this issue, so I think I should state my objection to removing .end.

I don't mind having .last as a convenience feature, but I believe providing .end is important for reasons of basic API consistency.

From the "Alternatives Considered" section of the proposal:

Having .end made it easier to refer to an open upper bound corresponding to the collection’s endIndex , which mostly manifests in documentation. However, in actual usage, .first and .last are almost always clearer.

We chose to eschew .start and .end members, simplifying the programming model and further distinguishing OffsetBound as an element-position abstraction more akin to working with Collection.first/last than Collection.startIndex/endIndex .

Removing .end does not simplify my programming model; it's quite the opposite -- it promotes closed ranges and one-based counting in certain scenarios, which hurt my brain and violate our established conventions.

I don't think the "almost always" argument holds any water here. Established Swift practice is to prefer using half-open ranges, so it seems self-evident to me that this feature should not arbitrarily decide to hamstring them.

Consider the simple case of replacing the last handful of elements of a collection with some sequence:

c.replaceRange(.last - 5 ... .last, with: "foobar")

Pop quiz: if c started out with ten elements, how many of them will be replaced?

Click to see the answer!

The operation replaces the last six elements. Gotcha! (?)

Trying to force .last into half-open ranges is just as error prone and misleading:

c.replaceRange(.last - 5 ..< .last + 1, with: "foobar")

I know I'd have a hard time remembering these unusual offsetting rules. Preferring closed ranges here goes against established Swift conventions that help prevent off-by-one errors. (And it ignores the legacy of Dijkstra.)

Compare and contrast to the obvious simplicity of the natural half-open range below:

c.replaceRange(.end - 6 ..< .end, with: "foobar")

I'm not arguing to remove .last -- I agree it is often convenient, especially when accessing individual elements. But I don't think the argument for removing .end holds up under scrutiny, and I believe that accepting its loss would strike a blow to the consistency of the stdlib's Collection model.

This may look like a small detail, but I think it's a crucial one.

QuinceyMorris · October 23, 2019, 5:46am

I strongly agree — that both .last and .end should be available, and that the design should not privilege one over the other — and I'm getting more unhappy about the lack of .end as the review period drags on.

Also, AFAIK, @Michael_Ilseman never showed the "actual usage" examples, and never showed the reasoning that .last is "almost always clearer". I'm afraid that it's clearer to the people it's clearer to, and murkier to the rest of us.

Is anyone really going to be upset if we have both .last and .end?

jawbroken · October 23, 2019, 11:05am

I don't have a strong objection to also providing .end, but a few points here.

I'm not sure this is true in my case, I seem to use a fair mix of the open and closed ranges. Is this supported by usage statistics or documentation somewhere?

lorentey:

Consider the simple case of replacing the last handful of elements of a collection with some sequence:
c.replaceRange(.last - 5 ... .last, with: "foobar")
Pop quiz: if c started out with ten elements, how many of them will be replaced?

I'm having a lot of trouble seeing the “gotcha” aspect of this. It's essentially equivalent asking how many elements this replaces:

c.replaceRange(.first ... .first + 5, with: "foobar")

or how many elements this has

array[0 ... 5]

i.e. it's just the obvious/standard property of a closed range.

I had this same thought when the pitch for this proposal was posted and looked up this exact article, but when I re-read it I didn't find it that relevant here. Swift already didn't choose to have a single range type, so all the parts about choosing [a, b) vs [a, b] aren't particularly relevant. And since you're counting from the back when using .last, it's as natural to start from 0 offset (not -1) as it is when counting from the front. And it makes things symmetric, so you can easily flip a range from being relative to .first to be relative to .last by just flipping the bounds around and negating the offsets, as I did above. Though the one tricky part, as you demonstrate above, is that there's currently no >.. operator, which would let you write:

c.replaceRange(.last - 6 >.. .last, with: "foobar")

Lantua · October 23, 2019, 5:12pm

I don't have problem with trapping, but I believe that it should have the same behavior as the out-of-bound offset/range, as it easily come from the same source of error.

I was the one who proposed to keep only one variant. I did (and still do) believe that they are so close together, that the usage easily become the stylistic choice, as oppose to ergonomic one.

lorentey · October 23, 2019, 7:41pm

The standard library APIs do exhibit a clear bias for half open ranges:

They get to use the short, unqualified Range name, while closed ranges have to live with the clumsier ClosedRange
The index range expression that captures the entire collection is a half-open range: startIndex ..< endIndex. (As it should be!)

Dijkstra’s notes remain as relevant as ever. Closed ranges can never be empty, which limits their use cases. Slicing is easier with half open ranges — the endpoint of the slice is the start of the remainder, or vice versa.

Which is not to say I never use closed ranges — I do! It’s helpful to have them, and I don’t mind the (huge) API surface dedicated to them. I don't even have trouble with the upTo/through naming convention*. However, the more complex a slicing expression becomes, the more strongly I prefer to use half open ranges.

* I am well aware though that they're a source of confusion for many developers, and I do sympathize with their frustration.

A quick unscientific search through the Swift Source Compatibility Suite reveals that this bias is also reflected in the general corpus of Swift code: I measured the ratio of ..< to ... to be about 2 : 1. (Even though I used a simple regular expression search that heavily overcounted ... -- the ellipsis is used quite frequently in its regular typographic role in comments and string literals, such as "Processing...".)

But this is largely unrelated to this review discussion. The problem I'm trying to point out is that the proposal in its current form is inconsistent with the existing Collection APIs, and that should be remedied.

Just to be clear here — I would be strongly opposed to the idea of adding even more kinds of ranges. Two kinds already cause plenty of trouble; Swift should not become a Mesa.

Pervasive support for half-open ranges is an important API design tool; the stdlib's pedantic consistency to support (and even prefer) them is our primary defense against off-by-one errors.

QuinceyMorris · October 23, 2019, 8:00pm

I think the truth of that varies from person to person. Some people will be comfortable with .last semantics, other people will be comfortable with .end semantics, yet others will switch comfortably between the two. It would be unfortunate if the decision were based by someone's introspection of their own feelings on the matter.

Actually, I thought it was you (@Lantua) who pointed out in the earlier thread that eliminating the choice between .last and .end just pushed the ambiguity from the choice of symbol to the choice of range operator (..> or ...). Maybe that was someone else, but I very much agree with that idea.

Another way of looking at this is that some people think of ranges as involving elements at each end of the range, while other people think of them as involving the boundaries around the elements.

The distinction is that a collection of n elements has n + 1 boundaries (one before each element, plus one after the last element). By reason of simple counting, the two conceptualizations are not interchangeable.

Lantua · October 23, 2019, 9:38pm

That’s what stylistic preference is.

It wouldn’t be very comfortable.
Choosing between options this similar, without personal preference, would require you to think long and hard between subtle differences, usually in scenarios where there is none.

There are surely many ways to interpret the OffsetBound (as well as Index), but I still think having them both would be, at best, redundant and unnecessary.

scanon · October 23, 2019, 11:15pm

Agree very strongly. If it wasn't for the problem of an integer type not being able to represent the bounds of a half open range comprising all of its values, my preference would be to only have a ..< b.

lorentey · October 24, 2019, 1:12am

The stdlib is quite consistent about supporting both upTo and through methods, and both ..< and ... ranges throughout its API. I agree that this is a source of confusion, but this is what we have.

.end and .last fall in exact the same category for me, with the huge caveat that in the world of indices, we have startIndex and endIndex, but we intentionally* don't provide a lastIndex. So if we were forced to choose only one (which we aren't) I'd argue that .end would be the one to keep.

* The reason for that is that for an empty collection, `lastIndex` would need to return the index before `startIndex`, which is a concept we are deliberately refusing to model. (Because embracing that would lead to complications (including `<..` and `<.<`) that we really don't want to deal with. We need to be careful to ensure that this proposal won't become a vehicle for smuggling in those complications anyway.)

QuinceyMorris · October 24, 2019, 5:35am

There are left-handed people, and right-handed people, and ambidextrous people. That's not a stylistic preference, though it does affect which hand they're comfortable using. FWIW, I don't think I've seen an ambidextrous person frozen with indecision about which hand to use to pick up a fork.

jawbroken · October 24, 2019, 10:00am

Sure, or startIndex ..., or just ....

Thanks for that, it's useful information.

If half-open ranges are the primary defence against off-by-one errors, then I have to say that they don't seem to be a very effective one. And valid offsets from .end starting from -1 would be it's own pervasive source of off-by-one errors, as it is in languages where -1 automatically refers to the last element. I don't think there's any good technical solution to fencepost errors, it's just something you have to think carefully about. But I'm not particularly opposed to having both, I just don't agree with some of your reasoning.