Offset Indexing and Slicing

Karl · August 29, 2019, 1:29pm

I like this. I just have one comment and some thoughts on future directions:

Slicing from the start/end is certainly the common case, but some kinds of exotic collections may expose other salient indexes as properties. In my demo gist from a while back, I added an operator which allows you to construct an OffsetBound by adding an integer offset to a KeyPath. Would something similar be possible with this?

Some thoughts on future directions:

Optimising offsets

I think it would also be interesting to consider a protocol requirement on Collection which 'optimises' an OffsetBound - e.g. if the Collection knows its own count in constant-time, it might decide to transform an OffsetBound of the form .start + x to the form .end - (count - x) if x > (count/2). This would be useful for BidirectionalCollections which cache their count (e.g. String). When you're slicing Strings in LTR languages, it's just easier to write things in terms of offsets from the start (because that's how a human reads it), even if that isn't the most performant way of walking to the desired location.
String slicing performance

I think it's fair to say that the biggest motivating use-case for this is slicing Strings (although things like correcting buggy code with integer-indexed collections are useful, too). I imagine lots of code will look like:
```
let something       = myString[.start + 4..<.start + 10]
let anotherThing    = myString[.start + 12..<.start + 20]
let yetAnotherThing = myString[.start + 25..<.start + 30]
```
...which we can expect to have pretty poor performance because String is not a RandomAccessCollection. I think it's worth considering how we might improve index calculations in general (which particular focus on String). Taking inspiration from @dabrahams post from way back about "upgrading" all Sequences to be Collections (by adding a buffer), I think it would be cool to have a generic wrapper than can upgrade any Collection to be a RandomAccessCollection by adding an index cache (similar to what we do with String's UTF16 breadcrumbs).

Then, if anybody needs to repeatedly slice a Collection which for some reason can't be a RAC, we can advise them to wrap the data:
```
// Add this line to shadow 'myString' with a wrapped version.
// Performs a single O(n) scan of the String and builds an index cache for later lookups in O(log n).
let myString = myString.withRandomAccess()

let something       = myString[.start + 4..<.start + 10]
let anotherThing    = myString[.start + 12..<.start + 20]
let yetAnotherThing = myString[.start + 25..<.start + 30]
```
We would want this to also be a protocol requirement, so RACs can simply return themselves from withRandomAccess().

Michael_Ilseman · August 29, 2019, 4:48pm

That's odd, it might be a type checker issue that was fixed later. I've been running with Swift 5.1, but I don't think I used anything new in 5.1.

The implementation itself doesn't use split, but some of the examples do. You can omit anything in the file after the // Test Harness comment: everything later is just unit tests and examples.

Michael_Ilseman · August 29, 2019, 4:58pm

Type checking was a significant concern with previous formulations of this, mentioned under "The Trouble with Generic Operator Overloading" in Alternatives Considered.

This approach hopefully has minimal impact.

OffsetBound is not ExpressibleByIntegerLiteral, so it is not even a candidate for literals.
OffsetBound is fully concrete. A generic parameter would force the type checker to explore the space of conditional conformances just in case some module somewhere conditionally conformed OffsetBound to ExpressibleByIntegerLiteral based on the generic parameter.
OffsetBounds are always on the LHS of +/-, and an Int is always on the RHS. Ambiguity would require a very contrived setup.

@xedin knows this area better and can provide more details.

Michael_Ilseman · August 29, 2019, 5:09pm

I like this, in addition to slicing with offsets. An index-taking subscript with a length parameter could also be useful. I'll think about whether it makes sense to incorporate into this proposal (while we have momentum on the topic).

It may be that this sort of pattern is better handled by SubSequence consumers, such as in this gist.

Joe_Groff · August 29, 2019, 5:16pm

jawbroken:

Looks good to me. Any thoughts around providing a nicer way to get a slice of a particular length using these offset indices? Looking at
print(str[.start + 3 ..< .start + 6]) // "def"
makes me wonder if there should be a
print(str[.start + 3, length: 3]) // "def"

Could we integrate this into the OffsetBound design? Maybe there could be a relative bound that specifies an offset relative to the other bound:

.start + 3 ..< .bound + 3
.bound - 3 ..< .end - 2

though we'd want it to be an error to use an open-ended .bound, or .bounds on both ends.

Michael_Ilseman · August 29, 2019, 5:45pm

In the general case, forming an offset bound that is relative to some index would require representing that index type in the bound itself, which has significant type checking implications, mentioned in the "Offset Arbitrary Indices" section. Instead, you have to fall back to slicing first and then offsetting into the slice.

For your specific case, where the collection provides salient indices (e.g. .middle), in the proposal as-is you would again have to fall back to slicing first. E.g. myCol[myCol.middle...][.start + 5].

Do you have some concrete examples of such collections? I'm not sure this specific case warrants a targeted solution.

Without making OffsetBound generic, I'm not sure how we could represent the return type of + or -.

You can do this today by customizing index(_:offsetBy:) or others.

In general, this would produce a new type and we don't have a good forwarding story, so you'd lose any APIs specific to that collection in the process.

String's breadcrumbs work because String is a concrete type and not a protocol, so we are the ones defining the in-memory representation.

That would requires an additional allocation of O(N) in size, which muddies the performance story quite a bit in practice. I think a RAC wrapper collection could be an interesting addition. You can make one by just storing an Array(myCollection.indices) and losing the specific APIs on that type, and provide access to the wrapped Collection and mapped indices to regain access to them.

Nevin · August 29, 2019, 8:18pm

Okay, I commented out everything past “// Test Harness” and the playground now runs successfully.

However, when I added the following two lines:

let a = "test"
let b = a[.start]

There was no autocomplete for .start.

Specifically, when I wrote “let b = a[.s” the available code completion options were two variants of String.Index.init and one Range<String.Index>.init.

Michael_Ilseman · August 29, 2019, 8:49pm

I'm not sure, Playgrounds may also behave differently. I put it into a command line tool project and got code completion there. @akyrtzi might know more details but at the language/compiler level, it seems to work in 5.1.

Untitled%202

blangmuir · August 29, 2019, 8:54pm

If you're using Xcode 10.3 this is expected. Code-completion for subscript and unresolved members was improved in Xcode 11. I'm not 100% sure how well this is going to code-complete once you're on the RHS of a complex range expression, but at least all the simple cases seem to work in Xcode 11 beta.

beccadax · August 29, 2019, 10:58pm

I like most of this, but I'm not sure I like the decision to exclude arbitrary-based offsets from the design. I'd honestly like to see this feature subsume direct calls to index(_:offsetBy:limitedBy:), but it can't do that without support for arbitrary bases. This part of the justification in particular seems weak:

Why is this disqualifying for arbitrary bases, but not for start and end? The fact that .index(j) - 1 ..< .index(i) + 1 might have a valid result even if j > i is no different from the fact that .end - 1 ..< .start + 1 might have a valid result even though .end > .start.

jawbroken · August 30, 2019, 12:52am

Joe_Groff:

Could we integrate this into the OffsetBound design? Maybe there could be a relative bound that specifies an offset relative to the other bound:
.start + 3 ..< .bound + 3
.bound - 3 ..< .end - 2
though we'd want it to be an error to use an open-ended .bound , or .bound s on both ends.

I thought about this but came to the opinion that it was too clever (e.g. requiring extra restrictions on the OffsetBounds allowed in a range) and didn't read that well. You also wouldn't be able to use the same design for a real index. It's a possibility, though.

mpangburn · August 30, 2019, 3:10am

An alternative would be to introduce a free function that extends a single OffsetBound into a range:

extending(.start + 3, by: 3)

or perhaps an initializer?

Range(from: .start + 3, length: 3)

xwu · August 30, 2019, 12:59pm

mpangburn:

An alternative would be to introduce a free function that extends a single OffsetBound into a range:
extending(.start + 3, by: 3)
or perhaps an initializer?
Range(from: .start + 3, length: 3)

One reason I find the pitch so well scoped is that I composes well with itself and with other features.

Is there a reason to favor any of these additional APIs over the straightforward composition x[(.start + 1)...][..<(.start + 3)]?

jawbroken · August 30, 2019, 1:57pm

Sure, aesthetics, teachability and readability. If you prefer

str[(.start + 3)...][..<(.start + 3)]

over

str[.start + 3, length: 3]

then I suppose there really is no accounting for taste. It would be better to stick to

str[.start + 3 ..< .start + 6]

at that point or

str[(.start + 3)...].prefix(3)

I suppose similar reasoning to yours, around straightforward composition, would conclude that the prefix method need not exist at all, given you can already subscript with ranges and advance indices.

Edit: And indeed none of this pitch passes that criteria, given that you can already manipulate indices and slice collections using straightforward composition of existing methods. The whole point here is to make this kind of manipulation less onerous and more usable, not to provide a minimal subset of composable primitives.

jrose · August 30, 2019, 4:35pm

No need to be insulting about it. We've seen over and over again in past reviews that people have different opinions on what's a "simplifying improvement" vs. an "unnecessary complication".

Karl · August 30, 2019, 5:03pm

That’s a good point.

We have conditional conformances, but yes, any additional APIs (e.g. String.uppercased()) would have to be accessed via a “.base” property. Slice, Reversed, etc have the same issue.

I’m not sure how that matters (except if you mean that we can have an index cache in the same type without introducing a wrapper).

My understanding of how Strong breadcrumbs work is that it maintains a much smaller Array of Indexes (not all of them), and binary-searches to find the nearest index to the one you want, thus algorithmically reducing lookup time for any random index.

The benefit over having an array of all indexes is that it’s smaller and, for generic collection-y stuff that doesn’t use type-specific APIs (like slicing), you would just get another Collection with faster random access.

But as I said, that’s a separate-but-related thing that we can consider later. I’m just thinking about how we can solve the potential performance impact of people using these convenient APIs naively.

Michael_Ilseman · August 30, 2019, 6:39pm

Ah, I saw you mention String's breadcrumbs as an example of a technique giving a feature that you'd want for all Collections. The breadcrumbs do not introduce a new type, they just accelerate certain APIs. We can do it for specific types we own (barring ABI constraints), but not for all Collections in general without introducing a new wrapper type.

It's still a linearly sized array. It would still perform an allocation (which can be costly in perf-critical code), do a linear pass over the entire Collection, and add O(n) space (even if the N is divided by, say 32). There is a real tradeoff here.

For String's UTF16View it made sense since the scan loop is very fast, orders of magnitude faster than doing character iteration. Scanning between breadcrumbs is likewise very fast; these can actually be done as a branch-free vectorized loop (yet to be implemented that way though). Random-access to UTF-16 in very perf-critical code is uncommon, i.e. it would likely be hidden behind Objective-C message sends and CF bridging logic anyways, so we have significant preexisting latency to hide behind (i.e. our constant-factors can be higher). Most such accesses are preceded by asking for the length in UTF-16, for which we need a linear scan anyways. And, the strings to which this applies are in the minority, most strings created are programmer/system strings and do not undergo the higher-level processing that wants random-access to UTF-16. Finally, String's storage representation is behind a resilience barrier and we had access to a lot of builtins to do tail-allocation and atomic compare-and-swaps.

Unfortunately, the introduce-a-new-wrapper-type approach wouldn't fix naive code (which is using underlying type directly), and any same-type solution needs to be done on the specific conformer.

Michael_Ilseman · August 30, 2019, 9:07pm

I also wanted some alternative to index(_:offsetBy:limitedBy:) and would have preferred if we could keep arbitrary indices. For now, you can do it by taking a detour through slicing.

We can add convenience APIs to do those operations, though it's probably not worth additional subscript overloads. The main issue is what to name it.

Ranges do not have access to the collection at construction time to answer contains nor the bounds for ordering. We have partial ranges which are treated as having a hole in them, and this pitch puts the hole in the middle. Supporting arbitrary bases would be supporting arbitrary holes, rather than one big one in the middle.

Ranges with holes answer .contains() as if applied to a sufficiently large Collection. That is, (5...).contains(100) == true but myArray[5...][100] may trap if myArray isn't big enough.

.end - 1 < .start + 1 would be a malformed range and traps. If we had arbitrary bases, we would have to treat every index as having a hole before and after it, so .index(j) - 1 ..< .index(i) + 1 would also trap if j > i.

What about the other part "The Trouble with Generic Operator Overloading"? Can you smuggle in arbitrary bases without a generic type?

xwu · August 31, 2019, 2:49am

You’re right; that reads better than str[(.start + 3)...][..<(.start + 3)].

So with such an elegant composition, I’m hard-pressed to see the need for a distinct API:

wadetregaskis · August 31, 2019, 4:34am

Readability. Simplicity. Efficiency.

Even if Swift can’t just have str[3:6] like most other languages - I haven’t really followed the technical discussion in this thread to understand why - it will still be compared against that syntax as the gold standard. The latter syntax is much closer to that ideal syntax than the former. I could even see someone having half a chance of rationalising the latter syntax as somehow better because it’s maybe clearer even if more verbose, but there’s no such hope for the former.