Pitch: Offsetting Indices and Relative Ranges

Michael_Ilseman · April 30, 2019, 9:48pm

If we could go back in time, we'd probably do this for all RandomAccessCollections:

Change the default index for RandomAccessCollection to be an opaque struct wrapping an Int
Fix RangeExpression's relative(to:) same-type constraint (and maybe drop contains as well)
Provide overloads of subscript taking an Int and RangeExpression where Bound == Int

Michael_Ilseman · April 30, 2019, 9:59pm

It's a valid alternative.

That's an interesting observation. If the majority of the uses of offsetting an index is forwards by 1, then another way to carve this up could be @xwu's OffsetRange<Bound> and a postfix-and-infix <...

I don't have a lot of indexing-from-the-end examples on hand, but it's a pretty common feature. Perhaps someone here has an example.

I don't see how improvements over dropLast() for indexes from the end are different than improvements over dropFirst() for indexes from the start.

We're talking about offsets from-the-end, i.e. we're talking about BidirectionalCollections whose indices are not Strideable. What non-String-related collections from the stdlib do I have at my disposal for use cases? Why wouldn't an example from String.UnicodeScalarView be as valuable?

jrose · April 30, 2019, 10:04pm

The reason I ask for non-String examples is because (I maintain) a lot of people use offset-based indexes to do string processing that'd be better done with matching. That is, instead of dropLast(1), I'd rather see dropLast("\n"). I guess the former's always going to be faster, though.

Michael_Ilseman · April 30, 2019, 10:12pm

Could you give me an example of a concrete BidirectionalCollection whose index is not Strideable that you would consider valid for use cases?

Nevin · April 30, 2019, 10:44pm

Perhaps what we ought to have is a split-at-offsets method:

extension Collection {
  func splitAtOffsets(_ offsets: [Int]) -> [SubSequence] {
    var result = [SubSequence]()
    result.reserveCapacity(offsets.count + 1)
    
    var i = startIndex
    var m = 0
    
    for n in offsets {
      let j = index(i, offsetBy: n-m)
      result.append(self[i..<j])
      (i, m) = (j, n)
    }
    result.append(self[i...])
    
    return result
  }
  
  func splitAtOffsets(_ offsets: Int...) -> [SubSequence] {
    return splitAtOffsets(offsets)
  }
}

That way if you are doing something like parsing a fixed-field-size string, you can just put the offsets in an array and split all your strings at once:

let offsets = [5, 20, 30]
let strings = getFixedFormatStrings()
let fields = strings.map{ $0.splitAtOffsets(offsets) }

jrose · April 30, 2019, 11:04pm

I'm not sure there are any, which is also important information for this proposal. If the "count from the end" case only applies to Strings for parsing purposes, it might be better to not have it at all.

scanon · April 30, 2019, 11:08pm

If I'm understanding right, it's pretty common in matrix algorithms. There's a whole class of them that want to do things like "grab the trailing 16x16 submatrix, do something with that, then apply an update to the [0 ..< n-16] x [n-16 ..< n] block, then update the [0 ..< n-16] x [0 ..< n-16] block."

beccadax · April 30, 2019, 11:15pm

I need to dust off some of my experiments in this area, which tend to center on things like:

print(string[.startIndex + 1])

They're a little wordier, but they're crystal-clear and they still omit the base collection, which was always my biggest objection to the status quo.

jrose · April 30, 2019, 11:30pm

Michael originally asked for a non-Strideable case, since you can write plain old endIndex - 16 for that. But I guess the shorthand might still be useful.

griotspeak · May 1, 2019, 12:09am

Streams of notes, especially when doing things like analyzing harmony or voice leading. The use encroaches on something that a ring buffer might be good/better for but sometimes but I've found the tradeoff on creating a new type to be a little questionable for these. Array is convenient.

MrBee · May 1, 2019, 12:49am

Swift is a verbose and expressive language, so I think we could better use more clean and clear syntax instead of ambiguous and cryptic syntax.

So…

Instead of string[--4], it would be better to write it as string[from: .endIndex-4, count: 1].
Instead of str[idx++1], it would be better to write it as str[from: idx+1, to: idx+2] or str[from: idx+1, count: 1].
Instead of str[idx--2...idx++3], it would be better to write it as str[from: idx-2, to: idx+3].

It's a lot easier to read and understand.

Well, this syntax can be implemented like this:

string[from: Int, to: Int] means get substring from from index to to index.
string[from: Int, count: Int] means get substring from from index as many count character(s).
count can be positive (forward) or negative (backward) value. If count goes beyond .endIndex then simply returns substring up to the .endIndex. If count goes before .startIndex then simply returns substring down to the .startIndex.

I think it's better. What do you think?

CTMacUser · May 1, 2019, 2:20am

Collection's design intentionally has offset calculations as methods of the collection and not of the Index type. If the Index type has them (for example, Index conforms to Strideable), you're not supposed to use them for index hopping. The use of "--" and "++" is that they look like "-" and "+" but aren't currently used within standard code.

For the rest of the previous post, doesn't the subsequence subscripting API take care of that? Or the "-->" and "<--" syntax I've seen in other posts here?

zwaldowski · May 1, 2019, 1:49pm

I called out my distaste for a one-off or several-off subscript additions in the previous version of the pitch. It offers a certain purity of purpose (and implementation!), but it doesn’t compose outside of the example in that pitch. You end up needing variants combining index start and offset end, offset start and index end, etc etc. A whole parallel universe to ranges. It’s a band-aid to the problem. I’m not saying this new pitch is a magical salve, but it’s no band-aid.

Composing with RangeExpression enables designing APIs that have useful offset convenience syntaxes of their own, and not just having useful offset syntax on Collection. I.e., and pardon my example as I’m typing on a phone, doThings<T>(in range: T) where T: RangeExpression. Passing around recipes for making ranges, including ones formed on offsets, is as valuable as ranges themselves.

I concur with the preference for dropFirst et. al., but there seems to be a problem with them in the community. A “How do I… with String / collections” post on StackOverflow is way more likely to have something horribly verbose using the index primitives rather than the conveniences we have already. Like I said in my post on the other pitch, if we want those to be the answer to this problem space, then there’s some marketing that we need to do.

Since we can talk about the sigils in this thread until we’re blue in the face, there’s probably room to ask for a (probably parallel) design that forms OffsetRanges using functions with good names, as is seen as Swifty. I don’t have a design for that off the top of my head, though.

lukasa · May 1, 2019, 3:03pm

As a relevant data point, Jordan, SwiftNIO's CircularBuffer has non-Strideable indices, but may reasonably want to be indexed from the back.

Michael_Ilseman · May 1, 2019, 8:08pm

What would be a reasonable example that wants to index from the back?

I also noticed that you defined subscript(offset offset:Int), how often is this used and how? Also, why didn't you choose to make this a RandomAccessCollection?

Michael_Ilseman · May 2, 2019, 12:24am

Some more directions and issues to consider.

The `RangeExpression` conformance issue

RangeExpression requires contains(_:Bound) -> Bool, which we can only answer partially if there is an existing index and not at all if there are only pure offsets. In hindsight, these should have been separate concepts, but they’re not so we’re stuck with it.

One option is to “fake” conformance by having contains return false or provide a partial order (as in this pitch). The thought is that relative(to:) is far more likely to be used by code in the wild. Unfortunately, there is some code in the wild which calls contains.

Another option would be to trap inside contains so at least it will be caught rather than give the wrong answer.

Another is to not conform to RangeExpression. Conformance allows us to avoid adding an extra subscript overload (as well as replaceSubrange and removeSubrange if we care enough to do so). We could instead add another concrete overload for RelativeRange or make a new less-constrained IndexRangeExpression protocol. E.g.

protocol IndexRangeExpression {
  associatedtype Index // : Comparable (maybe) because `relative` requires it...
  func relative<C: Collection>(to: C) -> Range<C.Index> where Index == C.Index
}

The same-type constraint on Index would force OffsetRange, discussed below, to continue down the phantom-type path.

Deprecating RangeExpression at this point is a little tough as there are many conformers in the wild. We do not have compiler support currently to insert IndexRangeExpression as a parent of RangeExpression, but we could do that in the future whenever supports lands.

`OffsetRange`

This is @xwu’s suggestion. This would represent offsets from the start or end of a collection but not from a given index. While this loses the flexibility of offsetting an existing index, it gains the ability to be used as a currency type applicable to multiple collections, which Range<Index> is not. On the other hand, such a thing doesn’t represent much functionality and could be replaced with e.g. a closure that slices a collection.

It requires an OffsetBound type as well. If this replaces RelativeRange and RelativeBound, it is a simplification of this pitch. Otherwise, it is a pair of new types in addition to this pitch.

This has the same conformance issues with respect to RangeExpression, just s/Relative/Offset. If this conforms to RangeExpression or some IndexRangeExpression, OffsetBound and OffsetRange will need to be phantom-typed, limiting their applicability to those collections with the same index type it was defined for:

public enum OffsetBound<👻> {
  case fromStart(Int)
  case fromEnd(Int)
}
public struct OffsetRange<👻>: IndexRangeExpression {
  var lowerBound: OffsetBound<👻>
  var upperBound: OffsetBound<👻>
}

If we go the route of no protocols, that is just concrete overloads for subscript (and others if we care), then OffsetRange can lose the phantom type and be a little more general. In exchange, we have more overloads.

Bikeshedding Operators

++/-- are understandably controversial. I gave a prototype of using symmetric --> and <--, which can be substituted with any desired symmetric pair. Let there be bikeshedding.

Faking Syntax

This whole area of ranges is faking proper language syntax. We use operators, which necessitate a gaggle of types such as PartialRangeUpTo, RelativeBound, etc., which are needed for intermediary results. Really, there should be Range and everything else is a just means to get to Range. We don’t have infix operators with implicit arguments, so we fake them with prefix/postfix operators. Prefix/postfix operators do not participate in infix operator precedence, so we fake it by adding yet more overloads. The no-fix ... operator is its own special thing. Further, we are limited in syntax for offsets by what general purpose operators are available (e.g. no ->). These put more stress on the type checker than would be necessary with proper language syntax for range expressions.

An alternative is to do proper language syntax for range expressions. This would be an open syntax design area. It can be library-extensible through an ExpressibleByRangeExpression protocol, which is invoked with the subexpressions and produces a Range (somewhat analogous to how ExpressibleByStringInterpolation is invoked over a list of segments).

RangeExpression and anything we add from this pitch would be obsoleted by the new syntax and approach.

Practically speaking, unless a very motivated individual steps up to drive this, choosing to wait for language syntax is likely postponing progress here by years.

As for potential for type checker issues, @xedin, how can we evaluate the new overloads in this pitch? I was careful to avoid overlap, but how should we think about the impact of these?

nnnnnnnn · May 2, 2019, 1:17am

Re: the offset range, you can get half the functionality of index-based bounds back by slicing the collection with your index first:

// relative range
collection[i++2 ..< --4]
// offset range
array[i...][++2 ..< --4]

You wouldn't be able to get "outside" the index-based slice, but it covers some of the use cases.

hartbit · May 2, 2019, 7:19am

I think not conforming to RangeExpression is the safer solution for users of the API: and they won't care about any extra subscripts.

I'm starting to like the simplifications brought by only having OffsetRange.

What would this syntax look like?

lukasa · May 2, 2019, 9:11am

The nature of a CircularBuffer is that it can be used either as a queue or a stack depending on the specific use-case. When working with it as a stack, indexing from the back is the natural mechanism for implementing stack_peek.

This was implemented because in NIO 1 CircularBuffer was Int-indexed. We wanted to avoid breaking code that already relied on this assumption, so we introduced this subscript. Additionally, we are aware that for Array-like data structures, Int-indexing feels natural to many programmers and we wanted to preserve it.

Given that the vast majority of NIO code in the world was written against NIO 1 and translated to NIO 2, my assumption is that subscript(offset offset: Int) is used pretty heavily, though I have no specific data to hand on its prevalence. I can say that the core NIO projects do not use it as they exclusively use CircularBuffer as a FIFO.

We did.

Michael_Ilseman · May 2, 2019, 5:32pm

That's also where I leaning now. The question is do we introduce a new protocol that will eventually be made a parent of RangeExpression?

What about this do you like? All things being equal, simpler is better of course. But, for the purposes of evaluating tradeoffs, could you provide your perspective as a user?

Also, the phantom-typed one or the non-phanton-typed one? If non-phantom-typed, meaning no IndexRangeExpression-like protocol is available, we could go back to an explicitly labeled subscript taking OffsetRange, and have the ..< and ... syntax available for OffsetRanges. That is, we wouldn't have any syntax such as ++/-- or -->/<--.

It's fairly open ended and up to whoever designs this. We might keep ..< and ... as is for source compatibility and existing knowledge, but as proper syntax. Then, we'd have other syntactic constructs available to us such as -> and : if we wanted.

Pitch: Offsetting Indices and Relative Ranges

The RangeExpression conformance issue

OffsetRange

Bikeshedding Operators

Faking Syntax

The `RangeExpression` conformance issue

`OffsetRange`