Call for Users (and Authors): Offset indexing pitch


(Michael Ilseman) #1

Hello everyone, I’d like to throw out a simplified offset-based indexing and subscript that I think is pretty minimal and straight-forward. There are some design questions involved that can only really be answered through actual usage. Please copy this code into your projects and have fun; let us know how you use it!

Also, I’m looking for a co-author (or primary author) on this pitch who believes in this minimalistic approach. There’s a (albeit narrow) window in which we could squeeze this in for Swift 5.1, but that would require a diligent co-author who can help sort out these issues and feedback from the review. Let me know if you’re interested!

Some points worth exploring

Should the result of subscript(offset:) be optional?

An optional result allows us to avoid having to explicitly guard against the length all over the place, and is a frequently requested addition to even Array. For String, optionality allows us to avoid the O(n) scan to answer count.

But, if most usage is in a context where we already know the length, then optionality will result in spurious !s littered throughout the code.

Related, what about index(atOffset:)? Optional seems a little more appropriate and aligns with index(_:offsetBy:limitedBy:).

Should index(atOffset:) return nil instead of endIndex if passed count?

endIndex is not part of indices, but is used as an upper bound for access. Should we consider count to be an invalid offset, index(atOffset:) should return nil rather than endIndex, which would be consistent with subscript.

What other kinds of benefits can you think of (outside of String)?

Offset-based subscripting seems like it would benefit slices of random-access collections. Integer indices for slices of RACs are relative to the base’s start rather than the slice’s. This can be especially problematic for Data, which is its own slice type. Would this avoid bugs or be useful in your code?

The code

extension Collection {
  internal func _range(fromOffset range: Range<Int>) -> Range<Index> {
    let lower = index(atOffset: range.lowerBound) ?? endIndex
    let upper = index(lower, offsetBy: range.count, limitedBy: endIndex) ?? endIndex
    return lower..<upper
  }
  internal func _range(fromOffset range: ClosedRange<Int>) -> Range<Index> {
    return _range(fromOffset: range.lowerBound..<(range.upperBound &+ 1))
  }
  internal func _range(fromOffset range: PartialRangeUpTo<Int>) -> Range<Index> {
    return _range(fromOffset: 0..<range.upperBound)
  }
  internal func _range(fromOffset range: PartialRangeThrough<Int>) -> Range<Index> {
    return _range(fromOffset: 0...range.upperBound)
  }
  internal func _range(fromOffset range: PartialRangeFrom<Int>) -> Range<Index> {
    return (index(atOffset: range.lowerBound) ?? endIndex)..<endIndex
  }

  public func index(atOffset offset: Int) -> Index? {
    return index(startIndex, offsetBy: offset, limitedBy: endIndex)
  }
  public func offset(of index: Index) -> Int {
    return distance(from: startIndex, to: index)
  }

  public subscript(offset offset: Int) -> Element? {
    guard let idx = index(atOffset: offset), idx != endIndex else { return nil }
    return self[idx]
  }

  public subscript(offset offset: Range<Int>) -> SubSequence {
    return self[_range(fromOffset: offset)]
  }
  public subscript(offset offset: ClosedRange<Int>) -> SubSequence {
    return self[_range(fromOffset: offset)]
  }
  public subscript(offset offset: PartialRangeUpTo<Int>) -> SubSequence {
    return self[_range(fromOffset: offset)]
  }
  public subscript(offset offset: PartialRangeThrough<Int>) -> SubSequence {
    return self[_range(fromOffset: offset)]
  }
  public subscript(offset offset: PartialRangeFrom<Int>) -> SubSequence {
    return self[_range(fromOffset: offset)]
  }
}

extension MutableCollection {
  public subscript(offset offset: Int) -> Element? {
    get {
      guard let idx = index(atOffset: offset), idx != endIndex else { return nil }
      return self[idx]
    }
    set {
      guard let idx = index(atOffset: offset), idx != endIndex && newValue != nil
      else { return }
      self[idx] = newValue!
    }
  }

  public subscript(offset offset: Range<Int>) -> SubSequence {
    get {
      return self[_range(fromOffset: offset)]
    }
    set {
      self[_range(fromOffset: offset)] = newValue
    }
  }
  public subscript(offset offset: ClosedRange<Int>) -> SubSequence {
    get {
      return self[_range(fromOffset: offset)]
    }
    set {
      self[_range(fromOffset: offset)] = newValue
    }
  }
  public subscript(offset offset: PartialRangeUpTo<Int>) -> SubSequence {
    get {
      return self[_range(fromOffset: offset)]
    }
    set {
      self[_range(fromOffset: offset)] = newValue
    }
  }
  public subscript(offset offset: PartialRangeThrough<Int>) -> SubSequence {
    get {
      return self[_range(fromOffset: offset)]
    }
    set {
      self[_range(fromOffset: offset)] = newValue
    }
  }
  public subscript(offset offset: PartialRangeFrom<Int>) -> SubSequence {
    get {
      return self[_range(fromOffset: offset)]
    }
    set {
      self[_range(fromOffset: offset)] = newValue
    }
  }
}
Example use
let str = "abcde\u{301}fg"

print("str elements")

print(str[offset: 0])
print(str[offset: 3])
print(str[offset: 4])
print(str[offset: 5])
print(str[offset: 6])
print(str[offset: 7])
print(str[offset: 8])
print(str[offset: 9])

print("str slices")

print("1")
print(str[offset: 0..<str.count])
print(str[offset: 0...str.count])
print(str[offset: 0...1_000_000])

print("2")
print(str[offset: 3..<str.count])
print(str[offset: 3...str.count])
print(str[offset: 3...1_000_000])

print("3")
print(str[offset: 3..<4])
print(str[offset: 3...4])
print(str[offset: 3...(4+1)])

print("4")
print(str[offset: ..<5])
print(str[offset: ...5])
print(str[offset: ...str.count])
print(str[offset: ...1_000_000])

print("5")
print(str[offset: 4...])
print(str[offset: str.count...])
print(str[offset: 1_000_000...])

var arr = [1,2,3,4,5,6]

arr[offset: arr.count] = 8
print(arr)

arr[offset: 3] = 8
print(arr)

arr[offset: 2...] = Array(arr[offset: 2...].reversed())[...]
print(arr)

print("slice")

arr = [1,2,3,4,5,6]
print(arr[2...][3])
print(arr[2...][offset: 3])

Finally

Big shout-out to @Letan, @dabrahams, @QuinceyMorris, @xwu, @Karl, @nnnnnnnn, @SDGGiesbrecht, @itaiferber, and apologies to whoever I missed. Every one of you have contributed to this area, and I’d like to acknowledge your work in the eventual pitch. This would not preclude a future direction involving a slicing DSL, offsets-from-end, cleaning up RangeExpression, etc.


Persistent Sequence
(Brent Royal-Gordon) #2

Comments, in no particular order:

  • I sort of like killing two birds with one stone by making the result Optional, but I don't like the fact that you can assign nil to an element and it just gets eaten. It seems like a code smell. It might be better to handle this kind of sloppy subscripting with a separate read-only feature.

  • If we split off out-of-bounds access into a separate feature, offset-based subscripting might make more sense to expose as a byOffset or zeroBased property returning a view wrapping the original Collection. That would also mean we could potentially cache the results of index-to-offset conversions (for non-RandomAccessCollections) in a separate instance variable alongside the collection, and invalidate the cache when the collection is mutated.

  • Any reason your implementation doesn't use RangeExpression? It was specifically introduced to abstract over Collection slicing subscripts.

extension Collection {
  internal func _range<R>(fromOffset range: R) -> Range<Index> where R: RangeExpression, R.Bound == Int {
    let r = range.relative(to: 0 ..< .max)
    let lower = index(atOffset: r.lowerBound) ?? endIndex
    let upper = index(lower, offsetBy: r.count, limitedBy: endIndex) ?? endIndex
    return lower..<upper
  }
  
  public func index(atOffset offset: Int) -> Index? {
    return index(startIndex, offsetBy: offset, limitedBy: endIndex)
  }
  public func offset(of index: Index) -> Int {
    return distance(from: startIndex, to: index)
  }
  
  public subscript(offset offset: Int) -> Element? {
    guard let idx = index(atOffset: offset), idx != endIndex else { return nil }
    return self[idx]
  }
  
  public subscript<R>(offset offset: R) -> SubSequence
    where R: RangeExpression, R.Bound == Int 
  {
    return self[_range(fromOffset: offset)]
  }
}

extension MutableCollection {
  public subscript(offset offset: Int) -> Element? {
    get {
      guard let idx = index(atOffset: offset), idx != endIndex else { return nil }
      return self[idx]
    }
    set {
      guard let idx = index(atOffset: offset), idx != endIndex && newValue != nil
        else { return }
      self[idx] = newValue!
    }
  }
  
  public subscript<R>(offset offset: R) -> SubSequence
    where R: RangeExpression, R.Bound == Int 
  {
    get {
      return self[_range(fromOffset: offset)]
    }
    set {
      self[_range(fromOffset: offset)] = newValue
    }
  }
}

(Tony Allevato) #3

It probably kills the 5.1 window, but this is one of those situations that pops up where you want to have a getter and setter with different optionality:

  • Many nil-resettable properties would be better represented as a non-optional getter with an optional setter. For example, UIColor properties where setting to nil reverts it to a default.
  • Subscripts like this, where it makes sense for the getter to be optional but the setter to be non-optional.

Now, the first use case is probably even better served by property behaviors, but if we set that aside, does it make sense to try to address differing optionality first here?


#4

For the benefit of future readers of this thread, I think it's worth clarifying that you're talking about offset-based access to String, but no other Collection. You don't actually say that up front. It becomes clear eventually, but it's a little confusing. [Edit: my mistake.]

I'm in favor of a partial solution to "the offsetting problem" until we get around to a more general solution (for String or Collection), but I would recommend assuming that the partial solution will be superseded by a different solution, rather than being extended into a different solution. (Maybe it will be extended that way, but who knows.)

For that reason, I'd recommend using "throwawayable" keywords — ones that aren't likely to conflict with the "real" solution later. In particular, I'd suggest using "atOffset" instead of just "offset" for the subscript's keyword. That avoids source code breakage if someone decides to use the simple "offset" for something more general later.

Yes, but an optional result forces everyone to handle the optional all over the place every time. I think I'd prefer it to trap instead of returning nil.

How about a second read-only subscript that returns an optional? Something like:

subscript(atOffset offset: Int) -> Element { get set }
subscript(whereOffset offset: Int) -> Element? { get }

(See above, under "throwawayable".)

I may be out on my own here, but I prefer to see less optionality forced on us. Maybe another duplication here, as I suggested for the subscript?

func index(atOffset offset: Int) -> Index
func index(whereOffset offset: Int) -> Index?

The downside is that returning nil makes two things that mean "at end", and I'm betting that would uglify code. What if index(atOffset: count) returns .endIndex, but index(whereOffset: count) returns nil?


(Michael Ilseman) #5

Hmm, I intended the opposite. The code listing is an extension of Collection.

It really depends on how it is used, and that's a main point of this post. I address that in the next paragraph. The whole section:

Maybe a code example is clearer:

Optional:

guard let element = c[offset: i] else { ... }

Trapping:

guard i < c.count else { ... }
c[offset: i]

We need experience from usage before deciding what the obviously correct version is.

A code example might be clearer.

Returns endIndex:

guard idx = c.index(atOffset: i), idx != c.endIndex else { ... }

Returns nil:

guard idx = c.index(atOffset: i) else { ... }

The main thing to see is when is the idea of endIndex being at an offset useful in practice? It's not formally part of a collection's valid indices, but is an upper-bound.

edit: grammar


#6

Oh, I see now. I saw this question:

and ran with the wrong implication.

FWIW, I think it's worth doing something about String, because extracting substrings
in Swift is just embarrassing right now. I'm less motivated to see a Collection-wide quick fix.

Yes, but where does i come from? In some places, you might be handed an arbitrary Int (such as a value passed in as a parameter), but in many actual algorithms, you'll derive i, j, k, etc as a preliminary step that already involves count. Then, having to deal with optionals when using the offsets as subscripts is just aggravating.

You may well have something like:

var insertionIndex = c.index(atOffset: i)

where i might be count, depending on what happened earlier. This is a reasonable scenario to use .endIndex, but you have go the long way round to get it.

I like the idea of doing something simple, but the earlier proposals didn't fail because of complexity, rather because there was genuinely no agreement on how to handle important details like these (whether to return optionals, etc). Maybe it's just me, but we don't seem to have sailed clear of those issues yet.


(Morten Bek Ditlevsen) #7

If the offset subscript is added to Collection, will it not cause confusion for RandomAccessCollections where you can then use both index and offset to get a very similar result?
I think that it could make sense to only add it for String since this is the type where most people get confused about index subscripting.


#8

I feel like most people asking for integer offsets want to treat a collection like Array, or port code written for Array to a generic context, and therefore I think that the subscript should probably match Array behaviour (i.e. trapping). There have been many previous discussions about adding a form of “lenient” subscripting to Array, and any solution there could also be adopted on Collection with the suitable addition of "offset" somewhere in the name.


(Cory Benfield) #9

Before making any concrete comments I want to say that I think this is probably a good idea. NIO just spent some time opaquifying many of its indices into its own custom data structures (those indices were previously Int), and as part of that work we intend to add Int-based subscripts that do essentially exactly this.

That said, there are some notes:

  1. This makes Set and Dictionary pretty weird. Their conformance to Collection is already slightly awkward in a world where their Index is an opaque nothingburger: being able to Int-subscript a Set or Dictionary is profoundly strange. @lorentey may have more thoughts here.

  2. Regarding optional results, and noting @brentdax's feedback about the ability to assign nil, I think that that ship has sailed. In general, it seems to be the case that out-of-bounds indexing has standardised on trapping instead of returning nil. The server community doesn't love this behaviour (it's easier to handle nil), but it's the standard behaviour.

    This also maps coherently with index(after:) and index(_:offsetBy:), both of which have non-optional results that either map to endIndex (which cannot validly subscript the collection) or will trap, depending on what the implementers want to do. Indeed, the documentation for index(_:offsetBy:) call out explicitly:

    The value passed as distance must not offset i beyond the bounds of the collection.

    If we fundamentally consider index(atOffset) as being a shorthand spelling for index(_:offsetBy:) then we should attempt to keep the behaviour the same, unless we have strong reason to want to spell these differently.

  3. I'm nervous about these extending Collection. Specifically if you conform to Collection and not RandomAccessCollection, the index(_:offsetBy:) calls are explicitly expected to be linear in the size of the offsetBy: argument. Repeated indexing using these integer offset subscripts would trivially lead to accidentally quadratic behaviour, which I suspect would be very likely.

    I think I'd want to push for these to be available only on RandomAccessCollection.


#10

We can't relitigate this issue on every single discussion about Collection/Sequence. It needs to be addressed as its own thread/pitch/proposal that somehow addresses the issue holistically, rather than derailing every other discussion trying to find a hack around each individual small corner of weirdness.

String isn't a RandomAccessCollection, and this seems like the most important/frequently requested use for integer offsets. I think we have to just accept that people are going to write things that are accidentally quadratic, and take comfort in some combination of:

  • The labelled subscript somewhat discourages use;
  • For a lot of use cases, particularly in beginner code processing strings where this is oft-requested, efficiency probably isn't that important;
  • Improvements to string-processing APIs in particular, and Collection-processing algorithms in general, will hopefully de-emphasise this kind of subscripting and explicit looping.

(Cory Benfield) #11

I don't think I was derailing or asking to relitigate, nor did I say I was opposed to this proposal. I merely wanted to note it for any future proposal, and attract the attention of someone who worries about these data types more than I do. I agree, we should not relitigate Set and Dictionary's conformance to Collection on this issue. However, I do think it's a requirement that any extension to a standard library protocol should explicitly address any standard library types that may be affected by this extension in a way that is suboptimal.

Put another way: don't mistake the flagging of imperfection as an attempt to derail a discussion. It's not. It's an attempt to provide information that will help produce a proposal that clearly documents decision making, so that in the future we have a reference for the logic behind specific decisions. As I said at the top of my post, I think this is a good idea.

Agreed. However, I do think attention should be paid to how easy we make it to stumble into accidentally quadratic behaviour.

Calling out String is really interesting, because it makes a good point. People coming from non-Swift programming languages repeatedly stumble on the string indexing behaviour, and in that stumble they reveal an important heuristic they are bringing with them from a previous language: the idea that String indexing is constant-time.

Now, it is reasonable to say that programmers should have the affordances they need to translate their mental model from a previous language to Swift. But it is important to consider whether "make this easy" is unquestionably a good thing.

I think there's a sliding scale for this feature.

  1. Dictionary/Set. This is just weird, and the semantics are entirely unclear. Users are likely to conclude that these subscripts have unclear semantics, and avoid them.
  2. String and other non-RandomAccessCollections. Here the semantics are clear, but the performance characteristics may not be. Programmers coming from other languages may translate code that was previously linear-time, but now is quadratic.
  3. RandomAccessCollection. Both the semantics and performance characteristics are clear.

Case (3) is an unalloyed win, and we should definitely do that. Case (2) is a judgement call. For my part, I am personally opposed to performance traps like this one. My opposition comes as a result of noting that the average programmer is bad at tracking down expensive operations in their code. In particular, accidentally-quadratic operations rarely occur in testing environments, and almost invariably occur in production ones.

A particular concern for me is server-side code. Developers using this on any kind of user-provided data on the server are attracting a trivial denial-of-service attack on their service from any user that submits a long string to them. That code, which would have been fine in (say) Python, becomes profoundly not-fine in Swift.

Again, this is not an argument for not doing this: it's an argument for thinking very hard about it.

A note: String can always be treated as a special case, and can have these methods added to it without conforming all Collections. The only reason that I can see to add it to Collection is if we want to allow generic programming across all Collections in this way. I think that is something we should very much avoid: anyone writing generic algorithms across Collection should be thinking very hard about the implied semantics of Collection.


#12

I've participated in several such threads, and I think it is more accurate to say that those wishing for Int-indexable strings don't care about the performance characteristics.


(Cory Benfield) #13

I don't think that's quite accurate. If integer-based String indexing took 30 seconds per operation, users would very much care about the performance characteristics.

I suspect that it's more likely that what you mean is that "for reasonably sized strings and reasonable integer indexing, users don't care so long as the performance feels reasonable" and...sure, yes. But that's because there are weasel words everywhere.

But that heuristic fundamentally boils down to "users don't care about performance until they do". Now, let me say this again: I am totally ok with adding offset-based indexing to String if we think the benefits outweigh the costs. I think this is getting lost in my wider discussion about this proposal. But it's important to remember that this proposal is "add offset-based indexing to Collection", not to String.

My notes on this specific part of the proposal have boiled down to:

  1. I don't think we should add offset-based indexing to all Collections.
  2. I think we should think hard about adding it to String.

On point (2) I care much less: it's fundamentally one weird behaviour of one weird data type. But on point (1) I feel quite strongly. If we extend Collection with this behaviour it will affect every Collection that anyone implements, and I think it's likely that many of those will fall into this accidentally-quadratic trap.


#14

I mean that the people I've seen arguing for Int-based indexes have never come across strings that present problems for their code. I did not mean to suggest that such people are not arguing from a position of selection bias. The general argument boils down to: I do it in other languages, and it's annoying to write my own subscript() and it is just fine and my code is perfect™️.

I, personally, would love to see such Int-based subscripting on strings. I've written plenty of code, in Swift, that would benefit from the conciseness, while not being unduly burdened by the inefficiency. The difference is, I know the trade-off I am making. As you point out, there is a risk to making this trade-off be the default that all newcomers will reach for.


#15

Sure, I wasn't saying that you were trying to derail the discussion, or that you were opposed to it. However, I have seen this seemingly-innocent point cause anything from several dozen to 100+ posts in unrelated Collection API discussions before. The semantics of Collection perfectly support this subscript, and offsetting indices generally, so I personally don't see why it shouldn't be added at the protocol level, along with clear documentation about the performance characteristics and traps.


(Michel Fortin) #16

I love the look of this pitch, but I too have some concerns about it being overused. Fortunately, an [offset:] subscript is easy to notice and search for in a code base once you become concerned about performance, so I think it's fine.

Regarding strings, could it be be that some of those arguing for Int-based access are dealing with ASCII strings? It'd be nice if offset-based indexing was O(1) when a string is purely ASCII. Without a static guaranty about ASCII-ness however, that could make Unicode the degenerate case for those offset-based algorithms.


(Konrad `ktoso` Malawski) #17

Seconding that question, as it is just what I was going to ask. I can only see myself operating like this on strings when I know™ that they are supposed to be ascii, and actually, they are some silly text based protocol etc.

I saw some proposals in the past about ASCIIString anyone more in the loop about those know what's the status of those?


(Karl) #18

I mentioned this in another thread about this topic recently. I think it would generally be worthwhile for us to have a way to promote non-RACs to RAC using a cache. Any RAC can be manipulated by integer offsets without blowing up the existing performance model. This approach would mirror other discussions we've had on here about promoting Sequences to Collections via a buffer, and StringBreadcrumbs, which solves this exact problem for String's UTF16 view.

I think that's generally what you want. If you're slicing a collection by offsets once, we already have very concise prefix/suffix functions and there isn't a huge readability gain by writing that one operation in fewer characters.

In all of the practical examples I've been able to come up with, most of the space for improvements is really about traversing and offset-slicing the same collection multiple times (or slicing, then slicing that slice again), which favours the caching approach.

Another cool thing is that, because the standard library owns the Collection/RAC protocols, we are in a unique position to add dynamically-dispatched members and make the caching a no-op when the type dynamically conforms to RAC, even if we can't statically determine that:

protocol Collection {
  // ...
  func _createBreadcrumbs(_ n: Int) -> [Index]?
}
extension Collection {
  func _createBreadcrumbs(_ n: Int) -> [Index]? { /* default implementation never returns nil */ }
}
extension RandomAccessCollection {
  func _createBreadcrumbs(_ n: Int) -> [Index]? { return nil }
}

func someGenericAlgo<C: Collection>(_ c: C) {
  let wrapper = makeRandomAccess(c)
  // Do stuff using random access, integer indexes, whatever. Remains fast for RACs.
}

let erased = AnyCollection([1, 2, 3, 4])
someGenericAlgo(erased) // Still fast. AnyCollection can forward the requirement.

I'm not sure if all ASCII strings can be done in O(1), because of CRLF. I think we might normalise them all to LF, though. Another nice thing about the above approach is that the type can determine dynamically if there are properties which allow for informal random access.


(Jonas) #19

Would you, or someone else for that matter, mind listing some of your real world use-cases for indexing by an integer offset?

I am extracting substrings all the time, but almost exclusively with either:
a) a range that is coming from another API (regexp match result, String.range(of:) etc. Extracting a substring is str[range]
b) in case of parsing textual formats that need to be parsed character-by-character, a range that starts with a start index of a slice (that slice represents the remaining-to-parse portion of the string). Extracting a range is str[slice.startIndex..<someEndIndexDerivedFromThatSliceStartIndex]


(Davide De Franceschi) #20

I agree that this can be helpful, but in the end isn't makeRandomAccess basically Array.init<S: Sequence>?