Call for Users (and Authors): Offset indexing pitch

That type of example—parsing a string of known fixed-length format—has come up in past threads on this subject. Perhaps we ought to consider addressing such uses cases with:

func split(atOffsets: [Int]) -> [SubSequence]

So a single call will tokenize the entire string.

This dodges the question though :wink: The point I was getting at if those are not most often used / useful on ASCII contexts, where such lookups can be guaranteed to be O(n), unlike in the general unicode strings.

Hi Michael, as privately discussed previously — definite +1 on this. I think this is something sorely needed which will make various functionality much easier to represent. The explicit offset: argument is also helpful in differentiating this from regular indexing, as others have noted, and gives an easy way to audit later for potential O(n) access like this.

Some thoughts:

  1. I agree with @beccadax and @lukasa about optionality here — introducing optionality here feels somewhat awkward and inconsistent. Specifically, too, I'm unhappy about the nil assignment into a MutableCollection: it seems really unfortunate that whereas someone might be used to assigning nil into, say, a dictionary would be used to it meaning "remove this element", but into an arbitrary MutableCollection, it would simply drop the assignment on the floor. (Of course, if you know the reasoning behind this — specifically, MutableCollection does not allow you to change its count — things make sense, but it's a bit gotcha).

    I definitely prefer consistency here: indexing past the end traps, like everywhere else. This allows us to drop optionality altogether.

  2. As mentioned prior, I think we should offer these methods for RangeReplaceableCollection as well. There's generally overlap between things which are MutableCollection and things which are RangeReplaceableCollection, but I think there's value in offering both — specifically for allowing assignments which do change the length of the collection (e.g. inserting or replacing subranges of a string with different-length strings/substrings)

    (I've been playing around with this internally and have found it useful)

I'm willing to help co-author once this has been discussed a bit more, though if others are willing to help too I would gladly accept their help. (Like you I am currently pressed for time on some other things.)

2 Likes

I would be -1 for this. The integer subscripting into String has been intentionally rejected multiple times for a reason. This just feels like a more general approach to the same thing (and has been stated as one of the most common use cases for this).

Swift indexes exist for a reason and integer offsets on a basic collection were intentionally left out of the language. I feel like proposals like this are just people swimming against the current. There are good reasons that things like C for loops have been removed from the language and integer offsets into collections are not permitted unless a very specific set of criteria are met.

While this is slightly less likely to be abused because it is a named subscript and therefore not the default, I feel like it would only be a matter of time until everyone is reaching for this instead of the intended tool, especially if all beginners are just taught to use this because it's easier and familiar. Rather than fighting the language's principles to tailor it for beginners, beginners should be taught the reasons behind the language principles. Rather than suggesting people avoid indexes altogether we should help people to embrace them.

Swift claims to be performant. The goal is to run as fast as C code when optimized and things like this would make that impossible and your average user would probably have no idea why their code is running 4x slower than an equivalent piece of C code.

Also, the fact that this would also affect both Set and Dictionary cannot lightly be tossed aside. If an API has an effect on two core objects of the standard library then that effect must be taken into consideration. This would also impact any existing code which conforms to collection, some of which may also not make sense to have an integer offset (I can think of at least one collection in some of my own pet projects where it would not make sense to have an integer offset).

I view this as a purely sugar proposal which can be trivially implemented if truly desired. It results in strange functionality being permitted on multiple core types, not to mention the effect it would have on 3rd party code for which it may or may not be desirable.

2 Likes

Not unless Array is the only random-access collection in the universe.

Just because something is a certain way, doesn't mean no other ways can exist. Otherwise we might as well go home - C++ is the way it is "because reasons", and clearly they must be good reasons or it wouldn't be like that, amirite?

1 Like

As is made evident by the very fact that swift doesn't follow the ways of all other languages and allow integer indexes on all collections :wink:

C++ ironically allows integer indexes into vector/array.

The only reasons I've seen for integer indexing are simplicity and consistency with other languages. While the reasons against it include, but are not limited to:

  • Integer indexes in swift are reserved for and imply Random Access Collection, which Strings specifically are not
  • Performance pitfalls caused by ease-of-use over correctness of use
  • Pure sugar proposals for trivial extensions have a much higher bar for acceptance

There's probably a lot more, and there are several threads on the forum discussing integer indexes into String specifically, but this pitch is right along those same lines.

Not ironic at all - unless I missed some big news, Swift also allows integer indexes into arrays.

In fact (and this is going to sound crazy given how much people love to talk about it, but I promise you it's true) - Strings are random-access collections and they support integer indexing. Yeah, I know! Problem solved!

The caveat (of course there is one) is that you can only access that functionality via an NSRange, and you have to communicate in terms of UTF-16 code-unit offsets, which are kind of meaningless to users/programmers/Swift itself. But if you're cool with that, it totally works.

There's no reason we couldn't do the same thing without the NSRange and using character offsets. In fact, there's no reason we couldn't do the same thing for any old Collection you found wandering around in the rain. But I'm not going to repeat myself - it's in my post like... a tiny bit above this one.

What are some use cases for this? I feel like this comes up because some beginner is trying to rewrite some pseudo code into Swift and the example they’re trying to replicate uses c-style loops and treats strings as arrays.

If you really need to extract a specific offset from a specific well-known string, maybe to validate well-formed user inputted strings, one should probably convert the string-as-a-sequence into an array of characters using let characters = Array(string) and do e.g characters[13] on that.

2 Likes

Pure ASCII isn't really the requirement. The requirement is that nothing outside of the ASCII range will match any ASCII characters that one is looking for.

What's needed here is a collection of an unsigned 8-bit type with a semantic guarantee that anything in the 0-127 range is an ASCII character, ASCII strings will only match bytes in that range, and that anything outside of that range will be passed through without any changes. That's what things like parsing PNG headers, ANSI control codes, HTML, Telnet, etc. require.

Youre right, a more appropriate comparison would be c++ string. Integer indexible in c++ but also inheritently diffferently than String in Swift.

You make an excellent point that there are other ways to get RAC of any String or collection that dont require any changes to the stdlib and would just require the developer to be more explicit with their intentions!

There are of course tradeoffs to your suggestions above as well. Just because something is possible doesnt mean it’s desirable or worthwhile for ALL swift users. Using String breadcrumbs and/or caching data has also been brought up before (perhaps by you?), and it was pointed out that if it’s only ever used when slicing objects or using a subscript then there can be a lot of overhead/wasted space (specifically on string) for creating/carrying around all this data.

Not every collection created even includes a slice/subscript during its lifetime and i dont know that everyone who ever makes a collection should pay the price of this overhead.

A negative distance could be used to move backward from the endIndex.

let abc = "ABC"
abc[offsetBy: -3] //-> "A"
abc[offsetBy: -2] //-> "B"
abc[offsetBy: -1] //-> "C"
abc[offsetBy:  0] //-> "A"
abc[offsetBy:  1] //-> "B"
abc[offsetBy:  2] //-> "C"

Would an index(offsetBy:) method that doesn't return endIndex be useful?

extension BidirectionalCollection {

  /// Returns an index at the given distance from `startIndex` or `endIndex`.
  ///
  /// - Parameter distance: The positive (>=0) distance to offset `startIndex`;
  ///   or the negative (<0) distance to offset `endIndex`.
  ///
  /// - Returns: A valid index of the collection that is less than `endIndex`;
  ///   or `nil` if no such index exists.
  ///
  /// - Complexity: O(1) if the collection conforms to `RandomAccessCollection`;
  ///   otherwise, O(*k*), where *k* is the absolute value of `distance`.
  public func index(offsetBy distance: Int) -> Index? {
    guard !isEmpty else {
      return nil
    }
    if distance < 0 {
      return index(
        endIndex,
        offsetBy: distance,
        limitedBy: startIndex)
    } else {
      return index(
        startIndex,
        offsetBy: distance,
        limitedBy: index(before: endIndex))
    }
  }
}
1 Like

I disagree with the notion that it's standard behavior for this use case. The contract of Array is that it traps when provided with an invalid index. In Swift's taxonomy, invalid indices are programmer error. In a perfect world, invalid indices should not be representable. When your index is Int that's unavoidable; c'est la vie.

By contrast, the index(_:offsetBy:limitedBy:) returns an optional because invalid offsets are not programmer error. The offset distance is definitely hard-coded to allow invalid values (because it's Int, but unlike with indices it's always Int). If we look at foo[offset: bar] as collapsing down the boilerplate of using index(_:offsetBy:limitedBy:), I stand by it returning an optional and think it falls out naturally of the existing model.

Another example: in SE-0202, Collection.randomElement() is optional because the collection can be empty.

(EDIT: Duh, and as I compose another post for the thread I forget that prefix(_:) et. al. also don't trap on too-large offsets, which is what makes them nice to use!)

In my mind, you're asking the collection to perform a (waves hands) using its internal data structures, and that can fail. That's not really your fault if it can't succeed, because we can't reasonably ask the programmer to have checked their offsets first when their only option for doing so is the exact same code the stdlib will execute; we instead ask the programmer to check the result.

4 Likes

My first reaction is to ask: Is anyone asking for this on a collection other than strings? Array indexing is already Int-based, these would be a bad idea to use on sets and dictionaries, and I don't think people use indexes into ranges. My own use cases have been 100% around the ergonomics of getting into a string with a known fixed-field format.

I'm also concerned about assigning into offsets — with the streamlined ability to get indexes via offset, could people interested in mutation just get the index they need that way and then use regular mutation/replacement methods?

Last question: do we need to be able to write c.index(atOffset: 5) if we have c.indices[offset: 5]? (Hmm, this last question refutes my earlier implied suggestion that we constrain these to String.)

2 Likes

The only use case I can think of for Array's to have an optional returning subscript is when you have a valid index into the Array, and you want to peek at the next item in the array and do something if there's also something there.

So instead of writing:

func parseToken(at index: Int, in: [Int]) {
  let currentToken = array[index]
  if index + 1 < array.count {
    let nextToken = array[index + 1]
   // handle both tokens
  } else {
    // handle the single token
  }
}

You can just do

func parseToken(at index: Int, in: [Int]) {
  let currentToken = array[index]
  if let nextToken = array[atOffset: index + 1]  {
   // handle both tokens
  } else {
    // handle the single token
  }
}

Worth it? Questionable. But I've found myself wanting this behavior on several occasions.

1 Like

On the one hand, I've fought hard and had good success with students and teams both to think in terms of prefix, dropFirst, etc, which have a similar offset-based friendliness. Some teams I've worked with have even proclaimed to like it better in the end.

Does the existing suite of slicing and dicing methods on Collection fit with what we want to encourage people to do in production? If so, should we make ergonomics improvements to those to meet this use case (f.ex. swift-evolution#935, and/or a way to prefix+dropFirst simultaneously), and then message the use of them better (since they seem to be underused). If not, well, why are they there at all?

On the other hand, I'm very sensitive to this being an oft-requested addition. As an ergonomics thing, it drives people bonkers coming from other languages to not have integral indices everywhere. So it'd stop a lot of yelling, but so would C-style for loops, :man_shrugging:.

From the "if you give a mouse a cookie" department, I'm almost certain a successful proposal for this would eventually lead to "why can't I [offset: …] a range?", and then we're down the road of building a parallel universe of Swift's indexing model for people who dislike it. That worries me a lot.

1 Like

It's not part of the standard library (yet :crossed_fingers:) but sorted collections which are backed by tree structures generally need this. Because as you might imagine you can't just use an Int to index a tree data structure. So the only easy way to get the nth element from the collection is to use an offset.

1 Like

The other non-speculative use case for it is ArraySlice, where a lot of folks are confused by its non-zero startIndex.

7 Likes

If people simply aren’t learning about Swift’s indexing, one of the stdlib’s critical abstractions, we should examine why that is, and whether or not if it’s a fault of the model. One convenience method isn’t a silver bullet to solving this problem.

6 Likes

I like this story, but it's incomplete. index(_:offsetBy:) is also a function call that exists, and it does not return an optional. While it's reasonable to say that index(_:offsetBy:) should never be preferred to index(_:offsetBy:limitedBy:), the reality is that the shorter spelling and non-optional return value will almost always lead programmers to use the undefined-behaviour version.

I don't strongly object to optionality here (though it raises problems when combined with MutableCollection as discussed up-thread). However, I think it's just too simplistic to say that invalid offsets are not programmer error. Swift is just not consistent on this point.

Any kind of parser: CSV, JSON, source code, network protocols, file formats, etc.