Offset Indexing and Slicing

Nevin · August 31, 2019, 4:40am

Some thoughts on the implementation

Is there a reason for using an enum with associated values? I think things become cleaner if the offset and index are separate properties like so:

struct OffsetBound: Hashable {
  enum Anchor { case start, end }
  
  var offset: Int
  var anchor: Anchor
}

That way Equatable and Hashable are synthesized automatically, and Comparable doesn’t need the bulky switch anymore:

extension OffsetBound: Comparable {
  static func < (lhs: OffsetBound, rhs: OffsetBound) -> Bool {
    if lhs.anchor == rhs.anchor {
      return lhs.offset < rhs.offset
    }
    return lhs.anchor == .start
  }
}

Various other methods can be simplified as well, such as advanced(by:).

• • •

Also, there’s probably a way to make the range subscript more efficient by first getting lowerBound then using that as a base for finding upperBound instead of walking all the way from startIndex a second time. (Or vice versa for .end-.end ranges on a BidirectionalCollection.)

jawbroken · August 31, 2019, 5:23am

Oh, that wasn't my intent at all, and I apologise to @xwu if it came across that way.

xwu · August 31, 2019, 12:05pm

str[.start + 3, length: 3]

str[(.start + 3)...].prefix(3)

Can you explain why you find the former more readable, simple, or efficient? In general, learning an additional API instead of a straightforward composition is less simple and efficient. And readability encompasses correctly understanding the behavior; prefix(3) is an old API that has always implied as many as but not necessarily exactly three elements, whereas length or count does not.

I highly recommend then that you take some time to work through the posted document and previous discussions about String indexing. We already have [3..<6] for arrays and any Int-based indices. The whole point here is that this is something entirely distinct, and understanding the reasons for that are completely fundamental to understanding the design goals.

bjhomer · August 31, 2019, 1:15pm

One downside of using .prefix() here is that the result is no longer usable as the left-hand side of an assignment.

gwendal.roue · August 31, 2019, 1:26pm

Many Swift users, especially users who are familiar with syntaxes like str[3:6] or any equivalent notation in other languages, are hoping for a simpler indexing and slicing API. We have seen many attempts in this forum.

Can you explain why you find the former more readable, simple, or efficient?

Because it has less punctuation and is easier to parse for the human eye. This does not invalidate the rest of your answer. Yes, prefix comes with known behavior. The ad-hoc length does not. And length is used in too many contexts: it is difficult to load it with the same efficiency as prefix.

This is because Swift String does not conform to the RandomAccessCollection protocol, the protocol that guarantees "efficient random-access index traversal".

String elements are Character, a not-so-trivial type whose values can be A, É, 㞛, or complex emojis like . There are many ways to store characters in memory, and the current trend is to store them in a variable-width encoding (in Swift 5, UTF8 is the preferred one). Because of those variable-width encodings, and other Unicode subtleties, jumping to the 10th character of a string forces your to iterate through all the 9 previous ones. Accessing the Nth element of a String is not a cheap O(1) operation as in Array, but a linear, O(N) operation. This is why String can't adopt RandomAccessCollection.

And this is also why we don't have str[10]. The designers of the Standard Library did not want a potentially costly operation to look like it is cheap. For the same reason, we don't have str[3:6]. And it's likely we will never have it until the Core Team radically changes its mind.

Note that there exists many third-party libraries out there that extend the standard library with such convenience accessors. You can use them. But they may make your code not efficient (it will run fine with small inputs, and become surprisingly slow with slightly longer inputs). It's a very common mistake.

It has been years now that many people have been looking for improvements in the Swift standard library in order to soothe some pain points Previous discussions are fascinating, but be prepared: it's a long read!

pyrtsa · August 31, 2019, 2:11pm

Maybe it's a future direction, but I didn't see the original proposal include the settable subscript part where the syntax could be used in the LHS position.

If it did, we'd need to define what it means to assign nil to such slice – no-op or removal (and why)?

wadetregaskis · August 31, 2019, 3:40pm

I've never used this prefix in my life. I don't think I've ever even seen it before. Your assumption that everyone is intimately familiar with it is at least a little inaccurate. In this semi-hypothetical case, I've now had to learn this new, extra API just in order to do this trivial string slicing. (and, pedantically, prefix is an awkward name because of how it relates conceptually to strings, and because it's ambiguous as a verb vs a noun)

The simpler form is fewer steps - essentially just one, which is subscripting / slicing the string at a point and to a length. It's very close to a typical substr function that many people will be familiar with from other languages, and the aforementioned str[3:6] likewise.

Conversely, composition of multiple distinct functions is more cognitive complexity. You also now have to be familiar with the open range operator and type, which is a fun feature of Swift but a very unique one. You have to parse more complex syntax with all those parenthesis and periods and whatnot. It's just a lot more to interpret and mentally glom together.

mattpolzin · August 31, 2019, 4:00pm

IMO either of these spellings must be learned. I don’t think a user is more likely to “stumble” upon the multiple argument subscript spelling than they are to have needed to get the first x characters of a string before and used prefix() to do it.

The most likely scenario is the newcomer to the language realizes swift does not use the same syntax for this operation as whatever language they were coming from and they search google for something like “slicing a string in swift” or “getting a substring in swift” and they read through some stackoverflow posts or walkthroughs. There’s no shortage of people explaining swift strings on the internet and whatever we spell it as in this case the good folks of the internet will write useful walkthroughs, at least some of which explaining why this is a complex problem in the first place.

[EDIT] I said “newcomer to the language” above and I think more broadly should have said “user who is looking to dig into swift string manipulation for the first time”

wadetregaskis · August 31, 2019, 4:02pm

gwendal.roue:

wadetregaskis:

Even if Swift can’t just have str[3:6] like most other languages - I haven’t really followed the technical discussion in this thread to understand why

This is because Swift String does not conform to the RandomAccessCollection protocol, the protocol that guarantees "efficient random-access index traversal".

String elements are Character, a not-so-trivial type whose values can be A, É, 㞛, or complex emojis like . There are many ways to store characters in memory, and the current trend is to store them in a variable-width encoding (in Swift 5, UTF8 is the preferred one). Because of those variable-width encodings, and other Unicode subtleties, jumping to the 10th character of a string forces your to iterate through all the 9 previous ones. Accessing the Nth element of a String is not a cheap O(1) operation as in Array, but a linear, O(N) operation. This is why String can't adopt RandomAccessCollection.

And this is also why we don't have str[10]. The designers of the Standard Library did not want a potentially costly operation to look like it is cheap. For the same reason, we don't have str[3:6]. And it's likely we will never have it until the Core Team radically changes its mind.

Note that there exists many third-party libraries out there that extend the standard library with such convenience accessors. You can use them. But they may make your code not efficient (it will run fine with small inputs, and become surprisingly slow with slightly longer inputs). It's a very common mistake.

It has been years now that many people have been looking for improvements in the Swift standard library in order to soothe some pain points Previous discussions are fascinating, but be prepared: it's a long read!

I appreciate you explaining this. That does help spell out the situation a bit better.

The crux of the matter then is whether that preference for performance over usability is the right decision. I can't think of any other languages I've used which have chosen performance at the expense of usability, for basic string handling, to the degree that Swift has (I'm also thinking of the annoyances with SubString here too, and other such things). Though I see academically that Rust has at least, so Swift isn't entirely alone here, to be fair.

I'm not convinced it's the right choice - I for one would like to use Swift more day-to-day, for random little scripts and whatnot, but frankly why would I when it's so much easier & more succinct to write in Python instead, and the performance differences are irrelevant in practice.

I'd also like to use Swift to replace Python as the go-to language for lots of production things at many companies, including web / API servers and all sorts of things where performance does matter. But - and rational or not - that's going to continue to be a non-starter if the developer experience on such trivial functionality as string manipulations remains comparatively poor. The runtime performance improvement over Python would be incredible even with a suboptimally-performant strings library and developers writing naive string manipulation code (especially if there are efficient ways to do the operations, that can be adopted if and when profiling reveals they're a bottleneck).

I say all this as a performance engineer (and I even worked as one at Apple in the past). Performance only matters in the end, if you get my meaning. Not every bit of code that needs to slice a string has to be fast. Not every bit of code that needs a character at a specific index needs to be perfectly optimal (and realistically, if you need that character the inability to write a nice simple str[10] isn't going to change that fact). It's nice not to lose performance needlessly, but here it's clearly coming at a steep cost. It would likely be better, in my experience, to make it possible to write performance-optimal string manipulations, but not required where it gets in the way of developer performance; when it's actually net harmful in the bigger picture.

Granted this is tangential, and to be clear it does seem that the pitch in this thread does improve the ergonomics slightly for what they are today, so if it's the only option I do technically support it. I just fear it's lipstick on a pig, if you'll forgive the crude metaphor.

wadetregaskis · August 31, 2019, 4:05pm

But how many of them will bother reading any explanation? You're exactly right at the outset - they're just trying to get stuff done. They're going to look up some reference material and essentially copy-paste the answers. But, they're going to be wondering, as they do so, why on Earth the syntax is so convoluted, unintuitive, and verbose compared to other languages. That there is a reason doesn't make it a good one and certainly doesn't mean they'll understand it nor appreciate it.

mattpolzin · August 31, 2019, 4:10pm

I think it’s fair to question the motivations for the way Swift Strings are designed, but unrealistic to seek out change to those designs in a conversation otherwise about improving usability without changing those designs.

gwendal.roue · August 31, 2019, 4:15pm

Exactly :-) This does not mean there is lack of understanding, or interest, for people who question the current design. Quite the contrary! But as Mathew says, this thread is an evaluation of an extension of the current design. Yet another one: the current design is not easily tamed.

SDGGiesbrecht · August 31, 2019, 9:11pm

Just for the sake of anyone listening who had a role to play in designing String the way it is, and who might be getting discouraged by the endless questioning of their decisions...

As one for whom only one fifth of day‐to‐day strings are English, barely half contain Latin characters and not all even run left‐to‐right:

Swift is the first (and so far only) programming language I have used where the String type feels like it mostly works for me and not against me. I say that both as a programmer myself, and as a user of others’ programs, which end up written with varying awareness of the world beyond their authors’ borders.

Thank you for daring to take the path less trodden.

Nevin · August 31, 2019, 11:51pm

Another thought on the implementation

If we don’t want to represent before-the-start and after-the-end indices, then the in-memory representation of OffsetBound can be the size of an Int:

struct OffsetBound: Hashable {
  enum Anchor { case start, end }
  
  private var rawValue: Int
  
  var offset: Int {
    return rawValue < 0 ? rawValue + 1 : rawValue
  }
  
  var anchor: Anchor {
    return rawValue < 0 ? .end : .start
  }
}

Lantua · September 1, 2019, 12:36am

We may still want to have invalid intermediate result though, since it could be relatively common to do this

array[.start + a - b]
array[.start - b + a]

Which could be surprising if one works but the other doesn’t.

On an unrelated note, this design also (intentionally?) prevent the bug of having offset wrap around as well, which could be tricky to debug in pure Int design.

CTMacUser · September 1, 2019, 8:23am

As I said in another “why don’t we support integer indices for String” thread:

Strings are vectors (as in the C++ type) of Characters,
which are vectors of code points,
which are vectors of code units
(which are vectors of octets, if not already byte-sized).

A vector-of-vectors, where you store memory in terms of the inner Element type, precludes both RandomAccessCollection and MutableCollection. And we have (at least) two layers. So the design doesn’t make it practical, no matter how many wish for it.

Other languages have it easier because they punt on at least one Unicode issue.

Use a vector of 32-bit code points, wasting memory and punting on organizing code points into the larger grapheme concept.
Use a vector of 16-bit code units, wasting less memory but still punting on code-point-to-grapheme conversion. Worse, unlike 32-bit cp, a 16-bit one cp is mostly formed from a single cu, but sometimes 2 are used instead. Ignoring this risks reading (or worse, editing) in the middle of a cp.
Use a vector of 8-bit code units. Minimal memory waste, but still all the other problems of the 16-bit case. Worse, code points can be up to four (formerly six) code units long.

Lantua · September 1, 2019, 2:50pm

I'm sure this discussion could become another thread altogether.

The pitch is about Collection in general, with String only as a main motivator. Should we fix the String, this pitch remains relevant (albeit may receive lower priority).

Lantua · September 1, 2019, 3:39pm

FWIW, the gist is using SE-0255 Implicit returns from single-expression functions.
Which is not implemented in Swift 5.0.

DeFrenZ · September 1, 2019, 5:06pm

This looks like a very nice end result of the long discussions had over the topic, and I love it.

Imho myArray[(.start + 3)...].prefix(3) and myArray[...(.last - 2)].suffix(3) are sufficient for the task. Though I do wonder as well what will the “write” story be, and if there will be one.

I’m not fully convinced for the “middle of collection” case, but assuming it’s a less common case doing myArray[myIndex ... myArray.index(at: .last - 1)!] might be a decent solution...

Great work!

bzamayo · September 1, 2019, 7:19pm

I really like this proposal and would 100% support it being implemented. It's a lot nicer than the ++ -- symbol soup of the last effort.

That being said, I personally think the decision to return an optional element for subscript(bound: OffsetBound) is a mistake. I think this should either be non-optional and trapping, or alternately labelled — like collection[at: .start + 5] — which would return Element?. We could also offer subscript(at index: Index) -> Element?, which has been asked for many times before, at the same time. (I wouldn't mind if we offered both labelled and unlabelled versions, but that is more debatable.)

I think it is weird that writing array[5] traps but writing array[.start + 5] returns an optional.

Also, in many cases, I believe something like string[.start + 5] would actually be known to exist a lot of the time; the optionality would just get in the way.

For instance, in the code quiz example from the proposal, the coding test does not require users to handle out-of-bounds cases. The ! force-unwraps are really just noise.

I also think, for better or worse, this API is going to be used a lot in 'scripting' or quick-hack contexts on strings. I know I'd use in a couple command line tools rather than diving for the overtly-long index manipulation APIs. Again, here, the optionality would get in the way.

(I think returning an optional for the index(at:) method is fine.)

That's my nitpick – let me be clear I'd still support the proposal regardless.