Why are String offsets so complicated?

Ponyboy47 · January 24, 2019, 4:36pm

The way indexing works in swift is actually much more than just for performance gains. Swift tries to force developers to using indexes in order to avoid a whole slew of pitfalls and bugs caused by pure integer indexing. Many difficult to identify bugs have been caused throughout the history of programming by integer indexing and using an improper variable or by making a false assumption about the storage of a variable. Swift prides itself on its safety and using the index APIs guarantees indexing safety.

For example, slices share storage with their parent collection. As such, indexing from integer 0 would actually point to the parent collection index 0. Swift tries to force developers to use indexes in order to avoid these types of bugs.

Some of the primary swift developers would like to remove integer indexing completely and force developers to always use indexes. They do however recognize that this will probably never happen because of the long history and ease-of-use with integer indexes, but it is recommended in most instances that developers use indexes over integers because of the improved safety and clarity it can provide. (That being said, I definitely still use integer indexing when available)

It has been argued that collection[5] should just be sugar for collection.index(collection.startIndex, offsetBy: 5) so that we get the best of both worlds, and I'm not an expert in this area so I won't even try to explain all the arguments for/against doing this.

jrose · January 24, 2019, 4:50pm

There's something to be said also about "get the third character of my name" not actually being useful. I'm not saying Swift's String APIs are perfect, or even that they've got all the basics needed for processing formatted text, cause they definitely don't. But I am saying that the sorts of things where you know the number of Characters beforehand—Characters, rather than UTF-8 bytes or UTF-16 code units—often turn out to be contrived in practice. (Unfortunately, those contrived examples are often the "fun programming challenges" that get shared around, and so Swift suffers in the comparison.)

Jon_Shier · January 24, 2019, 5:06pm

So true. Doing Advent of Code in Swift exposes all of the small conveniences, algorithms, and data types still missing from the language. First and foremost is string processing, especially parsing. But's it's improving bit by bit and hopefully we'll eventually get the nice regex literals and parsing tools other languages have.

Lantua · January 24, 2019, 5:28pm

Also part of the problem is that we don’t educate people enough the nuances of Sequence and Collection families (and the fact that we use Index rather than Offset).

Most online tutorial I see just use concrete types and shrug these under the rug. To make things worse, they’re often Array, imprinting on people that everything here is RandomAccessCollection. And if they’re from other programming background, they’ll likely already have this bias.

The problem in this thread is twofold:

You can’t use int-based subscription in Collection. This is by design, because if you can, you’d want to conform it to RandomAccessCollection, not just Collection.
String is Collection, not RAC. This is more or less debatable, because there’re some trade-off and design decision being made here.

String do need a lot more functionality, but it’d be hard sell, given that it’s not RAC, to give it int-based indexing (esp. since we would rather conform it to RAC than to give it a sugar).

AlexanderM · January 24, 2019, 5:29pm

I get the feeling that people who complain about the inconvenience of string offsets simply don't understand the challenges involved.

If anyone has a suggestion for a more ergonomic, Unicode-correct String implementation with constant-time indexing, I would love to hear about it.

And let's be clear: Unicode correctness is not about emoji support. It's about supporting every non-latin alphabet. Try telling a Chinese user that Unicode correctness is not important.

itaiferber · January 24, 2019, 5:30pm

Others have addressed various reasons for why Swift doesn't do this, but I want to highlight why making this "just work" is not necessarily a practical solution, from a very realistic standpoint. Human text is incredibly complex, and the details are very intricate. @Michael_Ilseman can corroborate (or refute) what I have to say here, but let's explore taking an approach of making String indexing O(1) to the end might entail.

To begin with, I'd like to point out that Unicode off the bat allows for arbitrarily long grapheme clusters, meaning that I can construct a string too large to represent in the sum total of all computer memory ever constructed, and which will ever be constructed, containing only a single character. I bring this up not to be facetious, but to show that we cannot make any assumptions about indexing given any string up-front: no fixed-size storage could represent a single character without chunking regularly — e.g., we can try to split up the character into constituent UTF-32 components, but at the end of the day, we cannot guess how to assign individual indices to the UTF-32 storage without actually traversing the whole thing. I can create a string containing various ASCII characters and insert a huge grapheme cluster right in the middle, and we would not be able to make any assumptions about indices without processing the whole thing.

Given, then, that calculating indices → byte offsets is an O(n) operation, we have two choices:

We can perform this mapping up-front, on the creation of every string. This means we will spend the time and storage costs to calculate this on the creation of string literals, file names, user input, etc., every single time. Almost certainly, this is a non-starter, since very few of those strings will be indexed via integer subscripts, so this would be very wasteful for little benefit. The alternative is,
Lazily calculate indices when subscripting actually happens. This means that we won't pay the cost up-front, but instead, construct the indices as we access them. Note that we can start optimizing this (e.g., given string[i], only calculate indices up to i so we don't pay an enormous cost on entireContentsOfWarAndPeace[0]; we can also improve storage by storing runs of like indices, e.g. if we know characters iₙ through iₘ are ASCII, we don't have to calculate intermediate indices; etc.), but the more optimizations we apply, the harder it will be to reason about performance. The first subscript access to a string may or may not end up being expensive, as would be latter accesses depending on the subscript, and it would be hard to optimize around the mysterious costs of doing so.

Let's take option 2 above and run with it — what do we do about:

String mutation? Removing the first character of a string could easily invalidate all of the indexing work we've done already, considering that the character may be arbitrarily large. We could optimize this by modeling this as the decrement of all indices coming after, for instance, but for a very long string with many indices stored, this starts increasing the cost of string mutation. (The same goes for insertion, or replacement of a subrange — the more we have to store, the more bookkeeping we have to do)
String concatenation? Strings can technically begin with degenerate combining characters such that (s1 + s2).count < s1.count + s2.count — how do we model this efficiently? (We could recalculate all of the indices provided by s2, or just throw that work away if it's too expensive.)

The further we follow this down, the more non-deterministic performance can get based on the specific operations you perform, and this can easily become mysterious. If you get two strings from an API which happened to subscript them with random-access indices, they might come with a lot of baggage you as an API client are not aware of. Even if you never subscript them again, they come with potential storage and performance costs that you never cared about to begin with!

I'll echo what @jrose said above — very often, the use cases I've seen for random access into Strings have been pretty much contrived. It's extremely rare to need to actually jump around inside of a String and very often, by the time you're doing that, you're either not really benefitting from the semantic correctness that String offers, or you really need a different data structure.

Papering over the complexity here would negatively impact many consumers of String who don't care about this, and don't need it, and the tradeoffs are not weighted in their favor.

As others above-thread have pointed out: you're always welcome to add the extension to String to allow this if you want, but a lot of hard work and effort has gone into not introducing exactly this type of pitfall into String.

Michael_Ilseman · January 24, 2019, 5:39pm

Offset-based subscripting was deferred during the push for ABI stability in Swift 5, but will be re-visited for the next release.

nick.keets:

This comes up about once every 6 months. Here are some previous discussions:

Shorthand for Offsetting startIndex and endIndex Pitches

Hello, Swift-Evolution Using collection indexes are a bit of a bother when you want to do simple slicing and a type is not indexed by Int. For a simple example take the code here: let s = "Hello, Swift" let m = s[...s.index(s.startIndex, offsetBy: 4)] The intent of advancing startIndex gets a bit muddled. So to ease this I think we could add startIndex(offsetBy:) and endIndex(offsetBy:) to Collection . Making the example above: let s = "Hello, Swift" let m = s[...s.startIndex(offsetBy: 4)…

String slicing ergonomics Discussion

I simply want to reopen the discussion about why this: string[string.startIndex..<string.index(string.startIndex, offsetBy: 15)] Cannot be simply this: string[0..<15] Or even better: string[..<15] After years of using Swift I continue to have to search how to slice a string each time. It's extremely cumbersome and user-hostile.

Yup, this is a glaring omission and a continual source of pain, hence why it always comes up again.

I completely agree. The philosophical goal of String is to find the right set of tradeoffs balancing correctness, performance, and ease of use. The reality of String and the history of Swift is that the early push for source-stability prioritized correctness, before it was too late. The push for ABI stability prioritized fundamental implementation decisions focusing on performance, before it was too late. The next push should be for ease of use, because developers have suffered long enough.

edit: I accidentally hit "save" before I was done writing.

Michael_Ilseman · January 24, 2019, 5:48pm

If you want the best of both worlds, you can do let myIndices = Array(myString.indices), which you can random-access to get the String.Index for String's subscript. If you do this a lot, you can make a IndexedString which holds both for you and gives random access, at the cost of additional space complexity and recalculation on mutation. Perhaps in the future the standard library or a popular package should provide this, unifying implementation and providing better performance.

itaiferber · January 24, 2019, 5:51pm

Indeed — this is a good compromise; what's nice about it is that the owner of Array(string.indices) can be aware of the cost of storage and performance. Perhaps we can offer this in some more automated way in the standard library, but performing this calculation and storage implicitly for all Strings I think is prohibitive. (All I meant to express above is that "simply" allowing subscript[offset: Int] is not so simple given the costs.)

taylorswift · January 24, 2019, 6:03pm

just saying, the upcoming Integer-convertible character literals proposal should make a lot of these problems moot. Usually when you want to get the nth character in a string, it's a specially formatted ASCII string like MM/DD/YYYY which are much better represented as [UInt8] (or even a tuple or fixed size array, if we ever get those) instead of String. Then you could write things like

if datestring[6 ..< 10] == ['1', '9', '8', '9'] 
{
    print("i was born in 1989!")
}

SDGGiesbrecht · January 24, 2019, 9:32pm

N’avez‐vous jamais demandé ce qui se trouve à l’autre côté de l’horizon ?
Have you never asked yourself what’s on the other side of the horizon?

Denn ich finde, dass Graphemhaufen auch im alltäglichen recht oft auftauchen.
For I find that grapheme clusters appear quite often in everyday use.

Πρέπει να τα χρησημοποιήσω σε τρεις γλώσσες στις πέντε που μιλώ.
I need to use them in three of the five languages I speak.

⁧(וארבא אם אני כותב עם נְקֻדּוֹת.)⁩
(And four if I write with pointing.)

English is really the odd one out, and even it still borrows words like “café”, “façade” and “jalapeño”.

taylorswift · January 24, 2019, 9:45pm

to be fair, most of those accented characters have single-codepoint representations, and the rest probably could have if Unicode hadn’t embarked on the “elegant” orthogonal composition path later. Extended Grapheme Clusters are an emoji-centric concept. The only benefit exclusive to them is to allow lots of emoji permutations without having emoji eat up a disproportionate amount of the codepoint space.

Bear · January 24, 2019, 9:50pm

I don’t know guys. I can imagine some of you justifying lack of automatic memory management in the exact same way.

„No, it’s impossible.”
„You must educate yourself how memory works.”
„I don’t think that’s super useful, it’s just a matter of telling compiler where to put values in memory.”

And somehow, Swift made it possible.

Saying that random access to string in not useful enough to put that in the language is just wrong. Is just taking a character, based on - let’s say - cursor position so exotic?

As for the accents - many fast languages handle them efficiently with indexing. It’s because Swift decided to handle entire Unicode standard efficiently why String API is so painful to use. A rare use case shaped this API this way. Yes, it’s correct and powerful, but it’s also unfriendly to use. And I am unable to express my human thoughts about text without sounding like a robot.

Lantua · January 24, 2019, 10:13pm

Huh, I didn’t realize it’s almost to the point of review already, thanks for reminding!

jrose · January 24, 2019, 10:20pm

No, that's a great idea! Cursor position should be an Index, not an Int.

Bear · January 24, 2019, 10:22pm

Stop taking this discussion personally and try a little harder to have a valuable contribution here. So far you are just rude. Should I explain what I meant with the cursor to you?

jrose · January 24, 2019, 10:27pm

I'm sorry, you're right. That was a pithy, brusque, and dismissive response, and it would have been better for me to not say anything.

Lantua · January 24, 2019, 10:27pm

I’m pretty sure he’s referring to Text Editor example a few posts back, like how he wants to convert cursor position (row and column) back to proper index.

I think it’s legit misunderstanding, it tooks me to read a few times too to realize what it would mean (and I almost post a similar thing).

Bear · January 24, 2019, 10:35pm

Of course I meant text editor use case. Believe it or not, mr. Rose, but I am not an idiot. And I’m not a programming dummy neither.

English is not my first language and I know my tone is offensive sometimes. It’s a pity that makes some of you not willing to understand me.

The post by MrBee above said the exact same thing I’m saying here, without single addition, put in better English and perhaps nicer and got a few upvotes. That made me think that maybe I’m not clear enough. I will continue trying to be understood.

Lantua · January 24, 2019, 11:30pm

It’s not as simple as that. There’re still many languages that depend on composition, most notably, those of alphasyllabary system.
An example would be Thai which has a very compact unicode block, but consonants, a few sets of vowels and tone marks can be combined independently, resulting in a combinatorial explosion (albiet on a small scale). Adding all possible combinations would be impractical (or we could be competing with Chinese blocks in term of size).

That is fair, tough in case of String, it’s a little trickier than that (see below).

Nobody ever say that, just that there’re also many features that compete for spotlight:

Supporting Unicode
Compact Size
Fast walking
Fast Random Access
...

And the fact that we choose 1 and 2 forces 3 and 4 to compete (as no data structure could do both).

IMO, slow code should look slow, and if developer still want to use it, they can make a function for that (which in case of subscript, they totally can).

P.S. This variance of „quotation mark”, is not ASCII and would take up 3 bytes each in UTF8 .