Why are String offsets so complicated?

I get the feeling that people who complain about the inconvenience of string offsets simply don't understand the challenges involved.

If anyone has a suggestion for a more ergonomic, Unicode-correct String implementation with constant-time indexing, I would love to hear about it.

And let's be clear: Unicode correctness is not about emoji support. It's about supporting every non-latin alphabet. Try telling a Chinese user that Unicode correctness is not important.

18 Likes

Others have addressed various reasons for why Swift doesn't do this, but I want to highlight why making this "just work" is not necessarily a practical solution, from a very realistic standpoint. Human text is incredibly complex, and the details are very intricate. @Michael_Ilseman can corroborate (or refute) what I have to say here, but let's explore taking an approach of making String indexing O(1) to the end might entail.

To begin with, I'd like to point out that Unicode off the bat allows for arbitrarily long grapheme clusters, meaning that I can construct a string too large to represent in the sum total of all computer memory ever constructed, and which will ever be constructed, containing only a single character. I bring this up not to be facetious, but to show that we cannot make any assumptions about indexing given any string up-front: no fixed-size storage could represent a single character without chunking regularly — e.g., we can try to split up the character into constituent UTF-32 components, but at the end of the day, we cannot guess how to assign individual indices to the UTF-32 storage without actually traversing the whole thing. I can create a string containing various ASCII characters and insert a huge grapheme cluster right in the middle, and we would not be able to make any assumptions about indices without processing the whole thing.

Given, then, that calculating indices → byte offsets is an O(n) operation, we have two choices:

  1. We can perform this mapping up-front, on the creation of every string. This means we will spend the time and storage costs to calculate this on the creation of string literals, file names, user input, etc., every single time. Almost certainly, this is a non-starter, since very few of those strings will be indexed via integer subscripts, so this would be very wasteful for little benefit. The alternative is,
  2. Lazily calculate indices when subscripting actually happens. This means that we won't pay the cost up-front, but instead, construct the indices as we access them. Note that we can start optimizing this (e.g., given string[i], only calculate indices up to i so we don't pay an enormous cost on entireContentsOfWarAndPeace[0]; we can also improve storage by storing runs of like indices, e.g. if we know characters iₙ through iₘ are ASCII, we don't have to calculate intermediate indices; etc.), but the more optimizations we apply, the harder it will be to reason about performance. The first subscript access to a string may or may not end up being expensive, as would be latter accesses depending on the subscript, and it would be hard to optimize around the mysterious costs of doing so.

Let's take option 2 above and run with it — what do we do about:

  1. String mutation? Removing the first character of a string could easily invalidate all of the indexing work we've done already, considering that the character may be arbitrarily large. We could optimize this by modeling this as the decrement of all indices coming after, for instance, but for a very long string with many indices stored, this starts increasing the cost of string mutation. (The same goes for insertion, or replacement of a subrange — the more we have to store, the more bookkeeping we have to do)
  2. String concatenation? Strings can technically begin with degenerate combining characters such that (s1 + s2).count < s1.count + s2.count — how do we model this efficiently? (We could recalculate all of the indices provided by s2, or just throw that work away if it's too expensive.)

The further we follow this down, the more non-deterministic performance can get based on the specific operations you perform, and this can easily become mysterious. If you get two strings from an API which happened to subscript them with random-access indices, they might come with a lot of baggage you as an API client are not aware of. Even if you never subscript them again, they come with potential storage and performance costs that you never cared about to begin with!


I'll echo what @jrose said above — very often, the use cases I've seen for random access into Strings have been pretty much contrived. It's extremely rare to need to actually jump around inside of a String and very often, by the time you're doing that, you're either not really benefitting from the semantic correctness that String offers, or you really need a different data structure.

Papering over the complexity here would negatively impact many consumers of String who don't care about this, and don't need it, and the tradeoffs are not weighted in their favor.

As others above-thread have pointed out: you're always welcome to add the extension to String to allow this if you want, but a lot of hard work and effort has gone into not introducing exactly this type of pitfall into String.

19 Likes

Offset-based subscripting was deferred during the push for ABI stability in Swift 5, but will be re-visited for the next release.

Yup, this is a glaring omission and a continual source of pain, hence why it always comes up again.

I completely agree. The philosophical goal of String is to find the right set of tradeoffs balancing correctness, performance, and ease of use. The reality of String and the history of Swift is that the early push for source-stability prioritized correctness, before it was too late. The push for ABI stability prioritized fundamental implementation decisions focusing on performance, before it was too late. The next push should be for ease of use, because developers have suffered long enough.

edit: I accidentally hit "save" before I was done writing.

7 Likes

If you want the best of both worlds, you can do let myIndices = Array(myString.indices), which you can random-access to get the String.Index for String's subscript. If you do this a lot, you can make a IndexedString which holds both for you and gives random access, at the cost of additional space complexity and recalculation on mutation. Perhaps in the future the standard library or a popular package should provide this, unifying implementation and providing better performance.

6 Likes

Indeed — this is a good compromise; what's nice about it is that the owner of Array(string.indices) can be aware of the cost of storage and performance. Perhaps we can offer this in some more automated way in the standard library, but performing this calculation and storage implicitly for all Strings I think is prohibitive. (All I meant to express above is that "simply" allowing subscript[offset: Int] is not so simple given the costs.)

just saying, the upcoming Integer-convertible character literals proposal should make a lot of these problems moot. Usually when you want to get the nth character in a string, it's a specially formatted ASCII string like MM/DD/YYYY which are much better represented as [UInt8] (or even a tuple or fixed size array, if we ever get those) instead of String. Then you could write things like

if datestring[6 ..< 10] == ['1', '9', '8', '9'] 
{
    print("i was born in 1989!")
}
4 Likes

N’avez‐vous jamais demand ce qui se trouve l’autre ct de l’horizon ?
Have you never asked yourself what’s on the other side of the horizon?

Denn ich finde, dass Graphemhaufen auch im alltglichen recht oft auftauchen.
For I find that grapheme clusters appear quite often in everyday use.

Πρέπει να τα χρησημοποιήσω σε τρεις γλώσσες στις πέντε που μιλώ.
I need to use them in three of the five languages I speak.

⁧(וארבא אם אני כותב עם נְקֻדּוֹת.)⁩
(And four if I write with pointing.)

English is really the odd one out, and even it still borrows words like “caf”, “faade” and “jalapeo”.

18 Likes

to be fair, most of those accented characters have single-codepoint representations, and the rest probably could have if Unicode hadn’t embarked on the “elegant” orthogonal composition path later. Extended Grapheme Clusters are an emoji-centric concept. The only benefit exclusive to them is to allow lots of emoji permutations without having emoji eat up a disproportionate amount of the codepoint space.

3 Likes

I don’t know guys. I can imagine some of you justifying lack of automatic memory management in the exact same way.

„No, it’s impossible.”
„You must educate yourself how memory works.”
„I don’t think that’s super useful, it’s just a matter of telling compiler where to put values in memory.”

And somehow, Swift made it possible.

Saying that random access to string in not useful enough to put that in the language is just wrong. Is just taking a character, based on - let’s say - cursor position so exotic?

As for the accents - many fast languages handle them efficiently with indexing. It’s because Swift decided to handle entire Unicode standard efficiently why String API is so painful to use. A rare use case shaped this API this way. Yes, it’s correct and powerful, but it’s also unfriendly to use. And I am unable to express my human thoughts about text without sounding like a robot.

Huh, I didn’t realize it’s almost to the point of review already, thanks for reminding!

No, that's a great idea! Cursor position should be an Index, not an Int.

4 Likes

Stop taking this discussion personally and try a little harder to have a valuable contribution here. So far you are just rude. Should I explain what I meant with the cursor to you?

I'm sorry, you're right. That was a pithy, brusque, and dismissive response, and it would have been better for me to not say anything.

3 Likes

I’m pretty sure he’s referring to Text Editor example a few posts back, like how he wants to convert cursor position (row and column) back to proper index.

I think it’s legit misunderstanding, it tooks me to read a few times too to realize what it would mean (and I almost post a similar thing).

1 Like

Of course I meant text editor use case. Believe it or not, mr. Rose, but I am not an idiot. And I’m not a programming dummy neither.

English is not my first language and I know my tone is offensive sometimes. It’s a pity that makes some of you not willing to understand me.

The post by MrBee above said the exact same thing I’m saying here, without single addition, put in better English and perhaps nicer and got a few upvotes. That made me think that maybe I’m not clear enough. I will continue trying to be understood.

2 Likes

It’s not as simple as that. There’re still many languages that depend on composition, most notably, those of alphasyllabary system.
An example would be Thai which has a very compact unicode block, but consonants, a few sets of vowels and tone marks can be combined independently, resulting in a combinatorial explosion (albiet on a small scale). Adding all possible combinations would be impractical (or we could be competing with Chinese blocks in term of size).

That is fair, tough in case of String, it’s a little trickier than that (see below).

Nobody ever say that, just that there’re also many features that compete for spotlight:

  1. Supporting Unicode
  2. Compact Size
  3. Fast walking
  4. Fast Random Access
  5. ...

And the fact that we choose 1 and 2 forces 3 and 4 to compete (as no data structure could do both).

IMO, slow code should look slow, and if developer still want to use it, they can make a function for that (which in case of subscript, they totally can).

P.S. This variance of „quotation mark”, is not ASCII and would take up 3 bytes each in UTF8 :slight_smile:.

1 Like

my guess is you're dealing with an api that takes a string and gives you an integer character offset back? that's the API's fault,, it should be returning a String.Index or if it's a C API, a byte offset.

I mean returning a character on n position, just like in any other programming language. For example, my cursor is at 200px from left, so I calculate it as my 27th character. Now I need to insert something at this position, or remove that character, or display it somewhere else or anything else.

Well, @sveinhal said it’s not useful and wanted me to give examples.

1 Like

are you implementing an app console with a monospace font?