Why are String offsets so complicated?

Swift uses String.Index to navigate through character positions in String. For my understanding, it's done for performance, because of the usage of Extended Grapheme Clusters. Am I correct?

I was thinking, isn't there a better way to do it? We can still find a character by numeric index with myString[myString.index(myString.startIndex, offsetBy: 7)]. I'm new to Swift, so please forgive my question, which may be naive. But I don't quite get it why Swift, a language with such a smart compiler, can't deal with efficient String offsetting and asks developers to work on it at low-level. I mean, it's such a basic thing we do on daily basis. Current implementation just doesn't match the philosophy of Swift and feels like some C++ feature.

Why is it such a complex problem?

5 Likes

My understanding is that there is a conscious decision not to paper over the performance pitfalls of Unicode.

3 Likes

This decision affects everyday operation, while 99% of the time it's just a matter of finding a character in a short string, which doesn't include any extended grapheme clusters. I'm in love in many things Swift simplifies, but this one is a surprise to me. I think it would make more sense to implement simple Int-based subscript, optimize it as much as possible in the compiler, and maybe warn about the possible pitfalls when using long strings with heavy Unicode characters. But that's quite niche use-case anyway, comparing to String operations in general.

1 Like

Maybe you'll like this method more.
myString.dropFirst(7).first

I think the idea is to keep people from treating strings as if they were a simple array of characters, which unfortunately they're not.

7 Likes

How come dropFirst() is fast enough to be included in the language, but offset lookup is not?

Why is Int based offset lookups not a niche use-case?

2 Likes

Because I find myself and other people using it on daily basis as one of the most basic string operation. That’s why all other programming languages support it. And how often does my string contain Emoji?

I thought that for a compiled language it would be possible to optimize this. And even if not optimized, still by average not heavy enough to skip it, in my opinion.

1 Like

I'm not convinced that your algorithms really need it. Most languages either represents strings as collection of code points, have buggy implementations, or really problematic performance. What is your use case?

Emojis really aren't the crux here. Many natural language strings have the same issues.

2 Likes

And if you do, I'm almost certain that your use case is far more niche than parsing user input strings.

The dropFirst() method is inherited from a protocol conformance. There's also no particular performance expectation, but everyone expects indexing to be O(1), but it never is for String.

1 Like

Don’t joke. Swift implements dozens of more exotic use cases. Being able to easily tell what is the character at n position just feels natural and is very helpful even for things like debugging or parsing custom syntax. This String.Index type makes basic thing difficult. Saying it’s niche is a little funny.

2 Likes

No, I don’t like it. It’s not clear what this line does when you read it.

This comes up about once every 6 months. Here are some previous discussions:

6 Likes

I don't think anyone is joking? You claim non-trivial strings are niche, and that random access to specific offsets are not, without justifying your position. I have yet to se a relevant use-case that isn't more niche.

4 Likes

For the simple use case where you just want to find the character in a certain position it may be alright (the performance cost of scanning through the string cannot be avoided), but if you make an algorithm that iterates from 0 to theString.count and indexes into the String at each loop, then the performance penalty is way too high O(n^2) compared to just scanning through the array.
This is the kind of situation that the current api tries to prevent.

In other words: index lookups are expected to be constant in time, and looking up by an integer offset is not constant in time.

4 Likes

Why isn’t this discouraged for arrays then? Why can’t strings internally store the indexes of each characters efficiently and make a lookup just like fast array indexing does?

Any simple slicing, like skipping the first character and trimming the string to certain position. You are trying too hard to justify your point. If you keep explaining everything like that you will reinvent C or Assembler. Swift has a beautiful syntax that is clearly aimed for humans. But string operations are designed for machines at the moment, their api just sucks.

I'm sure that the answer is performance. Such an index would probably be costly to create and would take up space as well.

If you wish, you can convert a string into an array of characters by hand:

let a = "myString"
let b = Array(a)
let c = b[1]

Here you only pay the penalty of scanning through the string once - and can index into the array as many times as you wish.

You can also create a subscript extension for String yourself. But then you have to remember that it is costly and take steps to avoid using this subscript inside of a loop over the integer offsets.

5 Likes

Strings ARE arrays of characters, in abstraction. It’s just ordered collection of characters, where each character has its position.

Languages evolve and Swift makes it possible to work conveniently with many different structures. In the past, even objects were difficult to work with in most languages. The reason was always performance.

I can think of 3 possible explanations:

  1. Hardware power is still not good enough to handle Unicode characters efficiently.
  2. Swift team hasn’t found a good algorithm yet.
  3. Swift team made poor choices by making entire String API annoying just because it would be slow for very long strings. If so, maybe separate type would be a better choice?

String is NOT array of Characters, Array<Character> is! You can convert String to Array<Character> by doing

let arrayOfCharacter = Array(stringValue)

String is a Sequence of Character, mostly due to performance reason.
Because Character itself doesn’t have uniform size, Array of Character will either waste a lot of space, or be suboptimal for traversing it in sequential order. That’s why it’s never what String is trying to be.

Note that Array(stringValue) will need to traverse the string once which takes O(n) time and likely extra O(n) space.

13 Likes