Why are String offsets so complicated?

Bear · January 24, 2019, 8:24am

Swift uses String.Index to navigate through character positions in String. For my understanding, it's done for performance, because of the usage of Extended Grapheme Clusters. Am I correct?

I was thinking, isn't there a better way to do it? We can still find a character by numeric index with myString[myString.index(myString.startIndex, offsetBy: 7)]. I'm new to Swift, so please forgive my question, which may be naive. But I don't quite get it why Swift, a language with such a smart compiler, can't deal with efficient String offsetting and asks developers to work on it at low-level. I mean, it's such a basic thing we do on daily basis. Current implementation just doesn't match the philosophy of Swift and feels like some C++ feature.

Why is it such a complex problem?

Avi · January 24, 2019, 9:34am

My understanding is that there is a conscious decision not to paper over the performance pitfalls of Unicode.

Bear · January 24, 2019, 9:40am

This decision affects everyday operation, while 99% of the time it's just a matter of finding a character in a short string, which doesn't include any extended grapheme clusters. I'm in love in many things Swift simplifies, but this one is a surprise to me. I think it would make more sense to implement simple Int-based subscript, optimize it as much as possible in the compiler, and maybe warn about the possible pitfalls when using long strings with heavy Unicode characters. But that's quite niche use-case anyway, comparing to String operations in general.

Daniel_Mullenborn · January 24, 2019, 11:34am

Maybe you'll like this method more.
myString.dropFirst(7).first

I think the idea is to keep people from treating strings as if they were a simple array of characters, which unfortunately they're not.

Bear · January 24, 2019, 11:59am

How come dropFirst() is fast enough to be included in the language, but offset lookup is not?

sveinhal · January 24, 2019, 12:00pm

Why is Int based offset lookups not a niche use-case?

Bear · January 24, 2019, 12:09pm

Because I find myself and other people using it on daily basis as one of the most basic string operation. That’s why all other programming languages support it. And how often does my string contain Emoji?

I thought that for a compiled language it would be possible to optimize this. And even if not optimized, still by average not heavy enough to skip it, in my opinion.

sveinhal · January 24, 2019, 12:17pm

I'm not convinced that your algorithms really need it. Most languages either represents strings as collection of code points, have buggy implementations, or really problematic performance. What is your use case?

Emojis really aren't the crux here. Many natural language strings have the same issues.

sveinhal · January 24, 2019, 12:17pm

And if you do, I'm almost certain that your use case is far more niche than parsing user input strings.

Avi · January 24, 2019, 12:21pm

The dropFirst() method is inherited from a protocol conformance. There's also no particular performance expectation, but everyone expects indexing to be O(1), but it never is for String.

Bear · January 24, 2019, 12:24pm

Don’t joke. Swift implements dozens of more exotic use cases. Being able to easily tell what is the character at n position just feels natural and is very helpful even for things like debugging or parsing custom syntax. This String.Index type makes basic thing difficult. Saying it’s niche is a little funny.

Bear · January 24, 2019, 12:25pm

No, I don’t like it. It’s not clear what this line does when you read it.

nick.keets · January 24, 2019, 12:32pm

This comes up about once every 6 months. Here are some previous discussions:

sveinhal · January 24, 2019, 12:56pm

I don't think anyone is joking? You claim non-trivial strings are niche, and that random access to specific offsets are not, without justifying your position. I have yet to se a relevant use-case that isn't more niche.

Morten_Bek_Ditlevsen · January 24, 2019, 12:56pm

For the simple use case where you just want to find the character in a certain position it may be alright (the performance cost of scanning through the string cannot be avoided), but if you make an algorithm that iterates from 0 to theString.count and indexes into the String at each loop, then the performance penalty is way too high O(n^2) compared to just scanning through the array.
This is the kind of situation that the current api tries to prevent.

In other words: index lookups are expected to be constant in time, and looking up by an integer offset is not constant in time.

Bear · January 24, 2019, 1:00pm

Why isn’t this discouraged for arrays then? Why can’t strings internally store the indexes of each characters efficiently and make a lookup just like fast array indexing does?

Bear · January 24, 2019, 1:03pm

Any simple slicing, like skipping the first character and trimming the string to certain position. You are trying too hard to justify your point. If you keep explaining everything like that you will reinvent C or Assembler. Swift has a beautiful syntax that is clearly aimed for humans. But string operations are designed for machines at the moment, their api just sucks.

Morten_Bek_Ditlevsen · January 24, 2019, 1:12pm

I'm sure that the answer is performance. Such an index would probably be costly to create and would take up space as well.

If you wish, you can convert a string into an array of characters by hand:

let a = "myString"
let b = Array(a)
let c = b[1]

Here you only pay the penalty of scanning through the string once - and can index into the array as many times as you wish.

You can also create a subscript extension for String yourself. But then you have to remember that it is costly and take steps to avoid using this subscript inside of a loop over the integer offsets.

Bear · January 24, 2019, 1:18pm

Strings ARE arrays of characters, in abstraction. It’s just ordered collection of characters, where each character has its position.

Languages evolve and Swift makes it possible to work conveniently with many different structures. In the past, even objects were difficult to work with in most languages. The reason was always performance.

I can think of 3 possible explanations:

Hardware power is still not good enough to handle Unicode characters efficiently.
Swift team hasn’t found a good algorithm yet.
Swift team made poor choices by making entire String API annoying just because it would be slow for very long strings. If so, maybe separate type would be a better choice?

Lantua · January 24, 2019, 1:44pm

String is NOT array of Characters, Array<Character> is! You can convert String to Array<Character> by doing

let arrayOfCharacter = Array(stringValue)

String is a Sequence of Character, mostly due to performance reason.
Because Character itself doesn’t have uniform size, Array of Character will either waste a lot of space, or be suboptimal for traversing it in sequential order. That’s why it’s never what String is trying to be.

Note that Array(stringValue) will need to traverse the string once which takes O(n) time and likely extra O(n) space.