Why are String offsets so complicated?

I'm not convinced that your algorithms really need it. Most languages either represents strings as collection of code points, have buggy implementations, or really problematic performance. What is your use case?

Emojis really aren't the crux here. Many natural language strings have the same issues.

2 Likes

And if you do, I'm almost certain that your use case is far more niche than parsing user input strings.

The dropFirst() method is inherited from a protocol conformance. There's also no particular performance expectation, but everyone expects indexing to be O(1), but it never is for String.

1 Like

Don’t joke. Swift implements dozens of more exotic use cases. Being able to easily tell what is the character at n position just feels natural and is very helpful even for things like debugging or parsing custom syntax. This String.Index type makes basic thing difficult. Saying it’s niche is a little funny.

2 Likes

No, I don’t like it. It’s not clear what this line does when you read it.

This comes up about once every 6 months. Here are some previous discussions:

6 Likes

I don't think anyone is joking? You claim non-trivial strings are niche, and that random access to specific offsets are not, without justifying your position. I have yet to se a relevant use-case that isn't more niche.

4 Likes

For the simple use case where you just want to find the character in a certain position it may be alright (the performance cost of scanning through the string cannot be avoided), but if you make an algorithm that iterates from 0 to theString.count and indexes into the String at each loop, then the performance penalty is way too high O(n^2) compared to just scanning through the array.
This is the kind of situation that the current api tries to prevent.

In other words: index lookups are expected to be constant in time, and looking up by an integer offset is not constant in time.

4 Likes

Why isn’t this discouraged for arrays then? Why can’t strings internally store the indexes of each characters efficiently and make a lookup just like fast array indexing does?

Any simple slicing, like skipping the first character and trimming the string to certain position. You are trying too hard to justify your point. If you keep explaining everything like that you will reinvent C or Assembler. Swift has a beautiful syntax that is clearly aimed for humans. But string operations are designed for machines at the moment, their api just sucks.

I'm sure that the answer is performance. Such an index would probably be costly to create and would take up space as well.

If you wish, you can convert a string into an array of characters by hand:

let a = "myString"
let b = Array(a)
let c = b[1]

Here you only pay the penalty of scanning through the string once - and can index into the array as many times as you wish.

You can also create a subscript extension for String yourself. But then you have to remember that it is costly and take steps to avoid using this subscript inside of a loop over the integer offsets.

5 Likes

Strings ARE arrays of characters, in abstraction. It’s just ordered collection of characters, where each character has its position.

Languages evolve and Swift makes it possible to work conveniently with many different structures. In the past, even objects were difficult to work with in most languages. The reason was always performance.

I can think of 3 possible explanations:

  1. Hardware power is still not good enough to handle Unicode characters efficiently.
  2. Swift team hasn’t found a good algorithm yet.
  3. Swift team made poor choices by making entire String API annoying just because it would be slow for very long strings. If so, maybe separate type would be a better choice?

String is NOT array of Characters, Array<Character> is! You can convert String to Array<Character> by doing

let arrayOfCharacter = Array(stringValue)

String is a Sequence of Character, mostly due to performance reason.
Because Character itself doesn’t have uniform size, Array of Character will either waste a lot of space, or be suboptimal for traversing it in sequential order. That’s why it’s never what String is trying to be.

Note that Array(stringValue) will need to traverse the string once which takes O(n) time and likely extra O(n) space.

13 Likes

You are correct that in abstraction they are. And in implementation both a String and an Array are ordered collections. But while Array conforms to RandomAccessCollection (which is the conformance that gives you the indexing subscript), String cannot since the protocol is exactly intended for only the types that can guarantee constant time lookups.

Each of your points are valid, but perhaps there are more considerations involved:

1: Even though hardware power was greater, you would perhaps still want to handle everything as efficiently as possible. So power increase would perhaps not be adequate to solve the issue.

2: Perhaps it is not a matter of finding good algorithms and more of a known and well understood tradeoff between being correct (with respect to the Unicode standard) and efficient at the 'cost' of having an API that may be slightly different from what you may be used to (in languages that perhaps do not care about unicode correctness or efficiency).

3: Perhaps it is again not as much a poor choice as a tradeoff. For each possible other way of modelling a String API there would be different issues. For instance in C you could have a byte buffer and you would be responsible for knowing whether they are ascii or UTF-8 or even sequences of multibyte UTF-8 strings that combine into single visible entities (the concept that is eactly modelled by the Character type - also referred to as an Extended Grapheme Cluster). Indexing here is fast, but handling the contents of the buffer is now entirely up to the user. There are really, really many benefits of the way that String is modelled in Swift - and the way I see it is that the language is completely taking care of issues that are very, very hard to deal with.

But there is a down side - namely that you are forced to consider the implications of referencing a Character inside a String. Although grasping this is not trivial, I still think that it is a good tradeoff, because day to day I don't have to deal with unicode, character encoding or anything like that. Strings simply do the work for me!

3 Likes

It’s not about it trying to be it or not. Good parts of programming languages make it convenient to express your thought in natural way. Bad parts sacrifice syntax to make it easy for machine to work efficiently.

Which one is most natural?

  1. What’s the third character of my name?
  2. What’s the third character after starting index of my character?
  3. What’s the next character after the next character of the starting index?

Programming languages aim for natural syntax and Swift is one of the greatest I have seen so far. But I don’t like the excuses of strings not trying to be like arrays (because that’s not true, strings are NATURALLY like arrays) or that it’s a niche problem to grab n character of a text. I can understand that the hardware is still slow and making it indexed would be too heavy, but that’s just this. Call it by name: it’s a flaw. Not a design choice, not conscious ignorance of niche problem, but a sacrifice. And in the future it should be possible to talk to strings by positions, simply.

I think that a real jump in processing power would make this optimization marginal. Just like today you won’t name your files short to save memory and performance, because it’s marginal. It’s always relative. Optimizations become micro-optimizations when hardware becomes better. That’s why languages become more and more natural and just fun to write and read.

When it comes to Unicode, that statement is false. Even UTF-32 has multi-word characters. Arrays (in programming) contain fixed-sized elements. Unicode characters are not fixed-length. It is exactly your misconception that Swift is trying to avoid.

Edit: Strings are like arrays, but they aren't arrays. This difference is important enough that Swift code devs feel it should not be papered over.

13 Likes

Oh really? So arrays of strings don’t exist, because arrays are always containing same-sized elements?

1 Like

At this point I think it's best to drop the topic or revive (with good effort put forth in the revival) an old discussion around improving the ergonomics of the String API.

Swift's string API was intentionally designed around unicode correctness and hiding potential pitfalls related to working with unicode strings. To say it is a flawed is flat out wrong and hurts your argument. It might not be an API you're used to working in other languages, but it is a good API for working with strings in a unicode safe manner. And part of that is giving up the notion that indexing into strings to get the n-th character is always going to be a constant time operation, it is not.

7 Likes

The String struct is fixed size, and contains a pointer to its storage.

2 Likes