Why are String offsets so complicated?

Morten_Bek_Ditlevsen · January 24, 2019, 1:49pm

You are correct that in abstraction they are. And in implementation both a String and an Array are ordered collections. But while Array conforms to RandomAccessCollection (which is the conformance that gives you the indexing subscript), String cannot since the protocol is exactly intended for only the types that can guarantee constant time lookups.

Each of your points are valid, but perhaps there are more considerations involved:

1: Even though hardware power was greater, you would perhaps still want to handle everything as efficiently as possible. So power increase would perhaps not be adequate to solve the issue.

2: Perhaps it is not a matter of finding good algorithms and more of a known and well understood tradeoff between being correct (with respect to the Unicode standard) and efficient at the 'cost' of having an API that may be slightly different from what you may be used to (in languages that perhaps do not care about unicode correctness or efficiency).

3: Perhaps it is again not as much a poor choice as a tradeoff. For each possible other way of modelling a String API there would be different issues. For instance in C you could have a byte buffer and you would be responsible for knowing whether they are ascii or UTF-8 or even sequences of multibyte UTF-8 strings that combine into single visible entities (the concept that is eactly modelled by the Character type - also referred to as an Extended Grapheme Cluster). Indexing here is fast, but handling the contents of the buffer is now entirely up to the user. There are really, really many benefits of the way that String is modelled in Swift - and the way I see it is that the language is completely taking care of issues that are very, very hard to deal with.

But there is a down side - namely that you are forced to consider the implications of referencing a Character inside a String. Although grasping this is not trivial, I still think that it is a good tradeoff, because day to day I don't have to deal with unicode, character encoding or anything like that. Strings simply do the work for me!

Bear · January 24, 2019, 1:54pm

It’s not about it trying to be it or not. Good parts of programming languages make it convenient to express your thought in natural way. Bad parts sacrifice syntax to make it easy for machine to work efficiently.

Which one is most natural?

What’s the third character of my name?
What’s the third character after starting index of my character?
What’s the next character after the next character of the starting index?

Programming languages aim for natural syntax and Swift is one of the greatest I have seen so far. But I don’t like the excuses of strings not trying to be like arrays (because that’s not true, strings are NATURALLY like arrays) or that it’s a niche problem to grab n character of a text. I can understand that the hardware is still slow and making it indexed would be too heavy, but that’s just this. Call it by name: it’s a flaw. Not a design choice, not conscious ignorance of niche problem, but a sacrifice. And in the future it should be possible to talk to strings by positions, simply.

Bear · January 24, 2019, 1:59pm

I think that a real jump in processing power would make this optimization marginal. Just like today you won’t name your files short to save memory and performance, because it’s marginal. It’s always relative. Optimizations become micro-optimizations when hardware becomes better. That’s why languages become more and more natural and just fun to write and read.

Avi · January 24, 2019, 2:01pm

When it comes to Unicode, that statement is false. Even UTF-32 has multi-word characters. Arrays (in programming) contain fixed-sized elements. Unicode characters are not fixed-length. It is exactly your misconception that Swift is trying to avoid.

Edit: Strings are like arrays, but they aren't arrays. This difference is important enough that Swift code devs feel it should not be papered over.

Bear · January 24, 2019, 2:03pm

Oh really? So arrays of strings don’t exist, because arrays are always containing same-sized elements?

nuclearace · January 24, 2019, 2:04pm

At this point I think it's best to drop the topic or revive (with good effort put forth in the revival) an old discussion around improving the ergonomics of the String API.

Swift's string API was intentionally designed around unicode correctness and hiding potential pitfalls related to working with unicode strings. To say it is a flawed is flat out wrong and hurts your argument. It might not be an API you're used to working in other languages, but it is a good API for working with strings in a unicode safe manner. And part of that is giving up the notion that indexing into strings to get the n-th character is always going to be a constant time operation, it is not.

Avi · January 24, 2019, 2:04pm

The String struct is fixed size, and contains a pointer to its storage.

michelf · January 24, 2019, 2:05pm

String chooses to keep its internal representation compact, which means characters have a variable width, which means it cannot really offer random-access. It's a bit like trying to get the 100th byte of a file that has been zipped: you need to unzip the file first, but you can stop unzipping once you reach the 100th byte. Unzipping like this every time you try to access a byte is a bad idea though, even on a powerful computer.

If you want the unzipped version, you can easily store it in an array of characters. It'll take more space in memory (4 to 8x) and require extra allocations for representing complex grapheme clusters, but you'll have quick access to all the characters. Those who made the language decided it wasn't worth it to impose those costs to everyone working with strings, so you have to request it explicitly.

Lantua · January 24, 2019, 2:09pm

As I mentioned, you can do Array(someString).

If elements are not of the same size, how do you random access it? If you want the 19th element, you need to know the size of 1st element, 2nd element, and so on until 18th element. This takes provably O(n) time. We can sugar around this, but people will just use it without realising the O(n) implication and be bewildered why it’s so slow.

Bear · January 24, 2019, 2:10pm

But the thing is, 99.99% of use cases are short strings. I’m not saying that Swift made it wrong for what it does. I’m saying that it’s not what would be welcomed in a language that tries to be developer-friendly, which Swift claims to be. Maybe it’s too soon to do it, maybe hardwares are still too slow. That’s all okay. I just want to be sure that’s the case and that at this point there is no better way to deal with it.

Morten_Bek_Ditlevsen · January 24, 2019, 2:10pm

This is a really nice analogy.

Bear · January 24, 2019, 2:12pm

Nice analogy, but not proportional in real world. It’s like saying that you can’t eat a ham, because entire pig is too much.

Lantua · January 24, 2019, 2:15pm

That’s what it is, a trade-off, a sacrifice. You get to traverse back and forth super, super, super efficiently, and have it being Unicode-correct, but you lose ability to get to specific location without traversing it first.

Bear · January 24, 2019, 2:18pm

And I like that explanation. Instead of deluding ourselves into thinking that it wouldn’t be better api, let’s just say that for now it would be too costly.

By flawed I mean flawed for human interaction, not flawed as bad idea for what could have been done instead. To make it clear, because some people got hurt here.

sveinhal · January 24, 2019, 2:28pm

Nope. But I can see why you're confused about my position. I'm advocating against using the cumbersome syntax altogether. I'm suggesting you instead use the iterator, or functions such as prefix and family. I'm suggesting that if you're trying to access specific indices of the character sequence, you're probably doing something wrong, and your algorithm is probably ported from some pseudo code that assumes both trivial encoding and presentation.

If you for some niche reason really need random access to a collection of characters, you should first convert your string into such a view, like so: let characters = Array(string) and use that for random access lookup or mutation.

Bear · January 24, 2019, 2:33pm

For the same reason, array random access is as niche as that. I understand why you call it niche, but I don’t agree that it’s not useful.

If you want hardcore explanation, let’s think of making a tool for text processing. Either production one or just something for you to quickly transform text for later usage. I really sometimes need to access certain position, based on position from another string, for example. Or, as someone said in another topic, slicing text it not niche for sure. Or if I have my own naming convention for IDs of anything and there are different sections in it, random access is natural too.

sveinhal · January 24, 2019, 2:59pm

Slicing is surely a common use case. However, doing it by indices is error prone for most strings and neither more readable nor performant than walking the sequence.

But I'll let this discussion be now. I've made my point: If you want a random access view into the string, the conversion is trivial and more performant than walking the string over and over from the beginning.

MrBee · January 24, 2019, 3:34pm

I get the point of the thread starter. In this modern age, programming languages become more human-like language rather than machine-like language, because machines are getting smarter and faster, so compilers are also getting smarter and faster. Programmers today "speaks" to machines like they're speaking to another "human". Machines today is better at understanding human than the other way around.

To human's eyes, we see a héllo text as an array of 5 characters, no matter how the machine encodes it. Humanly speaking, we know the second character is an é. Whether technically it's stored as a sequence of 6 characters as utf-8 or utf-16 or utf-32 or whatever, we don't care. Text encoding is a technical problem that is a machine's problem, not human's problem. So, let the machine handles it, not us.

We expect the string API would reflect the way we see text naturally. We're seeing héllo as simply a text, not a utf-8 or utf-16 etc. If we store a héllo text into a string variable named s, we surely know that s[2] equals to é. We don't need to write it as s[s.index(s.startIndex, offsetBy: 2)] because we don't naturally read a text that way. Even if s[2] is translated to s[s.index(s.startIndex, offsetBy: 2)] under the hood, so be it. If the performance is slow, then it's the duty of the Swift developers to make it fast.

It's the same reason why today memory management is mostly done automatically. Because it's machine's problem, not human's problem, so let the machine handles it for us. Unless we have a special case that we need to go deeper into technical details by ourselves, otherwise we don't need to be that technically verbose.

Avi · January 24, 2019, 4:00pm

That's kind of the point, though. There's no way to make it faster and still be Unicode-correct. And the Swift devs want you (us) to know that.

Tino · January 24, 2019, 4:28pm

Other languages may have "broken" string-implementations that produce errors when confronted with the pitfalls of unicode, but if someone is used to handle strings like

String s = "hello";  
System.out.println(s.substring(0, 2)); // -> he

Swift might really look like a step in the wrong direction, even if it actually has the better model: It isn't obvious why suddenly you need to type s three times, instead of just specifying two integers, and without knowing quite a lot of details, it hardly makes sense.
Our current API is superior when you need fast iteration over a string, but I don't think that is enough justification to force everyone to use it: High performance is rarely needed, and if someone realizes that his string processing is too slow, it can be optimized easily.