Why are String offsets so complicated?

taylorswift · January 24, 2019, 11:31pm

my guess is you're dealing with an api that takes a string and gives you an integer character offset back? that's the API's fault,, it should be returning a String.Index or if it's a C API, a byte offset.

Bear · January 24, 2019, 11:46pm

I mean returning a character on n position, just like in any other programming language. For example, my cursor is at 200px from left, so I calculate it as my 27th character. Now I need to insert something at this position, or remove that character, or display it somewhere else or anything else.

Bear · January 24, 2019, 11:48pm

Well, @sveinhal said it’s not useful and wanted me to give examples.

taylorswift · January 25, 2019, 12:04am

are you implementing an app console with a monospace font?

Bear · January 25, 2019, 12:13am

Nope, let’s say I implement a game with name typing feature in monospace font.

SDGGiesbrecht · January 25, 2019, 12:23am

Does your font really support all 137 439 Unicode characters with uniform spacing? What happens when the user enters a name that runs right‐to‐left?

Bear · January 25, 2019, 12:28am

Maybe I will filter out allowed characters. As in a game, I will probably just use clickable graphic letters and allow to type them with keyboard as well. But it doesn’t mean I will store the name in anything other than a String. I mean, we are talking about one of the most basic data types and it’s like I asked about using it to fry an egg. If I need to avoid using String in cases like that, or to stretch myself to put a character at n position, this is an unfriendly API, no matter how hard you try to lean it bakcwards.

taylorswift · January 25, 2019, 12:41am

fonts have a flag that declares if all their glyphs are of equal advance width. also,, you are presuming way too much about the character sets the game industry supports.

AlexanderM · January 25, 2019, 1:05am

I think one of the reasons you're getting somewhat negative responses is because I don't think you've demonstrated your understanding of the problem Swift's String design is trying to solve.

What you're saying is akin to "Why can't I just index a linked list to easily get the nth element?" You have two options:

If you're doing it once, you can deal with the syntactic salt of dealing with String.index.
If you're doing it more than once, LinkedList (and by analogy, String) is the wrong choice of data structure. Both of them optimize for particular usecases (LinkedList optimizes for fast prefix insert/delete, and rearrange operations, and String optimizes for compact storage of performantly readable text) which are incompatible with your desire for frequent random access. Switching to a contiguous data structure like an Array is fast and easy in both cases, and is better suited towards your use-case. And once you're done your random accessing, you can easily hop back into the world of String/LinkedList to switch back to the benefits they provide.
- "Well why isn't String just implemented as a Array<Character> then?" Because it makes trade-offs that are inappropriate to the most common use cases:
  - Contiguous data structures can only provide constant time indexing because they make assumptions about element size (that they're all the same size), so that the position of any given element is predictable using just its index. To do this for Unicode strings would waste a lot of memory to padding. In effect, all characters would need to be padded to made as large as the large characters. Alternatively (and even worse), the charcter's data can be referenced indirectly, which would incur absolutely crippling heap allocation/ARC costs.
  - The current compact representation of Swift String is uses less RAM, and more importantly, less cache space, so String algorithms can work faster by virtue of causing less cache thrashing. This is much more important than providing constant time indexing, because it's a much more common use case. (All strings are printed/saved/transmitted, but only a small number of them are ever indexed directly)

I think you would incite better responses if you do a better job in understanding the problem, and importantly, demonstrating your understanding so that others will take you seriously.

Bear · January 25, 2019, 1:13am

What are you talking about? If I demonstrate understanding the problem, I will be like some of you here, saying "it can’t be done". I do understand Unicode and that this is a difficult problem to solve, but what does it change? My topic is aboult language API, which is too complex in my opinion for what it is meant to do in most cases.

Memory management is perfect example of the same thing, which I raised a few times before. If I asked a question about possibilities of automatic memory management a few years ago, I would get the same responses. And would have been called out for "not demonstrating understanding of the problem". There are many possible solutions for this string case, including having separate API for what it does now and keeping the simple API, slower, because it’s what we usually do with strings. Or just being opened for idea that maybe one day, a better algorithm will be born on someone’s mind.

I will demonstrate my understanding for you to feel better: oh my gosh, this Unicode is so difficult to work with efficiently. Am I good enough to ask my question now?

Mapache · January 25, 2019, 1:14am

In my experience, code that needs random-access to the characters of a string is very common in programming exercises and toy problems encountered by people learning, where a string is nearly always equivalent to an array of ASCII characters and the goal is to display simple manipulations of it. However, random access to characters is quite rare in real-world programs that instead need to handle blobs of text, sometimes quite large, and nearly always internationalized into many languages that use the full expressivity of Unicode.

Swift has made the design decision to optimize for correctness in the real-world case at the expense of ease-of-use for the learning case, which I fully support, but it's also clear why that's frustrating to beginners. When doing coding exercises in Swift, it's frequently the right thing to do to convert the string input to an array of characters for the manipulations and then back to a string at the end.

Bear · January 25, 2019, 1:17am

Maybe I don’t feel like it’s rare, because I come from a web dev world, where you read and manipulate user’s input all the time, have many stringy identifiers like tokens, which sometimes have parts of information at certain positions. So it’s not toy programming nor beginner’s programming there.

Bear · January 25, 2019, 1:30am

That way you can call many Swift’s features toys. Inferring types is also kind of a toy and that’s actually great. Language feels fun to use because of that.

If someone finds a feature not intuitive to use, it doesn’t necessarily mean he’s a noob or doesn’t understand the problem. He’s just looking forward for the language to become as cool as possible.

Lantua · January 25, 2019, 1:32am

Tbh, we suggested that you convert them to array and work with that (twice by me, a few times by others), yet you never say what’s wrong with it.

AlexanderM · January 25, 2019, 1:32am

There are several things you said that indicated to me that you weren't sufficiently familiar with this problem to be able to understand the tradeoffs made in the current design. Such as:

The majority of the text world-wide uses extended grapheme clusters, to encode all non-latin alphabets. (and even some latin alphabets, those with accented chars).

This is a runtime feature that the compiler can't particularly help with. Also, the O(n) cost of extended grapheme traversal is paid on all strings. Even those without any "heavy Unicode characters"

It's no faster, it just happens to be a shorter to spell

Because arrays store elements of constant size. You'll never see an array with some rule like "if 2 appears after 1, the two together are actually a single element". It's always a simple 1:1 mapping between integer indexes and consistently-sized positions. Yet, that's the essence of exactly what Unicode does.

Because the cost would be absolutely unbearable. The vast majority of strings are never subscripted, so you're always paying a cost for which you only sometimes see benefit. Plus, every mutation of the string would invalidate the "index" cache, and require an O(n) rebuilding process. Just think about what that would be like in a situation where someone builds up a string using a for loop.

No, they're not. An "array" is a term-of-art for a contiguous collection of consistently sized elements, that can provide O(1) random access. String is not that, at all.

Ding ding ding, correct. But "ordered collection" ≠ "array", and the implications are very different.

Nor will it ever. No amount of hardware improvement turns a O(n) operation into O(1). And many strings are still sufficiently large that the distinction is very pronounced.

Or the unicode consortium, for that matter. O(1) grapheme breaking simply isn't possible. It would be akin to figuring out O(1) random access into a LinkedList, so good luck with that.

It would be slow for even surprisingly short strings. A harmless s[i] operation within a for i in 0...s.count loop turns O(n^2), without you even noticing. Considering that Swift's most popular target platform is a low-energy, low power CPU with heat constraints, making willy-nilly quadratic string algorithms has the potential for a large negative customer impact by causing hotter phones, faster battery drainer, and lower responsiveness.

Syntax is not sacrificed to make the "machine work efficiently." You could very easily write an extension on String to add Int based subscripts. The syntax does absolutely nothing to help the machine. It's syntactic salt. It's made to work as an eye-catcher for developers to say: "something suspicious is happening here"

Already addressed this, but again, it's simply not true.

Yes, it is marginal, that's only a constant-factor change in problem size. That's a completely different kind of optimization compared to one that changes O(1) to O(n)

Try telling the traveling salesman that.

This quite clearly shows a lack of understanding into how Array works.

Bear · January 25, 2019, 1:36am

You don’t see the difference between abstraction and implementation. You justify concepts by what it is internally. That’s a common syndrome of a programmer. Same issue like designing ui by what it does technically instead of how it interacts with human being.

By array I mean an ordered collection. Fixed sized elements are just implementation of Swift, many languages have arrays of varying sized elements. Conceptually they are the same, but the implementation is different.

John_McCall · January 25, 2019, 1:37am

Alright, look, this conversation got pretty heated an hour ago and hasn't cooled off, and there's a lot of defensiveness and aggression flying in all directions. I do not want to be in a position where I feel like I'm cutting off someone's ability to make a point, but I also think this thread is starting to run on pure heat. I am locking this until when-I-feel-like-it tomorrow.

zoul · January 26, 2019, 6:09pm

If an abstraction in a programming language papers over big performance differences, it’s a bad abstraction. Performance matters, in a programming language it’s not an implementation detail.