Why are String offsets so complicated?

What are you talking about? If I demonstrate understanding the problem, I will be like some of you here, saying "it can’t be done". I do understand Unicode and that this is a difficult problem to solve, but what does it change? My topic is aboult language API, which is too complex in my opinion for what it is meant to do in most cases.

Memory management is perfect example of the same thing, which I raised a few times before. If I asked a question about possibilities of automatic memory management a few years ago, I would get the same responses. And would have been called out for "not demonstrating understanding of the problem". There are many possible solutions for this string case, including having separate API for what it does now and keeping the simple API, slower, because it’s what we usually do with strings. Or just being opened for idea that maybe one day, a better algorithm will be born on someone’s mind.

I will demonstrate my understanding for you to feel better: oh my gosh, this Unicode is so difficult to work with efficiently. Am I good enough to ask my question now?

1 Like

In my experience, code that needs random-access to the characters of a string is very common in programming exercises and toy problems encountered by people learning, where a string is nearly always equivalent to an array of ASCII characters and the goal is to display simple manipulations of it. However, random access to characters is quite rare in real-world programs that instead need to handle blobs of text, sometimes quite large, and nearly always internationalized into many languages that use the full expressivity of Unicode.

Swift has made the design decision to optimize for correctness in the real-world case at the expense of ease-of-use for the learning case, which I fully support, but it's also clear why that's frustrating to beginners. When doing coding exercises in Swift, it's frequently the right thing to do to convert the string input to an array of characters for the manipulations and then back to a string at the end.

6 Likes

Maybe I don’t feel like it’s rare, because I come from a web dev world, where you read and manipulate user’s input all the time, have many stringy identifiers like tokens, which sometimes have parts of information at certain positions. So it’s not toy programming nor beginner’s programming there.

1 Like

That way you can call many Swift’s features toys. Inferring types is also kind of a toy and that’s actually great. Language feels fun to use because of that.

If someone finds a feature not intuitive to use, it doesn’t necessarily mean he’s a noob or doesn’t understand the problem. He’s just looking forward for the language to become as cool as possible.

1 Like

Tbh, we suggested that you convert them to array and work with that (twice by me, a few times by others), yet you never say what’s wrong with it.

1 Like

There are several things you said that indicated to me that you weren't sufficiently familiar with this problem to be able to understand the tradeoffs made in the current design. Such as:

The majority of the text world-wide uses extended grapheme clusters, to encode all non-latin alphabets. (and even some latin alphabets, those with accented chars).

This is a runtime feature that the compiler can't particularly help with. Also, the O(n) cost of extended grapheme traversal is paid on all strings. Even those without any "heavy Unicode characters"

It's no faster, it just happens to be a shorter to spell

Because arrays store elements of constant size. You'll never see an array with some rule like "if 2 appears after 1, the two together are actually a single element". It's always a simple 1:1 mapping between integer indexes and consistently-sized positions. Yet, that's the essence of exactly what Unicode does.

Because the cost would be absolutely unbearable. The vast majority of strings are never subscripted, so you're always paying a cost for which you only sometimes see benefit. Plus, every mutation of the string would invalidate the "index" cache, and require an O(n) rebuilding process. Just think about what that would be like in a situation where someone builds up a string using a for loop.

No, they're not. An "array" is a term-of-art for a contiguous collection of consistently sized elements, that can provide O(1) random access. String is not that, at all.

Ding ding ding, correct. But "ordered collection" ≠ "array", and the implications are very different.

Nor will it ever. No amount of hardware improvement turns a O(n) operation into O(1). And many strings are still sufficiently large that the distinction is very pronounced.

Or the unicode consortium, for that matter. O(1) grapheme breaking simply isn't possible. It would be akin to figuring out O(1) random access into a LinkedList, so good luck with that.

It would be slow for even surprisingly short strings. A harmless s[i] operation within a for i in 0...s.count loop turns O(n^2), without you even noticing. Considering that Swift's most popular target platform is a low-energy, low power CPU with heat constraints, making willy-nilly quadratic string algorithms has the potential for a large negative customer impact by causing hotter phones, faster battery drainer, and lower responsiveness.

Syntax is not sacrificed to make the "machine work efficiently." You could very easily write an extension on String to add Int based subscripts. The syntax does absolutely nothing to help the machine. It's syntactic salt. It's made to work as an eye-catcher for developers to say: "something suspicious is happening here"

Already addressed this, but again, it's simply not true.

Yes, it is marginal, that's only a constant-factor change in problem size. That's a completely different kind of optimization compared to one that changes O(1) to O(n)

Try telling the traveling salesman that.

This quite clearly shows a lack of understanding into how Array works.

11 Likes

You don’t see the difference between abstraction and implementation. You justify concepts by what it is internally. That’s a common syndrome of a programmer. Same issue like designing ui by what it does technically instead of how it interacts with human being.

By array I mean an ordered collection. Fixed sized elements are just implementation of Swift, many languages have arrays of varying sized elements. Conceptually they are the same, but the implementation is different.

Alright, look, this conversation got pretty heated an hour ago and hasn't cooled off, and there's a lot of defensiveness and aggression flying in all directions. I do not want to be in a position where I feel like I'm cutting off someone's ability to make a point, but I also think this thread is starting to run on pure heat. I am locking this until when-I-feel-like-it tomorrow.

22 Likes

If an abstraction in a programming language papers over big performance differences, it’s a bad abstraction. Performance matters, in a programming language it’s not an implementation detail.

4 Likes

The bottom line is that the philosophy of the core developers is substantially different than yours in this area. This has already been explained, and I don't think there's anything to add. You can add your own extension to String (I did), but you're not going to change the minds of those responsible for the standard library.

Agree to disagree, and let's all move on.

4 Likes
  • A string is a sequence of characters
  • Which are sequences of Unicode code points
  • Which are sequences of UTF-8 expansions

If you are familiar with C++ std::vector, its direct memory probably contains a length and a pointer to the elements, which are in the free store. Our Array works similarly. My description for a string would end up with a vector to vectors to vectors. That's a lot of pointers, lengths, and free-store blocks being thrown around in a logically single object. A massive memory savings would be to consolidate all the free-store blocks to a single block of all the lowest-level elements, and just have a pair of bounding pointers in the object's direct memory.

This consolidation has consequences, which are the causes of your concerns.

  • Although a vector is random-access, a vector of vectors can't be if inner vectors can be of arbitrary length. The only way around it would to have a random-access map to the bounds of each inner vector, which add a lot more to the total memory footprint.
  • Although a vector normally can have one of its elements splatted with a new value, it can't be done with a vector of vectors if a replacement inner vector doesn't implement the same length as the departing value. The only way around this would be to reallocate and copy each time, along with updating all the offsets in the inner vector bounds map.

These consequences are why String stops at BidirectionalCollection for traversal and only does RangeReplaceableCollection for mutability. A RandomAccessCollection needs to do its speciality in O(1) time, not O(n). And MutableCollection operations would essentially be RRC ops, which wouldn't be allowed due to indexing resets.

4 Likes

It is a data structure problem, with algorithmic trade-offs. A similar effect could be seen in Java (in java.util.collections) where there are multiple types of lists, multiple types of maps (dictionaries), and multiple kinds of sets. One set may have constant-time insertion, one might sort the entries, one might preserve insertion ordering, and one might be designed for concurrent usage.

For many languages, characters started off a being a byte and expanded later to support multiple single byte character sets, short and word width characters, variable-length codepoints, and now (with Swift) extended grapheme clusters. These all have their own trade-offs, with one being that once you have variable length codepoints or extended graphemes, you can't take a counter, add it to the starting pointer, and know you are pointing to the nth character (or even that you are pointing to the start of a code point vs the middle of one).

The default shipping String in Swift is a sequence extended grapheme cluster, which basically means it represents a single printable character, whitespace, or control code. An individual printable character in unicode has no fixed binary size, indeed a single character has no maximum binary length (outside practical limits which may be enforced to prevent abuse).

As others have mentioned, you can have string-like forms other than the String type:

  • A byte string in some byte-based encoding, such as latin1
  • An array of characters (where each character may be a pointer to a sequence if it is an extended grapheme cluster that exceeds a certain length)
  • A UTF-8 binary sequence that does not enforce correctness

There are benefits and trade-offs to each of these. All are relatively easily possible in Swift, but none are the String class - String has a design focused on handling characters for all users, whether they prefer English or Tagalog.

As per why Swift doesn't detect that you want to do offsetting and automatically switch - it is because a poor guess could have strong negative consequences. Better to give a function-rich default, and allow people to choose another option if they need it.

NSString (for what it is worth) is a cluster class, and may very well give different implementations based on how it is initialized. This can negatively affect performance, both in that string operations are harder to optimize and that the string operation may be a lot more expensive (such as performing encoding translation and/or making a full copy) than would be expected by the developer.

The reality is that nobody has come up with an ingenious way to do this yet such that it would be implementable in Swift as a default. You can either take a huge space hit, big performance hits on lookup, big performance hits in other parts of the algorithm like string mutations, limit your text to something fixed width (so not full Unicode), or make it based on some unsafe concept like codepoints or bytes rather than characters.

FWIW, my personal experience has been that when people strongly need numeric indexing into characters, they most often are working with characters as data rather than as text. For instance, HTML headers are all text but are limited to US-ASCII, which does have a strict one-byte-per-character limitation that means you can do integer offsets. There are proposals to make this sort of work easier, likely by being able to easily convert US-ASCII text to [UInt8] or Data types.

14 Likes

Data, yes. As I mentioned before, I come from web dev, where there are tons of textual data and you need to parse and slice them a lot. String is a string though and that’s the point. There is no „human text” data type we are talking about, but the old, good String, which is used for many things. I agree that random access on human text is less commonly needed, although not useless unlike some tried to prove here before, but for textual data it is the only way to conveniently parse it sometimes. And Swift has its use as a backend web language too, so it will be a little painful to parse some data structures, especially some sort of tokens, url parameters etc.

I’m not sure what you’re saying in this part.

Curiously enough I have done a fair amount of this kind of parsing as well (before the dawn of Server-size Swift ;-), though most of them were for pure experimentation. Usually what ends up happening is that I tokenize the string first, which only require one walk-through of the string, then work with the array of tokens I just created.
One could argue that we do it this way because of the lack of random access, but also that we don’t need random access in this specific use case because we have a better alternative as well.

2 Likes

Okay guys, I’m done with this topic. I see you tend to love everything that contradict my point in any way, so I’m outta here. But decide whether random access is not needed or there is no solution found yet, because some of you just looove two explanations contradicting themselves. And people saying it’s not needed should try doing something else than simple iOS tiny apps. You just didn’t expose yourself in situations where parsing by index is the only option. Bye, Felicia.

I should've left this closed.

9 Likes

I've talked to a few people individually, but some of this needs to be said publicly. I will not be re-opening this thread. That is not a judgment on Bear; it just seems to me that this thread has reached a point where it's become nothing but argument for its own sake.

Everyone in this community has a responsibility to treat one another with respect. That includes the way we say things to each other — rudeness is not acceptable — but it also includes the meaning of the things we say, no matter how politely we say them. It is a very tricky thing to question someone else's motivations and to tell them that they are wrong to want the things they want. Fundamentally, by doing so, we are questioning their judgment, and that is innately at least a little bit disrespectful. Now maybe their judgment needs to be questioned: sometimes, people don't understand what they're doing, or they don't understand that there are better alternatives available. If we couldn't "show disrespect" by asking any questions at all, we'd never be able to help people who are genuinely stuck. But often people do know their own problems better than we do, and it's we who are misunderstanding something vital, perhaps about their constraints or their priorities. To decline to recognize that possibility, or to call it irrelevant, is to repeatedly question their judgment, amplifying our disrespect.

Acknowledging that someone's concern is legitimate doesn't mean that we have to change the language to support them. Swift is not a pot of wishes; trade-offs have to be considered and decisions have to be made. But you cannot justify those decisions honestly without understanding their impact as best as you can, in concrete terms and not just in the abstract. And the steps necessary to do that — treating someone's concerns as legitimate until you're certain you understand that they're already well-addressed — are also the steps necessary for treating them with respect as a member of our community.

42 Likes