Why are String offsets so complicated?

Avi · January 26, 2019, 6:16pm

The bottom line is that the philosophy of the core developers is substantially different than yours in this area. This has already been explained, and I don't think there's anything to add. You can add your own extension to String (I did), but you're not going to change the minds of those responsible for the standard library.

Agree to disagree, and let's all move on.

CTMacUser · January 26, 2019, 9:40pm

A string is a sequence of characters
Which are sequences of Unicode code points
Which are sequences of UTF-8 expansions

If you are familiar with C++ std::vector, its direct memory probably contains a length and a pointer to the elements, which are in the free store. Our Array works similarly. My description for a string would end up with a vector to vectors to vectors. That's a lot of pointers, lengths, and free-store blocks being thrown around in a logically single object. A massive memory savings would be to consolidate all the free-store blocks to a single block of all the lowest-level elements, and just have a pair of bounding pointers in the object's direct memory.

This consolidation has consequences, which are the causes of your concerns.

Although a vector is random-access, a vector of vectors can't be if inner vectors can be of arbitrary length. The only way around it would to have a random-access map to the bounds of each inner vector, which add a lot more to the total memory footprint.
Although a vector normally can have one of its elements splatted with a new value, it can't be done with a vector of vectors if a replacement inner vector doesn't implement the same length as the departing value. The only way around this would be to reallocate and copy each time, along with updating all the offsets in the inner vector bounds map.

These consequences are why String stops at BidirectionalCollection for traversal and only does RangeReplaceableCollection for mutability. A RandomAccessCollection needs to do its speciality in O(1) time, not O(n). And MutableCollection operations would essentially be RRC ops, which wouldn't be allowed due to indexing resets.

dwaite · January 27, 2019, 10:24am

It is a data structure problem, with algorithmic trade-offs. A similar effect could be seen in Java (in java.util.collections) where there are multiple types of lists, multiple types of maps (dictionaries), and multiple kinds of sets. One set may have constant-time insertion, one might sort the entries, one might preserve insertion ordering, and one might be designed for concurrent usage.

For many languages, characters started off a being a byte and expanded later to support multiple single byte character sets, short and word width characters, variable-length codepoints, and now (with Swift) extended grapheme clusters. These all have their own trade-offs, with one being that once you have variable length codepoints or extended graphemes, you can't take a counter, add it to the starting pointer, and know you are pointing to the nth character (or even that you are pointing to the start of a code point vs the middle of one).

The default shipping String in Swift is a sequence extended grapheme cluster, which basically means it represents a single printable character, whitespace, or control code. An individual printable character in unicode has no fixed binary size, indeed a single character has no maximum binary length (outside practical limits which may be enforced to prevent abuse).

As others have mentioned, you can have string-like forms other than the String type:

A byte string in some byte-based encoding, such as latin1
An array of characters (where each character may be a pointer to a sequence if it is an extended grapheme cluster that exceeds a certain length)
A UTF-8 binary sequence that does not enforce correctness

There are benefits and trade-offs to each of these. All are relatively easily possible in Swift, but none are the String class - String has a design focused on handling characters for all users, whether they prefer English or Tagalog.

As per why Swift doesn't detect that you want to do offsetting and automatically switch - it is because a poor guess could have strong negative consequences. Better to give a function-rich default, and allow people to choose another option if they need it.

NSString (for what it is worth) is a cluster class, and may very well give different implementations based on how it is initialized. This can negatively affect performance, both in that string operations are harder to optimize and that the string operation may be a lot more expensive (such as performing encoding translation and/or making a full copy) than would be expected by the developer.

The reality is that nobody has come up with an ingenious way to do this yet such that it would be implementable in Swift as a default. You can either take a huge space hit, big performance hits on lookup, big performance hits in other parts of the algorithm like string mutations, limit your text to something fixed width (so not full Unicode), or make it based on some unsafe concept like codepoints or bytes rather than characters.

FWIW, my personal experience has been that when people strongly need numeric indexing into characters, they most often are working with characters as data rather than as text. For instance, HTML headers are all text but are limited to US-ASCII, which does have a strict one-byte-per-character limitation that means you can do integer offsets. There are proposals to make this sort of work easier, likely by being able to easily convert US-ASCII text to [UInt8] or Data types.

Bear · January 27, 2019, 8:44pm

Data, yes. As I mentioned before, I come from web dev, where there are tons of textual data and you need to parse and slice them a lot. String is a string though and that’s the point. There is no „human text” data type we are talking about, but the old, good String, which is used for many things. I agree that random access on human text is less commonly needed, although not useless unlike some tried to prove here before, but for textual data it is the only way to conveniently parse it sometimes. And Swift has its use as a backend web language too, so it will be a little painful to parse some data structures, especially some sort of tokens, url parameters etc.

Lantua · January 27, 2019, 9:16pm

I’m not sure what you’re saying in this part.

Curiously enough I have done a fair amount of this kind of parsing as well (before the dawn of Server-size Swift ;-), though most of them were for pure experimentation. Usually what ends up happening is that I tokenize the string first, which only require one walk-through of the string, then work with the array of tokens I just created.
One could argue that we do it this way because of the lack of random access, but also that we don’t need random access in this specific use case because we have a better alternative as well.

Bear · January 27, 2019, 9:21pm

Okay guys, I’m done with this topic. I see you tend to love everything that contradict my point in any way, so I’m outta here. But decide whether random access is not needed or there is no solution found yet, because some of you just looove two explanations contradicting themselves. And people saying it’s not needed should try doing something else than simple iOS tiny apps. You just didn’t expose yourself in situations where parsing by index is the only option. Bye, Felicia.

John_McCall · January 28, 2019, 5:27am

I should've left this closed.

John_McCall · January 28, 2019, 9:40am

I've talked to a few people individually, but some of this needs to be said publicly. I will not be re-opening this thread. That is not a judgment on Bear; it just seems to me that this thread has reached a point where it's become nothing but argument for its own sake.

Everyone in this community has a responsibility to treat one another with respect. That includes the way we say things to each other — rudeness is not acceptable — but it also includes the meaning of the things we say, no matter how politely we say them. It is a very tricky thing to question someone else's motivations and to tell them that they are wrong to want the things they want. Fundamentally, by doing so, we are questioning their judgment, and that is innately at least a little bit disrespectful. Now maybe their judgment needs to be questioned: sometimes, people don't understand what they're doing, or they don't understand that there are better alternatives available. If we couldn't "show disrespect" by asking any questions at all, we'd never be able to help people who are genuinely stuck. But often people do know their own problems better than we do, and it's we who are misunderstanding something vital, perhaps about their constraints or their priorities. To decline to recognize that possibility, or to call it irrelevant, is to repeatedly question their judgment, amplifying our disrespect.

Acknowledging that someone's concern is legitimate doesn't mean that we have to change the language to support them. Swift is not a pot of wishes; trade-offs have to be considered and decisions have to be made. But you cannot justify those decisions honestly without understanding their impact as best as you can, in concrete terms and not just in the abstract. And the steps necessary to do that — treating someone's concerns as legitimate until you're certain you understand that they're already well-addressed — are also the steps necessary for treating them with respect as a member of our community.