Removing CharacterSet characters from a string seems hard

Karl · April 11, 2024, 6:22pm

I think this is bit too strongly worded - It’s Not Wrong that "🤦🏼‍♂️".length == 7. There are multiple levels of interpreting a string's contents.

Originally, String didn't even conform to Collection (so it didn't have a .count member), and instead had a .characters view. That was considered too cumbersome and so characters was made the "default" view. Other languages might decide to use a different default view, but I wouldn't say that they "got it wrong" for deciding differently from Swift in that respect.

But on the broader point, you are correct. Unicode is difficult, it has evolved over time and gained additional complexity, and developers and programming languages in general have struggled to manage that. In my opinion, Swift has done an excellent job at presenting an API which effectively manages that inherent complexity.

Many (most?) things in programming look simple on the surface but have a surprising amount of complexity once you look deeper. Strings are one such example. Floating-point numbers are another. For all of the complaints about the hidden complexities of strings, you could direct very similar complaints at floats:

print(0.1 + 0.1 + 0.1)
// 0.30000000000000004
// developer: "floats are flat-out unpleasant to use"

Software engineering is all about this kind of thing - you don't just hand-wavingly tell a computer what to do and it does it; you have a set of tools, they have different characteristics, and you have to understand what their relative strengths are and when to use (or not use) each tool.

If you're not prepared to do that, to gain that understanding and appreciate those differences, you're going to find software engineering a struggle. Regardless of which language you use.

bdkjones · April 11, 2024, 6:35pm

Being able to handle complexity when required and being forced to deal with complexity when none exists are two different things.

String surfaces complexity at all times. And to use it effectively, you need a deep knowledge of Unicode as well as ancillary types like Substring, String.Index and more. If you don’t have that expertise (or it’s been a while since you used it, so you’ve forgotten the particulars), you end up googling how to do the most mundane manipulations. That’s unpleasant.

That perception harms Swift in the long-run because developers, like all living creatures, flock to the path of least resistance.

Dmitriy_Ignatyev · April 11, 2024, 7:14pm

This complexity comes from unicode due to the support of many scripts and languages, it is not a problem of Swift or String. To do things right, you need to dive into unicode anyway, no matter what language is used.
In many other languages string processing is even more hard. In Swift we at least have reasonable, well defined behaviour fully compatible with Unicode. I remember a lot of bugs in previous Obj-C, C and python codebases caused by incorrect understanding the nature of string.
It is reasonable to say that simple things can be done much easier with Python and more hard with Swift.String. But as soon as something more difficult than pure English alphabet is needed to be processed, then Swift.String wins in aspect of correctness and predictability of the result. Thus, Swift.String has a higher entry threshold. It's a compromise between simplicity and correctness.
My observation is that when people meet String.Index first time they are confused. In the same way they are confused dealing first time with optionals, strict type system, inheritance, generics, concurrency etc. String.Index is not as simple as Int-based indices, but it encourages people to ask questions and figure out why it works that way with further directing them in the right way. Once you understand the idea, everything becomes much easier.

Karl · April 11, 2024, 7:27pm

I don't think it does; I think String is quite a simple API to use. If you know how to work with Collections, most things can be done at that level, without needing a deep understanding of the complexities of Unicode. It also has excellent Regex support for pattern matching, including a very nice DSL.

If your argument is that people can forget how to use it, well that applies to just about everything. Personally, I'm able to write Python and JavaScript, but I don't use them very often these days, so on the rare occasions that I do, I often "end up googling how to do the most mundane manipulations". But I don't blame Python or JavaScript for that or call them inherently bad or unpleasant.

bdkjones · April 11, 2024, 7:36pm

The conclusion that “developers are motivated to dig in and understand the complexity” doesn’t fit the long, continuous decline in Swift’s popularity:

Certainly this decline can’t be laid at the feet of String alone, but the general theme is what I’m harping on: the more Swift forces developers to jump through hoops, the more developers are going to flee to other languages. And at some point, Swift loses the “critical mass” necessary to be relevant—the features available on Apple platforms won’t be gatekept by Apple, they’ll be gatekept by the third-party frameworks that choose if and when to adopt them. That. Is. Bad.

If you totally discount “ease of use for developers” as a non-goal and insist that the only thing that matters is semantic correctness, you risk driving developers to other languages that DO care about ease-of-use.

I love Swift. I enjoy writing it. But that graph worries me and I get the feeling that Swift is turning a blind eye to it.

bdkjones · April 11, 2024, 7:40pm

This is accurate. If your deployment target allows, the new Regex stuff introduced in 5.7 is very nice. It’s well done.

itaiferber · April 11, 2024, 7:49pm

The trouble with human text is that the complexity is always there. Unless you're working with purely-ASCII strings, which don't even always cover the full breadth of day-to-day English writing, there is inherent complexity in the representation of text, and dealing with it.

All string libraries need to make trade-offs in dealing with this inherent complexity. Some make certain common operations simpler and more terse, at the cost of utterly incorrect behavior in less simple cases. Very often, this translates to "this operation is correct in a subset of English, and broken in many scripts/languages you'll never bother testing".

Swift decided on a different tradeoff, preferring to prioritize that correctness for the benefit of a global population used to dealing with utterly broken string handling on a regular basis.

The trouble is that developers who come from other languages tend to be used to working with strings the "simple" way (e.g., indexing into strings), and often reach for more complex tools without knowing.

If you're used to indexing into strings with integers, String.Index is a horribly verbose and complex mire — agreed 100%. The thing is: if you reach for String.Index, you're likely holding it wrong (only in that you're reaching for a more complicated solution), and you have no way of know knowing.

Swift has really powerful Sequence and Collection algorithms that make working with strings so much nicer:

// Let's extract 'world' by offsets.
let s = "Hello, world!"

// Agreed: this is awful to write and awful to read
let startIdx = s.index(s.startIndex, offsetBy: 7)
let endIdx = s.index(startIdx, offsetBy: 5)
print(s[startIdx ..< endIdx]) // world

// Wouldn't you rather:
print(s.dropFirst(7).prefix(5)) // world
// ^gets you much closer to "Hello, world!"[7:12] in Python, while maintaining correctness

Swift.Index has to exist and be exposed in order for these algorithms to be possible to use, but that exposure means that people search for "swift string index" and fall down the rabbit-hole of the complexity of API where generic Sequence and Collection algorithms are a huge improvement in readability (and writability).

It would be great to try to figure out ways to make these operations easier to discover by default, and have the Swift compiler be able to better help surface these in some form or fashion (along with more-discoverable documentation). But we don't have to throw out the baby with the bathwater.

QuinceyMorris · April 11, 2024, 7:57pm

This is, in effect, misinformation.

For NSString, a "character" is a Unicode [UTF-16] code unit — not even a code point, but a code unit. It's trivially indexable via an integer offset.

You don't even have to go to NSString for that. Swift can give you a String as an array of code points, or as an array of UTF-32, UTF-16 or UTF-8 code units.

However, Swift Characters aren't any of those things. They're extended grapheme clusters, which are amenable to convenient access or performant access, but not really (so far) both at the same time.

Karl · April 11, 2024, 8:01pm

Before you start get too concerned about Swift's position on the TIOBE index, it's probably worth examining the methodology of how that index is calculated.

Basically, it's the number of results when querying +"<language> programming" on various popular search engines. That's all it is.

Would you judge anything based on that kind of metric? Google has more results for "New York" than it does for "London", so is that some kind of conclusive evidence that New York is a nicer place to live than London? Of course not.

Even the index's own website says it has nothing to do with the quality of languages:

Popular web sites Google, Amazon, Wikipedia, Bing and more than 20 others are used to calculate the ratings. It is important to note that the TIOBE index is not about the best programming language or the language in which most lines of code have been written.

They purport that search engine results may be used as an indicator of popularity, but in today's world of AI-generated SEO-optimised garbage, I would contest that assertion.

bdkjones · April 11, 2024, 8:07pm

Do you have an alternative approach that would be more valid? Without one, I think Tiobe (while certainly not perfect) is a decent canary in the coal mine.

Is there evidence that Swift is gaining in popularity against alternatives? We might argue about the magnitude of the decline, but I think it’s dangerous to turn a blind eye and say, “It’s fine. Everything is fine.”

I see a very large threat from the cross-platform tools and I think Swift would be wise to meet that threat head-on.

John_McCall · April 11, 2024, 9:39pm

I think it'd be best to keep the conversation technically focused.

tera · April 11, 2024, 9:55pm

I trust PYPL index better (but probably that's just because it gives a higher score for Swift )

I appreciate that (and that's a great article linked BTW). However, in practical terms if I, as a user expecting the third character of "à🏆💩🎬" string to be a poop and it is not, what am I doing wrong in those other languages and how do I get the poop right if not doing this?

val c = s[2]

Should it be something like val c = s[from:to] where I am getting exact values for from and two indices from some higher level library?

val (from, to) = s.getIndicesOfExtendedGraphemeClaster(at: 2)

(which, BTW, started to sound even more complicated than it is in Swift).

Consider a simpler example (which obviously works right in Swift).

Kotlin:

Python:

C-sharp:

Am I doing something wrong in those languages? Or is it considered a tolerable behaviour by the communities of those other languages?

tera · April 12, 2024, 1:33am

In this case hiding it under the carpet would make it cleaner:

extension String {
    subscript<T: RangeExpression>(bounds: T) -> Substring where T.Bound == Int {
        let range = bounds.relative(to: 0 ..< .max)
        return dropFirst(range.lowerBound).prefix(range.upperBound - range.lowerBound)
    }
}

// usage:
string[7..<12] // Swift
string[7 : 12] // Python

Playing devil's advocate though... Whether this is a longer s.dropFirst(7).prefix(5) form or a shorter string[7..<12] form this opens a door for a potential abuse:

precondition(string.count > 1)
for i in 0 ..< string.count - 1 {
    let char = string[i ..< i + 1]
    // or equally
    let char = string.dropFirst(i).prefix(1)
}

resulting in a quadratic time complexity...

wadetregaskis · April 12, 2024, 3:55am

itaiferber:

// Let's extract 'world' by offsets.
let s = "Hello, world!"

// Agreed: this is awful to write and awful to read
let startIdx = s.index(s.startIndex, offsetBy: 7)
let endIdx = s.index(startIdx, offsetBy: 5)
print(s[startIdx ..< endIdx]) // world

// Wouldn't you rather:
print(s.dropFirst(7).prefix(5)) // world
// ^gets you much closer to "Hello, world!"[7:12] in Python, while maintaining correctness

It might just be that I'm stuck in my ways, but I find it continuously difficult to adopt this 'style' - even though I'm well aware that it often performs not merely as good as the 'straightforward' version but sometimes better, in Swift.

I think it's because it's fundamentally unintuitive from a performance perspective - dropFirst, prefix et all return intermediary data structures (formally) which is way more complicated, conceptually, than simple subscripting & indexing. And relies on the compiler optimising out effectively all that formality to generate code that actually does just the indexing, essentially.

It's both impressive to me that the compiler can do this relatively reliably, and frustrating that by doing so it encourages such roundabout designs.

In the same way that I don't think it's ideal to force e.g. functional vs imperative programming styles, I'm not thrilled with things like this in Swift which seem to be forcing - because of performance (and readability) problems - what should be merely a stylistic choice.

sjavora · April 12, 2024, 7:46am

itaiferber:

// Let's extract 'world' by offsets.
let s = "Hello, world!"

// Agreed: this is awful to write and awful to read
let startIdx = s.index(s.startIndex, offsetBy: 7)
let endIdx = s.index(startIdx, offsetBy: 5)
print(s[startIdx ..< endIdx]) // world

// Wouldn't you rather:
print(s.dropFirst(7).prefix(5)) // world
// ^gets you much closer to "Hello, world!"[7:12] in Python, while maintaining correctness

At the risk of being reductive, it seems like the above is saying "no, you can't use Ints as indices, but you can use Ints in these methods that look a lot like indices, but actually aren't."

Why is one ok while the other is not? Is it just that the collection methods operate on characters? If that's the right thing to do there, why isn't it the same for indices...?

I don't do a lot of String processing in my code, so this doesn't come up often, so sorry if I'm missing something.

FlorianPircher · April 12, 2024, 8:58am

I think the assumption is that code like dropFirst(n) or prefix(n) is not that surprising to run in O(n) whereas collection[n] people generally expect to run in O(1).

Consider if the code was used inside a loop. When you see something like dropFirst(someCount) inside a loop, it feels like something that one might want to move outside the loop if possible. On the other hand, code like collection[someIndex] is frequently used inside of loops (especially in other programming languages), so it does not look out of place.

Signaling runtime performance using different syntax is not perfect, but it is a good way for an API to communicate a general sense of how fast code will run (subscript access? near-instant. property access? probably very fast. function call? may take some time).

itaiferber · April 12, 2024, 11:14am

+1 to everything Florian said. From their past experience in other contexts, many developers tend to expect subscripting with integers specifically to be an O(1) operation, leading to code like

for i in 0 ..< string.count { // O(n)
    let c = string[i] // O(n)
    // ...
}
// ^oops, this loop is accidentally O(n²)

// as opposed to
for c in string {
    // ...
}

// or if you need `i`,
for (i, c) in string.enumerated() {
    // ...
}

This is a slightly more difficult mistake to make with a function call. Like Florian says, this isn't perfect, but setting expectations is important where possible.

Which is exactly why I think sweeping it under the rug like @tera mentions, while cleaner in certain circumstances, is an anti-goal of these APIs. Instead, in fact, promoting higher level APIs tends to lead to more performant, more correct, and often more readable code.

If we find a way to make it significantly easier to reach for these higher-level APIs—despite being a different approach than what developers might normally reach for—I think we'll find fewer folks dissatisfied with String processing.

tera · April 12, 2024, 11:42am

++1

Maybe there's a middle ground?

for i in 0 ..< string.count - 1 {
    let char = string[slow: i ..< i + 1]
}

with the intention that "slow" would have users think twice before they do that in a loop.

PS. re the below comment, "slow" is a bike shed name. Could well be "slowToUseInLoops", "clipTo", "extractSubRange", etc. The subscript to get a single character could be in a form: string[scanTo: index]

FlorianPircher · April 12, 2024, 11:51am

“Slow” is somewhat of a misnomer because the implementation would not be purposefully slow; the implementation would still be the fastest possible for the given parameters. It just happens to be slow in context compared to other ways of approaching the same task.

A method like string.fetchCharacter(atOffset: 5) would be a more accurate name, but also not great.

The benefit of dropFirst and friends is that they also work with arrays and many other collections. That way, you can transfer knowledge from one specific type to another without having to memorize slightly different names for essentially the same operations across the language.

wadetregaskis · April 12, 2024, 3:06pm

Of course, in time a lot of these presumptions about people's presumptions will likely become incorrect. That strings are actually just byte arrays is a hangover from earlier languages that will [hopefully] become increasingly archaic and bizarre to new generations of programmers (and the newer languages they focus on).

So, one could argue that it would merely be future-facing for Swift to simplify the String API and allow intuitive, clean code (like simple subscripting with integer ranges).

In any case, it might be better to have tools provide guidance rather than deliberately make APIs uglier and harder to read. People still write inefficient code around String (and Collections in general) even despite the current API. You can cover all bases by having the compiler (or similar) tell them about that. e.g.:

for i in 0 ..< string.count {
    let c = string[i]
    ...
}

Enumerating characters in a String via subscripting is very slow.

FixIt: replace with direct enumeration:
for c in string {
  …
}
FixIt: replace with indexed enumeration:
for (i, c) in string.enumerated() {
  …
}

It doesn't have to be perfect - even just a couple of occurrences of this, to any given person, is likely enough to change their patterns. And going to be much more effective, in that regard, than random forum or blog posts that most Swift programmers will never, ever see.