Removing CharacterSet characters from a string seems hard

I want to strip an arbitrary string down to something that can exist in a URL path component. It seems rather difficult to do this:

title = title.removing(charactersIn: .urlPathAllowed)

Instead, the best I’ve been able to do is

title = title.filter { !$0.unicodeScalars.contains(where: { !CharacterSet.urlPathAllowed.contains($0) }) }

I cribbed that from Stack Overflow, and I don’t even really know what it’s doing. But the comments on the answer tell me I’m not alone in my disbelief that this sort of thing is so unintuitive.

3 Likes

tl;dr Removing Characters from String is easy, but that's not what you're doing.


That's because despite the name, CharacterSet isn't a set of Characters. (that name was chosen before swift even existed, and before people were thinking about unicode correctness as much as today)

The more correct name would be a UnicodeScalarSet because it contains UnicodeScalars, not Characters. If you work with scalars, then it's easy, just a single method call:

import Foundation
var scalars = "Hello, world! zażółć gęślą jaźń".unicodeScalars
scalars.removeAll(where: CharacterSet.urlUserAllowed.contains)
print(scalars) //  żółć ęśą źń

If you want to work with Characters all the way through, Swift also makes that easy:

import Foundation
var string = "hello world zażółć gęślą jaźń"
let realCharacterSet: Set<Character> = Set("qwertyuiopasdfghjklzxcvbnm")
string.removeAll(where: realCharacterSet.contains)
print(string) //  żółć ęśą źń

The problem you have is that you want to use two different types, UnicodeScalar and Character which causes you to convert back and forth. :(

Anyway, it's still possible to make it easier than what stack overflow suggested

import Foundation
var title = "Hello, world! zażółć gęślą jaźń"
title = String(title.unicodeScalars.filter({ !CharacterSet.urlUserAllowed.contains($0) }))
print(title) //  żółć ęśą źń
13 Likes

The unicodeScalars property is mutable:

title.unicodeScalars.removeAll(where: { !CharacterSet.urlUserAllowed.contains($0) })
5 Likes

I posted an improved answer to Stack Overflow too. Upvoting it would help Swift’s reputation by guiding people to the ergonomic solution instead of the extremely verbose one in the currently accepted answer. It’s too bad it sat in that condition for two and a half years.

2 Likes

That's definitely better, but it's still not what I would expect from a clean-sheet language design. Perhaps that should've never been renamed CharacterSet and stayed as NSCharacterSet to go with NSString. I upvoted your answer, though.

And yes, obviously I can create an extension to clean this up in the rest of my code, but this is the kind of thing people do over and over again. The nice thing about [NS]CharacterSet is the set of in-built sets for common things (like .urlPathAllowed).

3 Likes

The struct wrapper does need a different name to distinguish it from the class, but I would definitely have voted for UnicodeScalarSet instead if it had undergone the evolution process.


One other thing this demonstrates is a need for an in‐place version of filter. It would be even more succinct to write something like this:

title.unicodeScalars.keepOnly(where: CharacterSet.urlUserAllowed.contains)

(And yes, I know the method pair should have been filter and filtered, but it’s probably too late to fix that.)

1 Like

Or maybe replac[e|ing]Occurrences(of:with:).

You say you want to “strip” the string, which is typically a synonym for “trim” (i.e. removing from both ends, but not from the middle). None of the suggestions here, using removeAll or filter, will do that - instead, you’d need something like trim from the algorithms package.

But actually, it seems strange to me that you’d want to remove all characters which aren’t allowed in a URL’s path (what about if the component really does contain a disallowed character?). Are you sure you wouldn’t rather percent-encode the path component?

I want to strip all the disallowed characters from the entirety of the string. I don't think "strip" implies that. And yes, I'm sure I want to remove them and not encode them.

Well, it does, but okay - at least it’s clear that you mean to remove all occurrences.

@JetForMe, I agree with you above. Actually, String has func trimmingCharacters(in: CharacterSet) -> String which removes characters at both ends (dispite UnicodeScalar pseudoproblem). Why can't we have something like func removingCharacters(in: CharacterSet) -> String - ? Of-course everyone will make an extension.

We do have exceptions though. e.g. UICellConfigurationState being struct in Swift and class in Obj-C.

Swift has a (well-deserved) reputation for godawfully-painful String manipulation. And the reason that reputation exists is neatly encapsulated in this thread: the "natural" way to do something to a String is rarely the correct way to do it.

In short, Swift makes HUMANS bend to fit the language instead of having the language bend to fit how humans work.

Human: "The third character in this string is a poop emoji."
Swift: "AcTuAlLy, these are Unicode multibyte characters and you can't just count them like that. You have to use String.Index."
Human: "I'm looking right at it, man. The third character is a pile of poop."

And this has resulted in a decade+ of developers Googling, "Ugh, how do I do [entirely trivial thing] to a String in Swift, again?" after trying the "natural" way and running into a completely opaque compiler gibberish error. (Hence how I ended up on this thread!)

String is flat-out unpleasant to use if you have to do any manipulations beyond concatenation. I wish Apple would take a second crack at a more natural, intuitive implementation of it.

6 Likes

That feedback made me smile, and not in a bad way!


Unfortunately Unicode is complex and requires a complicated and non obvious (at first) API's to deal with. And Swift (being relatively the newest of the tribe) is by far ahead of the game IRT correctness in unicode handling. Have a look at this fragment:

let s = "a\u{0300}🏆💩🎬"
let c = s[s.index(s.startIndex, offsetBy: 2)] // third counting from zero
print("string: \(s) length: \(s.count) thirdChar: \(c)")

It outputs:

string: à🏆💩🎬 length: 4 thirdChar: 💩

which is what you would expect (as a human). The string is indeed 4 characters long and the third one is indeed a pile of poo.

Competitors? They all got it wrong, on both "length" and "poo":


Kotlin:


Python:


C#:


Yes, you could have this helper in a toy project:

// DON'T DO THAT IN PRODUCTION GRADE CODE
extension String {
    subscript(_ i: Int) -> Character {
        self[index(startIndex, offsetBy: i)]
    }
}

to have a nicer use sites and write this:

let c = s[2]

instead of:

let c = s[s.index(s.startIndex, offsetBy: 2)]

but Swift doesn't want to encourage that approach because once we have it we would be thoughtlessly writing:

for i in 0 ..< string.count {
    string[i]
}

which would result in a quadratic time complexity (and probably even worse).


Edit: for the benefits of future readers: this is the version of code that gives correct results in all mentioned languages.

15 Likes

Hmm, I'm not sure that argument is strong enough. If the author decides to write code to iterate over the characters, and finds they have to write

for i in 0 ..< string.count {
    let c = s[s.index(s.startIndex, offsetBy: i)]
}

They're still going to do that. It will be seen as an irritating need to work with String.Index, and it doesn't indicate anything about the runtime complexity. Presumably .forEach and .enumerated() are more performant? The docs don’t say (as a side note, . underestimatedCount has a note: “Complexity: O(1) if the collection conforms to RandomAccessCollection; otherwise, O(n), where n is the length of the collection,” which begs the question, what is the collection type of a String?)

What’s the “fast” way to iterate over the characters of a string, that lets you abandon the iteration of some condition is met?

Other encumbrances of String include working with substrings. The various functions for working with substrings return Substring or String.SubSequence (not sure what the difference is), and they are not Strings, so you can't use them where a String is expected. Sure, you can construct a new String from them, but it's more friction.

I get that there were compromises and tradeoffs when designing String, and it has improved since the early days, but I certainly wouldn’t call it intuitive or obvious.

1 Like
for c in string {
    if predicate(c) {
        break
    }
    // do work
}

I enjoy string processing so very much in Swift. Swift is one of very few languages that give you an API that matches the complexity that the diversity of languages and thus Unicode requires. Almost every other programming language makes it easy to handle English text and painfully difficult to do the right thing for the rest of the world.

12 Likes

The extra work required would suggest them to look and reach out for alternatives, e.g. for c in string or iterating starting with startIndex and advancing index until it reaches endIndex †.

String is not RandomAccessCollection. You can't do string[1000] in O(1) time.

See the † above.

Again, this is a deliberate friction: it makes you pause and think "do you really need "String" there" (as it will take extra time and space to create). More than often substring will be just enough for the immediate task at hand.

Nobody that could be attributed to the "Swift Team" is saying that in this thread, and this isn't a terribly productive way of making the argument for improvements.

5 Likes

Being easy to use is not the sole or primary criterion for whether an API is good or not. English might be widely used, but there are billions of people not using it. And they suffer from software not handling their names, messages, notes, etc. correctly. The right thing to do is to prioritize user experience over developer experience.

Swift is full of friction nudging or forcing you to do the right thing. You must initialize all values. You must unwrap values if they can be nil before accessing them. There are countless others.

The Substring type offers immense benefits for performance and memory usage. It exists for a reason, not just to add needless friction.

That is insulting, reductive, and exaggerated. Handling strings in a Unicode-conform way is not as easy as processing ASCII, but you do not need to study it for years before you can be successful at it. The Swift documentation for strings is a great starting point and covers many common use cases. You can read it in a few hours and learn more nuances as needed, like any domain knowledge in programming.

8 Likes

Look, here’s the reality: for any given String question on StackOverflow, there’s at least two answers saying, “Just cast it to NSString and then it’s an easy one-liner.” You can defend String until you’re blue in the face, but the large chorus of developers isn’t wrong: it’s a painful API that’s unintuitive and frustrating.

It’s certainly not the only wart in Swift, but it’s the one that pops up most often because String is such a fundamental part of programming.

Big Picture

Thinking more broadly, Swift’s existential threat is JavaScript: React Native and Flutter and all the frameworks that promise cross-platform binaries. If Swift is perceived as difficult and pedantic (it is), that’s just more of a reason for developers to choose competing languages. Swift has been falling down the Tiobe index for a while and it remains a fairly niche language in the large scheme of things.

If you enjoy native apps and all their advantages, you should support making Swift THE language that all developers love to use. The one they can’t get enough of. Otherwise, Swift becomes irrelevant and all your apps will be JavaScript monstrosities. So, yea, Developer ease-of-use REALLY does matter.

1 Like