Removing CharacterSet characters from a string seems hard

I want to strip an arbitrary string down to something that can exist in a URL path component. It seems rather difficult to do this:

title = title.removing(charactersIn: .urlPathAllowed)

Instead, the best I’ve been able to do is

title = title.filter { !$0.unicodeScalars.contains(where: { !CharacterSet.urlPathAllowed.contains($0) }) }

I cribbed that from Stack Overflow, and I don’t even really know what it’s doing. But the comments on the answer tell me I’m not alone in my disbelief that this sort of thing is so unintuitive.

tl;dr Removing Characters from String is easy, but that's not what you're doing.

That's because despite the name, CharacterSet isn't a set of Characters. (that name was chosen before swift even existed, and before people were thinking about unicode correctness as much as today)

The more correct name would be a UnicodeScalarSet because it contains UnicodeScalars, not Characters. If you work with scalars, then it's easy, just a single method call:

import Foundation
var scalars = "Hello, world! zażółć gęślą jaźń".unicodeScalars
scalars.removeAll(where: CharacterSet.urlUserAllowed.contains)
print(scalars) //  żółć ęśą źń

If you want to work with Characters all the way through, Swift also makes that easy:

import Foundation
var string = "hello world zażółć gęślą jaźń"
let realCharacterSet: Set<Character> = Set("qwertyuiopasdfghjklzxcvbnm")
string.removeAll(where: realCharacterSet.contains)
print(string) //  żółć ęśą źń

The problem you have is that you want to use two different types, UnicodeScalar and Character which causes you to convert back and forth. :(

Anyway, it's still possible to make it easier than what stack overflow suggested

import Foundation
var title = "Hello, world! zażółć gęślą jaźń"
title = String(title.unicodeScalars.filter({ !CharacterSet.urlUserAllowed.contains($0) }))
print(title) //  żółć ęśą źń

The unicodeScalars property is mutable:

title.unicodeScalars.removeAll(where: { !CharacterSet.urlUserAllowed.contains($0) })

I posted an improved answer to Stack Overflow too. Upvoting it would help Swift’s reputation by guiding people to the ergonomic solution instead of the extremely verbose one in the currently accepted answer. It’s too bad it sat in that condition for two and a half years.


That's definitely better, but it's still not what I would expect from a clean-sheet language design. Perhaps that should've never been renamed CharacterSet and stayed as NSCharacterSet to go with NSString. I upvoted your answer, though.

And yes, obviously I can create an extension to clean this up in the rest of my code, but this is the kind of thing people do over and over again. The nice thing about [NS]CharacterSet is the set of in-built sets for common things (like .urlPathAllowed).

1 Like

The struct wrapper does need a different name to distinguish it from the class, but I would definitely have voted for UnicodeScalarSet instead if it had undergone the evolution process.

One other thing this demonstrates is a need for an in‐place version of filter. It would be even more succinct to write something like this:

title.unicodeScalars.keepOnly(where: CharacterSet.urlUserAllowed.contains)

(And yes, I know the method pair should have been filter and filtered, but it’s probably too late to fix that.)

1 Like

Or maybe replac[e|ing]Occurrences(of:with:).

You say you want to “strip” the string, which is typically a synonym for “trim” (i.e. removing from both ends, but not from the middle). None of the suggestions here, using removeAll or filter, will do that - instead, you’d need something like trim from the algorithms package.

But actually, it seems strange to me that you’d want to remove all characters which aren’t allowed in a URL’s path (what about if the component really does contain a disallowed character?). Are you sure you wouldn’t rather percent-encode the path component?

I want to strip all the disallowed characters from the entirety of the string. I don't think "strip" implies that. And yes, I'm sure I want to remove them and not encode them.

Well, it does, but okay - at least it’s clear that you mean to remove all occurrences.

Terms of Service

Privacy Policy

Cookie Policy