Removing CharacterSet characters from a string seems hard

xwu · April 13, 2024, 2:12pm

Ah, not so: If your haystack is arbitrary Unicode text, then you do have to care about Unicode very much even if your needle is an ASCII string. For example, you must decide if "@imporṭ" (note the combining dot below) contains a match for your search query. You can decide that it does count by dropping down to the level of UTF8 code units, or you can decide that it doesn't count by sticking to extended grapheme clusters.

tera · April 13, 2024, 2:43pm

and whether to treat both valid UTF8 sequences the same way or not:

E1B9AD     // `LATIN SMALL LETTER T WITH DOT BELOW`
74CCA3     // t + `COMBINING DOT BELOW`

Both sequences visually look like ṭ

Here's a more convoluted example that has five different "spellings" for the same character: Ṩ

Try searching for one version – you’ll find all versions
Try searching for S, Ṡ, Ṣ – there won’t be a match
(that's if you are using a unicode aware text editor, like macOS TextEdit)

bdkjones · April 13, 2024, 5:51pm

Nope, I don’t. Because the haystack is a file that’s meant to be compiled and no variation in the needle will be tolerated. (Just like I can’t write ǐñț in a Swift file instead of the ASCII Int—I mean, you can do that, but it’s junk input.) And because I’m merely scanning files to build a dependency graph between them, I’m not worried about the @imporț that literally no one is going to write, ever.

tera · April 13, 2024, 6:12pm

FWIW, this is valid Swift:

@propertyWrapper struct imporṭ {
    var wrappedValue: Int
}
struct S {
    @imporṭ var foo: Int
}

You'd probably need to skip @import in comments and "strings", but that's easy. #if's could be more problematic.

bdkjones · April 13, 2024, 6:17pm

The correct comparison would be strucț. That’s disallowed. You can technically write that sequence of characters in a Swift file, but nobody is ever going to do it (outside of a string literal or, if you despise your coworkers, a variable name.)

tera · April 13, 2024, 6:25pm

No bug above, all according to the rules. This could be called a bug:

let imporṭ = 1
let imporṭ = 2
// no error

Despite Xcode treats the strings the same (when you search for example) the compiler treats them different because it uses similar "shortcuts" you are advocating for... (there are also other technicalities why it can't use proper unicode rules).

bdkjones · April 13, 2024, 6:31pm

The compiler isn’t wrong. It simply made the valid assumption that it’s not worth the complexity and performance regression to worry about edge cases that will never occur in real usage. That’s all I’ve done as well.

And that assumption has proven wise in practice, because outside of this academic forum post, I’d wager exactly zero times has a Swift variable been declared with two different code point combinations within the same file.

tera · April 13, 2024, 7:00pm

I wouldn't be so sure, it could happen by mistake!

typealias Composante = String

protocol Pourvoyeur {
    func prémunir() -> Composante
}
extension Pourvoyeur {
    // default implementation
    func prémunir() -> Composante { "default" }
}
struct MonFournisseur: Pourvoyeur {
    func prémunir() -> Composante { "pourquoi ça ne s'appelle pas ?" }
}

var fournisseur = MonFournisseur()
print(fournisseur.prémunir()) // default
print(fournisseur.prémunir()) // "pourquoi ça ne s'appelle pas ?"

The "prémunir" could be copied from somewhere, or typed with the keyboard, or I could type the "é" in it via Unicode Hex Input (e + Option + 0 3 0 1 (while holding option)).

FWIW should we have a "conformance" (bike-shed-name) keyword we would be alerted of such mistake:

struct MonFournisseur: Pourvoyeur {
    conformance func prémunir() -> Composante {...} // 🛑 "prémunir" not found
}

bdkjones · April 13, 2024, 7:20pm

This won’t win me any friends, but I think it’s a bad idea to leave the ASCII range for programming keywords. And I think there’s an unwritten convention as such, because we very rarely see any non-ASCII characters used for variable/function names in any major libraries in any language. (NB: I’m sure there are exceptions that prove the rule!)

There are zero ambiguities in ASCII. There are countless ambiguities in extended Unicode. The fewer opportunities for ambiguity that you have in your code, the more resilient and robust that code is. So, yep, I wouldn’t use diacritics in token names.

The Brits would love an NSColour, though, I’m sure.

itaiferber · April 13, 2024, 7:24pm

Out of curiosity, but serious question: it sounds like you want a type that's just a bag-of-bytes, which is not what String is; why not use a bag-of-bytes type like Data? String has human-oriented semantics applied to it, and that's never going to change, but if you want a type that has no semantics, there are others you can reach for, and you'll be much less frustrated.

(This isn't to be pithy; serious suggestion, and one I would make to anyone in the same type of situation.)

bdkjones · April 13, 2024, 7:31pm

No. Someone just asked for an example of how I deal with Strings in other languages and I explained one approach I used in pure C where Unicode was irrelevant.

Someone above had a really good point: the String docs are fairly bad and do a very poor job of explaining why things are as they are and what the correct way to accomplish fairly routine manipulations is.

That, combined with compiler error messages that might as well be Egyptian hieroglyphics, makes it very frustrating to use String.

I liked the suggestion of allowing the “stupid” approach that results in quadratic time complexity but having the compiler show a warning that offers the better way. THAT would be useful. Because a human reaches for the “natural” (but naive) approach and then the tooling provides a nudge to the more performant path.

tera · April 13, 2024, 7:37pm

I am with you if I were creating a programming language of my own. Swift is what it is though.

Exceptions don't prove rules... they prove the rules are wrong!

Avi · April 13, 2024, 7:38pm

Is that because it's an inherently bad idea, or because before languages like Swift, it wasn't even possible?

bdkjones · April 13, 2024, 7:42pm

It’s an inherently bad idea. With ASCII, what you see is what you get. e is e is e.

If you start using the higher Unicode planes, é could be X or it could be X+Y, where X and Y are code points. That’s just asking for trouble—especially if you’re writing code that is going to be consumed by others.

Swift has been around for a decade. There are zero APIs that use extended Unicode names, even though it’s been possible for that entire time.

vns · April 13, 2024, 7:44pm

+1. In such parsing of files reading it as String I would expect to bring undesired overhead in processing time and complexity, if Unicode support out of a play. Just an array or pointer to a bytes should be sufficient here, eliminating all the friction with the string and the capabilities it carries.

bdkjones · April 13, 2024, 7:47pm

An analogy. Why doesn’t Swift allow this?

if foo == true
    return 42

return 100

Because it’s a really bad idea and an easy source of bugs. So Swift required braces. So it is with Unicode names.

Avi · April 13, 2024, 7:55pm

let שם = "אבי"

print(שם)

Xcode and Discourse's poor formatting aside, this is valid Swift and no Hebrew speaker would ever be confused, nor does Hebrew suffer from the combining character problem with lookalikes.

bdkjones · April 13, 2024, 8:13pm

We’re getting into very academic territory. Best estimate according to Babel is that there’s about 9 million Hebrew speakers in the world. A tiny percentage of those are programmers and a tiny percentage of that tiny percentage use Swift.

The code sample you posted is completely impenetrable to anyone outside of that tiny sliver of a tiny sliver. Which is why it isn’t done in practice.

Avi · April 13, 2024, 8:19pm

Your opinions are very anglocentric. Swift's support for unicode tokens is precisely to combat this view and to allow speakers of other languages to work with tokens that are meaningful to them without having to transliterate or translate to what is to them a foreign language.

The ability to do this is very new and Swift itself is still a very niche language. Without insight into private codebases that are not shared outside of a company or monolingual community, it is rather impossible to do anything but speculate is to how widespread the capability is taken advantage of.

bdkjones · April 13, 2024, 8:33pm

Accurate. And, again, this won’t win me any friends, but I think one of the best side effects of technology’s rise has been an informal standardization on a single lingua franca.

It is very convenient for me that, because the tech boom originated in the United States, the standard language happens to be English. I am grateful that I don’t have to know mandarin to write apps. And I can appreciate the extra difficulties involved on the opposite side, where a mandarin speaker must learn English.

But, on the largest time scale, I think humanity is better off when we are less siloed. When there is a standard language. It unites rather than divides. And programming, like aviation (all ATC facilities and pilots worldwide are required to be English-proficient) is a step in the correct direction.