Removing CharacterSet characters from a string seems hard

FlorianPircher · April 13, 2024, 8:59pm

Whether a person stops using their native language and replacing it by English should be up to that person, not programmers refusing to properly handle international strings. And as repeatedly stated above, English text makes use of Unicode, too. There is no path towards the future that will lead to an all-ASCII land.

bdkjones · April 13, 2024, 9:09pm

That’s a deliberate misconstruction of my opinion to attribute malice where none was intended.

I’m not suggesting Swift should be ASCII only. There is a difference between can do and should do—I’m suggesting that even though Swift can have non-ASCII text in variable/function names, that’s not something we should do. I am NOT suggesting removing the ability and turning that “should” into “MUST”. It’s developer discretion; not a dictate from Cupertino.

I’m also not suggesting we never need care about Unicode. There is a time and place where it matters.

taylorswift · April 13, 2024, 9:20pm

i think this is a great example of the issue i described earlier, because it is very hard to learn this method even exists. why? because it doesn’t show up anywhere in the documentation!

why doesn’t it show up in the documentation? because String.contains(_:) doesn’t actually exist; this method is actually called StringProtocol.contains(_:). and this method is not part of the StringProtocol interface either; it is a overlay vended by the _StringProcessing module that extends StringProtocol. (StringProtocol is indigenous to Swift.) as it has no direct relationship to String, or StringProtocol even, it doesn’t show up in the docs.

once you’ve internalized that this method exists and can remember its name, none of this really matters, everything will Just Work as the compiler will resolve all the API layers at compile time. but the circuitous way in which String actually inherits much of its functionality can be confusing even to seasoned developers.

tera · April 13, 2024, 9:55pm

Perl, Java, JavaScript, Python, Ruby, Scala, Haskell, Clojure support unicode identifiers to name a few languages, there are more. Java had it from its version version, Perl since around 2000's.

tera · April 13, 2024, 10:37pm

I figured how to do the previous tests properly in those languages I tried before.

Swift

Swift Code

let c = "A\u{0300}"
let d = "\u{00C0}"
let eq = c == d ? "equal" : "not equal"
print(c, eq, d)

let a = "a\u{0300}🏆💩🎬"
for c in a {
    print(c, terminator: ".")
}
print()
let thirdChar = a[a.index(a.startIndex, offsetBy: 2)]
print("thirdChar: \(thirdChar)")

Kotlin

In Kotlin I had to import two modules (one for normalisation and another for grapheme cluster breaking) and create a helper for the breaking. Note that as in Swift there is no integer subscript to get the third character (or I haven't figured it out yet) so I was using an iteration and remembered the third character in a row.

Kotlin Code

import java.text.BreakIterator
import java.text.Normalizer

fun String.graphemeClusterSequence() = sequence {
    val iterator = BreakIterator.getCharacterInstance()
    iterator.setText(this@graphemeClusterSequence)
    var start = iterator.first()
    var end = iterator.next()
    while (end != BreakIterator.DONE) {
        yield(this@graphemeClusterSequence.substring(start, end))
        start = end
        end = iterator.next()
    }
}

fun main() {
    val c = Normalizer.normalize("A\u0300", Normalizer.Form.NFC)
	val d = Normalizer.normalize("\u00C0", Normalizer.Form.NFC)
    val eq = if (c == d) "equal" else "not equal"
    println("$c $eq $d")
    
	val a = Normalizer.normalize("a\u0300🏆💩🎬", Normalizer.Form.NFC)
    var i = 0
    var thirdChar = ""
    for (ch in a.graphemeClusterSequence()) {
        print(ch)
        print(".")
        if (i == 2) { thirdChar = ch }
        i += 1
    }
    println()
    // a.graphemeClusterSequence()[2] // not available, see how I calculated thirdChar above
    print("thirdChar: $thirdChar")
}

Python

In Python I had to import an extra module, other than that it was similar to Swift and quite short. Similar to Kotlin I had to normalise strings explicitly. Note that in Python we get a third character of a string with an integer subscript.

Python Code

import unicodedata

def main():
    c = unicodedata.normalize('NFC', "A\u0300")
    d = unicodedata.normalize('NFC', "\u00C0")
    eq = "equal" if c == d else "not equal"
    print(c, eq, d)
    a = unicodedata.normalize('NFC', "A\u0300🏆💩🎬")
    for ch in a:
        print(ch, end=".")
    print()
    print("thirdChar:", a[2])
main()

C#

Similar to Kotlin I had to call Normalize explicitly, and there is no integer subscript to get to the third character.

C# Code

using System;

public class Program {
    public static void Main() {
    	string c = "A\u0300".Normalize(System.Text.NormalizationForm.FormC);
		string d = "\u00C0".Normalize(System.Text.NormalizationForm.FormC);
	    string eq = c == d ? "equal" : "not equal";
		Console.WriteLine($"{c} {eq} {d}\n");
		
		
		string a = "a\u0300🏆💩🎬".Normalize(System.Text.NormalizationForm.FormC);
		var i = 0;
		string thrd = "";
		foreach (var ch in a.EnumerateRunes()) {
			Console.Write(ch);
			Console.Write(".");
			if (i == 2) {
				thrd = $"{ch}";
			}
			i += 1;
		}
		Console.WriteLine("\n");
		// char thirdChar = a.EnumerateRunes()[2]; // can't do that, see how I calculated thrd above
		Console.WriteLine($"thirdChar: {thrd}");
    }
}

Now in all languages the output is as expected:

À equal À
à.🏆.💩.🎬.
thirdChar: 💩

FlorianPircher · April 13, 2024, 10:52pm

One of the nice things with Swift strings is that you can skip the normalization step. If you write the string back to a file, the file will have changed only where you made edits to the string. With normalization, you might introduce a lot of undesired changes outside your edits as well.

wadetregaskis · April 13, 2024, 11:36pm

Or maliciously, to misdirect what the program is actually doing.

Tangentially, the Swift compiler also has very weird opinions about what characters can appear in operators versus what can be in names (types, members, variables, etc). And it changes between Swift versions (e.g. in earlier versions the emoji set was seemingly arbitrarily split between operators and names).

It's not a big deal, but it'd certainly be nice if the compiler weren't so quirky about these things.

I struggle with this quite often, and [for me] it's usually because the necessary APIs are (a) missing and (b) don't support it (and converting between String and Data / [UInt8] is annoyingly difficult - if not sometimes impossible - to do efficiently).

(a) can be fixed by me re-inventing the missing wheels, which I've done to a limited extent thus far, but it's tedious and disappointing to do so.

Tangentially, I kinda wish Data didn't exist, and was at best just a typealias for [UInt8]. That's another source of a thousand cuts of API mismatches because most APIs support only one of the two (and there's no broad consensus on which). And both are used all over the place - but rarely compatibly - when dealing with a lot of string stuff (e.g. serdes).

tera · April 14, 2024, 1:23am

Does Swift String have a guarantee no change will happen?

let data = get data from file
let fileEncoding = figure out file encoding
let string = String(data: data, encoding: fileEncoding)
// no changes to the string
let newData = string.data(encoding: fileEncoding)
precondition(data == newData)

BTW, this is probably a toy text editor we are talking about. Serious plain text editors won't load the whole file into a string, methinks.

taylorswift · April 14, 2024, 2:02am

after doing a little more investigation, i believe (at least for String.contains) that there is no real API deficiency. rather, i believe some of the folks in this thread are experiencing a bug in lib/SymbolGraphGen, namely that extensions to protocols do not propagate synthetic members to conforming types unless the extension lives in the same module as the protocol.

Avi · April 14, 2024, 5:01am

Are you certain?

$שם = "אבי";
Unrecognized character \xA9; marked by <-- HERE after $?<-- HERE near column 3 at - line 1.

tera · April 14, 2024, 12:36pm

I am not sure about the other languages of the list, as I don't use them, could be an error in that list. Just tested these three (the second "café" uses a different "spelling"):

Kotlin:
    val café = 42     // ✅
    // val café = 24  // ✅ Conflicting declarations: val café: Int, val café: Int
Python:
    café = 42          # ✅
    café = 24          # ✅
    print(café, café)  # 24 24
C#:
    int café = 42;    // ✅
    // int café = 24; // ✅ A local variable or function named 'café' is already defined in this scope

and this is Swift:

Swift:
    let café = 42     // ✅
    let café = 24     // 🤔 oops

Notably:

unlike in Swift they treated different spellings of é the same (correctly)
unlike in Swift neither of the three allowed emojis for the identifiers

Edit: interestingly Go allows one "spelling" only:

func foo() {
	var café = 42     // ✅
}
func bar() {
	var café = 24     // 🛑 invalid character U+0301 '́' in identifier
}

No emojis here either.

allevato · April 14, 2024, 1:57pm

I suspect that this should be easier to fix now—modulo concerns about source compatibility (but I hope nobody is depending on the current inequality of canonically equivalent identifiers)—given that the parser and other parts of the compiler are being reimplemented in Swift, because they'll have access to the Unicode functionality in the standard library. The C++ implementation would have had to take its own ICU dependency or some subset of it, but now we just need to make sure the right normalization APIs are available.

tera · April 14, 2024, 6:08pm

@bdkjones was talking about ASCII in regards to programming languages only (e.g. identifier names), and there's nothing wrong with that approach. The importance of non ASCII identifiers in programming languages is overrated.