Removing CharacterSet characters from a string seems hard

tera · April 13, 2024, 10:37pm

I figured how to do the previous tests properly in those languages I tried before.

Swift

Swift Code

let c = "A\u{0300}"
let d = "\u{00C0}"
let eq = c == d ? "equal" : "not equal"
print(c, eq, d)

let a = "a\u{0300}🏆💩🎬"
for c in a {
    print(c, terminator: ".")
}
print()
let thirdChar = a[a.index(a.startIndex, offsetBy: 2)]
print("thirdChar: \(thirdChar)")

Kotlin

In Kotlin I had to import two modules (one for normalisation and another for grapheme cluster breaking) and create a helper for the breaking. Note that as in Swift there is no integer subscript to get the third character (or I haven't figured it out yet) so I was using an iteration and remembered the third character in a row.

Kotlin Code

import java.text.BreakIterator
import java.text.Normalizer

fun String.graphemeClusterSequence() = sequence {
    val iterator = BreakIterator.getCharacterInstance()
    iterator.setText(this@graphemeClusterSequence)
    var start = iterator.first()
    var end = iterator.next()
    while (end != BreakIterator.DONE) {
        yield(this@graphemeClusterSequence.substring(start, end))
        start = end
        end = iterator.next()
    }
}

fun main() {
    val c = Normalizer.normalize("A\u0300", Normalizer.Form.NFC)
	val d = Normalizer.normalize("\u00C0", Normalizer.Form.NFC)
    val eq = if (c == d) "equal" else "not equal"
    println("$c $eq $d")
    
	val a = Normalizer.normalize("a\u0300🏆💩🎬", Normalizer.Form.NFC)
    var i = 0
    var thirdChar = ""
    for (ch in a.graphemeClusterSequence()) {
        print(ch)
        print(".")
        if (i == 2) { thirdChar = ch }
        i += 1
    }
    println()
    // a.graphemeClusterSequence()[2] // not available, see how I calculated thirdChar above
    print("thirdChar: $thirdChar")
}

Python

In Python I had to import an extra module, other than that it was similar to Swift and quite short. Similar to Kotlin I had to normalise strings explicitly. Note that in Python we get a third character of a string with an integer subscript.

Python Code

import unicodedata

def main():
    c = unicodedata.normalize('NFC', "A\u0300")
    d = unicodedata.normalize('NFC', "\u00C0")
    eq = "equal" if c == d else "not equal"
    print(c, eq, d)
    a = unicodedata.normalize('NFC', "A\u0300🏆💩🎬")
    for ch in a:
        print(ch, end=".")
    print()
    print("thirdChar:", a[2])
main()

C#

Similar to Kotlin I had to call Normalize explicitly, and there is no integer subscript to get to the third character.

C# Code

using System;

public class Program {
    public static void Main() {
    	string c = "A\u0300".Normalize(System.Text.NormalizationForm.FormC);
		string d = "\u00C0".Normalize(System.Text.NormalizationForm.FormC);
	    string eq = c == d ? "equal" : "not equal";
		Console.WriteLine($"{c} {eq} {d}\n");
		
		
		string a = "a\u0300🏆💩🎬".Normalize(System.Text.NormalizationForm.FormC);
		var i = 0;
		string thrd = "";
		foreach (var ch in a.EnumerateRunes()) {
			Console.Write(ch);
			Console.Write(".");
			if (i == 2) {
				thrd = $"{ch}";
			}
			i += 1;
		}
		Console.WriteLine("\n");
		// char thirdChar = a.EnumerateRunes()[2]; // can't do that, see how I calculated thrd above
		Console.WriteLine($"thirdChar: {thrd}");
    }
}

Now in all languages the output is as expected:

À equal À
à.🏆.💩.🎬.
thirdChar: 💩