SE-0363: Unicode for String Processing

Alejandro · July 5, 2022, 9:05pm

Correct me if I'm wrong, but it sounds like what you're proposing is a normalize every input string and match by scalar default. This model actually works pretty well until you come across a language whose Characters are not represented by a single scalar grapheme. Notoriously, this happens to be emojis, but there are also Indic sequences who follow this rule as well (there might be other cases!). Consider the following:

// This is a single grapheme composed of 5 scalars.
let string1 = "क़्‍त"
// This is a single grapheme composed of 7 scalars.
let string2 = "👨‍👩‍👧‍👦"

// Prints: 1
print(string1.count)
// Prints: 1
print(string2.count)

// Match any single character
let regex = /./

// Proposed: grapheme based matching semantics
// Prints: क़्‍त
print(try regex.firstMatch(in: string1)!)
// Prints: 👨‍👩‍👧‍👦
print(try regex.firstMatch(in: string2)!)

// ignoresNormalize and scalar based matching semantics
// Prints: क
print(try regex.firstMatch(in: string1)!)
// Prints: 👨
print(try regex.firstMatch(in: string2)!)

Without grapheme based matching semantics we fail to match String's single character, instead we get something that doesn't even appear to be in our original input. For someone who has never used regex before, these results seems bizarre!

This is pretty inconsistent with the rest of our language model where we want to be Unicode correct. Having different outputs by default between string.count and regex /./ seems like a direction that isn't toward being Unicode correct, even if it means being compatible with classical regex engines. From the get go, our string model is vastly different than that of other language's strings, so we have to diverge from being 100% compatible with other engines to ensure that we're consistent with what model we currently have and to continue trying to be as Unicode correct as possible.