The goal in @taylorswift's example seems to be reading some kind of binary format. You don't want to risk weird unicode character equivalence to meddle with that.
Even for textual file formats, many are defined in term of code points (like XML or JSON). Parsing characters by grapheme is asking for trouble. For instance, a combining character after the quote of an XML attribute (as in attr="⃠⃠value"
) is well-formed XML and must be parsed as a value starting with a combining character. If you parse by grapheme, you're out of spec.
So you need to express characters as code points or sometime lower-level integers in the parser. If it's a complicated mess to express this, then the parser becomes a complicated mess. Here's a function in one of my parsers (old-style plist, parsed in UTF-16):
func skipOneUnquotedStringCharacter() -> Bool {
switch utf[pos] {
case "a".utf16Head ... "z".utf16Head,
"A".utf16Head ... "Z".utf16Head,
"0".utf16Head ... "9".utf16Head,
"_".utf16Head, "$".utf16Head, "/".utf16Head, ":".utf16Head, ".".utf16Head, "-".utf16Head:
pos = utf.index(after: pos)
return true
default:
return false
}
}
That utf16Head
custom property? It's some weird contraption of mine I hope the optimizer is capable of seeing through.