Why is Character not Codable?

young · March 23, 2022, 4:29pm

I'm dealing with Character's, would prefer not to use String for type correctness. But then my struct cannot be automatically Codable because Character is not Codable?

What's the reasoning for Character not Codable?

itaiferber · March 23, 2022, 6:19pm

Theoretically, Character could be Codable, though there are some big pitfalls to consider when doing so:

The stability of grapheme clusters across Unicode versions isn't guaranteed, so you'd be liable to get different results with the same data across versions of Swift. For example:

Version Sᵢ of Swift has knowledge of Unicode version Uᵢ, which defines a sequence of Unicode code points P to be multiple grapheme clusters
A later version Sⱼ of Swift has knowledge of Unicode version Uⱼ, which redefines P to be a single grapheme cluster
You build a program with version Sⱼ of Swift and encode your single Character(P). You then try to decode that same data with the same program built with version Sᵢ of Swift and . Well, maybe not , but it would depend on what happens if Character.init(from:) encounters encoded data representing more than a single Character. Likely it would throw an error, but the data remains undecodable

(This might sound contrived, but Unicode 11 updated the definition of grapheme clusters to be defined by a regular expression, and extended the definitions of what might be considered an "allowable" grapheme cluster, especially around pictographic sequences [emoji]. The yellow in that link is showing a diff from the previous version of the report [TR29-31] to the next [TR29-33]. So what might be considered a single grapheme cluster in Unicode 11 may be considered multiple in earlier versions of Unicode)

As @SDGGiesbrecht notes in the other thread, it's likely you'll want to encode your Character as a String explicitly so that at decode time, you can figure out how you might want to deal with the failure modes directly.

young · March 23, 2022, 9:45pm

So Unicode grapheme clusters for a character can change between versions. And Swift can only decipher a specific version of Unicode. There is no backward compatibility.

How does using a single character String better than just a Character? Aren't they the same "on the wire"?
Then how to deal with old data/files? For example, how does a web browser display old page written in older Unicode version? How can the browser tell what version of Unicode the page is in?
My use case: from Swift struct save out to json file, then load the json file and decode back to Swift struct. At this moment since I'm compiling the writing and reading in the same Swift version, this works fine. But what happen in the future Unicode once again change and Swift move to that version and I recompile my code, I can't load my json anymore?

itaiferber · March 23, 2022, 10:26pm

You're absolutely correct that the underlying representation would likely end up being the same. The benefit to decoding the data as a String specifically is that String has no specific requirements on length or what it might contain. If you tried to decode "ab" as a Character, it would fail; if you try to decode it as a String it will succeed, and you can inspect the contents to figure out what to do next.

If you decode a String whose .count > 1, you know that either someone has messed with your data, or that the Character potentially came from a newer version of Unicode that considered that String to be a single Character.

Unicode maintains very strict backwards compatibility (e.g. old data carried forwards still has the same meaning), so this is rarely an issue. It's the other direction that's the problem (e.g. new definitions making it to older code which doesn't know about them).

Typically, software that receives unknown Unicode characters falls back to some default behavior somehow, depending on the context. For instance:

The Unicode 15 draft currently adds 4488 new characters (i.e. assigns semantic meaning to 4488 Unicode code points which were previously unassigned). One of these characters is U+1E030 MODIFIER LETTER CYRILLIC SMALL A. If I modify the HTML of a page to contain that character, this is what I see right now:

(Firefox 98.0.1 on Windows 10)

Since the browser doesn't know how to handle this character, and the fonts I have loaded don't have a glyph, it falls back to displaying the code point number in a box. (The browser truly doesn't know how to handle this: this character is a combining character, but the browser doesn't know this, because it doesn't have knowledge of Unicode 15 — so it doesn't actually combine with anything.)
Unicode sometimes assigns semantic meaning to certain combinations of character sequences via Zero-Width-Joiners (ZWJ); for instance, Unicode 14 added "face with spiral eyes" out of U+1F635 DIZZY FACE () + U+200D ZWJ + U+1F4AB DIZZY SYMBOL (). On your machine, this character sequence might appear correctly: On mine, however, it does not — although the character sequence is recognized as a single character (I can put my cursor on either side of it, but not in the middle), the Microsoft emoji font on my machine doesn't have a glyph for it, so the characters render side-by-side ():

Various software has different failure modes when it comes to this stuff.

Edit: case in point, the above screenshot is what I saw while typing up this response. However, after posting, here's what I see:

Emoji2

Discourse has replaced the display of the default Windows emoji font with images of what appear to be the Apple emoji symbols for and , and falls back to some other emoji font image for . As said: various software has different failure modes when it comes to this stuff...

(I know Slack also tries to make emoji appear "consistently" across OSes in similar fashion, with sometimes similar failure modes — e.g., if I type into Slack, it actually decomposes it into ...)

This isn't something you'll typically need to deal with, but the opposite: you run your app on macOS with a newer version of Swift, which knows about newer Unicode definitions. You produce data with that app on that OS, and try to send it over to an older computer with a much older version of Swift in the OS. You launch the same app and try to decode, but the Character you encoded is no longer recognized as a single Character with the older Unicode definitions. The end result is entirely dependent on the failure mode of how you decode.

This sort of backwards compatibility is why decoding typically needs to be very permissive, because once you release a very strict version of your app, you need to maintain that strictness or else it'll choke on newer data.

In general, if your decode fails very gracefully, or tries to recover in some meaningful way, at least the experience won't degrade. There isn't necessarily anything you can do about this, but it's a principle to keep in mind as you write your code.

young · March 24, 2022, 1:20am

Thank you for the explanation! I think I am clear on compatibility:

newer Swift version compile code can process old Unicode because Unicode is backward compatible.
older Swift version may not be able to process newer Unicode because character grapheme sequence maybe unknown.
Using String type encode/decode Character is better because it can handle "unknown" character.

So if I want to use Character, not String in my own type that's Codable, can I do this:

extension Character: Codable {
    public init(from decoder: Decoder) throws {
        let container = try decoder.singleValueContainer()
        let s = try container.decode(String.self)
        // if it's not a single character, use code FFFF to indicate illegal value
        self = s.count == 1 ? s.first! : "\u{FFFF}"
    }

    public func encode(to encoder: Encoder) throws {
        var container = encoder.singleValueContainer()
        try container.encode(String(describing: self))
    }
}

I tested it and seems to work:

func decodeToChar(_ input: String) {
    let decoder = JSONDecoder()
    if let data = input.data(using: .utf8), let result = try? decoder.decode(Character.self, from: data) {
        print(result)   // works
    } else {
        print("No dice!")
    }
}

func encodeChar(_ value: Character) {
    let encoder = JSONEncoder()
    if let jsonData = try? encoder.encode(value) {
        let output = String(data: jsonData, encoding: .utf8) ?? "Not valid json"
        print("Json:", output)
    } else {
        print("encode something went wrong")
    }
}

let goodInput = "\"🦀\""

decodeToChar(goodInput) // prints: 🦀

let badInput = "\"xx\""
decodeToChar(badInput)  // prints: 

let char: Character = "🦀"
encodeChar(char)    // prints: Json: "🦀"

xwu · March 24, 2022, 1:36am

There is a wrinkle to this: Until Swift 5.6, grapheme cluster breaking relied on ICU. Therefore, Swift versions and Unicode versions were not bound together. In the most recent release of Swift, a native implementation of the most recent Unicode grapheme breaking rules was implemented.

I have not thought it through entirely but it is conceivable that one could contrive a scenario where an app compiled with Swift 5.5 uses a newer Unicode grapheme cluster breaking algorithm in the future (via an updated version of ICU) than another app compiled with Swift 5.6.

itaiferber · March 24, 2022, 2:20am

For the most part, yes, though see @xwu's comment for a very niche edge case. I don't think it's something you'll need to worry about in practice as long as your code is permissive, but it's something to be generally aware of.

Yes, though some minor notes:

Along with U+FFFF, which is a Unicode "non-character" (an internal-use character not typically intended for usage), you may also consider using U+FFFD REPLACEMENT CHARACTER, which indicates that the underlying character was unrecognized or unrepresentable in Unicode
Instead of String(describing: self), you can just use String.init(_:) which takes a Character directly
It's not recommended to conform a type you don't own (Character) to a protocol you don't own (Codable) — I'd suggest inlining this implementation into the actual struct you mentioned which encodes and decodes the Character (i.e., implement init(from:) and encode(to:) on that struct rather than on Character)

tera · March 24, 2022, 12:55pm

A side question: can you recommend a resource that has a bunch of weird strings to test correct unicode handling? (this is in relation to that thread for which I have the code that seems to work fine but my fear is that it can break for us on some of the platform when some unusual weird string happens.

itaiferber · March 24, 2022, 1:36pm

It depends on what you mean by "weird strings", and what sort of handling you're looking to test. If you want to stress test some code with strings that you might find "validly" in the wild, I'd start with Unicode's own text files for various tests and definitions they offer:

emoji-variation-sequences.txt lists all emoji sequences marked with U+FE0E VARIATION SELECTOR-15 (i.e., display this character as text, e.g. "︎" ← edit: Discourse "helpfully" replaces this text with an emoji image of an umbrella, despite the explicit variation selector ) or U+FE0F VARIATION SELECTOR-16 (i.e., display this character as emoji, e.g. "") — each of these is a single grapheme cluster you can use for testing
emoji-zwj-sequences.txt contains recommendations for semantically-defined sequences of emoji characters separated with zero-width-joiners (e.g. U+1F468 MAN + U+200D + U+1F468 MAN + U+200D + U+1F467 GIRL + U+200D + U+1F466 BOY → "family: man, man, girl, boy"; or U+1F469 WOMAN + U+1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4 🏽 + U+200D + U+1F680 ROCKET → "woman astronaut: medium skin tone") — each of these is also a single grapheme cluster you can use for testing, and some get pretty long
GraphemeBreakTest.txt, WordBreakTest.txt, SentenceBreakTest.txt, and LineBreakTest.txt contain strings used to test the various ICU "break" iterators for iterating over graphemes, words, sentences, and lines, but you can repurpose these strings as various valid combinations of text
Similarly, BidiCharacterTest.txt contains various non-display strings to test various bidirectional embeddings of bidi control characters in strings, but they might also be useful to test too

These files have somewhat different formats, but they are pretty standardized and don't change much between Unicode versions, so it shouldn't be difficult to parse them.

If you're looking for abusive strings that would likely need to be manually and/or maliciously constructed, I don't know of a resource off the top of my head. You could try out a "Zalgo Text" generator which adds many combining characters to strings to create c̟̱͓͕̮̬̖u͆̓ͦrs͙͙͙̦͖̳͉ë̫d̫͒ text, but otherwise, I've put together test cases in the past manually.

If you do want to put together test cases manually, I'd start with the regex definitions of grapheme clusters and manually construct some out of various components in the regex. In particular, hangul-syllable and xpicto-sequence are unbounded, so you can also construct grapheme clusters that are arbitrarily long and test those too.

cukr · March 24, 2022, 1:57pm

tera · March 24, 2022, 3:01pm

For that task specifically cases that are suspected of changing unicode composition that could lead to the corresponding changes of UTF-8 byte counts / offsets. Thank you.

young · March 24, 2022, 3:34pm

My test using this string:

"😀\u{FE0E}"

show this SELECTOR is not obeyed anywhere. The only place that work is in Xcode playground console output area. Everywhere else I tested all show regular emoji.

itaiferber · March 24, 2022, 4:13pm

Variation selectors don't apply to all emoji — in particular, U+1F600 GRINNING FACE 😀 doesn't have a "textual" presentation.

Unicode scalars have a property called Is_Emoji, which indicates whether the scalar has an emoji presentation, and a boolean property called Emoji_Presentation which defines whether it displays as emoji by default. For some characters, variation selectors allow you to request display of one of these two presentations. These apply when:

A "base" Unicode character introduced in one version of Unicode got an emoji variant in a later version of Unicode, e.g.,
- Unicode version 1.1.0 added U+2603 SNOWMAN ☃
- Unicode version 9.0.0 added the emoji variant: U+2603 SNOWMAN ☃ + U+FE0F VARIATION SELECTOR-16 → ☃️
- These characters usually default to a textual presentation, with U+FE0F requesting the emoji variant
An emoji character introduced in one version of Unicode got a textual variant in a later version of Unicode, e.g.
- Unicode version 6.0.0 added U+1F3C4 SURFER 🏄
- A later Unicode version (can't find the specific one at the moment) added a non-emoji variant: U+1F3C4 SURFER 🏄 + U+FE0E VARIATION SELECTOR-15 → 🏄︎
- These characters usually default to an emoji presentation, with U+FE0E requesting the textual variant

Unicode spells out the rules around these properties and how they're treated, but a visual presentation works too:

func variations(_ base: Character) {
    if base.unicodeScalars.first!.properties.isEmoji {
        print(base, "(emoji presentation: \(base.unicodeScalars.first!.properties.isEmojiPresentation))", "\(base)\u{FE0E}", "<=>", "\(base)\u{FE0F}")
    } else {
        print(base, "(not emoji)")
    }
}

variations("a") // => a (not emoji)
variations("\u{2603}") // => ☃ (emoji presentation: false) ☃︎ <=> ☃️
variations("\u{1F3C4}") // => 🏄 (emoji presentation: true) 🏄︎ <=> 🏄️

Some characters default to presenting as emoji, so the variation selector can act as an override to explicitly request text — and vice versa. Note that these selectors are effectively a request to display text a certain way; on macOS 12.3, no system font supports a textual representation of , so I still see a pictoral representation of the surfer in all cases.

In any case, U+1F600 GRINNING FACE 😀 doesn't have a textual presentation to begin with, so the variation selector is ignored.

young · March 24, 2022, 5:55pm

😀 GRINNING FACE isEmoji: true, isEmojiPresentation: true
🏄 SURFER isEmoji: true, isEmojiPresentation: true

If a scalar isEmoji == true and isEmojiPresentation == true, how to know if it has textual form or not?

itaiferber · March 24, 2022, 6:13pm

Unfortunately, I don't believe there's currently a programmatic way to know. emoji-variation-sequences.txt covers all of the options for a given version of Unicode (the TR51 Data Files table lists the contents of that file as All permissible emoji presentation sequences and text presentation sequences), and you can process that file into a list, but Unicode doesn't otherwise offer a semantic property name for "has a textual presentation".

young · March 25, 2022, 1:18am

Thank you very much. You have taught me a whole lot about Unicode and Emoji characters. I can never learn by my own research!