Why is Character not Codable?

I'm dealing with Character's, would prefer not to use String for type correctness. But then my struct cannot be automatically Codable because Character is not Codable?

What's the reasoning for Character not Codable?

2 Likes

Theoretically, Character could be Codable, though there are some big pitfalls to consider when doing so:

The stability of grapheme clusters across Unicode versions isn't guaranteed, so you'd be liable to get different results with the same data across versions of Swift. For example:

  • Version Sแตข of Swift has knowledge of Unicode version Uแตข, which defines a sequence of Unicode code points P to be multiple grapheme clusters
  • A later version Sโฑผ of Swift has knowledge of Unicode version Uโฑผ, which redefines P to be a single grapheme cluster
  • You build a program with version Sโฑผ of Swift and encode your single Character(P). You then try to decode that same data with the same program built with version Sแตข of Swift and :boom:. Well, maybe not :boom:, but it would depend on what happens if Character.init(from:) encounters encoded data representing more than a single Character. Likely it would throw an error, but the data remains undecodable

(This might sound contrived, but Unicode 11 updated the definition of grapheme clusters to be defined by a regular expression, and extended the definitions of what might be considered an "allowable" grapheme cluster, especially around pictographic sequences [emoji]. The yellow in that link is showing a diff from the previous version of the report [TR29-31] to the next [TR29-33]. So what might be considered a single grapheme cluster in Unicode 11 may be considered multiple in earlier versions of Unicode)

As @SDGGiesbrecht notes in the other thread, it's likely you'll want to encode your Character as a String explicitly so that at decode time, you can figure out how you might want to deal with the failure modes directly.

18 Likes

So Unicode grapheme clusters for a character can change between versions. And Swift can only decipher a specific version of Unicode. There is no backward compatibility.

  1. How does using a single character String better than just a Character? Aren't they the same "on the wire"?

  2. Then how to deal with old data/files? For example, how does a web browser display old page written in older Unicode version? How can the browser tell what version of Unicode the page is in?

  3. My use case: from Swift struct save out to json file, then load the json file and decode back to Swift struct. At this moment since I'm compiling the writing and reading in the same Swift version, this works fine. But what happen in the future Unicode once again change and Swift move to that version and I recompile my code, I can't load my json anymore?

You're absolutely correct that the underlying representation would likely end up being the same. The benefit to decoding the data as a String specifically is that String has no specific requirements on length or what it might contain. If you tried to decode "ab" as a Character, it would fail; if you try to decode it as a String it will succeed, and you can inspect the contents to figure out what to do next.

If you decode a String whose .count > 1, you know that either someone has messed with your data, or that the Character potentially came from a newer version of Unicode that considered that String to be a single Character.

Unicode maintains very strict backwards compatibility (e.g. old data carried forwards still has the same meaning), so this is rarely an issue. It's the other direction that's the problem (e.g. new definitions making it to older code which doesn't know about them).

Typically, software that receives unknown Unicode characters falls back to some default behavior somehow, depending on the context. For instance:

  1. The Unicode 15 draft currently adds 4488 new characters (i.e. assigns semantic meaning to 4488 Unicode code points which were previously unassigned). One of these characters is U+1E030 MODIFIER LETTER CYRILLIC SMALL A. If I modify the HTML of a page to contain that character, this is what I see right now:
    Example
    (Firefox 98.0.1 on Windows 10)

    Since the browser doesn't know how to handle this character, and the fonts I have loaded don't have a glyph, it falls back to displaying the code point number in a box. (The browser truly doesn't know how to handle this: this character is a combining character, but the browser doesn't know this, because it doesn't have knowledge of Unicode 15 โ€” so it doesn't actually combine with anything.)

  2. Unicode sometimes assigns semantic meaning to certain combinations of character sequences via Zero-Width-Joiners (ZWJ); for instance, Unicode 14 added "face with spiral eyes" out of U+1F635 DIZZY FACE (:dizzy_face:) + U+200D ZWJ + U+1F4AB DIZZY SYMBOL (:dizzy:). On your machine, this character sequence might appear correctly: :face_with_spiral_eyes: On mine, however, it does not โ€” although the character sequence is recognized as a single character (I can put my cursor on either side of it, but not in the middle), the Microsoft emoji font on my machine doesn't have a glyph for it, so the characters render side-by-side (:dizzy_face::dizzy:):

    Emoji

Various software has different failure modes when it comes to this stuff.

Edit: case in point, the above screenshot is what I saw while typing up this response. However, after posting, here's what I see:

Emoji2

Discourse has replaced the display of the default Windows emoji font with images of what appear to be the Apple emoji symbols for :dizzy_face: and :dizzy:, and falls back to some other emoji font image for :face_with_spiral_eyes:. As said: various software has different failure modes when it comes to this stuff...

(I know Slack also tries to make emoji appear "consistently" across OSes in similar fashion, with sometimes similar failure modes โ€” e.g., if I type :face_with_spiral_eyes: into Slack, it actually decomposes it into :dizzy_face::dizzy:...)

This isn't something you'll typically need to deal with, but the opposite: you run your app on macOS with a newer version of Swift, which knows about newer Unicode definitions. You produce data with that app on that OS, and try to send it over to an older computer with a much older version of Swift in the OS. You launch the same app and try to decode, but the Character you encoded is no longer recognized as a single Character with the older Unicode definitions. The end result is entirely dependent on the failure mode of how you decode.

This sort of backwards compatibility is why decoding typically needs to be very permissive, because once you release a very strict version of your app, you need to maintain that strictness or else it'll choke on newer data.

In general, if your decode fails very gracefully, or tries to recover in some meaningful way, at least the experience won't degrade. There isn't necessarily anything you can do about this, but it's a principle to keep in mind as you write your code.

14 Likes

:pray: Thank you for the explanation! I think I am clear on compatibility:

  • newer Swift version compile code can process old Unicode because Unicode is backward compatible.
  • older Swift version may not be able to process newer Unicode because character grapheme sequence maybe unknown.
  • Using String type encode/decode Character is better because it can handle "unknown" character.

So if I want to use Character, not String in my own type that's Codable, can I do this:

extension Character: Codable {
    public init(from decoder: Decoder) throws {
        let container = try decoder.singleValueContainer()
        let s = try container.decode(String.self)
        // if it's not a single character, use code FFFF to indicate illegal value
        self = s.count == 1 ? s.first! : "\u{FFFF}"
    }

    public func encode(to encoder: Encoder) throws {
        var container = encoder.singleValueContainer()
        try container.encode(String(describing: self))
    }
}
I tested it and seems to work:
func decodeToChar(_ input: String) {
    let decoder = JSONDecoder()
    if let data = input.data(using: .utf8), let result = try? decoder.decode(Character.self, from: data) {
        print(result)   // works
    } else {
        print("No dice!")
    }
}

func encodeChar(_ value: Character) {
    let encoder = JSONEncoder()
    if let jsonData = try? encoder.encode(value) {
        let output = String(data: jsonData, encoding: .utf8) ?? "Not valid json"
        print("Json:", output)
    } else {
        print("encode something went wrong")
    }
}

let goodInput = "\"๐Ÿฆ€\""

decodeToChar(goodInput) // prints: ๐Ÿฆ€

let badInput = "\"xx\""
decodeToChar(badInput)  // prints: 

let char: Character = "๐Ÿฆ€"
encodeChar(char)    // prints: Json: "๐Ÿฆ€"

There is a wrinkle to this: Until Swift 5.6, grapheme cluster breaking relied on ICU. Therefore, Swift versions and Unicode versions were not bound together. In the most recent release of Swift, a native implementation of the most recent Unicode grapheme breaking rules was implemented.

I have not thought it through entirely but it is conceivable that one could contrive a scenario where an app compiled with Swift 5.5 uses a newer Unicode grapheme cluster breaking algorithm in the future (via an updated version of ICU) than another app compiled with Swift 5.6.

3 Likes

For the most part, yes, though see @xwu's comment for a very niche edge case. I don't think it's something you'll need to worry about in practice as long as your code is permissive, but it's something to be generally aware of.

Yes, though some minor notes:

  1. Along with U+FFFF, which is a Unicode "non-character" (an internal-use character not typically intended for usage), you may also consider using U+FFFD REPLACEMENT CHARACTER, which indicates that the underlying character was unrecognized or unrepresentable in Unicode
  2. Instead of String(describing: self), you can just use String.init(_:) which takes a Character directly
  3. It's not recommended to conform a type you don't own (Character) to a protocol you don't own (Codable) โ€” I'd suggest inlining this implementation into the actual struct you mentioned which encodes and decodes the Character (i.e., implement init(from:) and encode(to:) on that struct rather than on Character)
4 Likes

A side question: can you recommend a resource that has a bunch of weird strings to test correct unicode handling? (this is in relation to that thread for which I have the code that seems to work fine but my fear is that it can break for us on some of the platform when some unusual weird string happens.

It depends on what you mean by "weird strings", and what sort of handling you're looking to test. If you want to stress test some code with strings that you might find "validly" in the wild, I'd start with Unicode's own text files for various tests and definitions they offer:

  • emoji-variation-sequences.txt lists all emoji sequences marked with U+FE0E VARIATION SELECTOR-15 (i.e., display this character as text, e.g. ":open_umbrella:๏ธŽ" โ† edit: Discourse "helpfully" replaces this text with an emoji image of an umbrella, despite the explicit variation selector :man_facepalming:t2:) or U+FE0F VARIATION SELECTOR-16 (i.e., display this character as emoji, e.g. ":open_umbrella:") โ€” each of these is a single grapheme cluster you can use for testing
  • emoji-zwj-sequences.txt contains recommendations for semantically-defined sequences of emoji characters separated with zero-width-joiners (e.g. U+1F468 MAN :man: + U+200D + U+1F468 MAN :man: + U+200D + U+1F467 GIRL :girl: + U+200D + U+1F466 BOY :boy: โ†’ :family_man_man_girl_boy: "family: man, man, girl, boy"; or U+1F469 WOMAN :woman: + U+1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4 ๐Ÿฝ + U+200D + U+1F680 ROCKET :rocket: โ†’ :woman_astronaut:t4: "woman astronaut: medium skin tone") โ€” each of these is also a single grapheme cluster you can use for testing, and some get pretty long
  • GraphemeBreakTest.txt, WordBreakTest.txt, SentenceBreakTest.txt, and LineBreakTest.txt contain strings used to test the various ICU "break" iterators for iterating over graphemes, words, sentences, and lines, but you can repurpose these strings as various valid combinations of text
  • Similarly, BidiCharacterTest.txt contains various non-display strings to test various bidirectional embeddings of bidi control characters in strings, but they might also be useful to test too

These files have somewhat different formats, but they are pretty standardized and don't change much between Unicode versions, so it shouldn't be difficult to parse them.

If you're looking for abusive strings that would likely need to be manually and/or maliciously constructed, I don't know of a resource off the top of my head. You could try out a "Zalgo Text" generator which adds many combining characters to strings to create cฬŸฬฑอ“อ•ฬฎฬฌฬ–uอ†ฬ“อฆrsอ™อ™อ™ฬฆอ–ฬณอ‰eฬˆฬซdอ’ฬซ text, but otherwise, I've put together test cases in the past manually.

If you do want to put together test cases manually, I'd start with the regex definitions of grapheme clusters and manually construct some out of various components in the regex. In particular, hangul-syllable and xpicto-sequence are unbounded, so you can also construct grapheme clusters that are arbitrarily long and test those too.

2 Likes
3 Likes

For that task specifically cases that are suspected of changing unicode composition that could lead to the corresponding changes of UTF-8 byte counts / offsets. Thank you.

My test using this string:

"๐Ÿ˜€\u{FE0E}"

show this SELECTOR is not obeyed anywhere. The only place that work is in Xcode playground console output area. Everywhere else I tested all show regular emoji.

Variation selectors don't apply to all emoji โ€” in particular, U+1F600 GRINNING FACE ๐Ÿ˜€ doesn't have a "textual" presentation.

Unicode scalars have a property called Is_Emoji, which indicates whether the scalar has an emoji presentation, and a boolean property called Emoji_Presentation which defines whether it displays as emoji by default. For some characters, variation selectors allow you to request display of one of these two presentations. These apply when:

  1. A "base" Unicode character introduced in one version of Unicode got an emoji variant in a later version of Unicode, e.g.,
    • Unicode version 1.1.0 added U+2603 SNOWMAN โ˜ƒ
    • Unicode version 9.0.0 added the emoji variant: U+2603 SNOWMAN โ˜ƒ + U+FE0F VARIATION SELECTOR-16 โ†’ โ˜ƒ๏ธ
    • These characters usually default to a textual presentation, with U+FE0F requesting the emoji variant
  2. An emoji character introduced in one version of Unicode got a textual variant in a later version of Unicode, e.g.
    • Unicode version 6.0.0 added U+1F3C4 SURFER ๐Ÿ„
    • A later Unicode version (can't find the specific one at the moment) added a non-emoji variant: U+1F3C4 SURFER ๐Ÿ„ + U+FE0E VARIATION SELECTOR-15 โ†’ ๐Ÿ„๏ธŽ
    • These characters usually default to an emoji presentation, with U+FE0E requesting the textual variant

Unicode spells out the rules around these properties and how they're treated, but a visual presentation works too:

func variations(_ base: Character) {
    if base.unicodeScalars.first!.properties.isEmoji {
        print(base, "(emoji presentation: \(base.unicodeScalars.first!.properties.isEmojiPresentation))", "\(base)\u{FE0E}", "<=>", "\(base)\u{FE0F}")
    } else {
        print(base, "(not emoji)")
    }
}

variations("a") // => a (not emoji)
variations("\u{2603}") // => โ˜ƒ (emoji presentation: false) โ˜ƒ๏ธŽ <=> โ˜ƒ๏ธ
variations("\u{1F3C4}") // => ๐Ÿ„ (emoji presentation: true) ๐Ÿ„๏ธŽ <=> ๐Ÿ„๏ธ

Some characters default to presenting as emoji, so the variation selector can act as an override to explicitly request text โ€” and vice versa. Note that these selectors are effectively a request to display text a certain way; on macOS 12.3, no system font supports a textual representation of :surfing_man:, so I still see a pictoral representation of the surfer in all cases.


In any case, U+1F600 GRINNING FACE ๐Ÿ˜€ doesn't have a textual presentation to begin with, so the variation selector is ignored.

3 Likes
๐Ÿ˜€ GRINNING FACE isEmoji: true, isEmojiPresentation: true
๐Ÿ„ SURFER isEmoji: true, isEmojiPresentation: true

If a scalar isEmoji == true and isEmojiPresentation == true, how to know if it has textual form or not?

Unfortunately, I don't believe there's currently a programmatic way to know. emoji-variation-sequences.txt covers all of the options for a given version of Unicode (the TR51 Data Files table lists the contents of that file as All permissible emoji presentation sequences and text presentation sequences), and you can process that file into a list, but Unicode doesn't otherwise offer a semantic property name for "has a textual presentation".

1 Like

Thank you very much. You have taught me a whole lot about Unicode and Emoji characters. I can never learn by my own research!

:pray:

1 Like