I'm dealing with Character
's, would prefer not to use String
for type correctness. But then my struct
cannot be automatically Codable
because Character
is not Codable
?
What's the reasoning for Character
not Codable
?
I'm dealing with Character
's, would prefer not to use String
for type correctness. But then my struct
cannot be automatically Codable
because Character
is not Codable
?
What's the reasoning for Character
not Codable
?
Theoretically, Character
could be Codable
, though there are some big pitfalls to consider when doing so:
The stability of grapheme clusters across Unicode versions isn't guaranteed, so you'd be liable to get different results with the same data across versions of Swift. For example:
Sแตข
of Swift has knowledge of Unicode version Uแตข
, which defines a sequence of Unicode code points P
to be multiple grapheme clustersSโฑผ
of Swift has knowledge of Unicode version Uโฑผ
, which redefines P
to be a single grapheme clusterSโฑผ
of Swift and encode your single Character(P)
. You then try to decode that same data with the same program built with version Sแตข
of Swift and . Well, maybe not , but it would depend on what happens if Character.init(from:)
encounters encoded data representing more than a single Character
. Likely it would throw an error, but the data remains undecodable(This might sound contrived, but Unicode 11 updated the definition of grapheme clusters to be defined by a regular expression, and extended the definitions of what might be considered an "allowable" grapheme cluster, especially around pictographic sequences [emoji]. The yellow in that link is showing a diff from the previous version of the report [TR29-31] to the next [TR29-33]. So what might be considered a single grapheme cluster in Unicode 11 may be considered multiple in earlier versions of Unicode)
As @SDGGiesbrecht notes in the other thread, it's likely you'll want to encode your Character
as a String
explicitly so that at decode time, you can figure out how you might want to deal with the failure modes directly.
So Unicode grapheme clusters for a character can change between versions. And Swift can only decipher a specific version of Unicode. There is no backward compatibility.
How does using a single character String better than just a Character? Aren't they the same "on the wire"?
Then how to deal with old data/files? For example, how does a web browser display old page written in older Unicode version? How can the browser tell what version of Unicode the page is in?
My use case: from Swift struct
save out to json file, then load the json file and decode back to Swift struct
. At this moment since I'm compiling the writing and reading in the same Swift version, this works fine. But what happen in the future Unicode once again change and Swift move to that version and I recompile my code, I can't load my json anymore?
You're absolutely correct that the underlying representation would likely end up being the same. The benefit to decoding the data as a String
specifically is that String
has no specific requirements on length or what it might contain. If you tried to decode "ab"
as a Character
, it would fail; if you try to decode it as a String
it will succeed, and you can inspect the contents to figure out what to do next.
If you decode a String
whose .count > 1
, you know that either someone has messed with your data, or that the Character
potentially came from a newer version of Unicode that considered that String
to be a single Character
.
Unicode maintains very strict backwards compatibility (e.g. old data carried forwards still has the same meaning), so this is rarely an issue. It's the other direction that's the problem (e.g. new definitions making it to older code which doesn't know about them).
Typically, software that receives unknown Unicode characters falls back to some default behavior somehow, depending on the context. For instance:
The Unicode 15 draft currently adds 4488 new characters (i.e. assigns semantic meaning to 4488 Unicode code points which were previously unassigned). One of these characters is U+1E030 MODIFIER LETTER CYRILLIC SMALL A
. If I modify the HTML of a page to contain that character, this is what I see right now:
(Firefox 98.0.1 on Windows 10)
Since the browser doesn't know how to handle this character, and the fonts I have loaded don't have a glyph, it falls back to displaying the code point number in a box. (The browser truly doesn't know how to handle this: this character is a combining character, but the browser doesn't know this, because it doesn't have knowledge of Unicode 15 โ so it doesn't actually combine with anything.)
Unicode sometimes assigns semantic meaning to certain combinations of character sequences via Zero-Width-Joiners (ZWJ); for instance, Unicode 14 added "face with spiral eyes" out of U+1F635 DIZZY FACE
() + U+200D ZWJ
+ U+1F4AB DIZZY SYMBOL
(). On your machine, this character sequence might appear correctly: On mine, however, it does not โ although the character sequence is recognized as a single character (I can put my cursor on either side of it, but not in the middle), the Microsoft emoji font on my machine doesn't have a glyph for it, so the characters render side-by-side ():
Various software has different failure modes when it comes to this stuff.
Edit: case in point, the above screenshot is what I saw while typing up this response. However, after posting, here's what I see:
Discourse has replaced the display of the default Windows emoji font with images of what appear to be the Apple emoji symbols for and , and falls back to some other emoji font image for . As said: various software has different failure modes when it comes to this stuff...
(I know Slack also tries to make emoji appear "consistently" across OSes in similar fashion, with sometimes similar failure modes โ e.g., if I type into Slack, it actually decomposes it into ...)
This isn't something you'll typically need to deal with, but the opposite: you run your app on macOS with a newer version of Swift, which knows about newer Unicode definitions. You produce data with that app on that OS, and try to send it over to an older computer with a much older version of Swift in the OS. You launch the same app and try to decode, but the Character
you encoded is no longer recognized as a single Character
with the older Unicode definitions. The end result is entirely dependent on the failure mode of how you decode.
This sort of backwards compatibility is why decoding typically needs to be very permissive, because once you release a very strict version of your app, you need to maintain that strictness or else it'll choke on newer data.
In general, if your decode fails very gracefully, or tries to recover in some meaningful way, at least the experience won't degrade. There isn't necessarily anything you can do about this, but it's a principle to keep in mind as you write your code.
Thank you for the explanation! I think I am clear on compatibility:
String
type encode/decode Character is better because it can handle "unknown" character.So if I want to use Character
, not String
in my own type that's Codable
, can I do this:
extension Character: Codable {
public init(from decoder: Decoder) throws {
let container = try decoder.singleValueContainer()
let s = try container.decode(String.self)
// if it's not a single character, use code FFFF to indicate illegal value
self = s.count == 1 ? s.first! : "\u{FFFF}"
}
public func encode(to encoder: Encoder) throws {
var container = encoder.singleValueContainer()
try container.encode(String(describing: self))
}
}
func decodeToChar(_ input: String) {
let decoder = JSONDecoder()
if let data = input.data(using: .utf8), let result = try? decoder.decode(Character.self, from: data) {
print(result) // works
} else {
print("No dice!")
}
}
func encodeChar(_ value: Character) {
let encoder = JSONEncoder()
if let jsonData = try? encoder.encode(value) {
let output = String(data: jsonData, encoding: .utf8) ?? "Not valid json"
print("Json:", output)
} else {
print("encode something went wrong")
}
}
let goodInput = "\"๐ฆ\""
decodeToChar(goodInput) // prints: ๐ฆ
let badInput = "\"xx\""
decodeToChar(badInput) // prints:
let char: Character = "๐ฆ"
encodeChar(char) // prints: Json: "๐ฆ"
There is a wrinkle to this: Until Swift 5.6, grapheme cluster breaking relied on ICU. Therefore, Swift versions and Unicode versions were not bound together. In the most recent release of Swift, a native implementation of the most recent Unicode grapheme breaking rules was implemented.
I have not thought it through entirely but it is conceivable that one could contrive a scenario where an app compiled with Swift 5.5 uses a newer Unicode grapheme cluster breaking algorithm in the future (via an updated version of ICU) than another app compiled with Swift 5.6.
For the most part, yes, though see @xwu's comment for a very niche edge case. I don't think it's something you'll need to worry about in practice as long as your code is permissive, but it's something to be generally aware of.
Yes, though some minor notes:
U+FFFF
, which is a Unicode "non-character" (an internal-use character not typically intended for usage), you may also consider using U+FFFD REPLACEMENT CHARACTER
, which indicates that the underlying character was unrecognized or unrepresentable in UnicodeString(describing: self)
, you can just use String.init(_:)
which takes a Character
directlyCharacter
) to a protocol you don't own (Codable
) โ I'd suggest inlining this implementation into the actual struct
you mentioned which encodes and decodes the Character
(i.e., implement init(from:)
and encode(to:)
on that struct
rather than on Character
)A side question: can you recommend a resource that has a bunch of weird strings to test correct unicode handling? (this is in relation to that thread for which I have the code that seems to work fine but my fear is that it can break for us on some of the platform when some unusual weird string happens.
It depends on what you mean by "weird strings", and what sort of handling you're looking to test. If you want to stress test some code with strings that you might find "validly" in the wild, I'd start with Unicode's own text files for various tests and definitions they offer:
U+FE0E VARIATION SELECTOR-15
(i.e., display this character as text, e.g. "๏ธ" โ edit: Discourse "helpfully" replaces this text with an emoji image of an umbrella, despite the explicit variation selector ) or U+FE0F VARIATION SELECTOR-16
(i.e., display this character as emoji, e.g. "") โ each of these is a single grapheme cluster you can use for testingU+1F468 MAN
+ U+200D
+ U+1F468 MAN
+ U+200D
+ U+1F467 GIRL
+ U+200D
+ U+1F466 BOY
โ "family: man, man, girl, boy"; or U+1F469 WOMAN
+ U+1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4
๐ฝ + U+200D
+ U+1F680 ROCKET
โ "woman astronaut: medium skin tone") โ each of these is also a single grapheme cluster you can use for testing, and some get pretty longThese files have somewhat different formats, but they are pretty standardized and don't change much between Unicode versions, so it shouldn't be difficult to parse them.
If you're looking for abusive strings that would likely need to be manually and/or maliciously constructed, I don't know of a resource off the top of my head. You could try out a "Zalgo Text" generator which adds many combining characters to strings to create cฬฬฑออฬฎฬฌฬuอฬอฆrsอออฬฆอฬณอeฬฬซdอฬซ text, but otherwise, I've put together test cases in the past manually.
If you do want to put together test cases manually, I'd start with the regex definitions of grapheme clusters and manually construct some out of various components in the regex. In particular, hangul-syllable
and xpicto-sequence
are unbounded, so you can also construct grapheme clusters that are arbitrarily long and test those too.
For that task specifically cases that are suspected of changing unicode composition that could lead to the corresponding changes of UTF-8 byte counts / offsets. Thank you.
My test using this string:
"๐\u{FE0E}"
show this SELECTOR is not obeyed anywhere. The only place that work is in Xcode playground console output area. Everywhere else I tested all show regular emoji.
Variation selectors don't apply to all emoji โ in particular, U+1F600 GRINNING FACE ๐
doesn't have a "textual" presentation.
Unicode scalars have a property called Is_Emoji
, which indicates whether the scalar has an emoji presentation, and a boolean property called Emoji_Presentation
which defines whether it displays as emoji by default. For some characters, variation selectors allow you to request display of one of these two presentations. These apply when:
U+2603 SNOWMAN โ
U+2603 SNOWMAN โ
+ U+FE0F VARIATION SELECTOR-16
โ โ๏ธ
U+FE0F
requesting the emoji variantU+1F3C4 SURFER ๐
U+1F3C4 SURFER ๐
+ U+FE0E VARIATION SELECTOR-15
โ ๐๏ธ
U+FE0E
requesting the textual variantUnicode spells out the rules around these properties and how they're treated, but a visual presentation works too:
func variations(_ base: Character) {
if base.unicodeScalars.first!.properties.isEmoji {
print(base, "(emoji presentation: \(base.unicodeScalars.first!.properties.isEmojiPresentation))", "\(base)\u{FE0E}", "<=>", "\(base)\u{FE0F}")
} else {
print(base, "(not emoji)")
}
}
variations("a") // => a (not emoji)
variations("\u{2603}") // => โ (emoji presentation: false) โ๏ธ <=> โ๏ธ
variations("\u{1F3C4}") // => ๐ (emoji presentation: true) ๐๏ธ <=> ๐๏ธ
Some characters default to presenting as emoji, so the variation selector can act as an override to explicitly request text โ and vice versa. Note that these selectors are effectively a request to display text a certain way; on macOS 12.3, no system font supports a textual representation of , so I still see a pictoral representation of the surfer in all cases.
In any case, U+1F600 GRINNING FACE ๐
doesn't have a textual presentation to begin with, so the variation selector is ignored.
๐ GRINNING FACE isEmoji: true, isEmojiPresentation: true
๐ SURFER isEmoji: true, isEmojiPresentation: true
If a scalar isEmoji == true
and isEmojiPresentation == true
, how to know if it has textual form or not?
Unfortunately, I don't believe there's currently a programmatic way to know. emoji-variation-sequences.txt
covers all of the options for a given version of Unicode (the TR51 Data Files table lists the contents of that file as All permissible emoji presentation sequences and text presentation sequences), and you can process that file into a list, but Unicode doesn't otherwise offer a semantic property name for "has a textual presentation".
Thank you very much. You have taught me a whole lot about Unicode and Emoji characters. I can never learn by my own research!