String: what's a good unicode character (code point) to indicate "invalid"?

young · March 7, 2021, 12:47am

short best if 1 character for fast compare
this is never display to the user
best if it cannot be type in textfield by the user

Does Unicode have a code point for "invalid" I can use for this?

SDGGiesbrecht · March 7, 2021, 1:05am

Maybe U+FFFD? That is easily used in Swift with "\u{FFFD}". But in any use case I can dream up, it would be better to just use a String? set to nil.

P.S. There is no such thing as an untypable character.

James_Dempsey · March 7, 2021, 2:00am

I don’t know your use case but I agree that a String? with nil would generally be a better way to handle an ‘invalid’ value.

If you absolutely need a one character sentinel value in a string, you might consider one of the unused ASCII terminal control characters.

0x15 is ‘NEGATIVE ACKNOWLEDGE’ which doesn’t have an active use in Unicode. I haven’t done this myself so I don’t know of possible undesirable side effects.

kibigo · March 7, 2021, 12:46pm

U+FFFD � REPLACEMENT CHARACTER is ideal for this, yes—however it is entirely possible that a user might paste U+FFFD in from somewhere else.

If it is important to you that the character not be inputable, it is best to use a “noncharacter”, the full list of which is here: Q: Which code points are noncharacters?. It doesn’t matter which you pick (although probably don’t use U+FFFE or U+FFFF)—noncharacters are explicitly reserved for internal application usage, so you can assign them whatever meaning you want. In the extremely rare case that a user inputs the noncharacter you choose into the textfield, you should replace that character with U+FFFD to ensure the noncharacter retains your specialized usage.

SDGGiesbrecht · March 7, 2021, 10:23pm

U+FFFE is the byte‐reversed representation of the byte order mark. If you put that in a string or file, Unicode‐compliant applications and APIs may interpret that as a flag that the string was corrupted by mismatched endianness and try to repair it. If you write that into a database and reload it, all its strings might become byte‐reversed gibberish.

I do not recommend using noncharacters unless you are certain they never leave the memory of your program, and you never pass them into APIs outside your control. To be sure of that, you would pretty much have to use [Unicode.Scalar] the whole time instead of String.

kibigo · March 7, 2021, 10:47pm

In fact, passing to APIs outside your control is precisely why one would want to use noncharacters (instead of a Swift native type like String?)—to quote the Unicode FAQ:

But the intended use of noncharacters requires the ability to exchange them in a limited context, at least across APIs and even through data files and other means of "interchange", so that they can be processed as intended.

Although better options most definitely should be used if available, needing to represent a String? value unambiguously via a plaintext, non‐public‐facing API (for example, a save file format) would be an entirely appropriate use of noncharacters. (Of course, said use should be clearly documented.)

I agree that U+FFFE (and U+FFFF) are not ideal—for one thing; these characters are impossible to represent in XML. And as stated, U+FFFE might be misinterpreted if it is the first character of UTF-16 document without a BOM. However, this leaves 64 characters, 16 of which (U+FDD0..U+FDEF) are entirely ordinary characters in the BMP.

For the absence of doubt, noncharacters are something of a “last‐resort” feature of Unicode—you should use better mechanisms when better mechanisms exist. But sometimes you just really need to serialize to a string, and you have no say about the matter. In that case, noncharacters are, while maybe not good per se, certainly better than any other character you could choose for that purpose.

Nickolas_Pohilets · March 8, 2021, 9:48am

If you are processing code points as integers, any value outside of the 0...0x10FFFF will work. I guess 0xFFFFFFFF is an aesthetically pleasing one.