Prepitch: Character integer literals

Would the UInt16 character literal be a 21-bit scalar value that can be represented in 16-bits, or would you also allow high-surrogate and low-surrogate code points?

A 21 bit unicode scalar that needs to be split across two surrogate code points will overflow a Uint16 so you won’t be able to search for it directly using character literal syntax with a Uint16 buffer in the same way you can’t directly search for ‘é’ with an Int8 buffer encoded with utf-8. utf16 was always a compromise.

By surrogate code points, I meant literals in the range "\u{D800}" ... "\u{DFFF}".

let unicodeScalar: Unicode.Scalar = "\u{D800}" // error: invalid unicode scalar

let utf16CodeUnit: Unicode.UTF16.CodeUnit = "\u{D800}" // alternative to 0xD800

I don't think this is a good idea, so I'm glad that it was only my misunderstanding.

Inevitably, the new integer conformances to ExpressibleByUnicodeScalar are surfacing some of the intricacies of character encoding to developers which Swift has studiously sought to avoid with it’s modern, high level String model. The advantages of this proposal are twofold. It introduces a new class of literal for Character with a capital C as distinct from String which has ergonomic advantages and once the distinction has been made these new conformances can be introduced with the minimum of disruption and helpful diagnostics provided when they are misused.

1 Like

Hm, I agree: UCS-2 has the significant benefit that it's compatible with UTF-16 -- unlike Latin-1 vs UTF-8. So, at the very least, allowing UInt16 to be initialized with any BMP character wouldn't add glaring inconsistencies with Swift's "native" String encodings.

However, UCS-2 and UTF-16 aren't the only two char encodings with 16-bit units -- see GB2312, Big5, JIS X 0208, etc. I don't have the cultural background to judge whether these standards cause the same sort of confusion as legacy 8-bit codes do. (To be honest, I have not even wrapped my head around how these work in practice. ISO-2022? DBCS? SBCS? Oh my...) It may be a good idea to find someone with relevant experience and see if this gives them the heebie-jeebies like 'é' as UInt8 does to me:

let maybeNotSoBadAfterAll: UInt16 = '啊'

I think supporting Int16 would be a stretch, though.

5 Likes

I really don’t think any of this makes sense above ASCII; it is absolutely fraught with peril.

let number: UInt32 = ';'
assert(number == 0x37E, "My number changed?!?")

If a source editor, pasteboard, file sync, checkout, or download performs NFx normalization, the above code will break.

1 Like

This is already the case with String literals. Source processing tools seem to be generally ready for this -- I can't recall any corruption issues.

I think it is better to have a simpler and more consistent rule, with fewer special cases. In particular, it would be unexpected and surprising if the behavior for character-literals-as-integers were different from the following:

func scalarValue<T: BinaryInteger>(_ c: Character) -> T {
  precondition(c.unicodeScalars.count == 1)
  return numericCast(c.unicodeScalars.first!.value)
}

Swift is already Unicode aware, as a deliberate design choice. We can already write "\u{n}" to turn an integer literal into a Character. With this proposal to go the other way, writing '\u{n}' should give back the same value of n that went in, or produce an overflow error if the destination type is too small.

Where and how would such a "truncated UTF-32" encoding be useful in actual practice?

This is not the case with String literals.

This will never fail, no matter whether the file’s encoding or normalization changes:

let string = ";"
assert(string == "\u{37E}", "My character changed?!?")

Sure, but this will:

let scalar: Unicode.Scalar = ";"
assert(scalar == Unicode.Scalar(0x37e))

…anywhere that someone wants to represent an integer in source code by using the corresponding Unicode character surrounded by single quotes.

That is the entire point of this proposal after all.

• • •

Personally, I am skeptical about whether we need such a feature in the language, but seeing how this discussion has proceeded for hundreds of posts, I’m willing to accept that some people find it beneficial to write integers that way.

And if we’re going to allow it, we might as well do it the obvious way that works as expected.

• • •

Let me put it this way:

If someone needs to be a Unicode expert to understand how the feature works and use it correctly, then it definitely does not belong in a high-visibility part of the Swift language. That would make it an attractive nuisance.

On the other hand, if the feature is intended for use by non-experts in Unicode, then it needs to work the way those people want and expect it to. Specifically:

If the character is a single scalar, then turning it into an integer should give that scalar. And if the character is not a single scalar, then turning it into an integer should fail.

2 Likes

LOL. You’re right. And this would no longer even compile with NFx:

let scalar: Unicode.Scalar = "שּׂ"

There is a reason Swift steers people toward String and Character, and leaves Unicode.Scalar hidden away for those who know what they’re doing.

These are safe:

let string: String = "שּׂ"
let character: Character = "שּׂ"

These are dangerous and not Unicode‐compliant at the source level:

let scalar: Unicode.Scalar = "שּׂ"
let number: UInt32 = 'שּׂ'

They are only safely expressed like this:

let scalar: Unicode.Scalar = "\u{FB2D}"
let number: UInt32 = 0xFB2D
let scalar2: Unicode.Scalar(value: 0xFB2D)
let number2: UInt32 = '\u{FB2D}'

I don’t see how the last line is an improvement.


Interestingly—and very relevent—I had trouble posting this properly, because at first I was copying and pasting the character to reuse it, and Safari (or else the forum interface) was performing NFC on either the copy or the paste. In the end I had to stop copying and keep going back to the character palette to make it work properly.

Try it yourself. Copy and paste it to the text field of your own post (but don’t actually post it, or you’ll annoy us all). Then compare the bytes in the text field and the preview to the bytes where you copied it from. Not the same.

If you copy from my post to a playground, it will compile and run, but if you copy from my post to your own and then copy that to the playground, it will not compile.

(Edit: * This assumes you are using Safari and Xcode.)

I think that demonstrates fairly soundly that tools are not prepared for this sort of thing. (Even though they are Unicode‐compliant.)

1 Like

You can easily do this by simply using the appropriate integer type, i.e., UInt32.

let fine: UInt32 = '✓'

Or, even better, you can use Unicode.Scalar to do this in a more type-safe way.

let terrific: Unicode.Scalar = '✓'

Unicode.Scalar is a tiny wrapper type around an UInt32 value; using it makes it clear that we're dealing with Unicode code points, not random integers.

There is simply no way to represent a random Unicode character in less than 32 bits without dealing with encodings. Amazingly, during the last two decades, the world has somehow mostly agreed to use a sensible, universal 8-bit encoding, UTF-8. UTF-8 is already deeply integrated into Swift as the default text encoding; I would prefer to go forward by extending this support even further, rather than reverting back to anything else.

In particular, this would be entirely unacceptable to me:

let café1: [UInt8] = ['c', 'a', 'f', 'é']
let café2: [UInt8] = "café".utf8.map { $0 }

print(café1) // [99, 97, 102, 201]
print(café2) // [99, 97, 102, 195, 169]

The Swift-native way to encode the character é into 8-bit bytes is to use UTF-8, resulting in [195, 169]. Introducing a language feature that does anything else looks like a bad idea to me.

I agree; however, this feature (including support for UInt8) wasn't sold as a Unicode-enabling feature -- it's supposed to make it easier to do things like parsing binary file formats that include some ASCII bits.

I don't think limiting UInt8/Int8 initialization to ASCII characters is a particularly difficult thing to accept, given that it simplifies Swift's text processing story. Reviving the legacy Latin-1 encoding would have huge potential for accidental misuse, as demonstrated in the above snippet. This is especially the case for people not well-versed in Unicode matters.

6 Likes

I think we're debating these 3 options, but please let me know if I missed something.

Option 1: Simple truncation

UInt8 UInt16 UInt32 Unicode.Scalar Character
'x' 120 120 120 U+0078 x
'©' 169 169 169 U+00A9 ©
'花' error 33457 33457 U+82B1
'𓀎' error error 77838 U+1300E 𓀎
':family_woman_woman_boy_boy:' error error error error :family_woman_woman_boy_boy:
'ab' error error error error error

(What about signed integers? Should they refuse to set the sign bit, or are they bitwise-equivalent to unsigned?)

Option 2: Valid Encoding Width

UInt8 UInt16 UInt32 Unicode.Scalar Character
'x' 120 120 120 U+0078 x
'©' error 169 169 U+00A9 ©
'花' error 33457 33457 U+82B1
'𓀎' error error 77838 U+1300E 𓀎
':family_woman_woman_boy_boy:' error error error error :family_woman_woman_boy_boy:
'ab' error error error error error

Option 3: ASCII or UTF32

UInt8 UInt16 UInt32 Unicode.Scalar Character
'x' 120 120 120 U+0078 x
'©' error error 169 U+00A9 ©
'花' error error 33457 U+82B1
'𓀎' error error 77838 U+1300E 𓀎
':family_woman_woman_boy_boy:' error error error error :family_woman_woman_boy_boy:
'ab' error error error error error

Option 1 feels the least desirable to me. To write correct code, the developer has to understand Unicode encoding details and be vigilant against bugs. This means they should pretty much never use non-ASCII character literals, unless they're really interested in processing Latin1 or using a higher width type is unfathomable. This seems very niche to me, and the wrong tradeoff.

Option 2 fixes this, and would be convenient when processing UTF-16 encoded data. Scanning UTF-8 looking for specific ASCII characters is tolerant of multi-byte scalar values because we reject multi-code-unit literals. Similarly, scanning UTF-16 for certain BMP scalars is tolerant of non-BMP scalars because we reject surrogate literals. But, it does have the effect of further baking in encoding details into our model.

Option 3 seems the simplest model for users who don't want or need to worry much about Unicode or encoding. It also avoids any concern about signed-ness of types, as no sign bit is ever set. ASCII is special in computing, and will likely be so for the rest of our lifetimes, and is the most common use case anyways. Everything else is a Unicode scalar value and you need the full width to accommodate them. But, it is less useful than Option 2 for users who want to have non-ASCII BMP scalars width-restricted to 16 bits.

I feel like Option 3 is a tiny bit better than Option 2. Both Options 2 and 3 seem significantly better than Option 1.

11 Likes

The three tables looks like an excellent summary to me. Any of the three option would suit me. I'm not too sure how much harm it'd cause to have character literals be simply truncated unicode scalar. I find String and Character are pretty good at providing a high level interface matching general expectations, while working with anything is full of pitfalls you need to be aware of already. I'd probably favor option 2 if I had to explain to someone how UTF-8 and UTF-16 works.

I think character literals should work for signed integers. I can't see how it can be harmful and I can see a couple of ways it may be useful. For instance, it's not uncommon to play ASCII arithmetic when implementing parsers for Base64, numbers, or other binary to text schemes; having the ability to go negative might be useful in some cases. Also, CChar is signed on many platforms so it might also help interfacing with some C APIs.

Text (even down to an individual scalar’s worth of text) has multiple, interchangeable byte representations, even within a single encoding (i.e. without leaving UTF‐8). Anything written in a source file is text, and may switch forms without asking the user. But the different forms produce different numbers, or even different numbers of numbers. Two equal source files should not produce two different programs. Nor should a source file produce a different program than its copy produces. But this is what starts to happen when Unicode characters are used as something other than Unicode characters.

Option 4: Only What Is Safe

UInt8 UInt16 UInt32 Unicode.Scalar Character Notes
'x' 120 120 120 U+0078 x ASCII scalar
'©' error error error U+00A9* © Latin‐1 scalar
'é' error error error U+00E9/error* é Latin‐1 scalar which expands under NFD
'花' error error error U+82B1* BMP scalar
';' error error error U+037E/U+003B*† ; BMP scalar which changes under NFx
'שּׂ' error error error U+FB2D/error* BMP scalar which expands under NFx
'𓀎' error error error U+1300E* 𓀎 Supplemental plane scalar
'ē̱' error error error error ē̱ Character with no single‐scalar representation
'ab' error error error error error Multiple characters

* Cells with an asterisk are existing behaviour that is dangerous. As you can see, some characters in these ranges are safe to initialize as literals and others aren’t. You have to know a lot about Unicode to know which are which. At least if you have gone to the effort to call Unicode.Scalar out from under Character, then you must already know you are dealing with Unicode, and you’ve taken a certain level of responsibility for what happens. But that waiving of safety is not as clear when you are thinking of them as numbers (UIntx).

† These are equal strings.

10 Likes

Nor should there be. The compiler simply raises an overflow error if the value does not fit in the destination type. This is exactly the same as with integer literals:

101 as Int8     // success
101 as UInt8    // success

233 as Int8     // error
233 as UInt8    // success

Thus:

'e' as Int8     // success
'e' as UInt8    // success

'é' as Int8     // error
'é' as UInt8    // success

• • •

Encodings are for strings. We are dealing with numbers. Specifically, using a Unicode character to represent an integer in source code. Whatever encoding the source code uses, the compiler can check for the presence of exactly one Unicode scalar in the character literal, and see if its value fits in the contextual type.

1 Like

No, a Unicode scalar value has a single byte representation in any given encoding. Scalar equality does not model canonical equivalence. Furthermore, if you ask for String.UnicodeScalarView, you will get the scalars encoded by the String, not some set of canonically-equivalent scalars (though this behavior has yet to be formalized).

Canonical equivalence does not apply to character literals, which exist the scalar level. But, it can inform concerns, and I think you bring up an interesting point:

More concretely, normalization during copy-paste (fairly common) could break source code. For String literals, this wasn't as serious of a concern as it would only cause issues with hard-coded assumptions about the contents of the code unit or scalar views.

No strong opinions; this also looks great to me.

I’m trying to decide in my mind between Option 2 and Option 4 and it feels like a classic flexibility vs. safety debate to me. I’m generally in favour of enabling developers rather than molly-coddling them but in practical terms not much would be lost if we opted for safety and restricted all integer conversion to only be possible from ASCII. Apart from introducing Character literals to the language which is worthwhile in itself the integer conversions are niche feature and 90% of the usage will likely be the literal ‘\n’ anyway. One political approach would be to restrict it to ASCII for now and avoid a whole lot of documentation burden/debate with a view to opening it up later on if further use cases turn up. Though to be clear, I’d personally prefer Option 2.

2 Likes