Prepitch: Character integer literals

I think it is better to have a simpler and more consistent rule, with fewer special cases. In particular, it would be unexpected and surprising if the behavior for character-literals-as-integers were different from the following:

func scalarValue<T: BinaryInteger>(_ c: Character) -> T {
  precondition(c.unicodeScalars.count == 1)
  return numericCast(c.unicodeScalars.first!.value)
}

Swift is already Unicode aware, as a deliberate design choice. We can already write "\u{n}" to turn an integer literal into a Character. With this proposal to go the other way, writing '\u{n}' should give back the same value of n that went in, or produce an overflow error if the destination type is too small.

Where and how would such a "truncated UTF-32" encoding be useful in actual practice?

This is not the case with String literals.

This will never fail, no matter whether the file’s encoding or normalization changes:

let string = ";"
assert(string == "\u{37E}", "My character changed?!?")

Sure, but this will:

let scalar: Unicode.Scalar = ";"
assert(scalar == Unicode.Scalar(0x37e))

…anywhere that someone wants to represent an integer in source code by using the corresponding Unicode character surrounded by single quotes.

That is the entire point of this proposal after all.

• • •

Personally, I am skeptical about whether we need such a feature in the language, but seeing how this discussion has proceeded for hundreds of posts, I’m willing to accept that some people find it beneficial to write integers that way.

And if we’re going to allow it, we might as well do it the obvious way that works as expected.

• • •

Let me put it this way:

If someone needs to be a Unicode expert to understand how the feature works and use it correctly, then it definitely does not belong in a high-visibility part of the Swift language. That would make it an attractive nuisance.

On the other hand, if the feature is intended for use by non-experts in Unicode, then it needs to work the way those people want and expect it to. Specifically:

If the character is a single scalar, then turning it into an integer should give that scalar. And if the character is not a single scalar, then turning it into an integer should fail.

2 Likes

LOL. You’re right. And this would no longer even compile with NFx:

let scalar: Unicode.Scalar = "שּׂ"

There is a reason Swift steers people toward String and Character, and leaves Unicode.Scalar hidden away for those who know what they’re doing.

These are safe:

let string: String = "שּׂ"
let character: Character = "שּׂ"

These are dangerous and not Unicode‐compliant at the source level:

let scalar: Unicode.Scalar = "שּׂ"
let number: UInt32 = 'שּׂ'

They are only safely expressed like this:

let scalar: Unicode.Scalar = "\u{FB2D}"
let number: UInt32 = 0xFB2D
let scalar2: Unicode.Scalar(value: 0xFB2D)
let number2: UInt32 = '\u{FB2D}'

I don’t see how the last line is an improvement.


Interestingly—and very relevent—I had trouble posting this properly, because at first I was copying and pasting the character to reuse it, and Safari (or else the forum interface) was performing NFC on either the copy or the paste. In the end I had to stop copying and keep going back to the character palette to make it work properly.

Try it yourself. Copy and paste it to the text field of your own post (but don’t actually post it, or you’ll annoy us all). Then compare the bytes in the text field and the preview to the bytes where you copied it from. Not the same.

If you copy from my post to a playground, it will compile and run, but if you copy from my post to your own and then copy that to the playground, it will not compile.

(Edit: * This assumes you are using Safari and Xcode.)

I think that demonstrates fairly soundly that tools are not prepared for this sort of thing. (Even though they are Unicode‐compliant.)

1 Like

You can easily do this by simply using the appropriate integer type, i.e., UInt32.

let fine: UInt32 = '✓'

Or, even better, you can use Unicode.Scalar to do this in a more type-safe way.

let terrific: Unicode.Scalar = '✓'

Unicode.Scalar is a tiny wrapper type around an UInt32 value; using it makes it clear that we're dealing with Unicode code points, not random integers.

There is simply no way to represent a random Unicode character in less than 32 bits without dealing with encodings. Amazingly, during the last two decades, the world has somehow mostly agreed to use a sensible, universal 8-bit encoding, UTF-8. UTF-8 is already deeply integrated into Swift as the default text encoding; I would prefer to go forward by extending this support even further, rather than reverting back to anything else.

In particular, this would be entirely unacceptable to me:

let café1: [UInt8] = ['c', 'a', 'f', 'é']
let café2: [UInt8] = "café".utf8.map { $0 }

print(café1) // [99, 97, 102, 201]
print(café2) // [99, 97, 102, 195, 169]

The Swift-native way to encode the character é into 8-bit bytes is to use UTF-8, resulting in [195, 169]. Introducing a language feature that does anything else looks like a bad idea to me.

I agree; however, this feature (including support for UInt8) wasn't sold as a Unicode-enabling feature -- it's supposed to make it easier to do things like parsing binary file formats that include some ASCII bits.

I don't think limiting UInt8/Int8 initialization to ASCII characters is a particularly difficult thing to accept, given that it simplifies Swift's text processing story. Reviving the legacy Latin-1 encoding would have huge potential for accidental misuse, as demonstrated in the above snippet. This is especially the case for people not well-versed in Unicode matters.

6 Likes

I think we're debating these 3 options, but please let me know if I missed something.

Option 1: Simple truncation

UInt8 UInt16 UInt32 Unicode.Scalar Character
'x' 120 120 120 U+0078 x
'©' 169 169 169 U+00A9 ©
'花' error 33457 33457 U+82B1
'𓀎' error error 77838 U+1300E 𓀎
':family_woman_woman_boy_boy:' error error error error :family_woman_woman_boy_boy:
'ab' error error error error error

(What about signed integers? Should they refuse to set the sign bit, or are they bitwise-equivalent to unsigned?)

Option 2: Valid Encoding Width

UInt8 UInt16 UInt32 Unicode.Scalar Character
'x' 120 120 120 U+0078 x
'©' error 169 169 U+00A9 ©
'花' error 33457 33457 U+82B1
'𓀎' error error 77838 U+1300E 𓀎
':family_woman_woman_boy_boy:' error error error error :family_woman_woman_boy_boy:
'ab' error error error error error

Option 3: ASCII or UTF32

UInt8 UInt16 UInt32 Unicode.Scalar Character
'x' 120 120 120 U+0078 x
'©' error error 169 U+00A9 ©
'花' error error 33457 U+82B1
'𓀎' error error 77838 U+1300E 𓀎
':family_woman_woman_boy_boy:' error error error error :family_woman_woman_boy_boy:
'ab' error error error error error

Option 1 feels the least desirable to me. To write correct code, the developer has to understand Unicode encoding details and be vigilant against bugs. This means they should pretty much never use non-ASCII character literals, unless they're really interested in processing Latin1 or using a higher width type is unfathomable. This seems very niche to me, and the wrong tradeoff.

Option 2 fixes this, and would be convenient when processing UTF-16 encoded data. Scanning UTF-8 looking for specific ASCII characters is tolerant of multi-byte scalar values because we reject multi-code-unit literals. Similarly, scanning UTF-16 for certain BMP scalars is tolerant of non-BMP scalars because we reject surrogate literals. But, it does have the effect of further baking in encoding details into our model.

Option 3 seems the simplest model for users who don't want or need to worry much about Unicode or encoding. It also avoids any concern about signed-ness of types, as no sign bit is ever set. ASCII is special in computing, and will likely be so for the rest of our lifetimes, and is the most common use case anyways. Everything else is a Unicode scalar value and you need the full width to accommodate them. But, it is less useful than Option 2 for users who want to have non-ASCII BMP scalars width-restricted to 16 bits.

I feel like Option 3 is a tiny bit better than Option 2. Both Options 2 and 3 seem significantly better than Option 1.

11 Likes

The three tables looks like an excellent summary to me. Any of the three option would suit me. I'm not too sure how much harm it'd cause to have character literals be simply truncated unicode scalar. I find String and Character are pretty good at providing a high level interface matching general expectations, while working with anything is full of pitfalls you need to be aware of already. I'd probably favor option 2 if I had to explain to someone how UTF-8 and UTF-16 works.

I think character literals should work for signed integers. I can't see how it can be harmful and I can see a couple of ways it may be useful. For instance, it's not uncommon to play ASCII arithmetic when implementing parsers for Base64, numbers, or other binary to text schemes; having the ability to go negative might be useful in some cases. Also, CChar is signed on many platforms so it might also help interfacing with some C APIs.

Text (even down to an individual scalar’s worth of text) has multiple, interchangeable byte representations, even within a single encoding (i.e. without leaving UTF‐8). Anything written in a source file is text, and may switch forms without asking the user. But the different forms produce different numbers, or even different numbers of numbers. Two equal source files should not produce two different programs. Nor should a source file produce a different program than its copy produces. But this is what starts to happen when Unicode characters are used as something other than Unicode characters.

Option 4: Only What Is Safe

UInt8 UInt16 UInt32 Unicode.Scalar Character Notes
'x' 120 120 120 U+0078 x ASCII scalar
'©' error error error U+00A9* © Latin‐1 scalar
'é' error error error U+00E9/error* é Latin‐1 scalar which expands under NFD
'花' error error error U+82B1* BMP scalar
';' error error error U+037E/U+003B*† ; BMP scalar which changes under NFx
'שּׂ' error error error U+FB2D/error* BMP scalar which expands under NFx
'𓀎' error error error U+1300E* 𓀎 Supplemental plane scalar
'ē̱' error error error error ē̱ Character with no single‐scalar representation
'ab' error error error error error Multiple characters

* Cells with an asterisk are existing behaviour that is dangerous. As you can see, some characters in these ranges are safe to initialize as literals and others aren’t. You have to know a lot about Unicode to know which are which. At least if you have gone to the effort to call Unicode.Scalar out from under Character, then you must already know you are dealing with Unicode, and you’ve taken a certain level of responsibility for what happens. But that waiving of safety is not as clear when you are thinking of them as numbers (UIntx).

† These are equal strings.

10 Likes

Nor should there be. The compiler simply raises an overflow error if the value does not fit in the destination type. This is exactly the same as with integer literals:

101 as Int8     // success
101 as UInt8    // success

233 as Int8     // error
233 as UInt8    // success

Thus:

'e' as Int8     // success
'e' as UInt8    // success

'é' as Int8     // error
'é' as UInt8    // success

• • •

Encodings are for strings. We are dealing with numbers. Specifically, using a Unicode character to represent an integer in source code. Whatever encoding the source code uses, the compiler can check for the presence of exactly one Unicode scalar in the character literal, and see if its value fits in the contextual type.

1 Like

No, a Unicode scalar value has a single byte representation in any given encoding. Scalar equality does not model canonical equivalence. Furthermore, if you ask for String.UnicodeScalarView, you will get the scalars encoded by the String, not some set of canonically-equivalent scalars (though this behavior has yet to be formalized).

Canonical equivalence does not apply to character literals, which exist the scalar level. But, it can inform concerns, and I think you bring up an interesting point:

More concretely, normalization during copy-paste (fairly common) could break source code. For String literals, this wasn't as serious of a concern as it would only cause issues with hard-coded assumptions about the contents of the code unit or scalar views.

No strong opinions; this also looks great to me.

I’m trying to decide in my mind between Option 2 and Option 4 and it feels like a classic flexibility vs. safety debate to me. I’m generally in favour of enabling developers rather than molly-coddling them but in practical terms not much would be lost if we opted for safety and restricted all integer conversion to only be possible from ASCII. Apart from introducing Character literals to the language which is worthwhile in itself the integer conversions are niche feature and 90% of the usage will likely be the literal ‘\n’ anyway. One political approach would be to restrict it to ASCII for now and avoid a whole lot of documentation burden/debate with a view to opening it up later on if further use cases turn up. Though to be clear, I’d personally prefer Option 2.

2 Likes

I think we all understand how Option 1 works. The only thing going for it is that it is slightly easier to implement than the other options. As others and I have repeatedly tried to show, it leads to significant user-side complexity. It’s not at all simple:

  • It doesn’t integrate well with text processing features in the rest of the language. Swift’s stdlib prefers Unicode’s popular encodings (UTF-8, UTF-16, UTF-32), plus ASCII. You’re effectively arguing for the introduction of one more preferred encoding. At minimum, this requires you to demonstrate some highly convincing use cases.
  • It significantly deepens confusion about character encodings by hardwiring some implicit assumptions about specific 8-bit and 16-bit encodings directly into the language syntax. String’s API is careful to make such decisions explicit — we have dedicated utf8 and utf16 views, and they need to be explicitly spelled out in code that deal with them.*
  • The choice of the proposed 8-bit encoding is highly questionable. Option 1 (and to a much lesser degree, Option 2) effectively hardwires obsolete character encodings (ISO 8859-1 and UCS-2) directly into the language syntax.

 * Unfortunately, this middle argument also applies against Options 2–4. This proposal assumes that it is okay to simply define the byte value of ‘A’ as 65, ignoring those of us who prefer to use 193 instead. (Swift does run on platforms where that would be the natural choice.)

Characters aren’t integers, although there are many ways to map between the two. To convert a character (i.e., Unicode scalar) to an integer, you have to choose a character set and a corresponding encoding. There are many competing encodings, and choosing the wrong one will invariably lead to data loss or corruption. Ideally even ASCII should be an explicit choice.

If n is an integer written in hexadecimal, then "\u{n}" either uniquely determines a one-character string, or it is an error.

If it is not an error, then the resulting one-character string is uniquely determined by the integer n.

• • •

If c is a one-character string, then either it is uniquely determined by an integer n such that "\u{n}" produces c, or there is no such integer.

If there is an integer n such that "\u{n}" produces c, then that integer is unique.

• • •

The mapping between characters and integers is not ambiguous—it already exists in Swift code. This proposal is to provide a convenient compile-time way to utilize that mapping.

Deliberately hamstringing the feature to only work for ASCII would be antithetical to Swift’s support for Unicode.

Nevin, with all due respect, it looks like you may have missed my café1/café2 example above. We’re arguing to restrict the 8-bit (and possibly 16-bit) parts of this feature to ASCII in order to improve how it interacts with Unicode.

A fundamental idea of Unicode is that there isn’t a one-to-one correspondence between characters and particular bit sequences (a.k.a. integers). The character “é” is designated U+00E9 in the Unicode character set (although there are other ways to represent it, too). For convenience, Swift allows you to write this as ”\u{00E9}” as part of its support for Unicode. But “00E9” is just an arbitrary identifier, like “LATIN SMALL LETTER E ACUTE”. It does not necessarily have anything to do with the numerical value of a particular encoding of this character.

Luckily, Unicode also defines several useful encodings to map its characters into actual bits that can be transmitted/stored/processed. UTF-8, UTF-16 and UTF-32 are the most frequently used encodings (for 8-bit, 16-bit, and 32-bit code units), but there are many more. UTF-32 is particularly nice in that it uses a trivial one-to-one mapping between code points and their encoded forms; this isn’t true for 8-bit or 16-bit encodings.

Which Unicode encoding do you think encodes “é” as a single 8-bit value of 229?

4 Likes

Given the exact words I wrote, you are technically correct. My intended meaning was more like “text that is only one scalar long”, which is the way it lives in a source file. I fixed my statement for clarity. Thanks for the heads up.

Okay, now I think we’re finally getting somewhere productive. In my understanding, 00E9 is the hexadecimal representation of the number 233. It is exactly and exclusively the single numerical value which Unicode associates with the character “LATIN SMALL LETTER E ACUTE”.

If a programmer asks the Swift compiler for the numerical value of “LATIN SMALL LETTER E ACUTE”, the only sensible answer is the hexadecimal number 00E9, which is 233. Unicode assigns a unique hexadecimal number to each character, and from the perspective of a person who is not a Unicode expert, that hex number is the numerical value for that character.

All the talk about encodings is missing the point completely. The only place where encoding enters the mix at all, is the format in which a Swift source file is stored. As long as that encoding properly represents the character literal 'é', then when the Swift compiler processes the source file, it will recognize that character as representing the integer 233.

…what?

00E9 is 233, which is the number I used in my examples.

assert("\u{3B}" != "\u{37E}", "...oh...")
1 Like

Good catch. :thumbsup:

I love this question anyway. :heart: