Prepitch: Character integer literals

SDGGiesbrecht · January 9, 2019, 8:55pm

LOL. You’re right. And this would no longer even compile with NFx:

let scalar: Unicode.Scalar = "שּׂ"

There is a reason Swift steers people toward String and Character, and leaves Unicode.Scalar hidden away for those who know what they’re doing.

These are safe:

let string: String = "שּׂ"
let character: Character = "שּׂ"

These are dangerous and not Unicode‐compliant at the source level:

let scalar: Unicode.Scalar = "שּׂ"
let number: UInt32 = 'שּׂ'

They are only safely expressed like this:

let scalar: Unicode.Scalar = "\u{FB2D}"
let number: UInt32 = 0xFB2D
let scalar2: Unicode.Scalar(value: 0xFB2D)
let number2: UInt32 = '\u{FB2D}'

I don’t see how the last line is an improvement.

Interestingly—and very relevent—I had trouble posting this properly, because at first I was copying and pasting the character to reuse it, and Safari (or else the forum interface) was performing NFC on either the copy or the paste. In the end I had to stop copying and keep going back to the character palette to make it work properly.

Try it yourself. Copy and paste it to the text field of your own post (but don’t actually post it, or you’ll annoy us all). Then compare the bytes in the text field and the preview to the bytes where you copied it from. Not the same.

If you copy from my post to a playground, it will compile and run, but if you copy from my post to your own and then copy that to the playground, it will not compile.

(Edit: * This assumes you are using Safari and Xcode.)

I think that demonstrates fairly soundly that tools are not prepared for this sort of thing. (Even though they are Unicode‐compliant.)

lorentey · January 9, 2019, 9:13pm

You can easily do this by simply using the appropriate integer type, i.e., UInt32.

let fine: UInt32 = '✓'

Or, even better, you can use Unicode.Scalar to do this in a more type-safe way.

let terrific: Unicode.Scalar = '✓'

Unicode.Scalar is a tiny wrapper type around an UInt32 value; using it makes it clear that we're dealing with Unicode code points, not random integers.

There is simply no way to represent a random Unicode character in less than 32 bits without dealing with encodings. Amazingly, during the last two decades, the world has somehow mostly agreed to use a sensible, universal 8-bit encoding, UTF-8. UTF-8 is already deeply integrated into Swift as the default text encoding; I would prefer to go forward by extending this support even further, rather than reverting back to anything else.

In particular, this would be entirely unacceptable to me:

let café1: [UInt8] = ['c', 'a', 'f', 'é']
let café2: [UInt8] = "café".utf8.map { $0 }

print(café1) // [99, 97, 102, 201]
print(café2) // [99, 97, 102, 195, 169]

The Swift-native way to encode the character é into 8-bit bytes is to use UTF-8, resulting in [195, 169]. Introducing a language feature that does anything else looks like a bad idea to me.

I agree; however, this feature (including support for UInt8) wasn't sold as a Unicode-enabling feature -- it's supposed to make it easier to do things like parsing binary file formats that include some ASCII bits.

I don't think limiting UInt8/Int8 initialization to ASCII characters is a particularly difficult thing to accept, given that it simplifies Swift's text processing story. Reviving the legacy Latin-1 encoding would have huge potential for accidental misuse, as demonstrated in the above snippet. This is especially the case for people not well-versed in Unicode matters.

Michael_Ilseman · January 9, 2019, 9:49pm

I think we're debating these 3 options, but please let me know if I missed something.

Option 1: Simple truncation

	UInt8	UInt16	UInt32	Unicode.Scalar	Character
'x'	120	120	120	U+0078	x
'©'	169	169	169	U+00A9	©
'花'	error	33457	33457	U+82B1	花
'𓀎'	error	error	77838	U+1300E	𓀎
''	error	error	error	error
'ab'	error	error	error	error	error

(What about signed integers? Should they refuse to set the sign bit, or are they bitwise-equivalent to unsigned?)

Option 2: Valid Encoding Width

	UInt8	UInt16	UInt32	Unicode.Scalar	Character
'x'	120	120	120	U+0078	x
'©'	error	169	169	U+00A9	©
'花'	error	33457	33457	U+82B1	花
'𓀎'	error	error	77838	U+1300E	𓀎
''	error	error	error	error
'ab'	error	error	error	error	error

Option 3: ASCII or UTF32

	UInt8	UInt16	UInt32	Unicode.Scalar	Character
'x'	120	120	120	U+0078	x
'©'	error	error	169	U+00A9	©
'花'	error	error	33457	U+82B1	花
'𓀎'	error	error	77838	U+1300E	𓀎
''	error	error	error	error
'ab'	error	error	error	error	error

Option 1 feels the least desirable to me. To write correct code, the developer has to understand Unicode encoding details and be vigilant against bugs. This means they should pretty much never use non-ASCII character literals, unless they're really interested in processing Latin1 or using a higher width type is unfathomable. This seems very niche to me, and the wrong tradeoff.

Option 2 fixes this, and would be convenient when processing UTF-16 encoded data. Scanning UTF-8 looking for specific ASCII characters is tolerant of multi-byte scalar values because we reject multi-code-unit literals. Similarly, scanning UTF-16 for certain BMP scalars is tolerant of non-BMP scalars because we reject surrogate literals. But, it does have the effect of further baking in encoding details into our model.

Option 3 seems the simplest model for users who don't want or need to worry much about Unicode or encoding. It also avoids any concern about signed-ness of types, as no sign bit is ever set. ASCII is special in computing, and will likely be so for the rest of our lifetimes, and is the most common use case anyways. Everything else is a Unicode scalar value and you need the full width to accommodate them. But, it is less useful than Option 2 for users who want to have non-ASCII BMP scalars width-restricted to 16 bits.

I feel like Option 3 is a tiny bit better than Option 2. Both Options 2 and 3 seem significantly better than Option 1.

michelf · January 9, 2019, 10:55pm

The three tables looks like an excellent summary to me. Any of the three option would suit me. I'm not too sure how much harm it'd cause to have character literals be simply truncated unicode scalar. I find String and Character are pretty good at providing a high level interface matching general expectations, while working with anything is full of pitfalls you need to be aware of already. I'd probably favor option 2 if I had to explain to someone how UTF-8 and UTF-16 works.

I think character literals should work for signed integers. I can't see how it can be harmful and I can see a couple of ways it may be useful. For instance, it's not uncommon to play ASCII arithmetic when implementing parsers for Base64, numbers, or other binary to text schemes; having the ability to go negative might be useful in some cases. Also, CChar is signed on many platforms so it might also help interfacing with some C APIs.

SDGGiesbrecht · January 9, 2019, 11:11pm

Text (even down to an individual scalar’s worth of text) has multiple, interchangeable byte representations, even within a single encoding (i.e. without leaving UTF‐8). Anything written in a source file is text, and may switch forms without asking the user. But the different forms produce different numbers, or even different numbers of numbers. Two equal source files should not produce two different programs. Nor should a source file produce a different program than its copy produces. But this is what starts to happen when Unicode characters are used as something other than Unicode characters.

Option 4: Only What Is Safe

	UInt8	UInt16	UInt32	Unicode.Scalar	Character	Notes
'x'	120	120	120	U+0078	x	ASCII scalar
'©'	error	error	error	U+00A9*	©	Latin‐1 scalar
'é'	error	error	error	U+00E9/error*	é	Latin‐1 scalar which expands under NFD
'花'	error	error	error	U+82B1*	花	BMP scalar
';'	error	error	error	U+037E/U+003B*†	;	BMP scalar which changes under NFx
'שּׂ'	error	error	error	U+FB2D/error*	שּׂ	BMP scalar which expands under NFx
'𓀎'	error	error	error	U+1300E*	𓀎	Supplemental plane scalar
'ē̱'	error	error	error	error	ē̱	Character with no single‐scalar representation
'ab'	error	error	error	error	error	Multiple characters

* Cells with an asterisk are existing behaviour that is dangerous. As you can see, some characters in these ranges are safe to initialize as literals and others aren’t. You have to know a lot about Unicode to know which are which. At least if you have gone to the effort to call Unicode.Scalar out from under Character, then you must already know you are dealing with Unicode, and you’ve taken a certain level of responsibility for what happens. But that waiving of safety is not as clear when you are thinking of them as numbers (UIntx).

† These are equal strings.

Nevin · January 9, 2019, 11:23pm

Nor should there be. The compiler simply raises an overflow error if the value does not fit in the destination type. This is exactly the same as with integer literals:

101 as Int8     // success
101 as UInt8    // success

233 as Int8     // error
233 as UInt8    // success

Thus:

'e' as Int8     // success
'e' as UInt8    // success

'é' as Int8     // error
'é' as UInt8    // success

• • •

Encodings are for strings. We are dealing with numbers. Specifically, using a Unicode character to represent an integer in source code. Whatever encoding the source code uses, the compiler can check for the presence of exactly one Unicode scalar in the character literal, and see if its value fits in the contextual type.

Michael_Ilseman · January 9, 2019, 11:31pm

No, a Unicode scalar value has a single byte representation in any given encoding. Scalar equality does not model canonical equivalence. Furthermore, if you ask for String.UnicodeScalarView, you will get the scalars encoded by the String, not some set of canonically-equivalent scalars (though this behavior has yet to be formalized).

Canonical equivalence does not apply to character literals, which exist the scalar level. But, it can inform concerns, and I think you bring up an interesting point:

More concretely, normalization during copy-paste (fairly common) could break source code. For String literals, this wasn't as serious of a concern as it would only cause issues with hard-coded assumptions about the contents of the code unit or scalar views.

No strong opinions; this also looks great to me.

johnno1962 · January 10, 2019, 1:31am

I’m trying to decide in my mind between Option 2 and Option 4 and it feels like a classic flexibility vs. safety debate to me. I’m generally in favour of enabling developers rather than molly-coddling them but in practical terms not much would be lost if we opted for safety and restricted all integer conversion to only be possible from ASCII. Apart from introducing Character literals to the language which is worthwhile in itself the integer conversions are niche feature and 90% of the usage will likely be the literal ‘\n’ anyway. One political approach would be to restrict it to ASCII for now and avoid a whole lot of documentation burden/debate with a view to opening it up later on if further use cases turn up. Though to be clear, I’d personally prefer Option 2.

lorentey · January 10, 2019, 1:44am

I think we all understand how Option 1 works. The only thing going for it is that it is slightly easier to implement than the other options. As others and I have repeatedly tried to show, it leads to significant user-side complexity. It’s not at all simple:

It doesn’t integrate well with text processing features in the rest of the language. Swift’s stdlib prefers Unicode’s popular encodings (UTF-8, UTF-16, UTF-32), plus ASCII. You’re effectively arguing for the introduction of one more preferred encoding. At minimum, this requires you to demonstrate some highly convincing use cases.
It significantly deepens confusion about character encodings by hardwiring some implicit assumptions about specific 8-bit and 16-bit encodings directly into the language syntax. String’s API is careful to make such decisions explicit — we have dedicated utf8 and utf16 views, and they need to be explicitly spelled out in code that deal with them.*
The choice of the proposed 8-bit encoding is highly questionable. Option 1 (and to a much lesser degree, Option 2) effectively hardwires obsolete character encodings (ISO 8859-1 and UCS-2) directly into the language syntax.

* Unfortunately, this middle argument also applies against Options 2–4. This proposal assumes that it is okay to simply define the byte value of ‘A’ as 65, ignoring those of us who prefer to use 193 instead. (Swift does run on platforms where that would be the natural choice.)

Characters aren’t integers, although there are many ways to map between the two. To convert a character (i.e., Unicode scalar) to an integer, you have to choose a character set and a corresponding encoding. There are many competing encodings, and choosing the wrong one will invariably lead to data loss or corruption. Ideally even ASCII should be an explicit choice.

Nevin · January 10, 2019, 2:04am

If n is an integer written in hexadecimal, then "\u{n}" either uniquely determines a one-character string, or it is an error.

If it is not an error, then the resulting one-character string is uniquely determined by the integer n.

• • •

If c is a one-character string, then either it is uniquely determined by an integer n such that "\u{n}" produces c, or there is no such integer.

If there is an integer n such that "\u{n}" produces c, then that integer is unique.

• • •

The mapping between characters and integers is not ambiguous—it already exists in Swift code. This proposal is to provide a convenient compile-time way to utilize that mapping.

Deliberately hamstringing the feature to only work for ASCII would be antithetical to Swift’s support for Unicode.

lorentey · January 10, 2019, 2:58am

Nevin, with all due respect, it looks like you may have missed my café1/café2 example above. We’re arguing to restrict the 8-bit (and possibly 16-bit) parts of this feature to ASCII in order to improve how it interacts with Unicode.

A fundamental idea of Unicode is that there isn’t a one-to-one correspondence between characters and particular bit sequences (a.k.a. integers). The character “é” is designated U+00E9 in the Unicode character set (although there are other ways to represent it, too). For convenience, Swift allows you to write this as ”\u{00E9}” as part of its support for Unicode. But “00E9” is just an arbitrary identifier, like “LATIN SMALL LETTER E ACUTE”. It does not necessarily have anything to do with the numerical value of a particular encoding of this character.

Luckily, Unicode also defines several useful encodings to map its characters into actual bits that can be transmitted/stored/processed. UTF-8, UTF-16 and UTF-32 are the most frequently used encodings (for 8-bit, 16-bit, and 32-bit code units), but there are many more. UTF-32 is particularly nice in that it uses a trivial one-to-one mapping between code points and their encoded forms; this isn’t true for 8-bit or 16-bit encodings.

Which Unicode encoding do you think encodes “é” as a single 8-bit value of 229?

SDGGiesbrecht · January 10, 2019, 3:41am

Given the exact words I wrote, you are technically correct. My intended meaning was more like “text that is only one scalar long”, which is the way it lives in a source file. I fixed my statement for clarity. Thanks for the heads up.

Nevin · January 10, 2019, 4:15am

Okay, now I think we’re finally getting somewhere productive. In my understanding, 00E9 is the hexadecimal representation of the number 233. It is exactly and exclusively the single numerical value which Unicode associates with the character “LATIN SMALL LETTER E ACUTE”.

If a programmer asks the Swift compiler for the numerical value of “LATIN SMALL LETTER E ACUTE”, the only sensible answer is the hexadecimal number 00E9, which is 233. Unicode assigns a unique hexadecimal number to each character, and from the perspective of a person who is not a Unicode expert, that hex number is the numerical value for that character.

All the talk about encodings is missing the point completely. The only place where encoding enters the mix at all, is the format in which a Swift source file is stored. As long as that encoding properly represents the character literal 'é', then when the Swift compiler processes the source file, it will recognize that character as representing the integer 233.

…what?

00E9 is 233, which is the number I used in my examples.

SDGGiesbrecht · January 10, 2019, 4:20am

assert("\u{3B}" != "\u{37E}", "...oh...")

SDGGiesbrecht · January 10, 2019, 4:24am

Good catch.

I love this question anyway.

Nevin · January 10, 2019, 6:31am

Yes, the semicolon and the Greek question-mark are different Unicode characters, even though they compare equal in Swift.

I’m not sure what point you are trying to make.

It is the wrong question. We are not trying to encode “é” at all. We are trying to treat it as an integer. And there is only one integer which canonically maps to it, namely 233.

SDGGiesbrecht · January 10, 2019, 7:12am

I have now taken the time to read the entire thread from the very beginning. There are two comments I would like to call out:

@xwu said this, and sparked a long back‐and‐forth over whether it was a good idea:

He made several direct quotations from Unicode Technical Reports, which demonstrate he is not a newcomer to Unicode concepts, and that he knows where to look for answers. But he still got it wrong, and the code he suggested is unreliable. The statements in the technical reports mean that ACSII strings will never change to something else under NFx and Latin‐1 strings will never do so under NFC. It does not mean that nothing else will ever change into an ASCII string under FCx or a Latin‐1 string under NFC. The assertion example I used a few posts ago demonstrates clearly that something ASCII can be equal to something Non‐ASCII. So for @xwu’s code, something could appear in signature which is not an ASCII string (such as the Greek question mark, U+037E), but is canonically equivalent to an ASCII string (a semicolon, U+003B), thus causing the check to go in an unintended direction. (Though I don’t think anything folds down to the “J”, “F”, or “I” he actually used).

I say this because the fact that someone who is knowledgeable can make such a mistake, and that then several knowledgeable people can argue about it for a such a long time without being able to demonstrate the problem very clearly, shows just how easy such mistakes are to make, and just how much safer a direct‐to‐integer ASCII literal would be. Until I read that, I was generally against this entire pitch, but it—and it alone—switched me completely around to seeing it as a significant safety improvement for the ASCII range. This is precisely because it insulates you against Unicode pitfalls.

I do share most of @wxu’s actual Unicode‐related concerns though, and I stand by all my previous statements which demonstrate how unsafe such an idea is in the Unicode domain, where it instead makes you more vulnerable to Unicode pitfalls.

That brings me to the second comment I would like to call out:

@johnno1962 said this:

This seems to be a very wise suggestion. The ASCII stuff may succeed or fail during review based solely on its own merits, dangers, and use cases—which are vastly different than the merits, dangers and use cases of the Unicode realm. In fact I think the two are opposite to one another. So let Unicode be a separate and distinct round two.

lorentey · January 10, 2019, 10:57am

I have bad news: that is, by definition, an encoding. To treat a character as a number, you need to select an encoding. What you are arguing for is simply this particular set of encodings:

UInt8 : Latin-1
UInt16 : UCS-2
UInt32 : UTF-32

It’s not a great choice.

michelf · January 10, 2019, 12:43pm

Nevin is arguing for truncated unicode scalars, which I know gives the same result as Latin-1/USC-2/UTF-32 but is still more principled than seemingly picking encodings at random. I think it can be useful in some situations and it makes the programming model simpler because it can reuse the same protocol as unicode scalar literals (if I understood right). So it's not without its advantages, but I agree there is more potential for harm by misuse for those who haven't memorized the ASCII table or aren't perfectly aware of the encoding they're working with. How much harm this represents, and whether it it acceptable or not, is the real debate here.

michelf · January 10, 2019, 1:36pm

I'd like to propose a radically different approach — inspired from the concerns about non-ASCII characters:

double quotes mean Unicode string, character, or scalar
single quotes mean ASCII string or scalar

So whenever you need to be sure something is purely ASCII you use single-quotes:

let a: String = "planté" // unicode string
let b: String = 'planté' // ERROR: é is not ascii
let c: String = 'plante' // ascii string (no é in this one)

let a: Character = "é" // U+00E9
let b: Character = 'é' // ERROR: U+00E9 is not ascii
let c: Character = 'p' // U+0070

let a: UnicodeScalar = "é" // U+00E9
let b: UnicodeScalar = 'é' // ERROR: U+00E9 is not ascii
let c: UnicodeScalar = 'p' // U+0070

We can then allow ASCII literals to initialize any numeric type:

let a: UInt32 = "é" // ERROR: UInt32 does not conform to ExpressibleByUnicodeScalarLiteral
let b: UInt32 = 'é' // ERROR: U+00E9 is not ascii
let c: UInt32 = 'p' // 0x00000070

let a: UInt16 = "é" // error: UInt16 does not conform to ExpressibleByUnicodeScalarLiteral
let b: UInt16 = 'é' // error: U+00E9 is not ascii
let c: UInt16 = 'p' // 0x0070

let a: UInt8 = "é" // ERROR: UInt8 does not conform to ExpressibleByUnicodeScalarLiteral
let b: UInt8 = 'é' // ERROR: U+00E9 is not ascii
let c: UInt8 = 'p' // 0x70

And you can also initialize an array of numbers from an ASCII string:

let a: [UInt8] = "plante" // ERROR: Array does not conform to ExpressibleByStringLiteral
let b: [UInt8] = 'plante' // ascii

let a: [UInt16] = "plante" // ERROR: Array does not conform to ExpressibleByStringLiteral
let b: [UInt16] = 'plante' // ascii

Of course, this approach completely flips on its head the current character literal proposal.