Prepitch: Character integer literals

To be clear, let buggy: UInt8 = 'é' should not be reported as an overflow error -- it's an encoding issue.

2 Likes

As an other Central European engineer, I can't agree more.
I had to struggle with 8 bit encoding issue for many years, and start to get some relief now that most of the world is using UTF-8.

Introducing support for 8bit encoding in Swift would be a huge step backward, and will be a real source of bugs and frustrations.

Just let the compiler forbid using anything but ASCII for 8bit char literal.
Accepting UCS2 will probably be too a source of bugs, and many dev don't event know that 16bits are not enough to represent all the existing characters.

Can we find a realistic situation where you can wrongly use let buggy: UInt8 = 'é'? It won't bother me if literals are restricted to ASCII range, but I think we need a better justification than a one-line statement which by itself does nothing.

I suspect it'll not be that obvious. It's not like strings in Swift are simply a bag of integers like in some other languages. If you construct a String from integers you need to specify an encoding; if you are looking into a string as a sequence of integers you need to specify which view you want (utf8, utf16, unicodeScalars) or which encoding to use to get the bytes. Only when reading or writing data directly as bytes would those character literals go unchecked.

I'm still undecided and can see arguments either way. But, one difference here is that buggy doesn't correspond to any other construct present in the stdlib. Unicode.UTF8.CodeUnit is a UInt8, but buggy isn't a UTF-8 code unit, it's more like a truncated Unicode.UTF32.CodeUnit.

This argument isn't relevant to whether we should allow any BMP scalar as a UInt16.

Right, but that's also a major benefit of this proposal. It enhances the ergonomics of efficiently processing the raw contents of a String as through it were a bag of code units (i.e. integers). For that use case, let buggy: UInt8 = 'é' would be a strong code smell if you're processing the UTF-8 code units of a String.

The UTF8View is special, as it can in many cases provide access to contiguously-stored validly-encoded contents via withContiguousStorageIfAvailable. This is part of why Latin1 and UCS2 literals can feel a little off.

I should add that the main use case I see for character literals above U+128 is font tools and things like that which deal with unicode scalars in a numerical sense. Overflow checking there is pretty much irrelevant since you’re probably going tob e using Int32 or above. I would be fine with limiting the proposal to UInt8, and UInt32, Int32, … Int, but then we’re in the weirdly inconsistent place where UInt16 and Int16 are the only Swift integer types excluded from the proposal.

ASCII will work for UTF-16 code units the same way it does for UTF-8. So 16-bit integers should at the very least allow characters literals in the ASCII range.

this kind of splits the proposal into two pieces, a below-32 bits part which supports ASCII, and a 32-bits-and-above part which supports all unicode codepoints, and can be (clumsily) implemented using double-quoted literals already

I’ve come to agree that the implementation should restrict the characters that can express an 8 bit integer to ASCII. The balance of probabilities is that it is far more likely a developer is making a mistake checking against the value ‘é’ than actually scanning through a byte sequence that is latin1 encoded in the modern world.

I’m not sure the same argument applies to utf16 literals however. There are not two competing encodings or interpretations and if a character does not fit into16 bits the compiler will give an error. In the interests of the consistency & documentabilty of the proposal my vote is we should keep that all integer types be expressible by a character literal including Uint16 not least because this is the most likely array-of-ints external buffer format the developer is likely to encounter.

4 Likes

Would the UInt16 character literal be a 21-bit scalar value that can be represented in 16-bits, or would you also allow high-surrogate and low-surrogate code points?

A 21 bit unicode scalar that needs to be split across two surrogate code points will overflow a Uint16 so you won’t be able to search for it directly using character literal syntax with a Uint16 buffer in the same way you can’t directly search for ‘é’ with an Int8 buffer encoded with utf-8. utf16 was always a compromise.

By surrogate code points, I meant literals in the range "\u{D800}" ... "\u{DFFF}".

let unicodeScalar: Unicode.Scalar = "\u{D800}" // error: invalid unicode scalar

let utf16CodeUnit: Unicode.UTF16.CodeUnit = "\u{D800}" // alternative to 0xD800

I don't think this is a good idea, so I'm glad that it was only my misunderstanding.

Inevitably, the new integer conformances to ExpressibleByUnicodeScalar are surfacing some of the intricacies of character encoding to developers which Swift has studiously sought to avoid with it’s modern, high level String model. The advantages of this proposal are twofold. It introduces a new class of literal for Character with a capital C as distinct from String which has ergonomic advantages and once the distinction has been made these new conformances can be introduced with the minimum of disruption and helpful diagnostics provided when they are misused.

1 Like

Hm, I agree: UCS-2 has the significant benefit that it's compatible with UTF-16 -- unlike Latin-1 vs UTF-8. So, at the very least, allowing UInt16 to be initialized with any BMP character wouldn't add glaring inconsistencies with Swift's "native" String encodings.

However, UCS-2 and UTF-16 aren't the only two char encodings with 16-bit units -- see GB2312, Big5, JIS X 0208, etc. I don't have the cultural background to judge whether these standards cause the same sort of confusion as legacy 8-bit codes do. (To be honest, I have not even wrapped my head around how these work in practice. ISO-2022? DBCS? SBCS? Oh my...) It may be a good idea to find someone with relevant experience and see if this gives them the heebie-jeebies like 'é' as UInt8 does to me:

let maybeNotSoBadAfterAll: UInt16 = '啊'

I think supporting Int16 would be a stretch, though.

5 Likes

I really don’t think any of this makes sense above ASCII; it is absolutely fraught with peril.

let number: UInt32 = ';'
assert(number == 0x37E, "My number changed?!?")

If a source editor, pasteboard, file sync, checkout, or download performs NFx normalization, the above code will break.

1 Like

This is already the case with String literals. Source processing tools seem to be generally ready for this -- I can't recall any corruption issues.

I think it is better to have a simpler and more consistent rule, with fewer special cases. In particular, it would be unexpected and surprising if the behavior for character-literals-as-integers were different from the following:

func scalarValue<T: BinaryInteger>(_ c: Character) -> T {
  precondition(c.unicodeScalars.count == 1)
  return numericCast(c.unicodeScalars.first!.value)
}

Swift is already Unicode aware, as a deliberate design choice. We can already write "\u{n}" to turn an integer literal into a Character. With this proposal to go the other way, writing '\u{n}' should give back the same value of n that went in, or produce an overflow error if the destination type is too small.

Where and how would such a "truncated UTF-32" encoding be useful in actual practice?

This is not the case with String literals.

This will never fail, no matter whether the file’s encoding or normalization changes:

let string = ";"
assert(string == "\u{37E}", "My character changed?!?")

Sure, but this will:

let scalar: Unicode.Scalar = ";"
assert(scalar == Unicode.Scalar(0x37e))

…anywhere that someone wants to represent an integer in source code by using the corresponding Unicode character surrounded by single quotes.

That is the entire point of this proposal after all.

• • •

Personally, I am skeptical about whether we need such a feature in the language, but seeing how this discussion has proceeded for hundreds of posts, I’m willing to accept that some people find it beneficial to write integers that way.

And if we’re going to allow it, we might as well do it the obvious way that works as expected.

• • •

Let me put it this way:

If someone needs to be a Unicode expert to understand how the feature works and use it correctly, then it definitely does not belong in a high-visibility part of the Swift language. That would make it an attractive nuisance.

On the other hand, if the feature is intended for use by non-experts in Unicode, then it needs to work the way those people want and expect it to. Specifically:

If the character is a single scalar, then turning it into an integer should give that scalar. And if the character is not a single scalar, then turning it into an integer should fail.

2 Likes