Prepitch: Character integer literals

18.04, but the toolchain you linked to is an osx toolchain…
i’ll just build it from source from your fork, but i figured i’d let you know lol

I don’t have 18.04 setup alas. Curious what the problem is when you try the mac binaries. The compiler doesn’t even work?

Lunix 18.04 toolchain available http://johnholdsworth.com/swift-LOCAL-2019-01-02-a-linux.tar.gz

As a Central European engineer who lived through the Dark Age of 8-bit code pages, I would strongly prefer limiting this feature to US-ASCII (for UInt8 and Int8) and full Unicode (for Int32+).

Assuming/defaulting to any particular 8-bit encoding is just asking for trouble. For example, in an 8-bit context, 0xE6 isn't at all the same as 'æ' -- depending on encoding, it can mean any of 'W', 'Ê', 'ć', 'ĉ', 'Š', 'ц', 'ن', 'ζ', 'ז', 'ๆ', 'و', 'µ', 'φ', etc. etc. etc.

In the vast majority of contexts, the old single-byte 8-bit encodings shouldn't be used at all. In the (legacy) cases where their use is unavoidable, it should not be possible to encode non-ASCII characters like 'æ' or 'é' etc. to a single byte without also specifying an explicit encoding, in a highly visible way.

Implicitly hardwiring Latin-1 into Swift's syntax seems completely unnecessary and wildly anachronistic to me. Code like let obviousBug: UInt8 = 'é' raises alarms in my brain that haven't triggered for years and years -- I associate it with the Macarena and tamagotchis.

In our brave new UTF-8 world, the default assumption should be that the 8-bit value 0xE6 is the first of three bytes in the UTF-8 encoding of '歹', or one of 4095 other ideographs.

I'm not as strongly opinionated on 16-bit encodings, but it seems to me UCS-2 has also lost most of its lustre by now. I'd prefer to not support initializing UInt16/Int16 with character literals at all, or to limit the feature to the 128 characters available in ASCII.

18 Likes

what if we just got rid of the high bit in UInt8? then 'é' would trigger an overflow and you would have to choose a wider type like UInt16.

That's what I'm suggesting for UInt8 and Int8.

As I said, I see no reason to add UCS-2 support directly into Swift's syntax, either. What would be the rationale for supporting let dubious: UInt16 = 'é'?

2 Likes

To be clear, let buggy: UInt8 = 'é' should not be reported as an overflow error -- it's an encoding issue.

2 Likes

As an other Central European engineer, I can't agree more.
I had to struggle with 8 bit encoding issue for many years, and start to get some relief now that most of the world is using UTF-8.

Introducing support for 8bit encoding in Swift would be a huge step backward, and will be a real source of bugs and frustrations.

Just let the compiler forbid using anything but ASCII for 8bit char literal.
Accepting UCS2 will probably be too a source of bugs, and many dev don't event know that 16bits are not enough to represent all the existing characters.

Can we find a realistic situation where you can wrongly use let buggy: UInt8 = 'é'? It won't bother me if literals are restricted to ASCII range, but I think we need a better justification than a one-line statement which by itself does nothing.

I suspect it'll not be that obvious. It's not like strings in Swift are simply a bag of integers like in some other languages. If you construct a String from integers you need to specify an encoding; if you are looking into a string as a sequence of integers you need to specify which view you want (utf8, utf16, unicodeScalars) or which encoding to use to get the bytes. Only when reading or writing data directly as bytes would those character literals go unchecked.

I'm still undecided and can see arguments either way. But, one difference here is that buggy doesn't correspond to any other construct present in the stdlib. Unicode.UTF8.CodeUnit is a UInt8, but buggy isn't a UTF-8 code unit, it's more like a truncated Unicode.UTF32.CodeUnit.

This argument isn't relevant to whether we should allow any BMP scalar as a UInt16.

Right, but that's also a major benefit of this proposal. It enhances the ergonomics of efficiently processing the raw contents of a String as through it were a bag of code units (i.e. integers). For that use case, let buggy: UInt8 = 'é' would be a strong code smell if you're processing the UTF-8 code units of a String.

The UTF8View is special, as it can in many cases provide access to contiguously-stored validly-encoded contents via withContiguousStorageIfAvailable. This is part of why Latin1 and UCS2 literals can feel a little off.

I should add that the main use case I see for character literals above U+128 is font tools and things like that which deal with unicode scalars in a numerical sense. Overflow checking there is pretty much irrelevant since you’re probably going tob e using Int32 or above. I would be fine with limiting the proposal to UInt8, and UInt32, Int32, … Int, but then we’re in the weirdly inconsistent place where UInt16 and Int16 are the only Swift integer types excluded from the proposal.

ASCII will work for UTF-16 code units the same way it does for UTF-8. So 16-bit integers should at the very least allow characters literals in the ASCII range.

this kind of splits the proposal into two pieces, a below-32 bits part which supports ASCII, and a 32-bits-and-above part which supports all unicode codepoints, and can be (clumsily) implemented using double-quoted literals already

I’ve come to agree that the implementation should restrict the characters that can express an 8 bit integer to ASCII. The balance of probabilities is that it is far more likely a developer is making a mistake checking against the value ‘é’ than actually scanning through a byte sequence that is latin1 encoded in the modern world.

I’m not sure the same argument applies to utf16 literals however. There are not two competing encodings or interpretations and if a character does not fit into16 bits the compiler will give an error. In the interests of the consistency & documentabilty of the proposal my vote is we should keep that all integer types be expressible by a character literal including Uint16 not least because this is the most likely array-of-ints external buffer format the developer is likely to encounter.

4 Likes

Would the UInt16 character literal be a 21-bit scalar value that can be represented in 16-bits, or would you also allow high-surrogate and low-surrogate code points?

A 21 bit unicode scalar that needs to be split across two surrogate code points will overflow a Uint16 so you won’t be able to search for it directly using character literal syntax with a Uint16 buffer in the same way you can’t directly search for ‘é’ with an Int8 buffer encoded with utf-8. utf16 was always a compromise.

By surrogate code points, I meant literals in the range "\u{D800}" ... "\u{DFFF}".

let unicodeScalar: Unicode.Scalar = "\u{D800}" // error: invalid unicode scalar

let utf16CodeUnit: Unicode.UTF16.CodeUnit = "\u{D800}" // alternative to 0xD800

I don't think this is a good idea, so I'm glad that it was only my misunderstanding.

Inevitably, the new integer conformances to ExpressibleByUnicodeScalar are surfacing some of the intricacies of character encoding to developers which Swift has studiously sought to avoid with it’s modern, high level String model. The advantages of this proposal are twofold. It introduces a new class of literal for Character with a capital C as distinct from String which has ergonomic advantages and once the distinction has been made these new conformances can be introduced with the minimum of disruption and helpful diagnostics provided when they are misused.

1 Like

Hm, I agree: UCS-2 has the significant benefit that it's compatible with UTF-16 -- unlike Latin-1 vs UTF-8. So, at the very least, allowing UInt16 to be initialized with any BMP character wouldn't add glaring inconsistencies with Swift's "native" String encodings.

However, UCS-2 and UTF-16 aren't the only two char encodings with 16-bit units -- see GB2312, Big5, JIS X 0208, etc. I don't have the cultural background to judge whether these standards cause the same sort of confusion as legacy 8-bit codes do. (To be honest, I have not even wrapped my head around how these work in practice. ISO-2022? DBCS? SBCS? Oh my...) It may be a good idea to find someone with relevant experience and see if this gives them the heebie-jeebies like 'é' as UInt8 does to me:

let maybeNotSoBadAfterAll: UInt16 = '啊'

I think supporting Int16 would be a stretch, though.

5 Likes

I really don’t think any of this makes sense above ASCII; it is absolutely fraught with peril.

let number: UInt32 = ';'
assert(number == 0x37E, "My number changed?!?")

If a source editor, pasteboard, file sync, checkout, or download performs NFx normalization, the above code will break.

1 Like