Prepitch: Character integer literals

(^) #117

here’s the thing, for something to be harmful, it has to contribute to a user making some kind of mistake. i’m sorry but i just can’t forsee where someone could make a mistake with -. no one writes string1 - string2 and expects it to mean anything. I have seen + used on strings, I have seen * used on strings, I have even seen w^2 used on strings. But I have to admit, I have never seen a - used on a string. The fact that they produce results of totally different types makes the chance of a mistake even smaller.

(Xiaodi Wu) #118

Then substitute * for - in the discussion; the same issue applies.

At base the problem is that unless you are totally inured to the strangeness of this C-ism, performing math on literal characters is an extremely odd thing to do, because characters are not numbers (although they are represented as such) any more than arrays are pointers. Propagating this throughout Swift is not a mere implementation detail of having character literals, it's a fundamental philosophical shift in the direction of the language.

My problem with the design proposed has nothing to do with the addition of syntax to distinguish single characters from sequences of them. It has everything to do with the other piece of the proposal, which is to make numbers expressible by characters (rather than allowing explicit conversion between them, which it is already possible to do).

(^) #119

I don’t know if C pointers are a valid comparison, because unless you are writing a kernel, you almost never need to (or should) hardcode pointer addresses in source code. Interestingly, the exception is the null pointer, which we do have a expressibleBy conformance for: nil ! On the other hand, switching on hardcoded ASCII characters is extremely common.

Beyond that though, C pointers are actually a really interesting analogy. All of us who work on/in the language agree that C pointers are problematic, and Swift offers superior tools that don’t have the same safety hazards and performance traps that C pointers have, such as Array, for _ in, map, etc. As a result, 95% of the time in Swift, you never need to touch a pointer, and probably 75% of the time you don’t even need to use indices. This is a good thing.

That being said, sometimes pointers and memory are unavoidable. And Swift does its best to provide good support for unsafe pointers to make pointer code safe and readable. For example, we check for null pointers like this

guard let buffer:UnsafePointer<Int> = foo() 

not like this

let buffer:UnsafePointer<Int> = foo()
guard buffer != .init(bitPattern: 0)

We provide UnsafeBufferPointer and friends with Collection conformance. We even have integer subscripts on UnsafePointer, a feature I lobbied to remove from the language, because I thought it was too c-ish, and was unsuccessful because people wanted to write p[0] instead of p.pointee!

Now, consider unicode codepoints. All of us who work on/in the language agree they are problematic, and Swift offers superior tools that model human-readable text better, such as Character and String. As a result, 85% of the time in Swift, you never need to touch an ASCII scalar. This is a good thing.

That being said, sometimes unicode codepoints are unavoidable. And Swift really should do its best to provide good support to make ASCII code safe and readable. Right now we have to do things like spell them in decimal (or slightly better, hex). Or use workarounds like wrap a higher-level construct like a Unicode.Scalar with a function that you know returns the integer value you want. Neither of these are very readable, especially if you have a lot of them in one place together, which you frequently do.

if signature == UInt32(truncatingIfNeeded: UInt8(ascii: "J")) << 24 | 
                UInt32(truncatingIfNeeded: UInt8(ascii: "F")) << 16 | 
                UInt32(truncatingIfNeeded: UInt8(ascii: "I")) <<  8 | 
                UInt32(truncatingIfNeeded: UInt8(ascii: "F"))

If we had codepoint literals we could write this as

if signature == 'J' << 24 | 'F' << 16 | 'I' << 8 | 'F'

I take issue with this because, characters (with a lowercase ‘c’) are numbers. Most programmers have internalized that A-Z and 0-9 are contiguous numbers, that most common symbols can be encoded in a fixed-width 8-bit integer, and engineers have taken to using this mapping between letters and numbers as useful mnemonics to encode integers. So we get ('J', 'F', 'I', 'F') == (74, 70, 73, 70) instead of four arbitrary numbers. The problem is that people tried to use this system to encode characters of human text, which doesn’t work so well. So Swift created the world of Character and String to handle human text correctly. But the machine-readable world still exists, and there, this system does work well. And using Character and all the higher-level unicode constructs in this context is as clunky and inappropriate as using UInt8 and all the lower-level unicode constructs for human text.


I guess I'm kind of in the middle here. I think that single-quoted literals should be able to represent any Character, but I also don't think it's a huge deal if these literals can be inferred to be the various integer types. This is going to potentially lead to some edge cases (e.g. 'f' - 1 being some integer), but I don't know that these will be frequently encountered and I think at least some of these edge cases could be handled with targeted warnings or similar (e.g. perhaps warning on use of operators that Character doesn't support when the default fallback to Character would otherwise have been used). There are already somewhat similar edge cases possible with the other literal forms and type inference.

(Xiaodi Wu) #121

Assuming signature is an array of CChar, why would anyone want to write either of these instead of String(signature) == "JFIF"?

No, I'm not making a comparison of the relationship between pointers and numbers to the relationship between characters and numbers; I am making a comparison of the relationship between pointers and arrays to the relationship between characters and numbers. Specifically, how C treats each of the two relationships as one of interchangeability and Swift does not.

No, characters don't have the semantics of numbers. They might be stored as numbers, but that doesn't make them numbers any more than arrays are pointers.

(Michel Fortin) #122

The goal in @taylorswift's example seems to be reading some kind of binary format. You don't want to risk weird unicode character equivalence to meddle with that.

Even for textual file formats, many are defined in term of code points (like XML or JSON). Parsing characters by grapheme is asking for trouble. For instance, a combining character after the quote of an XML attribute (as in attr="⃠⃠value") is well-formed XML and must be parsed as a value starting with a combining character. If you parse by grapheme, you're out of spec.

So you need to express characters as code points or sometime lower-level integers in the parser. If it's a complicated mess to express this, then the parser becomes a complicated mess. Here's a function in one of my parsers (old-style plist, parsed in UTF-16):

func skipOneUnquotedStringCharacter() -> Bool {
	switch utf[pos] {
		case "a".utf16Head ... "z".utf16Head,
			"A".utf16Head ... "Z".utf16Head,
			"0".utf16Head ... "9".utf16Head,
			"_".utf16Head, "$".utf16Head, "/".utf16Head, ":".utf16Head, ".".utf16Head, "-".utf16Head:
			pos = utf.index(after: pos)
			return true
			return false

That utf16Head custom property? It's some weird contraption of mine I hope the optimizer is capable of seeing through.

(^) #123

String(signature) == "JFIF" is an example of what you should never do, and while Michel probably explained it better than I can I just want to reiterate so visitors don’t get misled and perpetuate this. Yes, you can probably get away with this in this situation because the people who choose these ASCII mnemonics are pretty good about picking “not-weird” characters that are fairly resilient to unicode normalization so that implementers can shoot themselves in the foot and still walk to the hospital. But this is still bad practice on a good day, a weird bug on a normal day, and a security hole on a bad day.

(Xiaodi Wu) #124

Huh? The RHS, "JFIF", is an ASCII sequence. Do you know of any "weird Unicode character equivalence" that I don't know of? What weird bugs or security holes arise from comparison of a byte sequence to a hardcoded ASCII literal that involve Unicode processing?

(^) #125

In general,

a':String == b':String => a:[UInt8] == b:[UInt8]

does not hold. the converse holds, but that doesn’t really help us unless you just don’t care about validation at all.

A real world example:

The JPEG magic file signature is the sequence

'ÿ', 'Ø', 'ÿ', 'Û'

Three of these four characters have composed forms that compare equal under unicode normalization rules. If you use String comparison, you are potentially accepting files as JPEGs that are not actually JPEGs.

But wait, you say! Won’t the composed form take up more code units and you could catch this by comparing String.count?† No, because 1), as you’re probably aware, String counts Characters, and 2) Unicode normalization includes singleton cases where a single scalar can be aliased by other single scalars. 'Å' is an example.

In reality, real parsing libraries have lots of other components that serve as sanity checks (your file system API probably counts in bytes, not Characters, for example) so the chance of this not getting caught is low. But, is this something you are really comfortable with considering Swift’s emphasis on safety and correctness? That was the rationale for unicode-correct Strings themselves after all.

† By the way, this even happens with CRLF vs LF which is an entirely ASCII phenomenon. PNG is a real-world example of a popular file format which has a newline in its magic header. (In fact, the newline is there exactly to catch this sort of problem!)

(Michel Fortin) #126

The rules for comparing strings are complicated and aren't stable; they are evolving as Unicode evolves and perhaps as Swift evolves. Probably it's fine in this case, and perhaps it'll be fine forever in this case. But in general there is inherently more pitfalls in doing something complicated that is thought equivalent rather than just comparing byte-for-byte. Better comparing byte-for-byte when this is what you need and not have to prove the string comparison is equivalent.

We still haven't proven it's fine for this case by the way. We're just assuming it'll work since it's ASCII and we can't find a reason it'll break using our knowledge of today's Unicode.

Also it's worth mentioning doing a Unicode comparison when you don't need one is rather inefficient.

(Xiaodi Wu) #127

It does hold for ASCII (to be more specific, when comparing any string to an ASCII string), which is what your example is about, and what the use cases shown above are all about. Yes, even "\r\n" is distinguished from "\n".

My question was not a rhetorical one. Do you know something I don't which suggests it won't be fine? Because my understanding is that Unicode guarantees that, for all ASCII strings, it will be fine forever. See, for example, this explanation in UTR#15:

Note: Text containing only ASCII characters (U+0000 to U+007F) is left unaffected by all of the normalization forms. This is particularly important for programming languages.

So, again, I pose the question:

Assuming signature is an array of CChar, why would anyone want to write either of [those examples] instead of String(signature) == "JFIF"?

..because the whole point of Unicode making particular guarantees of the ASCII range, and of Swift faithfully implementing Unicode-safe strings, is to make String(signature) == "JFIF" do exactly what you mean it to on a good day, normal day, and bad day. If there's some pitfall that means you shouldn't write this in some situation, then that is a massive failure on the part of Swift and/or Unicode.

Actually, comparisons of known ASCII strings are quite efficient in Swift. Do you have data to suggest otherwise? Meanwhile, if it's not a known ASCII string, then you actually need to incur the cost of (still a rather efficient) comparison operation.

(^) #128

String.init(cString:) parses the C string as UTF-8, not ASCII. There is a method init(cString:encoding:) which can import it as an ASCII string, but that is a Foundation method, not a Swift method.

I probably didn’t make this distinction clear, but technically UInt8s include the region 0x80-0xFF, we just don’t have a good name for it so I’ve been saying “ASCII”. While most formats explicitly disallow names from including those characters, they are very common in file headers and magic signatures.

(Xiaodi Wu) #129

I think, then, that you have a good argument to support bringing an ASCII-specific version of that method into the standard library.

But in the case of this example (which has to do with literals, the topic of this thread after all) even that is not necessary because you are totally in control of the literal itself being entirely ASCII, and only one of the two operands needs to be entirely ASCII for the relation in question to hold.

(^) #130

…the entire point of such code is to check if the bytes we read in, which could be anything, match a particular sequence, and only that sequence. If something matches that shouldn’t match, that’s a bug.

(Michel Fortin) #131

My point is I don't know everything about Unicode, even though I know more than many, and comparing Unicode strings is far removed from the problem. It doesn't make sense to have to depend on a shortcut path in a complex algorithm if all you need is to compare bytes.

All this doesn't mean you can't use a string for checking this four-char signature. You can write this pretty neat code:

let signature = bytes[0..<4]

Here it's clear the code does a byte-by-byte comparison and nothing else. This expresses the intent much better than a normal string comparison.

That said, I'm not sure what @taylorswift actually wants is ASCII (ASCII stops at 128, so there's no ÿØÿÛ in ASCII). If you want Unicode scalars encoded on one byte (equivalent to ISO 8859-1), you'll need something like this:

signature.elementsEqual("ÿØÿÛ" { UInt8($0) })

Especially in this case, doing a byte-by-byte comparison is what you must do.

(Xiaodi Wu) #132

I understand you; I think you're misunderstanding me. When you use String.==, only a string encoded by a sequence of exactly the same bytes will match a string literal with only ASCII characters. Unless you know something different, in which case I am grossly misunderstanding the Unicode standard...

You could do this, but in the case of two ASCII strings, String.== performs a byte-by-byte comparison. That's not a happy accident but by design.

Ah but Unicode treats the Latin-1 range in a special way too. The latest versions of UAX #15 expand on what I quoted above:

Text exclusively containing ASCII characters (U+0000..U+007F) is left unaffected by all of the Normalization Forms. This is particularly important for programming languages.... Text exclusively containing Latin-1 characters (U+0000..U+00FF) is left unaffected by NFC. This is effectively the same as saying that all Latin-1 text is already normalized to NFC.

(^) #133

There are Latin-1 characters that have aliases, lots of them actually: áàâäã…

I believe the document is referring to the fact where 0x00–0xFF text can never collapse to other 0x00–0xFF characters. This corresponds to the hypothetical String.init(cString:encoding) method you mentioned. It provides absolutely no guarantees for any real String initializers. But adding such an initializer would be rather silly, as what would be the point of having the String instead of a buffer in the first place. We are effectively turning off all the functionality that makes String, String, without gaining any functionality useful for processing byte strings in return. A String wrapper is also just the wrong tool for the job — how should you subscript the kth codepoint in the c string? integer subscripts make no sense in the context of String, yet we commonly need them in the context of byte strings.

This is true of the strict ASCII subrange of UInt8.min ... UInt8.max. It is not true for the upper half of UInt8’s range (whose elements are just as common as the lower half), it is not true of wider formats like UInt16 or UInt32 which this proposal also seeks to cover. You also haven’t addressed Michel’s original point about why parsing by grapheme can be incorrect, or about how awkward writing codepoint-based parsers is under the current system. You haven’t proposed an alternative way we can get compile-time overflow checks for whether Unicode.Scalar literals are representable by a certain range.

(Xiaodi Wu) #134

Again, I think you're not understanding me. We are discussing literals, meaning you control what you put inside them as the author of the code. When you use an ASCII string literal on one side of a comparison, it does not matter what string you compare that literal to: that other string can be in the ASCII subrange, it can be Latin-1, it can be a string composed entirely of emoji. The only thing that will compare equal is a string of exactly the same ASCII characters.

There's a lot more about NFC than just that, but this fact alone is sufficient if you're working with bytes. When you compare a known number of bytes of unknown value to a Latin-1 string literal of the same number of characters, then once again String.== would only evaluate to true for one specific sequence of bytes.

So again you can see how Unicode has made certain design choices specifically to accommodate programmers' use of byte strings.

I have no idea what this has to do with a proposal for character literals here. Remember what you are arguing:

You define this so-called 'world' as one where some programmers associate numbers with characters. No one associates integers past 256 with characters, and certainly not any values in the range where one has to wonder if it'll overflow UInt32.

You produced an example involving "JFIF" which you argued demonstrated why a character literal would make the code more readable, and I demonstrated how Unicode strings do perfectly fine; you argued that such an approach should never be used for safety reasons and I showed how in fact it can always be used.


Code is read more often than it is written. So, when you’re reading some code, how can you tell, say, which of these literal strings is ASCII?

let a = "ЈFІF"
let b = "JFⅠF"
let c = "JꓝIꓝ"
let d = "JFIF"
let e = "JϜΙϜ"
let f = "JF‌IF"

(Xiaodi Wu) #136

If the ease of picking out such differences written by others is important to you, then that's a great argument that integers should never be expressible by character or string literals: put another way, if you find that to be persuasive, that's an argument against the use case advanced by @taylorswift for character literals.