Prepitch: Character integer literals

michelf · October 16, 2018, 2:13am

The goal in @taylorswift's example seems to be reading some kind of binary format. You don't want to risk weird unicode character equivalence to meddle with that.

Even for textual file formats, many are defined in term of code points (like XML or JSON). Parsing characters by grapheme is asking for trouble. For instance, a combining character after the quote of an XML attribute (as in attr="⃠⃠value") is well-formed XML and must be parsed as a value starting with a combining character. If you parse by grapheme, you're out of spec.

So you need to express characters as code points or sometime lower-level integers in the parser. If it's a complicated mess to express this, then the parser becomes a complicated mess. Here's a function in one of my parsers (old-style plist, parsed in UTF-16):

func skipOneUnquotedStringCharacter() -> Bool {
	switch utf[pos] {
		case "a".utf16Head ... "z".utf16Head,
			"A".utf16Head ... "Z".utf16Head,
			"0".utf16Head ... "9".utf16Head,
			"_".utf16Head, "$".utf16Head, "/".utf16Head, ":".utf16Head, ".".utf16Head, "-".utf16Head:
			pos = utf.index(after: pos)
			return true
		default:
			return false
	}
}

That utf16Head custom property? It's some weird contraption of mine I hope the optimizer is capable of seeing through.

taylorswift · October 16, 2018, 2:31am

String(signature) == "JFIF" is an example of what you should never do, and while Michel probably explained it better than I can I just want to reiterate so visitors don’t get misled and perpetuate this. Yes, you can probably get away with this in this situation because the people who choose these ASCII mnemonics are pretty good about picking “not-weird” characters that are fairly resilient to unicode normalization so that implementers can shoot themselves in the foot and still walk to the hospital. But this is still bad practice on a good day, a weird bug on a normal day, and a security hole on a bad day.

xwu · October 16, 2018, 10:20pm

Huh? The RHS, "JFIF", is an ASCII sequence. Do you know of any "weird Unicode character equivalence" that I don't know of? What weird bugs or security holes arise from comparison of a byte sequence to a hardcoded ASCII literal that involve Unicode processing?

taylorswift · October 16, 2018, 11:30pm

In general,

a':String == b':String => a:[UInt8] == b:[UInt8]

does not hold. the converse holds, but that doesn’t really help us unless you just don’t care about validation at all.

A real world example:

The JPEG magic file signature is the sequence

'ÿ', 'Ø', 'ÿ', 'Û'

Three of these four characters have composed forms that compare equal under unicode normalization rules. If you use String comparison, you are potentially accepting files as JPEGs that are not actually JPEGs.

But wait, you say! Won’t the composed form take up more code units and you could catch this by comparing String.count?† No, because 1), as you’re probably aware, String counts Characters, and 2) Unicode normalization includes singleton cases where a single scalar can be aliased by other single scalars. 'Å' is an example.

In reality, real parsing libraries have lots of other components that serve as sanity checks (your file system API probably counts in bytes, not Characters, for example) so the chance of this not getting caught is low. But, is this something you are really comfortable with considering Swift’s emphasis on safety and correctness? That was the rationale for unicode-correct Strings themselves after all.

† By the way, this even happens with CRLF vs LF which is an entirely ASCII phenomenon. PNG is a real-world example of a popular file format which has a newline in its magic header. (In fact, the newline is there exactly to catch this sort of problem!)

michelf · October 16, 2018, 11:41pm

The rules for comparing strings are complicated and aren't stable; they are evolving as Unicode evolves and perhaps as Swift evolves. Probably it's fine in this case, and perhaps it'll be fine forever in this case. But in general there is inherently more pitfalls in doing something complicated that is thought equivalent rather than just comparing byte-for-byte. Better comparing byte-for-byte when this is what you need and not have to prove the string comparison is equivalent.

We still haven't proven it's fine for this case by the way. We're just assuming it'll work since it's ASCII and we can't find a reason it'll break using our knowledge of today's Unicode.

Also it's worth mentioning doing a Unicode comparison when you don't need one is rather inefficient.

xwu · October 17, 2018, 12:19am

taylorswift:

Huh? The RHS, "JFIF" , is an ASCII sequence. Do you know of any "weird Unicode character equivalence" that I don't know of? What weird bugs or security holes arise from comparison of a byte sequence to a hardcoded ASCII literal that involve Unicode processing?

In general,
a':String == b':String => a:[UInt8] == b:[UInt8]
does not hold.

It does hold for ASCII (to be more specific, when comparing any string to an ASCII string), which is what your example is about, and what the use cases shown above are all about. Yes, even "\r\n" is distinguished from "\n".

My question was not a rhetorical one. Do you know something I don't which suggests it won't be fine? Because my understanding is that Unicode guarantees that, for all ASCII strings, it will be fine forever. See, for example, this explanation in UTR#15:

Note: Text containing only ASCII characters (U+0000 to U+007F) is left unaffected by all of the normalization forms. This is particularly important for programming languages.

So, again, I pose the question:

Assuming signature is an array of CChar, why would anyone want to write either of [those examples] instead of String(signature) == "JFIF"?

..because the whole point of Unicode making particular guarantees of the ASCII range, and of Swift faithfully implementing Unicode-safe strings, is to make String(signature) == "JFIF" do exactly what you mean it to on a good day, normal day, and bad day. If there's some pitfall that means you shouldn't write this in some situation, then that is a massive failure on the part of Swift and/or Unicode.

Actually, comparisons of known ASCII strings are quite efficient in Swift. Do you have data to suggest otherwise? Meanwhile, if it's not a known ASCII string, then you actually need to incur the cost of (still a rather efficient) comparison operation.

taylorswift · October 17, 2018, 2:23am

String.init(cString:) parses the C string as UTF-8, not ASCII. There is a method init(cString:encoding:) which can import it as an ASCII string, but that is a Foundation method, not a Swift method.

I probably didn’t make this distinction clear, but technically UInt8s include the region 0x80-0xFF, we just don’t have a good name for it so I’ve been saying “ASCII”. While most formats explicitly disallow names from including those characters, they are very common in file headers and magic signatures.

xwu · October 17, 2018, 2:28am

I think, then, that you have a good argument to support bringing an ASCII-specific version of that method into the standard library.

But in the case of this example (which has to do with literals, the topic of this thread after all) even that is not necessary because you are totally in control of the literal itself being entirely ASCII, and only one of the two operands needs to be entirely ASCII for the relation in question to hold.

taylorswift · October 17, 2018, 2:31am

…the entire point of such code is to check if the bytes we read in, which could be anything, match a particular sequence, and only that sequence. If something matches that shouldn’t match, that’s a bug.

michelf · October 17, 2018, 3:24am

My point is I don't know everything about Unicode, even though I know more than many, and comparing Unicode strings is far removed from the problem. It doesn't make sense to have to depend on a shortcut path in a complex algorithm if all you need is to compare bytes.

All this doesn't mean you can't use a string for checking this four-char signature. You can write this pretty neat code:

let signature = bytes[0..<4]
signature.elementsEqual("JFIF".utf8)

Here it's clear the code does a byte-by-byte comparison and nothing else. This expresses the intent much better than a normal string comparison.

That said, I'm not sure what @taylorswift actually wants is ASCII (ASCII stops at 128, so there's no ÿØÿÛ in ASCII). If you want Unicode scalars encoded on one byte (equivalent to ISO 8859-1), you'll need something like this:

signature.elementsEqual("ÿØÿÛ".unicodeScalars.map { UInt8($0) })

Especially in this case, doing a byte-by-byte comparison is what you must do.

xwu · October 17, 2018, 11:06pm

I understand you; I think you're misunderstanding me. When you use String.==, only a string encoded by a sequence of exactly the same bytes will match a string literal with only ASCII characters. Unless you know something different, in which case I am grossly misunderstanding the Unicode standard...

michelf:

My point is I don't know everything about Unicode, even though I know more than many, and comparing Unicode strings is far removed from the problem. It doesn't make sense to have to depend on a shortcut path in a complex algorithm if all you need is to compare bytes.

All this doesn't mean you can't use a string for checking this four-char signature. You can write this pretty neat code:
let signature = bytes[0..<4]
signature.elementsEqual("JFIF".utf8)
Here it's clear the code does a byte-by-byte comparison and nothing else. This expresses the intent much better than a normal string comparison.

You could do this, but in the case of two ASCII strings, String.== performs a byte-by-byte comparison. That's not a happy accident but by design.

Ah but Unicode treats the Latin-1 range in a special way too. The latest versions of UAX #15 expand on what I quoted above:

Text exclusively containing ASCII characters (U+0000..U+007F) is left unaffected by all of the Normalization Forms. This is particularly important for programming languages.... Text exclusively containing Latin-1 characters (U+0000..U+00FF) is left unaffected by NFC. This is effectively the same as saying that all Latin-1 text is already normalized to NFC.

taylorswift · October 18, 2018, 12:29am

There are Latin-1 characters that have aliases, lots of them actually: áàâäã…

I believe the document is referring to the fact where 0x00–0xFF text can never collapse to other 0x00–0xFF characters. This corresponds to the hypothetical String.init(cString:encoding) method you mentioned. It provides absolutely no guarantees for any real String initializers. But adding such an initializer would be rather silly, as what would be the point of having the String instead of a buffer in the first place. We are effectively turning off all the functionality that makes String, String, without gaining any functionality useful for processing byte strings in return. A String wrapper is also just the wrong tool for the job — how should you subscript the kth codepoint in the c string? integer subscripts make no sense in the context of String, yet we commonly need them in the context of byte strings.

This is true of the strict ASCII subrange of UInt8.min ... UInt8.max. It is not true for the upper half of UInt8’s range (whose elements are just as common as the lower half), it is not true of wider formats like UInt16 or UInt32 which this proposal also seeks to cover. You also haven’t addressed Michel’s original point about why parsing by grapheme can be incorrect, or about how awkward writing codepoint-based parsers is under the current system. You haven’t proposed an alternative way we can get compile-time overflow checks for whether Unicode.Scalar literals are representable by a certain range.

xwu · October 18, 2018, 1:02am

Again, I think you're not understanding me. We are discussing literals, meaning you control what you put inside them as the author of the code. When you use an ASCII string literal on one side of a comparison, it does not matter what string you compare that literal to: that other string can be in the ASCII subrange, it can be Latin-1, it can be a string composed entirely of emoji. The only thing that will compare equal is a string of exactly the same ASCII characters.

There's a lot more about NFC than just that, but this fact alone is sufficient if you're working with bytes. When you compare a known number of bytes of unknown value to a Latin-1 string literal of the same number of characters, then once again String.== would only evaluate to true for one specific sequence of bytes.

So again you can see how Unicode has made certain design choices specifically to accommodate programmers' use of byte strings.

I have no idea what this has to do with a proposal for character literals here. Remember what you are arguing:

You define this so-called 'world' as one where some programmers associate numbers with characters. No one associates integers past 256 with characters, and certainly not any values in the range where one has to wonder if it'll overflow UInt32.

You produced an example involving "JFIF" which you argued demonstrated why a character literal would make the code more readable, and I demonstrated how Unicode strings do perfectly fine; you argued that such an approach should never be used for safety reasons and I showed how in fact it can always be used.

Nevin · October 18, 2018, 3:26am

Code is read more often than it is written. So, when you’re reading some code, how can you tell, say, which of these literal strings is ASCII?

let a = "ЈFІF"
let b = "JFⅠF"
let c = "JꓝIꓝ"
let d = "JFIF"
let e = "JϜΙϜ"
let f = "JF‌IF"

xwu · October 19, 2018, 1:03am

Nevin:

Code is read more often than it is written. So, when you’re reading some code, how can you tell, say, which of these literal strings is ASCII?
let a = "ЈFІF"
let b = "JFⅠF"
let c = "JꓝIꓝ"
let d = "JFIF"
let e = "JϜΙϜ"
let f = "JF‌IF"

If the ease of picking out such differences written by others is important to you, then that's a great argument that integers should never be expressible by character or string literals: put another way, if you find that to be persuasive, that's an argument against the use case advanced by @taylorswift for character literals.

bjhomer · October 19, 2018, 6:34am

A lot of the use cases in this thread might be solved by an AsciiString* type, which would be initializable with a string literal and fail for any non-ascii characters.

let a: AsciiString = "JFIF"

The use cases where someone wants to compare strings byte-wise with incoming data don't really line up with needing Unicode support, so perhaps a separate type is in order? It could also be integer-subscriptable since each character would be fixed-width. Reasonable use cases for integer subscripting of strings usually involve an assumption that the string will be ASCII-compatible.

Anyway, I'm not sure exactly how such a type would affect the desire for Character integer literals, but it seems like it might address a need here.

*Technically, "ASCII" only refers to the bottom 128 characters of the 256 possibilities in a single byte. It would be nice to have a better name to indicate that it covers the top half as well, but I'm not sure what an appropriate name would be.

johnno1962 · October 19, 2018, 7:53am

I’m sad to see this pitch get mired down in specifics and falter again. For a pithy high level summary of the issues in play you need look no further than this:

@xwu, if it is your opinion that no attempt should be made to build a bridge between character literals and their ascii/copepoint value then this was never a pitch you were going to find attractive, but character literals are little use without this feature. It’s a debate that needed to be had and we all understand you’re trying to protect the Swift string model from dilution but would it be possible to sit this one out for now rather than saying the same thing again and again so we can see where this pitch leads us. I for one think this is a worthwhile idea that I would like to see go to review when you’ll have a chance to have your say.

To that end, I’ve created a “Detailed Design” paragraph with an example use case to drop into Chris’ draft.

Detailed Design

These new character literals will have default type Character and be statically checked to contain only a single extended grapheme cluster. They will be processed largely as if they were a short String.

When the Character is representable by a single UNICODE codepoint however, (a 20 bit number) they will be able to express a Unicode.Scalar and any of the integer types provided the codepoint value fits into that type.

As an example:

let a = 'a' // This will have type Character
let s: Unicode.Scalar = 'a' // is also possible
let i: Int8 = 'a' // takes the ASCII value

In order to implement this a new protocol ExpressibleByCodepointLiteral is created and used for character literals that are also a single codepoint instead of ExpressibleByExtendedGraphemeClusterLiteral. Conformances to this protocol for Unicode.Scalar, Character and String will be added to the standard library so these literals can operate in any of those roles. In addition, conformances to ExpressibleByCodepointLiteral will be added to all integer types in the standard library so a character literal can initialize variables of integer type (subject to a compile time range check) or satisfy the type checker for arguments of these integer types.

These new conformances and the existing operators defined in the Swift language will make the following code possible out of the box:

func unhex(_ hex: String) throws -> Data {
    guard hex.utf16.count % 2 == 0 else {
        throw NSError(domain: "odd characters", code: -1, userInfo: nil)
    }

    func nibble(_ char: UInt16) throws -> UInt16 {
        switch char {
        case '0' ... '9':
            return char - '0'
        case 'a' ... 'f':
            return char - 'a' + 10
        case 'A' ... 'F':
            return char - 'A' + 10
        default:
            throw NSError(domain: "bad character", code: -1, userInfo: nil)
        }
    }

    var out = Data(capacity: hex.utf16.count/2)
    var chars = hex.utf16.makeIterator()
    while let next = chars.next() {
        try out.append(UInt8((nibble(next) << 4) + nibble(chars.next()!)))
    }

    return out
}

One area which may involve ambiguity is the + operator which can mean either String concatenation or addition on integer values. Generally this wouldn't be a problem as most reasonable contexts will provide the type checker the information to make the correct decision.

print('1' + '1' as String) // prints 11
print('1' + '1' as Int) // prints 98

torquato · October 19, 2018, 8:35am

This implies that ("1" as Character) + ("1" as Character) as String should work. Or unless you can also write var str: String = '1' . Which would again fold together String literal and character literal, which is the status quo. No?

ladislas · October 19, 2018, 8:55am

Thanks @taylorswift for the proposal and thanks @johnno1962 for the insightful summary.

Being able to write let i: UInt8 = 'a' would be a game changer for people working a lot with embedded devices and serial/bluetooth/ble communication.

We basically just send & receive bytes and sometimes it makes a lof of sense to use the ASCII representation to make the code clearer.

One super simple example using Arduino. Imagine you have the following code running on your board:

  while (Serial.available() > 0) {
    
    uint8_t c = Serial.read();

    if (c == '+') {
      digitalWrite(LED_BUILTIN, HIGH);
    }
    else if (c == '-') {
      digitalWrite(LED_BUILTIN, LOW);
    }

    delay(250);
  
  }

Right now on the swift side, I need to write this to make the led blink:

let buffer: [UInt8] = [43, 45, 43, 45, 43, 45, 43, 45, 43, 45]
Darwin.write(fd, buffer, buffer.count)

It would be much clearer with:

let buffer: [UInt8] = ['+', '-', '+', '-', '+', '-', '+', '-', '+', '-', ]
Darwin.write(fd, buffer, buffer.count)

It's just an example but as things get more complex, it could really be helpful.

johnno1962 · October 19, 2018, 8:38pm

'1' + '1' as String works as String has an ExpressibleByCodepointLiteral conformance in the prototype and uses the String+String operator. Perhaps I should remove it and there wouldn’t be any ambiguity.