SE-0243: Codepoint and Character Literals

taylorswift · March 11, 2019, 7:08pm

isASCII is fundamentally different. It just tells you if the object satisfies some property or not. It doesn’t make an assumption that it does, and crash the program if it doesn’t.

An analogy is Array.first. We have Array.isEmpty, which tells you if Array.first exists or not, but Array.first doesn’t and shouldn’t crash the program if you invoke it on an empty array.

Swift is Unicode, and Unicode is ASCII. We do not need to specify the encoding, because the encoding has already been specified by the language.

Maybe not on Character but certainly 65 as Unicode.Scalar makes sense. This is the whole point of explicit coercible-but-not-expressible as expressions. They allow us to express literal coercions without opening up the whole pandora’s box that ExpressibleBy brings with itself. Keep in mind we still don’t have a way of going the other way, i.e. converting an integer literal to a codepoint, that is validated at compile-time. (this is a bit of a rarified point though, we already have "\u{XX}" syntax which works just as well, except only in hex.)

lorentey · March 11, 2019, 8:21pm

Code points are an extension of ASCII, but there are ASCII-incompatible Unicode encoding forms.

More importantly, the fact that Swift standardizes on Unicode strings does not mean that we can ignore internationalization best practices in our API design. "Unicode" is not a meaningful encoding when it comes to byte sequences. Even if we restrict ourself to UTF-8, UTF-16-LE, UTF-16-BE, UTF-32-LE, UTF-32-BE, we do need to keep track of which one we're using at any given point.

michelf · March 11, 2019, 10:00pm

We could also say that "A" does not specify its encoding. We know it's unicode because Swift defines it that way, and Swift could define 'A' to mean ASCII. [To be clear, an ASCII literal is not exactly what the current proposal is trying to achieve, just what I wish it to be.]

ASCII is meaningful for byte sequences however. So we just need to define the literal as an ASCII literal.

Only around half the strings are meant for internationalization though. The rest includes identifiers, data formatted in specific ways (number and dates for instance) that must be parsed, signature bytes in binary structures, etc. All of this is predominantly ASCII.

Having to write "ascii" around every literal is a bit like using naturalLogarithm(of: 5) for math. It might be helpful if you aren't familiar the context, but you'll tire out quickly of writing equations using such verbose tools. It'd be good if people did not feel C has better ergonomics for writing efficient parsers, because this is a huge source of security bugs in C.

lorentey · March 11, 2019, 11:07pm

String is a high-level abstraction. The string value "Café" is a value of an abstract data type representing a Unicode string. It's true, String specifies nothing about how it internally encodes its value -- indeed, the internal encoding depends on the stdlib version used and the provenance of the value. However, String's internal encoding is irrelevant during its use. String's API is carefully defined in terms of abstract concepts like grapheme clusters, not any particular encoding.

In order to convert a String value into a series of bytes like [UInt8], you need to encode it. This is never done automatically. The APIs we provide to do this explicitly specify the encoding being used.

One simple way of encoding Strings is to use the encoded views provided by the stdlib. "A".utf8 gives you a collection containing the single UInt8 value 65. This is perfectly fine -- the encoding used is explicit in the name of the view, and the potential for confusion is minimal.

We deliberately do not provide "A" as [UInt8], however useful it might seem like for the common UTF-8 case. It would be a bad idea to introduce such a language feature.

I fail to see how Unicode scalars are in any way different. Unicode.Scalar is an abstraction for a grapheme. It has an assigned number, but it isn't the same as that number. For any Unicode scalar, there are many ways to encode it into a series of bytes, and the encoding usually has no obvious relationship to the numerical value of its code point. Implicitly defaulting to any particular encoding (even for a relatively(!) common subset like ASCII) would be a mistake.

Edit:

That would indeed work, as long as the literal syntax would only accept ASCII characters. In this case, the literal syntax itself would be a shorthand for the encoding specification.

let a: Character = 'e'
let b: Unicode.Scalar = 'e'
let c: UInt8 = 'e'

let d: Character = 'é' // error: 'é' is not an ASCII character
let e: Unicode.Scalar = 'é' // error: 'é' is not an ASCII character
let d: UInt8 = 'é' // error: 'é' is not an ASCII character

I don't think that would fly, though.

plorenzi · March 11, 2019, 11:37pm

I think the ' ' notation is not pulling its weight. Anyway I would never use it outside ASCII, because if I write 'é' in a source code I'm not sure if it has been encoded as one scalar (U+00E9) or two (U+0301 and U+0065), and I'd fall back on an explicit code '\u{E9}', and in that case the notation is hardly better than the constructor Unicode.Scalar(0xE9)!.

The only use cases I had with this notation was to more easily manipulate Character and Int values within the ASCII range. But if it can't be both a Character and an Int, as the discussion showed, I'm not sure it is still worth considering.

xwu · March 12, 2019, 12:18am

And this is precisely why the Unicode scalar literal idea is so powerful in terms of making it easier to work with Unicode scalars, because the compiler would enforce the literal being one and only one scalar—so that you can be sure!

plorenzi · March 12, 2019, 12:40am

Oh yes, right, but still, I don't think I would use it outside ASCII if it's a Unicode scalar. I'm French and I maybe could use 'é' in a source code, but in that case I'd need it as a Character because I'd compare it to characters in French strings. The notation is supposed to be short and simple, I don't see the point if I have to do the conversion Character('é').

SDGGiesbrecht · March 12, 2019, 2:21am

It would enforce that it is only one scalar, which would catch many problems such as é. But in general it would not enforce that it is the intended scalar, since not all canonical equivalence involves expansion.

Even with 'é', someone still has to go back and retype it now that the program refuses to compile, (or use some tool to normalize the file to NFC, which might break some other literal the same way). It really is best to do '\u{E9}', as @plorenzi said. — P.S. : Vive la francophonie !

Of course, when working with text as text, equivalence is desired and Character is the tool for the job, regardless of which quotation marks end up being necessary to express it, 'é' or "é". Then it does not matter what happens to the file, because its source code is under the same equivalence rules as the text value the literal represents.

In case it has not been said yet, 'é' as Character should not be a thing if it has to be converted through Unicode.Scalar to get there. It would be very surprising that it sometimes plays out like 'e' as Character, which would always succeed; and other times like 'ȩ̨̛̇̉̆̃̂́̀̈̌̄̊' as Character, which would always fail even though it could be represented by a single Character. It would mirror the large UInt literals that used to overflow Int along their way to becoming actual UInt instances.

xwu · March 12, 2019, 3:16am

Right, which is why we would propose one syntax as a Unicode scalar literal. In other words, it is to be used precisely for those uses where canonical equivalence is unwanted. Currently, since Unicode scalar literals are initialized using double quotation marks, there is no such visual indicator.

There is no conversion involved. This would be a coercion. No such thing as overflow would occur.

If single quotes are reserved for Unicode scalar literals, then this would be invalid syntax just as let x: Int = 123xyz would be invalid syntax for an integer literal. There would be no failure in coercion or conversion.

dwaite · March 12, 2019, 8:24am

I feel if ‘x’ is a syntax for a Unicode scalar literal, then multi-point EGC like ‘‘ should be compile errors.

johnno1962 · March 12, 2019, 2:48pm

A couple of days ago I posted a couple of initialisers and four operators that resolve the ergonomic issues I personally was looking to see solved in working with buffers of ints. I’ve prepared a patch and this change passes all the Swift project tests without requiring the new literal syntax. As these operators are targeted, this new approach does do not allow nonsense expressions such as '1' / ‘1’ or 'x'.isMultipleOf('a’) as was the case when we were exploring using the ExpressibleBy protocols to make Character literals convertible to integers which seems to have been a misstep.

This nicely decouples the decision on whether Swift needs a new literal syntax from our ergonomic goals and we can make a decision about it separately. For me it is still worth pursuing as people have slipped into the new syntax in posts indicating it seems very familiar and a distinction worth making. This will be a source breaking change (why introduce a new syntax without deprecating the old one) for which the bar has to be set high. To give an idea of the amount of disruption, this would require changes in 96 places in all of the standard library and 58 changes in the Swift project tests. We seem to be headed towards Unicode.Scalar literals and not Character literals this would be a retrograde step IMO. Character is the element type of Swift strings and what constitutes a single visual entity on the screen. Character is not a particularly useful type but subtle distinction between Unicode.Scalar and Character is not one we want users to have to be mindful of to use the new literal.

Looking back through the posts it seems at times we took the fork to Unicode.Scalar literals to facilitate implementing the trapping .ascii property. This is not a consideration with the patch I propose as this property is not required. This implicit choice of the encoding may be viewed as a mis-feature by some but for me ASCII is so pervasive and would be most peoples understanding (particularly with the new literal syntax) encountering an expression such as:

if cString.advanced(by: 2).pointee == ‘i' {

SDGGiesbrecht · March 12, 2019, 8:01pm

Perhaps my terms were imprecise. Sorry.

What I meant was that if the 'e' syntax takes on the semantics of Unicode.Scalar, then it should not be coercible to a Character. Someone who writes 'é' as Character would be asking for a valid character no matter what, but it may or may not be valid syntax after the source file is handled in any way. The developer should be forced to use something like "é" as Character instead, which is universally stable. As concisely as I can put it: Coercion to a Character should only be possible away from a String^† and not possible away from a Unicode.Scalar.

(† or from a Character like in the original proposal.)

xwu · March 12, 2019, 8:03pm

This would be a justifiable choice but is not possible with ABI stability, because ExpressibleByUnicodeScalarLiteral is refined by ExpressibleByExtendedGraphemeClusterLiteral and in turn by ExpressibleByStringLiteral. However, I am not sure I see in what way an explicit coercion would be harmful.

SDGGiesbrecht · March 12, 2019, 8:30pm

It could still be rejected by the compiler when it processes the coercion attempt: “Error: Coercion to a type conforming to ExpressibleByExtendedGraphemeClusterLiteral requires a double‐quoted String literal.”

Runtime usage of the protocol would be “fine”. If someone calls init(unicodeScalarLiteral: someScalarProducedAtRuntime), it is “okay”, because they can only get there with a valid Unicode.Scalar instance.

It is only at the pre‐compiler, literal‐in‐a‐source‐file level that the character is volatile and may not stay in a single‐scalar form. I would rather that be rejected immediately, than left to become a sudden surprise down the road. "e" as Character is just as succinct anyway.

xwu · March 13, 2019, 1:48am

Yes, we could have the compiler reject it, but then the protocol conformance and protocol hierarchy would be a lie. Swift would affirm that String: ExpressibleByUnicodeScalarLiteral and then bark an error at you when you actually attempt to express a string by a Unicode scalar literal. You still haven't demonstrated where the harm lies that requires such a prohibition which would violate promised semantics.

SDGGiesbrecht · March 13, 2019, 3:51am

All sorts of things handle Swift source, not just the compiler, and they are fully compliant with Unicode if they normalize the source in some way or another. Where Swift syntax plays by the rules of Unicode equivalence, all is well. Such is the case with "x̱̄" as Character now. But wherever Swift syntax relies on a particular scalar representation—effectively ignoring unicode equivalence—then the source code is volatile, because any tool which (rightly) assumes unicode equivalence may be unknowingly changing the semantics of the Swift source code, sometimes even breaking the syntax rules so it no longer compiles.

Are there Swift‐related tools already doing this sort of thing? Yes.

This forum performs NFC on copy‐and‐paste, and as demonstrated in the pitch thread here, can render valid code uncompilable.
While Git does not perform Unicode normalization on file contents yet, it does already normalize filenames to NFD on macOS and NFC elsewhere. (Since it already normalizes line endings and whitespace, it is not much of a stretch to imagine it in the future.)

Those are just the two I have already run into. As Unicode awareness continues to spread, the number is likely to grow.

Currently, only constructions which deliberately call out Unicode.Scalar are vulnerable to this, such as "é" as Unicode.Scalar. I do not even like that this problem is already reachable, but at least right now you have to ask for it with as Unicode.Scalar, and as stated in an earlier post, it could not be protected against anyway without also blocking "\u{E8}" as Unicode.Scalar.

As for future syntax, I am strongly against anything which increases the number of places in the language where these vulnerabilities can be encountered—especially where it can be encountered without some explicit reminder that “This is lower‐level Unicode; think about what you are doing.”

Once the originally proposed deprecations had come into effect and 'é' as Unicode.Scalar completely replaced "é" as Unicode.Scalar, the net effect would only have been that the danger zone had moved, not that it had increased. And I could live with that.

But if 'é' as Character—where it must actually be a scalar—is proposed to be indefinitely possible (deprecation is fine), then it would be a new place to run into errors. That new zone feels much worse because (1) there is no nearby reminder of Unicode, let alone Scalar; (2) someone can know enough to write that code without ever having heard of Unicode whatsoever; (3) there is a very simple, safer alternative available: "é" as Character.

I do not really care how the vulnerability is prevented, only that it is. Many new alternative designs have already been mentioned in this review, and it is not evident to me which direction this will go if it is returned for revision. Some of the proposed alternatives do not even have to deal with this issue in the first place, some have simple fixes, others could be solved with some work, and still others are basically irreparable in this respect. I have not thought through each to know which will have to tiptoe around the ABI to get it right and which will have easy solutions. That is why my last few posts were littered with the word “if”. I am only trying to say that whichever way development continues, I think this needs to be a design consideration.

jawbroken · March 13, 2019, 5:03am

I acknowledge the issues but I don't think find any of them particularly important (though the Discourse one would be mildly embarrassing because these are the official forums). Various text processing can and always has been able to destroy code (e.g. something that doesn't preserve newlines like HTML would join consecutive lines of Swift code together) and there is a whole language, Python, where the code is so whitespace sensitive that any form of whitespace processing/normalisation breaks code.

xwu · March 13, 2019, 5:11am

Swift does not give license for a tool to replace string literal contents with their canonical equivalents. This has not, and has never been, the case; it would break behavior in all sorts of observable ways, and it would do so for more languages than just Swift. That said, if a tool did cause such breakage, you would want the compiler to catch it as a syntax error rather than facing unexpected behavior at runtime. That is a feature, not a bug.

(As it happens, though, there are only about 1000 code points that change on NFC normalization, and of these only about 80 are normalized to more than one code point. In the rare circumstance that you are working with HEBREW LIGATURE YIDDISH YOD YOD PATAH, I am sure you'd appreciate knowing that it's been decomposed without your knowing.)

This forum regularly breaks Swift code on copy-and-paste.

If you have a tool that decomposes the character 'é' in your code, then 'é' as Character with a compile-time error is safer than "é" as Character, because in the first case you are warned of it immediately, and in the second case you get observable unexpected behavior.

michelf · March 13, 2019, 12:15pm

I just want to throw an aside related to this.

Edit: decided this should be a separate topic: String Comparison for Identifiers

SDGGiesbrecht · March 13, 2019, 6:25pm

Are we are talking about the same thing?

For either 'é' as Unicode.Scalar or "é" as Unicode.Scalar in NFD, a compile time error would be exactly what I want. So maybe we are arguing about something we agree on? (In NFC, it is fine. The extra effort to switch to Unicode.Scalar accepts responsibility for any Unicode mistakes.)

But I have been trying to talk about 'é' as Character vs "é" as Character, where by asking for a Character, the developer demonstrates that he does not care about its scalar representation (and may not even know it can have more than one). The second variant always succeeds and gives no unexpected behaviour (which is why your comment confuses me). The first variant may result in surprises in the future if NFD happens (if and only if 'x' syntax is defined as a scalar literal). Given that the two variants are virtually identical in intended functionality, I would prefer only the second, resilient variant be available insofar as it is possible to design that way.

Phrased in a completely different way: In a world where 'x' means a scalar literal, please make 'x' as Character just as discouraged as 'x' as String, because Character is much more like a string than it is like a scalar. Is that really all that weird of a request?