SE-0243: Codepoint and Character Literals

It could still be rejected by the compiler when it processes the coercion attempt: “Error: Coercion to a type conforming to ExpressibleByExtendedGraphemeClusterLiteral requires a double‐quoted String literal.”

Runtime usage of the protocol would be “fine”. If someone calls init(unicodeScalarLiteral: someScalarProducedAtRuntime), it is “okay”, because they can only get there with a valid Unicode.Scalar instance.

It is only at the pre‐compiler, literal‐in‐a‐source‐file level that the character is volatile and may not stay in a single‐scalar form. I would rather that be rejected immediately, than left to become a sudden surprise down the road. "e" as Character is just as succinct anyway.

4 Likes

Yes, we could have the compiler reject it, but then the protocol conformance and protocol hierarchy would be a lie. Swift would affirm that String: ExpressibleByUnicodeScalarLiteral and then bark an error at you when you actually attempt to express a string by a Unicode scalar literal. You still haven't demonstrated where the harm lies that requires such a prohibition which would violate promised semantics.

All sorts of things handle Swift source, not just the compiler, and they are fully compliant with Unicode if they normalize the source in some way or another. Where Swift syntax plays by the rules of Unicode equivalence, all is well. Such is the case with "x̱̄" as Character now. But wherever Swift syntax relies on a particular scalar representation—effectively ignoring unicode equivalence—then the source code is volatile, because any tool which (rightly) assumes unicode equivalence may be unknowingly changing the semantics of the Swift source code, sometimes even breaking the syntax rules so it no longer compiles.

Are there Swift‐related tools already doing this sort of thing? Yes.

Those are just the two I have already run into. As Unicode awareness continues to spread, the number is likely to grow.

Currently, only constructions which deliberately call out Unicode.Scalar are vulnerable to this, such as "é" as Unicode.Scalar. I do not even like that this problem is already reachable, but at least right now you have to ask for it with as Unicode.Scalar, and as stated in an earlier post, it could not be protected against anyway without also blocking "\u{E8}" as Unicode.Scalar.

As for future syntax, I am strongly against anything which increases the number of places in the language where these vulnerabilities can be encountered—especially where it can be encountered without some explicit reminder that “This is lower‐level Unicode; think about what you are doing.”

Once the originally proposed deprecations had come into effect and 'é' as Unicode.Scalar completely replaced "é" as Unicode.Scalar, the net effect would only have been that the danger zone had moved, not that it had increased. And I could live with that.

But if 'é' as Character—where it must actually be a scalar—is proposed to be indefinitely possible (deprecation is fine), then it would be a new place to run into errors. That new zone feels much worse because (1) there is no nearby reminder of Unicode, let alone Scalar; (2) someone can know enough to write that code without ever having heard of Unicode whatsoever; (3) there is a very simple, safer alternative available: "é" as Character.

I do not really care how the vulnerability is prevented, only that it is. Many new alternative designs have already been mentioned in this review, and it is not evident to me which direction this will go if it is returned for revision. Some of the proposed alternatives do not even have to deal with this issue in the first place, some have simple fixes, others could be solved with some work, and still others are basically irreparable in this respect. I have not thought through each to know which will have to tiptoe around the ABI to get it right and which will have easy solutions. That is why my last few posts were littered with the word “if”. I am only trying to say that whichever way development continues, I think this needs to be a design consideration.

2 Likes

I acknowledge the issues but I don't think find any of them particularly important (though the Discourse one would be mildly embarrassing because these are the official forums). Various text processing can and always has been able to destroy code (e.g. something that doesn't preserve newlines like HTML would join consecutive lines of Swift code together) and there is a whole language, Python, where the code is so whitespace sensitive that any form of whitespace processing/normalisation breaks code.

Swift does not give license for a tool to replace string literal contents with their canonical equivalents. This has not, and has never been, the case; it would break behavior in all sorts of observable ways, and it would do so for more languages than just Swift. That said, if a tool did cause such breakage, you would want the compiler to catch it as a syntax error rather than facing unexpected behavior at runtime. That is a feature, not a bug.

(As it happens, though, there are only about 1000 code points that change on NFC normalization, and of these only about 80 are normalized to more than one code point. In the rare circumstance that you are working with HEBREW LIGATURE YIDDISH YOD YOD PATAH, I am sure you'd appreciate knowing that it's been decomposed without your knowing.)

This forum regularly breaks Swift code on copy-and-paste.

If you have a tool that decomposes the character 'é' in your code, then 'é' as Character with a compile-time error is safer than "é" as Character, because in the first case you are warned of it immediately, and in the second case you get observable unexpected behavior.

1 Like

I just want to throw an aside related to this.

Edit: decided this should be a separate topic: String Comparison for Identifiers

Are we are talking about the same thing?

For either 'é' as Unicode.Scalar or "é" as Unicode.Scalar in NFD, a compile time error would be exactly what I want. So maybe we are arguing about something we agree on? (In NFC, it is fine. The extra effort to switch to Unicode.Scalar accepts responsibility for any Unicode mistakes.)

But I have been trying to talk about 'é' as Character vs "é" as Character, where by asking for a Character, the developer demonstrates that he does not care about its scalar representation (and may not even know it can have more than one). The second variant always succeeds and gives no unexpected behaviour (which is why your comment confuses me). The first variant may result in surprises in the future if NFD happens (if and only if 'x' syntax is defined as a scalar literal). Given that the two variants are virtually identical in intended functionality, I would prefer only the second, resilient variant be available insofar as it is possible to design that way.

Phrased in a completely different way: In a world where 'x' means a scalar literal, please make 'x' as Character just as discouraged as 'x' as String, because Character is much more like a string than it is like a scalar. Is that really all that weird of a request?

2 Likes

Apparently I have been repeatedly incorrect when I said scalar literals where the only place Swift syntax was not Unicode compliant. @michelf is right, we have much bigger fish to fry. Sorry for the misinformation in that regard.

They would be just as discouraged—in that neither would be discouraged.

Just because a developer is asking for a Character does not mean that they do not care about the number of Unicode scalars that make up that character. They may want to compare the input to another Character taking into account canonical equivalence, but they could be working with input and output that expects a single Unicode scalar while doing so.

A literal value should never be normalized by tooling. That goes against what a “literal” value is. How would you feel if a tool converted all your integer literals to hexadecimal notation?

1 Like

Also, it is recommended to use the "\u{...}" notation if you're doing things like this.

Alternatively (not sure it's a good idea), literals that vary under NFC could trigger a compilation error and a fix-it that used "\u{...}" notation.

6 Likes

It's a fantastic linter rule, I'd say.

Not sure how I'd feel with it as a compiler warning or error. For one thing, it'd disproportionately affect certain scripts and could conceivably render some of them unreadable, which would be a rather unforgivable sin for a literal notation.

1 Like

As a String is a vector of Character, the latter is a vector of code points. If you mean to compare (ASCII) code points to bytes within binary data, then wouldn't Unicode.Scalar be the correct abstraction? A scalar value always maps to exactly one integer, while a character may be a vector of scalars.

For text file mangling your representation, wouldn't scalars be better? The "é" abstract character may have two Character representations, U+00E9 as a single scalar or U+0065 & U+0301 as a primary and secondary code-point pair. If the characters within single quotes must always be Unicode scalars, then only one interpretation is allowed in the object code for "é", no matter which way it's stored in the source code file. It does mean that the compiler must have ICU or an equivalent to find valid recombinations that can resolve to a single scalar. If we allow "\u{}" Unicode escapes within single quotes, we can mandate that they always have to used the single scalar version and never a decomposed form. (In other words, recomposition is allowed from translation between the source file's encoding to object code, and never from the user deliberately splitting a single-scalar character to an official decomposed form.)

1 Like

This isn’t quite true. Characters can contain other Characters, so it’s not a neat 3-level hierarchy. It’s probably better to think of Character boundaries as maximal, context-dependent intervals calculated on a String object as a whole, and an individual Character object as a very short String whose largest interval (among many shorter choices) extends across its entire length.

Also, Unicode.Scalar isn’t entirely the right abstraction when comparing with UInt8s, since Unicode.Scalar can and will assume every 8-bit character it’s compared against is encoded in the Latin-1 encoding, and compared with 7-bit encodings where ASCII is queen, there are just too many alternative 8-bit character encodings for me to be comfortable here.

1 Like

It is difficult to see why single-quoted literals should be presumed to default to Character, as no language offers such a syntax.

Here's how popular programming languages make use of single quotation marks:

String

  • Delphi/Object Pascal
  • JavaScript
  • MATLAB (char array)
  • Python
  • R
  • SQL

'Raw' string

  • Groovy
  • Perl
  • PHP
  • Ruby

Code unit/code point/Unicode scalar

  • C: int
  • C++: char (if literal is prefixed, it can be char8_t, char16_t, char32_t, or wchar_t)
  • C#: char (16-bit)
  • Java: char (16-bit)
  • Kotlin: Char (16-bit)
  • Go: rune (32-bit)
  • Rust: char (32-bit)

In Go, a Unicode code point is known as a rune (a term now also adopted in .NET). In Rust, a Unicode scalar value is known as a character; in Swift, it is known as a Unicode scalar. (A Unicode scalar value is any Unicode code point except high- and low-surrogate code points.)

As can be seen, Go and Rust use single quotation marks for what in Swift is known as a Unicode scalar literal.

No language uses this notation for what in Swift is known as an extended grapheme cluster literal (i.e., character literal).

The version of Unicode supported, and therefore grapheme breaking, is a runtime concept. In other words, it is the version of the standard library linked at run time that determines whether a string's contents are one extended grapheme cluster (i.e., Character) or not.

Adding syntax to distinguish between a single character and a string that may contain zero or more such characters will enable only best-effort diagnostics at compile time. In other words, a dedicated extended grapheme cluster literal syntax can provide users no guarantees about grapheme breaking as it relates to the contents of the literal, because such knowledge cannot be "baked in" statically into the code.

6 Likes

I think so far there’s been five serious alternatives if this does get returned for revision, so I figured it’s worth summarizing the pros, cons, and implications of each so we can settle on a design moving forward.

1. 'a'.ascii, callable member

let codepoint:Unicode.Scalar = 'a'
return codepoint.ascii
Single quoted literals default to Unicode.Scalar
Implementation difficulty Easy
Compile-time validation? No

Summary:
The Unicode.Scalar type will get an .ascii computed property, which provides its value with the trapping precondition that value < 0x80.

Pros:

  • Readable, concise, and clearly indicates encoding used.
  • Has high discoverability as a callable member.
  • No new compiler or language features needed.
  • No new syntax or semantics.

Cons:

  • Character literals will continue to require type context.
  • Character literals either cannot be expressed with single quotes, or would result in ambiguous expressions like 'é' as Character.
  • Impossible to provide compile-time validation guarantees. (The best we can do is a warning heuristic.)
  • Member .ascii would be available on all Unicode.Scalar values, including run-time values (foo.ascii), which doesn’t seem appropriate from an API standpoint.
  • Exposes users to run-time trapping.
  • Privileges ASCII subset of Unicode.Scalar.
  • Overloads on return type.
  • Strongly ABI-coupled.

2. 'a'.ascii, “literal-bound” member

return 'a'.ascii
Single quoted literals default to Character
Implementation difficulty Hard
Compile-time validation? Yes

Summary:
Swift will support a new method attribute @literalself, essentially a more restrictive version of @constexpression on self. The Character type will get an .ascii computed property which is @literalself, and provides its ASCII value subject to the compile-time condition that it consists of a single codepoint within the ASCII range. Note that this would still be vulnerable to '\r\n' folding.

Pros:

  • Readable, concise, and clearly indicates encoding used.
  • Provides compile-time validation guarantee.
  • Decoupled from ABI.

Cons:

  • Extremely magical, could be considered an abuse of dot . notation.
  • Effectively introduces entire new kind of instance method to the language, depends on @constexpression to generalize into a language feature.
  • Very low discoverability.
  • Privileges ASCII subset of Unicode.Scalar.
  • Overloads on return type.

3. 'a' as UInt8

return 'a' as UInt8
Single quoted literals default to Character
Implementation difficulty Hard
Compile-time validation? Yes

Summary:
Swift will introduce the concept of non-expressible literal coercions, which would allow “opt-in” literal coercions through the use of the as operator. (Note that this is not an overload on the as operator, it merely makes this operator mandatory if requested.) Contrast with Swift’s existing expressible literal coercions, which are “opt-out”, and make the as operator optional. All FixedWidthInteger types would receive a non-expressible literal conformance to unicode scalar literals. This is essentially identical to the proposal as written, except it requires an explicit as (U)Int8 everywhere a codepoint literal→ASCII coercion takes place.

Pros:

  • Readable (though not as concise).
  • Makes it obvious that a literal coercion is taking place.
  • Provides compile-time validation guarantee.
  • Decoupled from ABI.

Cons:

  • Does not indicate ASCII as the specific encoding used.
  • Effectively adds a new feature to the literals system (see this post), depends on @constexpression to generalize into a language feature.

4. a'a'

return a'a' 
Single quoted literals default to Character (u'a' defaults to Unicode.Scalar)
Implementation difficulty Medium
Compile-time validation? Yes

Summary:
Single-quoted literals will be subdivided into multiple prefixed literal sorts. Unprefixed tokens will be parsed as character literals, u-prefixed tokens will be parsed as unicode scalar literals, and a-prefixed tokens will be parsed as integer literals, constrained to the ASCII-range.

Pros:

  • Readable, highly concise, indicates encoding used.
  • Very few compiler modifications needed, no new language features needed.
  • Provides compile-time validation guarantee.
  • Decoupled from ABI.
  • Easily extensible to provide unambiguous syntaxes for Unicode.Scalar (u'a') and Character ('a') literals, as well as alternative character encodings.
  • No new semantics.

Cons:

  • Introduces new syntax to the language. (As opposed to 2 and 3, which only introduce new semantics.)
  • Users need to remember single-character abbreviations for each prefix (“a for ascii”, “u for unicode scalar”, etc).
  • Low discoverability.

5. Full ASCII struct

return ('a' as ASCII).value
Single quoted literals default to Character
Implementation difficulty Medium
Compile-time validation? Yes

Summary:
The standard library will gain a full 7-bit ASCII type which is expressible by unicode scalar literals. Compile-time validation will be performed in the compiler in a semi-magical fashion, just like the current proposal as written.

Pros:

  • The most conservative and strongly-typed design.
  • Few compiler modifications needed, no new language features needed.
  • Provides compile-time validation guarantee.
  • No new syntax or semantics.
  • Mid–high discoverability.

Cons:

  • Limited utility. (Useful for generating outputs, but useless for processing input bytestrings.)
  • May encourage users to bind raw buffers to this type, which is incorrect. (An arbitrary (U)Int8 cannot be safely reinterpreted as 7-bit ASCII value.)
  • Member .value effectively overloads on return type.
  • Strongly ABI-coupled.
5 Likes

In the languages you cite that use a character literal, only Rust and Go don't use out-dated representations of characters. And in these both languages, the single-quote literal represents a character, that is, an element of a string. Because they consider code points as string elements, unlike Swift, which uses extended grapheme clusters.

The primary and core use case of the ' ' literal is to inspect strings.

In Swift that means that is should be able to represent all the Character values, elsewhere it is too limited for that task.

1 Like

I'm not sure if this was being somewhat disingenuous, because I know you're well aware of the particular focus on Unicode-correctness for strings in Swift, but I'll take it at face value. It would be great if you would follow up on the category of “Code unit/code point/Unicode scalar” by specifying what the default “atom” of a string is in these languages, e.g. something like what you get when you index into a string, or what the string's length is calculated in terms of. A quick skim and spot check of a couple of them didn't reveal anything similar to Swift in this respect. Languages which are less interested in Unicode-correct strings are of course going to have a different idea of what this “atom” or “character” is, and that is generally reflected in their “character” syntax, so I don't find this survey very relevant here.

Edit: And of course, this delusion about the obvious default type for single quoted literals isn't unique to me and @RMJay:

This is a logical fallacy. The primary and core use case of single-quoted literals depends on how we design it.

It is important to emphasize that there isn't one thing that is an "atom" of a string. All Unicode-conscious languages start with that caveat in their documentation.

There is no reason whatsoever to tie single-quoted literals to the language's choice of element when indexing (which is also not necessarily a language's choice of element when iterating). Indeed, as I wrote above, because Swift chooses to make the extended grapheme cluster the element type, it is not possible to have compile-time validation of such syntax if we chose to do so.

In Go, indexing into a string yields its bytes, and a string is an arbitrary sequence of UTF-8 bytes; its "length" is the length in bytes. As a special exception, iteration occurs over a string's Unicode code points, or runes.

In Rust, indexing into a string slice (str) yields its bytes, and a string slice is an arbitrary sequence of UTF-8 bytes; its "length" is the length in bytes. It is not possible to iterate over a string slice; one must explicitly ask for its UTF-8 byte view or Unicode scalar view, and the documentation additionally notes that the user may actually want to iterate over extended grapheme clusters instead.

Swift strings can be encoded in UTF-8 or UTF-16, so they cannot be designed as in Rust or Go. In Swift 3, as in Rust today, it is not possible to iterate over String. To improve ergonomics, it was later decided to model Swift strings as a BidirectionalCollection of Characters despite violating some of the semantic guarantees of that protocol.

What this survey shows is that these modern languages do not tie their single-quote literal syntax with the "atom" of their string type, and in fact they divorce iteration over a string from the "atom" of their string type as well.

The Unicode scalar type, as has been mentioned, has gained some fantastic and useful APIs in Swift but lags in ergonomics due to the difficulty of expressing a literal of that type. Even though .NET strings are sequences of UTF-16 code units, they have recently adopted a new Unicode scalar type (named Rune) to improve Unicode string handling ergonomics.

5 Likes

This sounds like an argument for removing the ExpressibleByExtendedGraphemeClusterLiteral protocol entirely, because the compiler cannot validate that the contents of a literal will in fact contain exactly one extended grapheme cluster at runtime.

Since we are obviously not going to do that, and we already have dedicated syntax for specifying a Character literal—"x" as Character—the line of reasoning you describe here is inapplicable.

3 Likes

I'm not sure why that is "obvious." We can deprecate that protocol, and in fact that might be a good thing to do unless there is a clear use case for it currently.

That said, the protocol itself is fine as it makes no guarantees about compile-time validation, and the syntax for an extended grapheme cluster literal is identical to that of a string literal. "x" as Character and 123 as Character are both syntactically well-formed, and neither is a "dedicated" syntax for a character literal.

1 Like