SE-0243: Codepoint and Character Literals

Also, it is recommended to use the "\u{...}" notation if you're doing things like this.

Alternatively (not sure it's a good idea), literals that vary under NFC could trigger a compilation error and a fix-it that used "\u{...}" notation.

6 Likes

It's a fantastic linter rule, I'd say.

Not sure how I'd feel with it as a compiler warning or error. For one thing, it'd disproportionately affect certain scripts and could conceivably render some of them unreadable, which would be a rather unforgivable sin for a literal notation.

1 Like

As a String is a vector of Character, the latter is a vector of code points. If you mean to compare (ASCII) code points to bytes within binary data, then wouldn't Unicode.Scalar be the correct abstraction? A scalar value always maps to exactly one integer, while a character may be a vector of scalars.

For text file mangling your representation, wouldn't scalars be better? The "é" abstract character may have two Character representations, U+00E9 as a single scalar or U+0065 & U+0301 as a primary and secondary code-point pair. If the characters within single quotes must always be Unicode scalars, then only one interpretation is allowed in the object code for "é", no matter which way it's stored in the source code file. It does mean that the compiler must have ICU or an equivalent to find valid recombinations that can resolve to a single scalar. If we allow "\u{}" Unicode escapes within single quotes, we can mandate that they always have to used the single scalar version and never a decomposed form. (In other words, recomposition is allowed from translation between the source file's encoding to object code, and never from the user deliberately splitting a single-scalar character to an official decomposed form.)

1 Like

This isn’t quite true. Characters can contain other Characters, so it’s not a neat 3-level hierarchy. It’s probably better to think of Character boundaries as maximal, context-dependent intervals calculated on a String object as a whole, and an individual Character object as a very short String whose largest interval (among many shorter choices) extends across its entire length.

Also, Unicode.Scalar isn’t entirely the right abstraction when comparing with UInt8s, since Unicode.Scalar can and will assume every 8-bit character it’s compared against is encoded in the Latin-1 encoding, and compared with 7-bit encodings where ASCII is queen, there are just too many alternative 8-bit character encodings for me to be comfortable here.

1 Like

It is difficult to see why single-quoted literals should be presumed to default to Character, as no language offers such a syntax.

Here's how popular programming languages make use of single quotation marks:

String

  • Delphi/Object Pascal
  • JavaScript
  • MATLAB (char array)
  • Python
  • R
  • SQL

'Raw' string

  • Groovy
  • Perl
  • PHP
  • Ruby

Code unit/code point/Unicode scalar

  • C: int
  • C++: char (if literal is prefixed, it can be char8_t, char16_t, char32_t, or wchar_t)
  • C#: char (16-bit)
  • Java: char (16-bit)
  • Kotlin: Char (16-bit)
  • Go: rune (32-bit)
  • Rust: char (32-bit)

In Go, a Unicode code point is known as a rune (a term now also adopted in .NET). In Rust, a Unicode scalar value is known as a character; in Swift, it is known as a Unicode scalar. (A Unicode scalar value is any Unicode code point except high- and low-surrogate code points.)

As can be seen, Go and Rust use single quotation marks for what in Swift is known as a Unicode scalar literal.

No language uses this notation for what in Swift is known as an extended grapheme cluster literal (i.e., character literal).

The version of Unicode supported, and therefore grapheme breaking, is a runtime concept. In other words, it is the version of the standard library linked at run time that determines whether a string's contents are one extended grapheme cluster (i.e., Character) or not.

Adding syntax to distinguish between a single character and a string that may contain zero or more such characters will enable only best-effort diagnostics at compile time. In other words, a dedicated extended grapheme cluster literal syntax can provide users no guarantees about grapheme breaking as it relates to the contents of the literal, because such knowledge cannot be "baked in" statically into the code.

6 Likes

I think so far there’s been five serious alternatives if this does get returned for revision, so I figured it’s worth summarizing the pros, cons, and implications of each so we can settle on a design moving forward.

1. 'a'.ascii, callable member

let codepoint:Unicode.Scalar = 'a'
return codepoint.ascii
Single quoted literals default to Unicode.Scalar
Implementation difficulty Easy
Compile-time validation? No

Summary:
The Unicode.Scalar type will get an .ascii computed property, which provides its value with the trapping precondition that value < 0x80.

Pros:

  • Readable, concise, and clearly indicates encoding used.
  • Has high discoverability as a callable member.
  • No new compiler or language features needed.
  • No new syntax or semantics.

Cons:

  • Character literals will continue to require type context.
  • Character literals either cannot be expressed with single quotes, or would result in ambiguous expressions like 'é' as Character.
  • Impossible to provide compile-time validation guarantees. (The best we can do is a warning heuristic.)
  • Member .ascii would be available on all Unicode.Scalar values, including run-time values (foo.ascii), which doesn’t seem appropriate from an API standpoint.
  • Exposes users to run-time trapping.
  • Privileges ASCII subset of Unicode.Scalar.
  • Overloads on return type.
  • Strongly ABI-coupled.

2. 'a'.ascii, “literal-bound” member

return 'a'.ascii
Single quoted literals default to Character
Implementation difficulty Hard
Compile-time validation? Yes

Summary:
Swift will support a new method attribute @literalself, essentially a more restrictive version of @constexpression on self. The Character type will get an .ascii computed property which is @literalself, and provides its ASCII value subject to the compile-time condition that it consists of a single codepoint within the ASCII range. Note that this would still be vulnerable to '\r\n' folding.

Pros:

  • Readable, concise, and clearly indicates encoding used.
  • Provides compile-time validation guarantee.
  • Decoupled from ABI.

Cons:

  • Extremely magical, could be considered an abuse of dot . notation.
  • Effectively introduces entire new kind of instance method to the language, depends on @constexpression to generalize into a language feature.
  • Very low discoverability.
  • Privileges ASCII subset of Unicode.Scalar.
  • Overloads on return type.

3. 'a' as UInt8

return 'a' as UInt8
Single quoted literals default to Character
Implementation difficulty Hard
Compile-time validation? Yes

Summary:
Swift will introduce the concept of non-expressible literal coercions, which would allow “opt-in” literal coercions through the use of the as operator. (Note that this is not an overload on the as operator, it merely makes this operator mandatory if requested.) Contrast with Swift’s existing expressible literal coercions, which are “opt-out”, and make the as operator optional. All FixedWidthInteger types would receive a non-expressible literal conformance to unicode scalar literals. This is essentially identical to the proposal as written, except it requires an explicit as (U)Int8 everywhere a codepoint literal→ASCII coercion takes place.

Pros:

  • Readable (though not as concise).
  • Makes it obvious that a literal coercion is taking place.
  • Provides compile-time validation guarantee.
  • Decoupled from ABI.

Cons:

  • Does not indicate ASCII as the specific encoding used.
  • Effectively adds a new feature to the literals system (see this post), depends on @constexpression to generalize into a language feature.

4. a'a'

return a'a' 
Single quoted literals default to Character (u'a' defaults to Unicode.Scalar)
Implementation difficulty Medium
Compile-time validation? Yes

Summary:
Single-quoted literals will be subdivided into multiple prefixed literal sorts. Unprefixed tokens will be parsed as character literals, u-prefixed tokens will be parsed as unicode scalar literals, and a-prefixed tokens will be parsed as integer literals, constrained to the ASCII-range.

Pros:

  • Readable, highly concise, indicates encoding used.
  • Very few compiler modifications needed, no new language features needed.
  • Provides compile-time validation guarantee.
  • Decoupled from ABI.
  • Easily extensible to provide unambiguous syntaxes for Unicode.Scalar (u'a') and Character ('a') literals, as well as alternative character encodings.
  • No new semantics.

Cons:

  • Introduces new syntax to the language. (As opposed to 2 and 3, which only introduce new semantics.)
  • Users need to remember single-character abbreviations for each prefix (“a for ascii”, “u for unicode scalar”, etc).
  • Low discoverability.

5. Full ASCII struct

return ('a' as ASCII).value
Single quoted literals default to Character
Implementation difficulty Medium
Compile-time validation? Yes

Summary:
The standard library will gain a full 7-bit ASCII type which is expressible by unicode scalar literals. Compile-time validation will be performed in the compiler in a semi-magical fashion, just like the current proposal as written.

Pros:

  • The most conservative and strongly-typed design.
  • Few compiler modifications needed, no new language features needed.
  • Provides compile-time validation guarantee.
  • No new syntax or semantics.
  • Mid–high discoverability.

Cons:

  • Limited utility. (Useful for generating outputs, but useless for processing input bytestrings.)
  • May encourage users to bind raw buffers to this type, which is incorrect. (An arbitrary (U)Int8 cannot be safely reinterpreted as 7-bit ASCII value.)
  • Member .value effectively overloads on return type.
  • Strongly ABI-coupled.
5 Likes

In the languages you cite that use a character literal, only Rust and Go don't use out-dated representations of characters. And in these both languages, the single-quote literal represents a character, that is, an element of a string. Because they consider code points as string elements, unlike Swift, which uses extended grapheme clusters.

The primary and core use case of the ' ' literal is to inspect strings.

In Swift that means that is should be able to represent all the Character values, elsewhere it is too limited for that task.

1 Like

I'm not sure if this was being somewhat disingenuous, because I know you're well aware of the particular focus on Unicode-correctness for strings in Swift, but I'll take it at face value. It would be great if you would follow up on the category of “Code unit/code point/Unicode scalar” by specifying what the default “atom” of a string is in these languages, e.g. something like what you get when you index into a string, or what the string's length is calculated in terms of. A quick skim and spot check of a couple of them didn't reveal anything similar to Swift in this respect. Languages which are less interested in Unicode-correct strings are of course going to have a different idea of what this “atom” or “character” is, and that is generally reflected in their “character” syntax, so I don't find this survey very relevant here.

Edit: And of course, this delusion about the obvious default type for single quoted literals isn't unique to me and @RMJay:

This is a logical fallacy. The primary and core use case of single-quoted literals depends on how we design it.

It is important to emphasize that there isn't one thing that is an "atom" of a string. All Unicode-conscious languages start with that caveat in their documentation.

There is no reason whatsoever to tie single-quoted literals to the language's choice of element when indexing (which is also not necessarily a language's choice of element when iterating). Indeed, as I wrote above, because Swift chooses to make the extended grapheme cluster the element type, it is not possible to have compile-time validation of such syntax if we chose to do so.

In Go, indexing into a string yields its bytes, and a string is an arbitrary sequence of UTF-8 bytes; its "length" is the length in bytes. As a special exception, iteration occurs over a string's Unicode code points, or runes.

In Rust, indexing into a string slice (str) yields its bytes, and a string slice is an arbitrary sequence of UTF-8 bytes; its "length" is the length in bytes. It is not possible to iterate over a string slice; one must explicitly ask for its UTF-8 byte view or Unicode scalar view, and the documentation additionally notes that the user may actually want to iterate over extended grapheme clusters instead.

Swift strings can be encoded in UTF-8 or UTF-16, so they cannot be designed as in Rust or Go. In Swift 3, as in Rust today, it is not possible to iterate over String. To improve ergonomics, it was later decided to model Swift strings as a BidirectionalCollection of Characters despite violating some of the semantic guarantees of that protocol.

What this survey shows is that these modern languages do not tie their single-quote literal syntax with the "atom" of their string type, and in fact they divorce iteration over a string from the "atom" of their string type as well.

The Unicode scalar type, as has been mentioned, has gained some fantastic and useful APIs in Swift but lags in ergonomics due to the difficulty of expressing a literal of that type. Even though .NET strings are sequences of UTF-16 code units, they have recently adopted a new Unicode scalar type (named Rune) to improve Unicode string handling ergonomics.

5 Likes

This sounds like an argument for removing the ExpressibleByExtendedGraphemeClusterLiteral protocol entirely, because the compiler cannot validate that the contents of a literal will in fact contain exactly one extended grapheme cluster at runtime.

Since we are obviously not going to do that, and we already have dedicated syntax for specifying a Character literal—"x" as Character—the line of reasoning you describe here is inapplicable.

3 Likes

I'm not sure why that is "obvious." We can deprecate that protocol, and in fact that might be a good thing to do unless there is a clear use case for it currently.

That said, the protocol itself is fine as it makes no guarantees about compile-time validation, and the syntax for an extended grapheme cluster literal is identical to that of a string literal. "x" as Character and 123 as Character are both syntactically well-formed, and neither is a "dedicated" syntax for a character literal.

1 Like

I think that is basically a given at this point. There are a handful of individuals here who will be able to follow the entire discussion and all of the intricate Unicode details, but it's not reasonable to expect the majority of the community to do that.

One of the things I find quite difficult about this discussion is that we seem to have lost track of what the original problem was this thing was supposed to solve. As I understand it, we want to be able to match ASCII sequences (like IHDR) in byte sequences, right?

And the reason we don't care about non-ASCII sequences is because their byte representations are not obvious in the face of normalisation and combining characters and whatnot.

Why don't we just stick to the actual problem instead of getting bogged-down in syntax and integer conversions?

7 Likes

ExpressibleByExtendedGraphemeClusterLiteral has always been an oddity, not the least of which because ExtendedGraphemeClusterLiteralType (and UnicodeScalarLiteralType) are currently unreachable.

If/when we move to a static literal model, we will not have “unicode scalar literals” or “character literals” or “string literals”, we will just have @stringLiterals and @stringElementLiterals, both of which represented by [Unicode.Scalar].

Surely we can’t avoid working on an area of the language just because many community members lack the background expertise to understand the problem? We don’t declare all of FloatingPoint a no-go zone just because Steve is the only person here who understands floats.

The problem is we don’t have a way to express integer values with textual semantics, with an appropriate textual literal syntax, that doesn’t cause additional issues in the rest of the language (i.e., x.isMultiple(of: 'a')). Or more broadly, we don’t have a “safe” and “readable” way to process and generate ASCII bytestrings. I don’t think anyone has lost track of that.

IHDR is just a concrete example of something that is very difficult to safely and efficiently express with existing language tools. If you want a sampling of “pain points”, I would say that any proposed solution must address the following in a safe manner:

// storing a bytestring value 
static 
var liga:(Int8, Int8, Int8, Int8) 
{
    return (108, 105, 103, 97) // ('l', 'i', 'g', 'a')
}
// storing an ASCII scalar to mixed utf8-ASCII text
var xml:[UInt8] = ...
xml.append(47) // '/'
xml.append(62) // '>'
// ASCII range operations 
let current:UnsafePointer<Int8> = ...
if 97 ... 122 ~= current.pointee // 'a' ... 'z'
{
    ...
}
// ASCII arithmetic operations 
let year:ArraySlice<Int8> = ...
var value:Int = 0
for digit:Int8 in year 
{
    guard 48 ... 57 ~= digit // '0' ... '9'
    else 
    {
        ...
    }

    value = value * 10 + .init(digit - 48) // digit - '0'
}
// reading an ASCII scalar from mixed utf8-ASCII text 
let xml:[Int8] = ... 
if let i:Int = xml.firstIndex(of: 60) // '<'
{
    ...
}
// matching ASCII signatures 
let c:UnsafePointer<UInt8> = ...
if (c[0], c[1], c[2], c[3]) == (80, 76, 84, 69) // ('P', 'L', 'T', 'E')
{
    ...
}

There is no reason whatsoever to introduce new literal syntax for this; String’s UTF-8 view already provides a succint and highly efficient way to represent ASCII byte sequences.

let needle = “PNG89a”.utf8
// needle is a sequence of bytes corresponding to the 
// UTF-8 encoding of “PNG89a”. For ASCII strings,
// this is exactly the same as their 7-bit ASCII encoding 
// zero-extended to 8-bit bytes.

If the standard library doesn’t provide convenient enough APIs to match such byte sequences, then that can and should be remedied by introducing new APIs in stdlib. Inventing new syntax won’t help.

(The fact that this also works for non-ASCII characters still seems like a great feature to me. UTF-8 is the new ASCII.)

We do not have a similarly succinct syntax to express individual bytes between 0 and 127 by their corresponding ASCII character. If this is an important usecase, then Unicode scalar literal syntax would give us that by allowing ’a’.ascii. (Character is on the wrong abstraction level for this; its asciiValue property is broken.)

Support for other legacy encodings (ISO 8859-x, EBCDIC variants, etc.) can be provided by external packages, by simply defining similar properties on Unicode.Scalar. These would work just as nicely as .ascii.

let hello = “I’m an ASCII bytestring”.utf8

Is this unsafe or unreadable? Why?

Yes, it is unsafe. The reason why is located immediately after the I.

3 Likes

Great point. I’ve been typing most of my posts directly into a poor web emulation of a text editor, not a code editor. Most of my apostrophes and quotes have been converted to the proper punctuation marks for English text.

There is no need for any additional compile-time checks, though: my code above (and throughout this discussion) already won’t compile because it uses English left and right quotation marks, not the ASCII approximation that Swift requires for String literals.

(Note how the corruption exhibited in these forum posts is not related to Unicode normalization. It’s the browser trying to be helpful and work around the limitations of my keyboard, which has a fewer keys than English text requires.)

If such corruption is likely enough in practical contexts to deserve special treatment, then there is a wide spectrum of possible approaches to detect it. Adding dedicated language syntax for ASCII literals to protect against these seems like severe overreaction to me; the same practical effect can be achieved by runtime checks, possibly combined with special-cased warning diagnostics.

1 Like

If the problem were just limited to generating arrays of ASCII characters, I might agree with you, but as i said in my other post, there are a lot of other use cases that .utf8 doesn’t solve. I also don’t think new syntax should be the cost we need to be worried about,, I’m a lot more concerned with potential solutions like “literal-bound 'a'.ascii, which overload existing syntax with new semantics.

These are really useful! I really don't see your point, though -- String.utf8 (and Unicode.Scalar.ascii) seem to provide perfectly elegant, safe and efficient solutions to all of them:

// storing a bytestring value
static var liga = "liga".utf8
// storing an ASCII scalar to mixed utf8-ASCII text
var xml: [UInt8] = ...
xml.append('/'.ascii)
xml.append('>'.ascii)
// ASCII range operations 
let current: UnsafePointer<UInt8> = ...
if 'a'.ascii ... 'z'.ascii ~= current.pointee {
    ...
}
// ASCII arithmetic operations 
let year: ArraySlice<UInt8> = ...
var value: Int = 0
for digit: UInt8 in year {
    guard '0'.ascii ... '9'.ascii ~= digit else {
        ...
    }
    value *= 10
    value += Int(digit - '0'.ascii)
}
// reading an ASCII scalar from mixed utf8-ASCII text 
let xml: [UInt8] = ... 
if let i: Int = xml.firstIndex(of: '<'.ascii) {
    ...
}
// matching ASCII signatures 
let c: UnsafeRawBufferPointer = ...
if c.starts(with: "PLTE".utf8) {
    ...
}

Note: I took the liberty of replacing Int8 above with UInt8. As far as I know, Int8 data typically comes from C APIs imported as CChar, which is a truly terrible type: it's documented to be either UInt8 or Int8, depending on the platform. Any code that doesn't immediately and explicitly rebind C strings to UInt8 is arguably broken.

1 Like