SE-0243: Codepoint and Character Literals

This is a logical fallacy. The primary and core use case of single-quoted literals depends on how we design it.

It is important to emphasize that there isn't one thing that is an "atom" of a string. All Unicode-conscious languages start with that caveat in their documentation.

There is no reason whatsoever to tie single-quoted literals to the language's choice of element when indexing (which is also not necessarily a language's choice of element when iterating). Indeed, as I wrote above, because Swift chooses to make the extended grapheme cluster the element type, it is not possible to have compile-time validation of such syntax if we chose to do so.

In Go, indexing into a string yields its bytes, and a string is an arbitrary sequence of UTF-8 bytes; its "length" is the length in bytes. As a special exception, iteration occurs over a string's Unicode code points, or runes.

In Rust, indexing into a string slice (str) yields its bytes, and a string slice is an arbitrary sequence of UTF-8 bytes; its "length" is the length in bytes. It is not possible to iterate over a string slice; one must explicitly ask for its UTF-8 byte view or Unicode scalar view, and the documentation additionally notes that the user may actually want to iterate over extended grapheme clusters instead.

Swift strings can be encoded in UTF-8 or UTF-16, so they cannot be designed as in Rust or Go. In Swift 3, as in Rust today, it is not possible to iterate over String. To improve ergonomics, it was later decided to model Swift strings as a BidirectionalCollection of Characters despite violating some of the semantic guarantees of that protocol.

What this survey shows is that these modern languages do not tie their single-quote literal syntax with the "atom" of their string type, and in fact they divorce iteration over a string from the "atom" of their string type as well.

The Unicode scalar type, as has been mentioned, has gained some fantastic and useful APIs in Swift but lags in ergonomics due to the difficulty of expressing a literal of that type. Even though .NET strings are sequences of UTF-16 code units, they have recently adopted a new Unicode scalar type (named Rune) to improve Unicode string handling ergonomics.

5 Likes

This sounds like an argument for removing the ExpressibleByExtendedGraphemeClusterLiteral protocol entirely, because the compiler cannot validate that the contents of a literal will in fact contain exactly one extended grapheme cluster at runtime.

Since we are obviously not going to do that, and we already have dedicated syntax for specifying a Character literal—"x" as Character—the line of reasoning you describe here is inapplicable.

3 Likes

I'm not sure why that is "obvious." We can deprecate that protocol, and in fact that might be a good thing to do unless there is a clear use case for it currently.

That said, the protocol itself is fine as it makes no guarantees about compile-time validation, and the syntax for an extended grapheme cluster literal is identical to that of a string literal. "x" as Character and 123 as Character are both syntactically well-formed, and neither is a "dedicated" syntax for a character literal.

1 Like

I think that is basically a given at this point. There are a handful of individuals here who will be able to follow the entire discussion and all of the intricate Unicode details, but it's not reasonable to expect the majority of the community to do that.

One of the things I find quite difficult about this discussion is that we seem to have lost track of what the original problem was this thing was supposed to solve. As I understand it, we want to be able to match ASCII sequences (like IHDR) in byte sequences, right?

And the reason we don't care about non-ASCII sequences is because their byte representations are not obvious in the face of normalisation and combining characters and whatnot.

Why don't we just stick to the actual problem instead of getting bogged-down in syntax and integer conversions?

7 Likes

ExpressibleByExtendedGraphemeClusterLiteral has always been an oddity, not the least of which because ExtendedGraphemeClusterLiteralType (and UnicodeScalarLiteralType) are currently unreachable.

If/when we move to a static literal model, we will not have “unicode scalar literals” or “character literals” or “string literals”, we will just have @stringLiterals and @stringElementLiterals, both of which represented by [Unicode.Scalar].

Surely we can’t avoid working on an area of the language just because many community members lack the background expertise to understand the problem? We don’t declare all of FloatingPoint a no-go zone just because Steve is the only person here who understands floats.

The problem is we don’t have a way to express integer values with textual semantics, with an appropriate textual literal syntax, that doesn’t cause additional issues in the rest of the language (i.e., x.isMultiple(of: 'a')). Or more broadly, we don’t have a “safe” and “readable” way to process and generate ASCII bytestrings. I don’t think anyone has lost track of that.

IHDR is just a concrete example of something that is very difficult to safely and efficiently express with existing language tools. If you want a sampling of “pain points”, I would say that any proposed solution must address the following in a safe manner:

// storing a bytestring value 
static 
var liga:(Int8, Int8, Int8, Int8) 
{
    return (108, 105, 103, 97) // ('l', 'i', 'g', 'a')
}
// storing an ASCII scalar to mixed utf8-ASCII text
var xml:[UInt8] = ...
xml.append(47) // '/'
xml.append(62) // '>'
// ASCII range operations 
let current:UnsafePointer<Int8> = ...
if 97 ... 122 ~= current.pointee // 'a' ... 'z'
{
    ...
}
// ASCII arithmetic operations 
let year:ArraySlice<Int8> = ...
var value:Int = 0
for digit:Int8 in year 
{
    guard 48 ... 57 ~= digit // '0' ... '9'
    else 
    {
        ...
    }

    value = value * 10 + .init(digit - 48) // digit - '0'
}
// reading an ASCII scalar from mixed utf8-ASCII text 
let xml:[Int8] = ... 
if let i:Int = xml.firstIndex(of: 60) // '<'
{
    ...
}
// matching ASCII signatures 
let c:UnsafePointer<UInt8> = ...
if (c[0], c[1], c[2], c[3]) == (80, 76, 84, 69) // ('P', 'L', 'T', 'E')
{
    ...
}

There is no reason whatsoever to introduce new literal syntax for this; String’s UTF-8 view already provides a succint and highly efficient way to represent ASCII byte sequences.

let needle = “PNG89a”.utf8
// needle is a sequence of bytes corresponding to the 
// UTF-8 encoding of “PNG89a”. For ASCII strings,
// this is exactly the same as their 7-bit ASCII encoding 
// zero-extended to 8-bit bytes.

If the standard library doesn’t provide convenient enough APIs to match such byte sequences, then that can and should be remedied by introducing new APIs in stdlib. Inventing new syntax won’t help.

(The fact that this also works for non-ASCII characters still seems like a great feature to me. UTF-8 is the new ASCII.)

We do not have a similarly succinct syntax to express individual bytes between 0 and 127 by their corresponding ASCII character. If this is an important usecase, then Unicode scalar literal syntax would give us that by allowing ’a’.ascii. (Character is on the wrong abstraction level for this; its asciiValue property is broken.)

Support for other legacy encodings (ISO 8859-x, EBCDIC variants, etc.) can be provided by external packages, by simply defining similar properties on Unicode.Scalar. These would work just as nicely as .ascii.

let hello = “I’m an ASCII bytestring”.utf8

Is this unsafe or unreadable? Why?

Yes, it is unsafe. The reason why is located immediately after the I.

3 Likes

Great point. I’ve been typing most of my posts directly into a poor web emulation of a text editor, not a code editor. Most of my apostrophes and quotes have been converted to the proper punctuation marks for English text.

There is no need for any additional compile-time checks, though: my code above (and throughout this discussion) already won’t compile because it uses English left and right quotation marks, not the ASCII approximation that Swift requires for String literals.

(Note how the corruption exhibited in these forum posts is not related to Unicode normalization. It’s the browser trying to be helpful and work around the limitations of my keyboard, which has a fewer keys than English text requires.)

If such corruption is likely enough in practical contexts to deserve special treatment, then there is a wide spectrum of possible approaches to detect it. Adding dedicated language syntax for ASCII literals to protect against these seems like severe overreaction to me; the same practical effect can be achieved by runtime checks, possibly combined with special-cased warning diagnostics.

1 Like

If the problem were just limited to generating arrays of ASCII characters, I might agree with you, but as i said in my other post, there are a lot of other use cases that .utf8 doesn’t solve. I also don’t think new syntax should be the cost we need to be worried about,, I’m a lot more concerned with potential solutions like “literal-bound 'a'.ascii, which overload existing syntax with new semantics.

These are really useful! I really don't see your point, though -- String.utf8 (and Unicode.Scalar.ascii) seem to provide perfectly elegant, safe and efficient solutions to all of them:

// storing a bytestring value
static var liga = "liga".utf8
// storing an ASCII scalar to mixed utf8-ASCII text
var xml: [UInt8] = ...
xml.append('/'.ascii)
xml.append('>'.ascii)
// ASCII range operations 
let current: UnsafePointer<UInt8> = ...
if 'a'.ascii ... 'z'.ascii ~= current.pointee {
    ...
}
// ASCII arithmetic operations 
let year: ArraySlice<UInt8> = ...
var value: Int = 0
for digit: UInt8 in year {
    guard '0'.ascii ... '9'.ascii ~= digit else {
        ...
    }
    value *= 10
    value += Int(digit - '0'.ascii)
}
// reading an ASCII scalar from mixed utf8-ASCII text 
let xml: [UInt8] = ... 
if let i: Int = xml.firstIndex(of: '<'.ascii) {
    ...
}
// matching ASCII signatures 
let c: UnsafeRawBufferPointer = ...
if c.starts(with: "PLTE".utf8) {
    ...
}

Note: I took the liberty of replacing Int8 above with UInt8. As far as I know, Int8 data typically comes from C APIs imported as CChar, which is a truly terrible type: it's documented to be either UInt8 or Int8, depending on the platform. Any code that doesn't immediately and explicitly rebind C strings to UInt8 is arguably broken.

1 Like

I fully agree with this part; there is no need for any language feature beyond Unicode.Scalar literals. A regular .ascii property would work just fine.

In this particular topic, we keep trying to come up with needlessly complicated syntax-level solutions to a rather niche problem that'd be much better resolved through a bit of careful API design.

1 Like

I agree CChar is horrible but i don’t know enough about Swift’s C interop to say if getting rid of it is possible. If so, we can probably drop the “overloads on return type” issue with options 1, 2, and 5.

this isn’t exactly ideal since we’d get a heap allocation. That’s why I’ve used (UInt8, UInt8, UInt8, UInt8) for all these 32-bit ASCII string examples so far. (v v popular in binary file formats since they’re the same size as a C int or float.)

I don’t think all of these issues can be solved at the standard library level. Ultimately, there’s 3 questions that come into play for all users of ASCII bytestrings, and those are

  1. I don’t want my code to crash.

  2. I don’t want my code to be cryptic and indecipherable.

  3. I don’t want my code to be a sprawling inefficient mess.

API design will only give you two out of three.

If you care about 2 and 3, but not 1, then 'a'.ascii is the right solution for you. But then you’ll be vending a trapping API on all Unicode.Scalar values, regardless of context. And even though you could sweep all misuse under the “programmer error” rug, it would still be a questionable addition to the standard library, just like a trapping .first property on Array would be.

If you care about 1 and 2, but not 3, then 'a'.ascii would be the right solution for you, but you would want it to return an Optional, just like the (v v problematic) .asciiValue property on Character. It doesn’t take a lot of imagination to see how cumbersome careful usage of this API could get.

If you care about 1 and 3, but not 2, then no solution is needed,, you’ve probably memorized the ASCII table by now, and you should just go on plugging in hex or decimal values in everywhere an UInt8 ASCII value is needed. But I don’t think anyone is advocating for that.

You definitely could insist on a standard library-only solution to this problem, but it’d be pretty suboptimal. I ask that you keep an open mind to syntax and feature-based solutions, as they’re not all as complicated as you make them sound. Option 4 (a'a') is a relatively superficial change in the compiler that would touch nothing below the lexer/parser level. Option 3 ('a' as UInt8) changes neither syntax nor semantics, it just makes it possible for certain types to make existing syntax explicit and mandatory.

If I was to post to this thread I’d be restating what I wrote 50 posts ago so I’ll just post a link: SE-0243: Codepoint and Character Literals - #252 by johnno1962

TLDR; a few well chosen operators added to stdlib can solve the problem without trying to tie down a new literal syntax. One opinion that has changed is that I’m leaning toward @michelf’s suggestion that single quotes be reserved for ASCII only literals to allow compile time validation and put an end to any Unicode Shenanigans.

3 Likes

One thing that won't work if you take the full "ASCII literal" suggestion I made in the pitch thread is that it allows single quotes to represent both a String and a UInt8 and thus suffers from the same problem I pointed out earlier where UInt8('8') != ('8' as UInt8). To fix this we'd have to amend it by either:

  1. disallowing single quote literals for integer types, or
  2. disallowing single quote literals for String

I’ve given up on Integer types being expressible by quoted literals and am currently putting forward a different approach involving targeted operators so this shouldn’t be an issue. Just saying a literal form for ASCII only strings which I think is currently being floated as a’a’ might be worthwhile.

Sure, but we are discussing the default type here, not the sole type, and there is definitely an obvious default in Swift.

It's similarly difficult to express a Character, which should be much more commonly used than a Unicode.Scalar, so I still hold the position that if it's not important to have a literal form designed primarily for Characters (i.e. defaults to Character) then how can it be important to have one designed for Unicode.Scalars? And if I'm wrong, and Unicode.Scalar really is more important than Character in this sense, then it seems to me that the whole Swift string design must have failed.

There is no heap allocation in "liga".utf8. It’s either an immortal string or a small string. If this isn’t the case, then that’s a bug!

I’m not saying we can change it; that’s a different discussion. I’m saying that Int8 byte sequences tend to originate from CChar, but there is no reason code should keep them in that form. Swift APIs have standardized on UInt8; imported data in any other format should be converted to match.

Meh. Array vends a trapping subscript on all array values, regardless of context. The utility of a trapping .ascii property seems clear to me, and it feels similar to how/why Array.subscript doesn’t return an optional value. In case asciiness is non-obvious in a particular context, then the (already existing) .isASCII property can be used to test for it before accessing .ascii.

Yes, a trapping property is somewhat unusual. But in this case I feel it’s the right trade off.

1 Like

After close to 300 messages on this thread alone, it’s still not clear to me why that must be the case, if there aren’t any clear usecases. Is Character used in any context where the lack of a syntactic shortcut is actively hurting Swift’s usability?