Apparently I have been repeatedly incorrect when I said scalar literals where the only place Swift syntax was not Unicode compliant. @michelf is right, we have much bigger fish to fry. Sorry for the misinformation in that regard.
They would be just as discouragedâin that neither would be discouraged.
Just because a developer is asking for a Character
does not mean that they do not care about the number of Unicode scalars that make up that character. They may want to compare the input to another Character
taking into account canonical equivalence, but they could be working with input and output that expects a single Unicode scalar while doing so.
A literal value should never be normalized by tooling. That goes against what a âliteralâ value is. How would you feel if a tool converted all your integer literals to hexadecimal notation?
Also, it is recommended to use the "\u{...}" notation if you're doing things like this.
Alternatively (not sure it's a good idea), literals that vary under NFC could trigger a compilation error and a fix-it that used "\u{...}" notation.
It's a fantastic linter rule, I'd say.
Not sure how I'd feel with it as a compiler warning or error. For one thing, it'd disproportionately affect certain scripts and could conceivably render some of them unreadable, which would be a rather unforgivable sin for a literal notation.
As a String
is a vector of Character
, the latter is a vector of code points. If you mean to compare (ASCII) code points to bytes within binary data, then wouldn't Unicode.Scalar
be the correct abstraction? A scalar value always maps to exactly one integer, while a character may be a vector of scalars.
For text file mangling your representation, wouldn't scalars be better? The "Ă©" abstract character may have two Character
representations, U+00E9 as a single scalar or U+0065 & U+0301 as a primary and secondary code-point pair. If the characters within single quotes must always be Unicode scalars, then only one interpretation is allowed in the object code for "Ă©", no matter which way it's stored in the source code file. It does mean that the compiler must have ICU or an equivalent to find valid recombinations that can resolve to a single scalar. If we allow "\u{}" Unicode escapes within single quotes, we can mandate that they always have to used the single scalar version and never a decomposed form. (In other words, recomposition is allowed from translation between the source file's encoding to object code, and never from the user deliberately splitting a single-scalar character to an official decomposed form.)
This isnât quite true. Characters
can contain other Character
s, so itâs not a neat 3-level hierarchy. Itâs probably better to think of Character
boundaries as maximal, context-dependent intervals calculated on a String
object as a whole, and an individual Character
object as a very short String
whose largest interval (among many shorter choices) extends across its entire length.
Also, Unicode.Scalar
isnât entirely the right abstraction when comparing with UInt8
s, since Unicode.Scalar
can and will assume every 8-bit character itâs compared against is encoded in the Latin-1 encoding, and compared with 7-bit encodings where ASCII is queen, there are just too many alternative 8-bit character encodings for me to be comfortable here.
It is difficult to see why single-quoted literals should be presumed to default to Character
, as no language offers such a syntax.
Here's how popular programming languages make use of single quotation marks:
String
- Delphi/Object Pascal
- JavaScript
- MATLAB (
char
array) - Python
- R
- SQL
'Raw' string
- Groovy
- Perl
- PHP
- Ruby
Code unit/code point/Unicode scalar
- C:
int
- C++:
char
(if literal is prefixed, it can bechar8_t
,char16_t
,char32_t
, orwchar_t
) - C#:
char
(16-bit) - Java:
char
(16-bit) - Kotlin:
Char
(16-bit) - Go:
rune
(32-bit) - Rust:
char
(32-bit)
In Go, a Unicode code point is known as a rune (a term now also adopted in .NET). In Rust, a Unicode scalar value is known as a character; in Swift, it is known as a Unicode scalar. (A Unicode scalar value is any Unicode code point except high- and low-surrogate code points.)
As can be seen, Go and Rust use single quotation marks for what in Swift is known as a Unicode scalar literal.
No language uses this notation for what in Swift is known as an extended grapheme cluster literal (i.e., character literal).
The version of Unicode supported, and therefore grapheme breaking, is a runtime concept. In other words, it is the version of the standard library linked at run time that determines whether a string's contents are one extended grapheme cluster (i.e., Character
) or not.
Adding syntax to distinguish between a single character and a string that may contain zero or more such characters will enable only best-effort diagnostics at compile time. In other words, a dedicated extended grapheme cluster literal syntax can provide users no guarantees about grapheme breaking as it relates to the contents of the literal, because such knowledge cannot be "baked in" statically into the code.
I think so far thereâs been five serious alternatives if this does get returned for revision, so I figured itâs worth summarizing the pros, cons, and implications of each so we can settle on a design moving forward.
1. 'a'.ascii
, callable member
let codepoint:Unicode.Scalar = 'a'
return codepoint.ascii
Single quoted literals default to | Unicode.Scalar |
Implementation difficulty | Easy |
Compile-time validation? | No |
Summary:
The Unicode.Scalar
type will get an .ascii
computed property, which provides its value
with the trapping precondition that value < 0x80
.
Pros:
- Readable, concise, and clearly indicates encoding used.
- Has high discoverability as a callable member.
- No new compiler or language features needed.
- No new syntax or semantics.
Cons:
-
Character
literals will continue to require type context. -
Character
literals either cannot be expressed with single quotes, or would result in ambiguous expressions like'Ă©' as Character
. - Impossible to provide compile-time validation guarantees. (The best we can do is a warning heuristic.)
- Member
.ascii
would be available on allUnicode.Scalar
values, including run-time values (foo.ascii
), which doesnât seem appropriate from an API standpoint. - Exposes users to run-time trapping.
- Privileges ASCII subset of
Unicode.Scalar
. - Overloads on return type.
- Strongly ABI-coupled.
2. 'a'.ascii
, âliteral-boundâ member
return 'a'.ascii
Single quoted literals default to | Character |
Implementation difficulty | Hard |
Compile-time validation? | Yes |
Summary:
Swift will support a new method attribute @literalself
, essentially a more restrictive version of @constexpression
on self
. The Character
type will get an .ascii
computed property which is @literalself
, and provides its ASCII value subject to the compile-time condition that it consists of a single codepoint within the ASCII range. Note that this would still be vulnerable to '\r\n'
folding.
Pros:
- Readable, concise, and clearly indicates encoding used.
- Provides compile-time validation guarantee.
- Decoupled from ABI.
Cons:
- Extremely magical, could be considered an abuse of dot
.
notation. - Effectively introduces entire new kind of instance method to the language, depends on
@constexpression
to generalize into a language feature. - Very low discoverability.
- Privileges ASCII subset of
Unicode.Scalar
. - Overloads on return type.
3. 'a' as UInt8
return 'a' as UInt8
Single quoted literals default to | Character |
Implementation difficulty | Hard |
Compile-time validation? | Yes |
Summary:
Swift will introduce the concept of non-expressible literal coercions, which would allow âopt-inâ literal coercions through the use of the as
operator. (Note that this is not an overload on the as
operator, it merely makes this operator mandatory if requested.) Contrast with Swiftâs existing expressible literal coercions, which are âopt-outâ, and make the as
operator optional. All FixedWidthInteger
types would receive a non-expressible literal conformance to unicode scalar literals. This is essentially identical to the proposal as written, except it requires an explicit as (U)Int8
everywhere a codepoint literalâASCII coercion takes place.
Pros:
- Readable (though not as concise).
- Makes it obvious that a literal coercion is taking place.
- Provides compile-time validation guarantee.
- Decoupled from ABI.
Cons:
- Does not indicate ASCII as the specific encoding used.
- Effectively adds a new feature to the literals system (see this post), depends on
@constexpression
to generalize into a language feature.
4. a'a'
return a'a'
Single quoted literals default to |
Character (u'a' defaults to Unicode.Scalar ) |
Implementation difficulty | Medium |
Compile-time validation? | Yes |
Summary:
Single-quoted literals will be subdivided into multiple prefixed literal sorts. Unprefixed tokens will be parsed as character literals, u
-prefixed tokens will be parsed as unicode scalar literals, and a
-prefixed tokens will be parsed as integer literals, constrained to the ASCII-range.
Pros:
- Readable, highly concise, indicates encoding used.
- Very few compiler modifications needed, no new language features needed.
- Provides compile-time validation guarantee.
- Decoupled from ABI.
- Easily extensible to provide unambiguous syntaxes for
Unicode.Scalar
(u'a'
) andCharacter
('a'
) literals, as well as alternative character encodings. - No new semantics.
Cons:
- Introduces new syntax to the language. (As opposed to 2 and 3, which only introduce new semantics.)
- Users need to remember single-character abbreviations for each prefix (â
a
for asciiâ, âu
for unicode scalarâ, etc). - Low discoverability.
5. Full ASCII
struct
return ('a' as ASCII).value
Single quoted literals default to | Character |
Implementation difficulty | Medium |
Compile-time validation? | Yes |
Summary:
The standard library will gain a full 7-bit ASCII type which is expressible by unicode scalar literals. Compile-time validation will be performed in the compiler in a semi-magical fashion, just like the current proposal as written.
Pros:
- The most conservative and strongly-typed design.
- Few compiler modifications needed, no new language features needed.
- Provides compile-time validation guarantee.
- No new syntax or semantics.
- Midâhigh discoverability.
Cons:
- Limited utility. (Useful for generating outputs, but useless for processing input bytestrings.)
- May encourage users to bind raw buffers to this type, which is incorrect. (An arbitrary
(U)Int8
cannot be safely reinterpreted as 7-bitASCII
value.) - Member
.value
effectively overloads on return type. - Strongly ABI-coupled.
In the languages you cite that use a character literal, only Rust and Go don't use out-dated representations of characters. And in these both languages, the single-quote literal represents a character, that is, an element of a string. Because they consider code points as string elements, unlike Swift, which uses extended grapheme clusters.
The primary and core use case of the ' '
literal is to inspect strings.
In Swift that means that is should be able to represent all the Character
values, elsewhere it is too limited for that task.
I'm not sure if this was being somewhat disingenuous, because I know you're well aware of the particular focus on Unicode-correctness for strings in Swift, but I'll take it at face value. It would be great if you would follow up on the category of âCode unit/code point/Unicode scalarâ by specifying what the default âatomâ of a string is in these languages, e.g. something like what you get when you index into a string, or what the string's length is calculated in terms of. A quick skim and spot check of a couple of them didn't reveal anything similar to Swift in this respect. Languages which are less interested in Unicode-correct strings are of course going to have a different idea of what this âatomâ or âcharacterâ is, and that is generally reflected in their âcharacterâ syntax, so I don't find this survey very relevant here.
Edit: And of course, this delusion about the obvious default type for single quoted literals isn't unique to me and @RMJay:
This is a logical fallacy. The primary and core use case of single-quoted literals depends on how we design it.
It is important to emphasize that there isn't one thing that is an "atom" of a string. All Unicode-conscious languages start with that caveat in their documentation.
There is no reason whatsoever to tie single-quoted literals to the language's choice of element when indexing (which is also not necessarily a language's choice of element when iterating). Indeed, as I wrote above, because Swift chooses to make the extended grapheme cluster the element type, it is not possible to have compile-time validation of such syntax if we chose to do so.
In Go, indexing into a string yields its bytes, and a string is an arbitrary sequence of UTF-8 bytes; its "length" is the length in bytes. As a special exception, iteration occurs over a string's Unicode code points, or runes.
In Rust, indexing into a string slice (str
) yields its bytes, and a string slice is an arbitrary sequence of UTF-8 bytes; its "length" is the length in bytes. It is not possible to iterate over a string slice; one must explicitly ask for its UTF-8 byte view or Unicode scalar view, and the documentation additionally notes that the user may actually want to iterate over extended grapheme clusters instead.
Swift strings can be encoded in UTF-8 or UTF-16, so they cannot be designed as in Rust or Go. In Swift 3, as in Rust today, it is not possible to iterate over String
. To improve ergonomics, it was later decided to model Swift strings as a BidirectionalCollection
of Character
s despite violating some of the semantic guarantees of that protocol.
What this survey shows is that these modern languages do not tie their single-quote literal syntax with the "atom" of their string type, and in fact they divorce iteration over a string from the "atom" of their string type as well.
The Unicode scalar type, as has been mentioned, has gained some fantastic and useful APIs in Swift but lags in ergonomics due to the difficulty of expressing a literal of that type. Even though .NET strings are sequences of UTF-16 code units, they have recently adopted a new Unicode scalar type (named Rune
) to improve Unicode string handling ergonomics.
This sounds like an argument for removing the ExpressibleByExtendedGraphemeClusterLiteral
protocol entirely, because the compiler cannot validate that the contents of a literal will in fact contain exactly one extended grapheme cluster at runtime.
Since we are obviously not going to do that, and we already have dedicated syntax for specifying a Character
literalâ"x" as Character
âthe line of reasoning you describe here is inapplicable.
I'm not sure why that is "obvious." We can deprecate that protocol, and in fact that might be a good thing to do unless there is a clear use case for it currently.
That said, the protocol itself is fine as it makes no guarantees about compile-time validation, and the syntax for an extended grapheme cluster literal is identical to that of a string literal. "x" as Character
and 123 as Character
are both syntactically well-formed, and neither is a "dedicated" syntax for a character literal.
I think that is basically a given at this point. There are a handful of individuals here who will be able to follow the entire discussion and all of the intricate Unicode details, but it's not reasonable to expect the majority of the community to do that.
One of the things I find quite difficult about this discussion is that we seem to have lost track of what the original problem was this thing was supposed to solve. As I understand it, we want to be able to match ASCII sequences (like IHDR
) in byte sequences, right?
And the reason we don't care about non-ASCII sequences is because their byte representations are not obvious in the face of normalisation and combining characters and whatnot.
Why don't we just stick to the actual problem instead of getting bogged-down in syntax and integer conversions?
ExpressibleByExtendedGraphemeClusterLiteral
has always been an oddity, not the least of which because ExtendedGraphemeClusterLiteralType
(and UnicodeScalarLiteralType
) are currently unreachable.
If/when we move to a static literal model, we will not have âunicode scalar literalsâ or âcharacter literalsâ or âstring literalsâ, we will just have @stringLiteral
s and @stringElementLiteral
s, both of which represented by [Unicode.Scalar]
.
Surely we canât avoid working on an area of the language just because many community members lack the background expertise to understand the problem? We donât declare all of FloatingPoint
a no-go zone just because Steve is the only person here who understands floats.
The problem is we donât have a way to express integer values with textual semantics, with an appropriate textual literal syntax, that doesnât cause additional issues in the rest of the language (i.e., x.isMultiple(of: 'a')
). Or more broadly, we donât have a âsafeâ and âreadableâ way to process and generate ASCII bytestrings. I donât think anyone has lost track of that.
IHDR
is just a concrete example of something that is very difficult to safely and efficiently express with existing language tools. If you want a sampling of âpain pointsâ, I would say that any proposed solution must address the following in a safe manner:
// storing a bytestring value
static
var liga:(Int8, Int8, Int8, Int8)
{
return (108, 105, 103, 97) // ('l', 'i', 'g', 'a')
}
// storing an ASCII scalar to mixed utf8-ASCII text
var xml:[UInt8] = ...
xml.append(47) // '/'
xml.append(62) // '>'
// ASCII range operations
let current:UnsafePointer<Int8> = ...
if 97 ... 122 ~= current.pointee // 'a' ... 'z'
{
...
}
// ASCII arithmetic operations
let year:ArraySlice<Int8> = ...
var value:Int = 0
for digit:Int8 in year
{
guard 48 ... 57 ~= digit // '0' ... '9'
else
{
...
}
value = value * 10 + .init(digit - 48) // digit - '0'
}
// reading an ASCII scalar from mixed utf8-ASCII text
let xml:[Int8] = ...
if let i:Int = xml.firstIndex(of: 60) // '<'
{
...
}
// matching ASCII signatures
let c:UnsafePointer<UInt8> = ...
if (c[0], c[1], c[2], c[3]) == (80, 76, 84, 69) // ('P', 'L', 'T', 'E')
{
...
}
There is no reason whatsoever to introduce new literal syntax for this; Stringâs UTF-8 view already provides a succint and highly efficient way to represent ASCII byte sequences.
let needle = âPNG89aâ.utf8
// needle is a sequence of bytes corresponding to the
// UTF-8 encoding of âPNG89aâ. For ASCII strings,
// this is exactly the same as their 7-bit ASCII encoding
// zero-extended to 8-bit bytes.
If the standard library doesnât provide convenient enough APIs to match such byte sequences, then that can and should be remedied by introducing new APIs in stdlib. Inventing new syntax wonât help.
(The fact that this also works for non-ASCII characters still seems like a great feature to me. UTF-8 is the new ASCII.)
We do not have a similarly succinct syntax to express individual bytes between 0 and 127 by their corresponding ASCII character. If this is an important usecase, then Unicode scalar literal syntax would give us that by allowing âaâ.ascii
. (Character
is on the wrong abstraction level for this; its asciiValue
property is broken.)
Support for other legacy encodings (ISO 8859-x, EBCDIC variants, etc.) can be provided by external packages, by simply defining similar properties on Unicode.Scalar
. These would work just as nicely as .ascii
.
let hello = âIâm an ASCII bytestringâ.utf8
Is this unsafe or unreadable? Why?
Yes, it is unsafe. The reason why is located immediately after the I
.
Great point. Iâve been typing most of my posts directly into a poor web emulation of a text editor, not a code editor. Most of my apostrophes and quotes have been converted to the proper punctuation marks for English text.
There is no need for any additional compile-time checks, though: my code above (and throughout this discussion) already wonât compile because it uses English left and right quotation marks, not the ASCII approximation that Swift requires for String literals.
(Note how the corruption exhibited in these forum posts is not related to Unicode normalization. Itâs the browser trying to be helpful and work around the limitations of my keyboard, which has a fewer keys than English text requires.)
If such corruption is likely enough in practical contexts to deserve special treatment, then there is a wide spectrum of possible approaches to detect it. Adding dedicated language syntax for ASCII literals to protect against these seems like severe overreaction to me; the same practical effect can be achieved by runtime checks, possibly combined with special-cased warning diagnostics.