Also, it is recommended to use the "\u{...}" notation if you're doing things like this.
Alternatively (not sure it's a good idea), literals that vary under NFC could trigger a compilation error and a fix-it that used "\u{...}" notation.
Also, it is recommended to use the "\u{...}" notation if you're doing things like this.
Alternatively (not sure it's a good idea), literals that vary under NFC could trigger a compilation error and a fix-it that used "\u{...}" notation.
It's a fantastic linter rule, I'd say.
Not sure how I'd feel with it as a compiler warning or error. For one thing, it'd disproportionately affect certain scripts and could conceivably render some of them unreadable, which would be a rather unforgivable sin for a literal notation.
As a String
is a vector of Character
, the latter is a vector of code points. If you mean to compare (ASCII) code points to bytes within binary data, then wouldn't Unicode.Scalar
be the correct abstraction? A scalar value always maps to exactly one integer, while a character may be a vector of scalars.
For text file mangling your representation, wouldn't scalars be better? The "é" abstract character may have two Character
representations, U+00E9 as a single scalar or U+0065 & U+0301 as a primary and secondary code-point pair. If the characters within single quotes must always be Unicode scalars, then only one interpretation is allowed in the object code for "é", no matter which way it's stored in the source code file. It does mean that the compiler must have ICU or an equivalent to find valid recombinations that can resolve to a single scalar. If we allow "\u{}" Unicode escapes within single quotes, we can mandate that they always have to used the single scalar version and never a decomposed form. (In other words, recomposition is allowed from translation between the source file's encoding to object code, and never from the user deliberately splitting a single-scalar character to an official decomposed form.)
This isn’t quite true. Characters
can contain other Character
s, so it’s not a neat 3-level hierarchy. It’s probably better to think of Character
boundaries as maximal, context-dependent intervals calculated on a String
object as a whole, and an individual Character
object as a very short String
whose largest interval (among many shorter choices) extends across its entire length.
Also, Unicode.Scalar
isn’t entirely the right abstraction when comparing with UInt8
s, since Unicode.Scalar
can and will assume every 8-bit character it’s compared against is encoded in the Latin-1 encoding, and compared with 7-bit encodings where ASCII is queen, there are just too many alternative 8-bit character encodings for me to be comfortable here.
It is difficult to see why single-quoted literals should be presumed to default to Character
, as no language offers such a syntax.
Here's how popular programming languages make use of single quotation marks:
char
array)int
char
(if literal is prefixed, it can be char8_t
, char16_t
, char32_t
, or wchar_t
)char
(16-bit)char
(16-bit)Char
(16-bit)rune
(32-bit)char
(32-bit)In Go, a Unicode code point is known as a rune (a term now also adopted in .NET). In Rust, a Unicode scalar value is known as a character; in Swift, it is known as a Unicode scalar. (A Unicode scalar value is any Unicode code point except high- and low-surrogate code points.)
As can be seen, Go and Rust use single quotation marks for what in Swift is known as a Unicode scalar literal.
No language uses this notation for what in Swift is known as an extended grapheme cluster literal (i.e., character literal).
The version of Unicode supported, and therefore grapheme breaking, is a runtime concept. In other words, it is the version of the standard library linked at run time that determines whether a string's contents are one extended grapheme cluster (i.e., Character
) or not.
Adding syntax to distinguish between a single character and a string that may contain zero or more such characters will enable only best-effort diagnostics at compile time. In other words, a dedicated extended grapheme cluster literal syntax can provide users no guarantees about grapheme breaking as it relates to the contents of the literal, because such knowledge cannot be "baked in" statically into the code.
I think so far there’s been five serious alternatives if this does get returned for revision, so I figured it’s worth summarizing the pros, cons, and implications of each so we can settle on a design moving forward.
'a'.ascii
, callable memberlet codepoint:Unicode.Scalar = 'a'
return codepoint.ascii
Single quoted literals default to | Unicode.Scalar |
Implementation difficulty | Easy |
Compile-time validation? | No |
Summary:
The Unicode.Scalar
type will get an .ascii
computed property, which provides its value
with the trapping precondition that value < 0x80
.
Pros:
Cons:
Character
literals will continue to require type context.Character
literals either cannot be expressed with single quotes, or would result in ambiguous expressions like 'é' as Character
..ascii
would be available on all Unicode.Scalar
values, including run-time values (foo.ascii
), which doesn’t seem appropriate from an API standpoint.Unicode.Scalar
.'a'.ascii
, “literal-bound” memberreturn 'a'.ascii
Single quoted literals default to | Character |
Implementation difficulty | Hard |
Compile-time validation? | Yes |
Summary:
Swift will support a new method attribute @literalself
, essentially a more restrictive version of @constexpression
on self
. The Character
type will get an .ascii
computed property which is @literalself
, and provides its ASCII value subject to the compile-time condition that it consists of a single codepoint within the ASCII range. Note that this would still be vulnerable to '\r\n'
folding.
Pros:
Cons:
.
notation.@constexpression
to generalize into a language feature.Unicode.Scalar
.'a' as UInt8
return 'a' as UInt8
Single quoted literals default to | Character |
Implementation difficulty | Hard |
Compile-time validation? | Yes |
Summary:
Swift will introduce the concept of non-expressible literal coercions, which would allow “opt-in” literal coercions through the use of the as
operator. (Note that this is not an overload on the as
operator, it merely makes this operator mandatory if requested.) Contrast with Swift’s existing expressible literal coercions, which are “opt-out”, and make the as
operator optional. All FixedWidthInteger
types would receive a non-expressible literal conformance to unicode scalar literals. This is essentially identical to the proposal as written, except it requires an explicit as (U)Int8
everywhere a codepoint literal→ASCII coercion takes place.
Pros:
Cons:
@constexpression
to generalize into a language feature.a'a'
return a'a'
Single quoted literals default to |
Character (u'a' defaults to Unicode.Scalar ) |
Implementation difficulty | Medium |
Compile-time validation? | Yes |
Summary:
Single-quoted literals will be subdivided into multiple prefixed literal sorts. Unprefixed tokens will be parsed as character literals, u
-prefixed tokens will be parsed as unicode scalar literals, and a
-prefixed tokens will be parsed as integer literals, constrained to the ASCII-range.
Pros:
Unicode.Scalar
(u'a'
) and Character
('a'
) literals, as well as alternative character encodings.Cons:
a
for ascii”, “u
for unicode scalar”, etc).ASCII
structreturn ('a' as ASCII).value
Single quoted literals default to | Character |
Implementation difficulty | Medium |
Compile-time validation? | Yes |
Summary:
The standard library will gain a full 7-bit ASCII type which is expressible by unicode scalar literals. Compile-time validation will be performed in the compiler in a semi-magical fashion, just like the current proposal as written.
Pros:
Cons:
(U)Int8
cannot be safely reinterpreted as 7-bit ASCII
value.).value
effectively overloads on return type.In the languages you cite that use a character literal, only Rust and Go don't use out-dated representations of characters. And in these both languages, the single-quote literal represents a character, that is, an element of a string. Because they consider code points as string elements, unlike Swift, which uses extended grapheme clusters.
The primary and core use case of the ' '
literal is to inspect strings.
In Swift that means that is should be able to represent all the Character
values, elsewhere it is too limited for that task.
I'm not sure if this was being somewhat disingenuous, because I know you're well aware of the particular focus on Unicode-correctness for strings in Swift, but I'll take it at face value. It would be great if you would follow up on the category of “Code unit/code point/Unicode scalar” by specifying what the default “atom” of a string is in these languages, e.g. something like what you get when you index into a string, or what the string's length is calculated in terms of. A quick skim and spot check of a couple of them didn't reveal anything similar to Swift in this respect. Languages which are less interested in Unicode-correct strings are of course going to have a different idea of what this “atom” or “character” is, and that is generally reflected in their “character” syntax, so I don't find this survey very relevant here.
Edit: And of course, this delusion about the obvious default type for single quoted literals isn't unique to me and @RMJay:
This is a logical fallacy. The primary and core use case of single-quoted literals depends on how we design it.
It is important to emphasize that there isn't one thing that is an "atom" of a string. All Unicode-conscious languages start with that caveat in their documentation.
There is no reason whatsoever to tie single-quoted literals to the language's choice of element when indexing (which is also not necessarily a language's choice of element when iterating). Indeed, as I wrote above, because Swift chooses to make the extended grapheme cluster the element type, it is not possible to have compile-time validation of such syntax if we chose to do so.
In Go, indexing into a string yields its bytes, and a string is an arbitrary sequence of UTF-8 bytes; its "length" is the length in bytes. As a special exception, iteration occurs over a string's Unicode code points, or runes.
In Rust, indexing into a string slice (str
) yields its bytes, and a string slice is an arbitrary sequence of UTF-8 bytes; its "length" is the length in bytes. It is not possible to iterate over a string slice; one must explicitly ask for its UTF-8 byte view or Unicode scalar view, and the documentation additionally notes that the user may actually want to iterate over extended grapheme clusters instead.
Swift strings can be encoded in UTF-8 or UTF-16, so they cannot be designed as in Rust or Go. In Swift 3, as in Rust today, it is not possible to iterate over String
. To improve ergonomics, it was later decided to model Swift strings as a BidirectionalCollection
of Character
s despite violating some of the semantic guarantees of that protocol.
What this survey shows is that these modern languages do not tie their single-quote literal syntax with the "atom" of their string type, and in fact they divorce iteration over a string from the "atom" of their string type as well.
The Unicode scalar type, as has been mentioned, has gained some fantastic and useful APIs in Swift but lags in ergonomics due to the difficulty of expressing a literal of that type. Even though .NET strings are sequences of UTF-16 code units, they have recently adopted a new Unicode scalar type (named Rune
) to improve Unicode string handling ergonomics.
This sounds like an argument for removing the ExpressibleByExtendedGraphemeClusterLiteral
protocol entirely, because the compiler cannot validate that the contents of a literal will in fact contain exactly one extended grapheme cluster at runtime.
Since we are obviously not going to do that, and we already have dedicated syntax for specifying a Character
literal—"x" as Character
—the line of reasoning you describe here is inapplicable.
I'm not sure why that is "obvious." We can deprecate that protocol, and in fact that might be a good thing to do unless there is a clear use case for it currently.
That said, the protocol itself is fine as it makes no guarantees about compile-time validation, and the syntax for an extended grapheme cluster literal is identical to that of a string literal. "x" as Character
and 123 as Character
are both syntactically well-formed, and neither is a "dedicated" syntax for a character literal.
I think that is basically a given at this point. There are a handful of individuals here who will be able to follow the entire discussion and all of the intricate Unicode details, but it's not reasonable to expect the majority of the community to do that.
One of the things I find quite difficult about this discussion is that we seem to have lost track of what the original problem was this thing was supposed to solve. As I understand it, we want to be able to match ASCII sequences (like IHDR
) in byte sequences, right?
And the reason we don't care about non-ASCII sequences is because their byte representations are not obvious in the face of normalisation and combining characters and whatnot.
Why don't we just stick to the actual problem instead of getting bogged-down in syntax and integer conversions?
ExpressibleByExtendedGraphemeClusterLiteral
has always been an oddity, not the least of which because ExtendedGraphemeClusterLiteralType
(and UnicodeScalarLiteralType
) are currently unreachable.
If/when we move to a static literal model, we will not have “unicode scalar literals” or “character literals” or “string literals”, we will just have @stringLiteral
s and @stringElementLiteral
s, both of which represented by [Unicode.Scalar]
.
Surely we can’t avoid working on an area of the language just because many community members lack the background expertise to understand the problem? We don’t declare all of FloatingPoint
a no-go zone just because Steve is the only person here who understands floats.
The problem is we don’t have a way to express integer values with textual semantics, with an appropriate textual literal syntax, that doesn’t cause additional issues in the rest of the language (i.e., x.isMultiple(of: 'a')
). Or more broadly, we don’t have a “safe” and “readable” way to process and generate ASCII bytestrings. I don’t think anyone has lost track of that.
IHDR
is just a concrete example of something that is very difficult to safely and efficiently express with existing language tools. If you want a sampling of “pain points”, I would say that any proposed solution must address the following in a safe manner:
// storing a bytestring value
static
var liga:(Int8, Int8, Int8, Int8)
{
return (108, 105, 103, 97) // ('l', 'i', 'g', 'a')
}
// storing an ASCII scalar to mixed utf8-ASCII text
var xml:[UInt8] = ...
xml.append(47) // '/'
xml.append(62) // '>'
// ASCII range operations
let current:UnsafePointer<Int8> = ...
if 97 ... 122 ~= current.pointee // 'a' ... 'z'
{
...
}
// ASCII arithmetic operations
let year:ArraySlice<Int8> = ...
var value:Int = 0
for digit:Int8 in year
{
guard 48 ... 57 ~= digit // '0' ... '9'
else
{
...
}
value = value * 10 + .init(digit - 48) // digit - '0'
}
// reading an ASCII scalar from mixed utf8-ASCII text
let xml:[Int8] = ...
if let i:Int = xml.firstIndex(of: 60) // '<'
{
...
}
// matching ASCII signatures
let c:UnsafePointer<UInt8> = ...
if (c[0], c[1], c[2], c[3]) == (80, 76, 84, 69) // ('P', 'L', 'T', 'E')
{
...
}
There is no reason whatsoever to introduce new literal syntax for this; String’s UTF-8 view already provides a succint and highly efficient way to represent ASCII byte sequences.
let needle = “PNG89a”.utf8
// needle is a sequence of bytes corresponding to the
// UTF-8 encoding of “PNG89a”. For ASCII strings,
// this is exactly the same as their 7-bit ASCII encoding
// zero-extended to 8-bit bytes.
If the standard library doesn’t provide convenient enough APIs to match such byte sequences, then that can and should be remedied by introducing new APIs in stdlib. Inventing new syntax won’t help.
(The fact that this also works for non-ASCII characters still seems like a great feature to me. UTF-8 is the new ASCII.)
We do not have a similarly succinct syntax to express individual bytes between 0 and 127 by their corresponding ASCII character. If this is an important usecase, then Unicode scalar literal syntax would give us that by allowing ’a’.ascii
. (Character
is on the wrong abstraction level for this; its asciiValue
property is broken.)
Support for other legacy encodings (ISO 8859-x, EBCDIC variants, etc.) can be provided by external packages, by simply defining similar properties on Unicode.Scalar
. These would work just as nicely as .ascii
.
let hello = “I’m an ASCII bytestring”.utf8
Is this unsafe or unreadable? Why?
Yes, it is unsafe. The reason why is located immediately after the I
.
Great point. I’ve been typing most of my posts directly into a poor web emulation of a text editor, not a code editor. Most of my apostrophes and quotes have been converted to the proper punctuation marks for English text.
There is no need for any additional compile-time checks, though: my code above (and throughout this discussion) already won’t compile because it uses English left and right quotation marks, not the ASCII approximation that Swift requires for String literals.
(Note how the corruption exhibited in these forum posts is not related to Unicode normalization. It’s the browser trying to be helpful and work around the limitations of my keyboard, which has a fewer keys than English text requires.)
If such corruption is likely enough in practical contexts to deserve special treatment, then there is a wide spectrum of possible approaches to detect it. Adding dedicated language syntax for ASCII literals to protect against these seems like severe overreaction to me; the same practical effect can be achieved by runtime checks, possibly combined with special-cased warning diagnostics.
If the problem were just limited to generating arrays of ASCII characters, I might agree with you, but as i said in my other post, there are a lot of other use cases that .utf8
doesn’t solve. I also don’t think new syntax should be the cost we need to be worried about,, I’m a lot more concerned with potential solutions like “literal-bound 'a'.ascii
”, which overload existing syntax with new semantics.
These are really useful! I really don't see your point, though -- String.utf8
(and Unicode.Scalar.ascii
) seem to provide perfectly elegant, safe and efficient solutions to all of them:
// storing a bytestring value
static var liga = "liga".utf8
// storing an ASCII scalar to mixed utf8-ASCII text
var xml: [UInt8] = ...
xml.append('/'.ascii)
xml.append('>'.ascii)
// ASCII range operations
let current: UnsafePointer<UInt8> = ...
if 'a'.ascii ... 'z'.ascii ~= current.pointee {
...
}
// ASCII arithmetic operations
let year: ArraySlice<UInt8> = ...
var value: Int = 0
for digit: UInt8 in year {
guard '0'.ascii ... '9'.ascii ~= digit else {
...
}
value *= 10
value += Int(digit - '0'.ascii)
}
// reading an ASCII scalar from mixed utf8-ASCII text
let xml: [UInt8] = ...
if let i: Int = xml.firstIndex(of: '<'.ascii) {
...
}
// matching ASCII signatures
let c: UnsafeRawBufferPointer = ...
if c.starts(with: "PLTE".utf8) {
...
}
Note: I took the liberty of replacing Int8
above with UInt8
. As far as I know, Int8
data typically comes from C APIs imported as CChar
, which is a truly terrible type: it's documented to be either UInt8
or Int8
, depending on the platform. Any code that doesn't immediately and explicitly rebind C strings to UInt8
is arguably broken.