Unicode scalar literals

At that point I can just write ascii("a") and get the same effect using contextual types (and a helper function in my project, not a proposed addition to the stdlib). I want something more compact.

2 Likes

Yes, to be clear, this proposal is fundamentally addressed at improving facilities for low-level Unicode-aware text processing. We see .NET moving in the direction of adopting Go's Rune type for much the same reasons.

If the most you would want to do is to search through ASCII bytes, then indeed the proposed solution (in isolation, as per core team feedback) does not offer that at the maximum compactness; the same can be said for pretty much all of String, though, for that use case.

1 Like

That’s fair. Neglecting to mention the name of the encoding would be a step too far for me.

Imho this is the most convincing example that there is some issue with the status quo:
case Int8(UInt8(ascii: "a"))
looks really cumbersome compared to case 'a'.

But maybe there are other ways to solve this?
I recently had the odd idea of adding static computed properties to UInt8 - one for each character.
That would allow you to write
case .a:
instead, which is even more concise than single quotes.

It is probably a weird concept, but the good thing with ASCII is that it's rather limited compared to Unicode.
Because of the restrictions for variable names, you couldn't represent all characters in this way - but actually, I think using textual descriptions for special characters (UInt8.space instead of ' ') could even make the code easier to understand.

1 Like

This is certainly an interesting direction.

The core team's feedback from SE-0243 is to keep this issue separated from the topic of single-quoted literals and to tackle it second, so this pitch makes no mention of it or any alternatives here.

1 Like

I’ve tried something like this for handling keypresses in an application, and what I’ve found is it’s surprisingly hard to remember what the non-alphanumeric symbol name is.

'!' -> 'exclam', 'exclamation', 'bang', 'exclamationMark'?
'@' -> 'at', 'atSign', 'atSymbol'?
'#' -> 'hashtag', 'hashTag', 'number', 'numberSign', 'pound', 'poundSymbol', 'poundSign'?
'$' -> 'dollar', 'dollarSymbol', 'dollarSign'?
'%' -> 'percent', 'percentage', 'percentSign', 'percentageSign', 'percentSymbol'? 
'^' -> ????????? (discoverability = 0)
'&' -> 'and', 'andSymbol', 'ampersand'?
'*' -> 'asterisk', 'star', 'starSymbol'?
'(' -> 'lparen', 'lParentheses', 'leftParentheses', 'openParentheses'?
')' -> 'rparen', 'rParentheses', 'rightParentheses', 'closeParentheses?
'-' -> 'dash', 'hyphen', 'minus', 'hyphenMinus'?
'_' -> 'underscore', 'underbar'?
'+' -> 'plus', 'plusSign', 'plusSymbol'?
'=' -> 'equal', equals', 'equalSign', 'equalsSign'?
'{' -> 'leftCurlyBrace', 'leftCurlyBracket', 'leftFrenchBrace', 'leftFrenchBracket', 'openCurlyBrace', ...?
'}' -> ...?
'[' -> 'leftBracket', 'leftSquareBracket', 'openBracket', 'openSquareBracket'?
'|' -> 'verticalBar', 'pipe'?
'\' -> 'backslash', 'backSlash'
':' -> 'colon'
';' -> 'semi', 'semicolon', 'semiColon'?
'"' -> 'quot', 'quote', 'doubleQuote'?
2 Likes

'^' is "caret", but the point is still valid.

(For fun, take a look at the mnemonics @Joe_Groff came up with for mapping ASCII operator characters to the English alphabet. That's a slightly different problem because the names have to have a unique starting letter, but it's still fun.)

1 Like

This argument isn’t relevant. You are (again) mixing up the concepts of literal conversion and literal coercion. Double-quoted literals are capable of providing the exact same static guarantees as single-quoted literals (as they do right now!), the difference is only in the amount of type context needed to accomplish it.
The sole benefit of single quoted literals for Unicode.Scalar only, is that this guarantee can now be provided:

let c:Character = 'a'

which guaranteed that c is a single-codepoint Character. But we already decided this kind of expression is a bug of the design, not a feature, so i hardly see the utility.

This kind of discussion only proves that the entire literals system needs rework to present the concept of static coercion more correctly to users.

I am referring neither to coercion nor conversion. I am arguing that a literal syntax dedicated to mean "this is a character" is misleading because it cannot guarantee that what's contained is in fact a character. This has nothing to do with the implementation of literals and is exclusively about the surface syntax for end users.

Have we? I don't see why. This exactly parallels floating-point types being expressible by integers.

It is no more misleading than double quoted literals for Character. It sounds like an argument against any literal syntax at all for Character.

from the rejection rationale:

One concern raised during the review was that because ExpressibleByStringLiteral refines ExpressibleByExtendedGraphemeClusterLiteral , then type context will allow expressions like 'x' + 'y' == "xy" . The core team agrees that this is unfortunate and that if these protocols were redesigned, this refinement would not exist. However, this is not considered enough of an issue to justify introducing new protocols to avoid the problem. Where practical, the implementation of single-quote literals should generate compile-time warnings for these kind of misuses – though this should not be done by adding additional deprecated operators to the standard library.

I believe this also covers the implicit promotion relationship between ExpressibleByExtendedGraphemeClusterLiteral and ExpressibleByUnicodeScalarLiteral.

Note that the specific concern about guaranteeing that "x" is a single character is because that can change based on what version of Unicode you're using, which Swift treats as a run-time decision. On old versions of macOS, for example, skin-tone-modified emoji are not considered valid Characters, because the version of Unicode that the system provided did not know about skin tone modifiers.

4 Likes

We don't have a dedicated syntax for Character: these are spelled identically to string literals and for good reason (i.e., because they cannot be distinguished at compile time). And yes, it is an argument against any dedicated literal syntax for Character.

1 Like

This is not an "implicit promotion relationship" but one protocol refining the other. I'm not sure why you would conclude that this particular relationship is problematic; I have demonstrated where there would be a benefit, and I have seen no examples where it would be unfortunate in the way that 'x' + 'y' == "xy" might be.

Throughout the several threads leading up to this (previous pitch threads, the rejected proposal, this thread), I still remain unshaken in my core belief that if there's going to be a single-quoted literal for character-like things in Swift then it should naturally default to Character. I don't want to rehash all of my posts, but I'll summarise some of them as responses to this pitch.

Sure, but so is Unicode.Scalar. If the primary use case is low-level processing of ASCII strings then perhaps single-quoted literals should be devoted to ASCII as you mention in your alternatives. Processing at the level of Unicode.Scalar is a small niche of the already-niche use case of low-level string processing. It seems to me to be in no man's land, satisfying neither people processing strings in the natural way for Swift (i.e. by Character) or people processing ASCII strings.

The same thing is true of Character, and it would be similarly great if people learning Swift could explore properties of Character, the type they are more likely to use, in a playground. The set of properties isn't currently as rich as it is on Unicode.Scalar, but that won't necessarily be true forever.

Your examples here really demonstrate to me how confusing mixing together UTF-8 bytes, Unicode scalars, Unicode code points, etc. can be in a language. But this suggests to me that adding a shorthand syntax for Unicode.Scalar may only bring some of this confusion to Swift as well. Are there any languages with a similar focus on Unicode-correctness to Swift that we could look at instead? I'm having trouble seeing “some other modern languages have seriously confusing string implementations” as a great argument for this pitch.

I don't believe this is the reason that there is currently not a dedicated literal syntax for Character (and would be an argument against even the current status of using double-quoted string literals with Character). If I recall correctly, core Swift developers have previously said that single-quoted literals were being reserved mostly in case they were going to be used for raw strings, which are now implemented with a different syntax. And I don't see this as a great argument against a literal syntax that defaults to Character, as the literals could be checked at compile-time in a best-effort way that will handle a lot of cases, with the rest verified at run-time (i.e. how double-quoted literals and Character currently work). A lot of other Swift features work in the same way.

I would personally replace Character with Unicode.Scalar in these sentences. If Character isn't important enough to be the default type for a literal (and it may not be), then I don't think Unicode.Scalar is.

2 Likes

This is explicitly the use case that this pitch seeks to make more ergonomic. There are not many users today who do this kind of processing with Swift because the language makes it very cumbersome. If you have concluded already that this use case is not worth addressing, then you have presupposed that this pitch should be rejected.

If you read the motivations in the document you link, you will see that one of the reasons why these properties were added to Character is that use of Unicode.Scalar properties is not ergonomic. This pitch addresses that problem directly.

It will necessarily be true forever. You will note, as written in the document you linked, that Unicode does not define these properties for extended grapheme literals, only Unicode scalars. That proposal makes best efforts at adding a small number of them for Character, and for reasons outlined here, at least one of these is a footgun for ASCII byte processing.

These languages are cited because they have a focus on Unicode correctness. In fact, the term “rune” adopted in Go was first used by Rob Pike and colleagues when they created UTF-8, and Rob Pike now works on Go.

Swift is ambitious in its Unicode support, but do not suppose that it has already achieved its ambitions. Unicode defines a Unicode string as a sequence of code units, which is modeled more explicitly in other languages than Swift.

When contributors to .NET considered whether to adopt a Go-like rune, they also surveyed Swift’s design choices and Miguel de Icaza wrote: “but also Swift is not a great model for handling strings.”

Of course, whether a model is great or not depends on use case, but I would put it to you that Go and .NET are not “a small niche.”

Because Unicode grapheme breaking changes from version to version, compile-time checking produces false positives and false negatives: by definition, it can “handle” zero cases.

So what happens if you backward-deploy a skin-tone-modified emoji on an older macOS:

let x = "🧒🏽" as Character

Just did the test by back-deploying to macOS 10.9 and 10.11. Nothing special happen: you just have a character variable that looks like it contains two characters.

Maybe that's not relevant, but you can sort-of do the same thing with unicode scalars:

let x = "fl" as Unicode.Scalar

This "fl" ligature is meant to display as two characters, although it is (I believe) still a single grapheme.


I'd choose Unicode.Scalar for single quote literals because it is a different level of abstraction than String and Character. Using a different syntax would clarify we're working at the level where '\u{37e}' != ';'. But that thinking somewhat breaks if Character and String can be initialized using the same literals (the separation becomes fuzzy again), so I'm not too sure what to think.

Swift uses the term Character to refer to an extended grapheme cluster. This concept is distinct from what “looks like a character” although it deliberately approximates that.

This is, as I’m sure you see, problematic. You should not be able to instantiate a Character with “:child:t4:” on older systems where that is not a single extended grapheme cluster. If you instantiate a String with “:child:t4:” and deploy it, you will see that its count is 1 on newer systems and 2 on older systems.

This is a good point and would argue for compile-time warnings against mixing the notations.

This partial quote removes the context from my post here that most use cases from the previous threads and in this pitch are for processing ASCII not Unicode scalars. This pitch does essentially nothing to make the ASCII case more ergonomic, as @jrose points out. Hence what I said about being in no man's land, in my opinion.

It does not follow from “Unicode [currently] does not define these properties” that this will necessarily be true forever. And, as you note, Swift already defines properties on Character despite the lack of such definitions.

I'm having trouble seeing the relevance here. Swift has already made this choice for handling strings, and I presume you're not proposing to change it, as you support it in the Motivation section. And this decision was made in the context of a mature language with a 16-bit Char and very different backwards compatibility issues. And he also writes: “So the Swift character does not have a fixed 32-bit size, it is variable length (and we should also have that construct, but that belongs in a different data type)”.

I have no idea what you are responding to here. I said that low-level text processing on Unicode scalars was a small niche of a niche (again, the larger part being processing ASCII), but you've somehow interpreted that as me saying that Go and .Net are a small niche?

Clearly an exaggeration, and applies to the current use of double-quoted literals anyway. And arbitrary changes to the Unicode specification could invalidate any part of Swift's Unicode implementation. I'll defer to the experts here, but the recent changes to grapheme breaking that I'm aware of have been to broaden what counts as a single grapheme, which seems benign in this context. And the best-effort checking can be fairly broad, as I understand it currently is, while still catching most mistakes in practice.

2 Likes

For my part, I have a hard time seeing how unicode scalars are "a niche within a niche" when I can't even see when I would want to use Character as my abstraction level.

I think we need to come with a list of string processing use cases and the levels of string representation adequate for each, otherwise it's just us arguing in a void.

If you can't see when you would ever want to process a string at the Character level then the Swift string design is a failure and we have bigger problems than single-quoted literals. If you're looking for low-level string processing use cases then see the previous threads but, as I said, I mostly (only?) recall seeing ASCII examples, and this proposal doesn't seem to make that case more ergonomic.