Prepitch: Character integer literals

but you see, that’s the point. we know that new users have a very hard time getting their heads around Swift’s three-level string model (4 if you count the utf storage level), since most languages have a two-level model (“strings are arrays of chars”). then you get people asking questions like “why does the character count of a String not always change when i append new stuff to it?”. If you write a Character as "💁🏼" that kind of suggests the object you’ve really created is conceptually a ['💁', '🏼'], which is exactly what we’re trying to communicate

We already have integer literals that fail depending on the inferred type:

$ swift
Welcome to Apple Swift version 4.1 (swiftlang-902.0.41 clang-902.0.31). Type :help for assistance.
  1> let a: Int = 300
a: Int = 300
  2> let b: Int8 = 300
b: Int8 = <extracting data from value failed>

error: repl.swift:2:15: error: integer literal '300' overflows when stored into 'Int8'
let b: Int8 = 300
              ^

Should we change integer literals to only support 8-bit values?

3 Likes

What is and is not truly a grapheme is only determinable at run time. The compiler makes a best-effort attempt when it comes to issuing diagnostics.

1 Like

fair enough

Maybe I'm misunderstanding you, but it should be a compile time error to use a multiple grapheme sequence in single quotes. This isn't a failable initializer that returns optional.

Swift didn't come up with these distinctions. This is the nature of Unicode and a direct consequence of modeling it faithfully.

"Most languages" matured in a time before everyday users of computing devices write each other using multi-code-point grapheme clusters in the form of emoji, and before human thoughts were required to be divided into 140-character chunks. Even non-technical users will care now if they press a single button on their software keyboard and their app says they've used "8 characters."

3 Likes

users shouldn’t care how many code points are in a 👩‍⚖️ but developers most certainly should. Graphemes have always been an abstraction we use to relate encoded text to what humans consider atomic characters. we should make obvious to developers that what a user would consider an atomic letter is actually an array of logical codepoints, it’s just a coincidence that 99% of the time, the length of that array is 1. i don’t know why this is such an argument, it’s the whole reason Swift has separate concepts of Character and Unicode.Scalar, and why Swift has to do a linear amount of work to find the character boundaries in a String.

btw user-facing character counts have never been just a count of Characters, they’re almost always from some more sophisticated algorithm that does stuff like strip out whitespace and take into account the amount of information in the character (else chinese tweets would be novels!)

also this is 2018 we write in 280 character chunks now

It depends on what you're developing. If you're moving the caret in a text editing app you want to advance by a single grapheme cluster; drawing your caret in the middle of an emoji is pretty gauche regardless of how many code points it comprises.

But in any case, I'm not sure what your argument is here. Of course, developers should have access to information about code units and code points in their strings, and that's why Swift models Unicode so faithfully.

It's not exactly fun to index by code points when text is encoded in UTF-8 either. Strings are hard, because it's not just graphemes that are abstractions we use to relate to humans--text itself is an abstraction that we use to relate to humans, and humans are complex, illogical, messy, and squishy.

4 Likes

you should go drag firefox then it still draws the cursor down the middle of 🤷🏼 and all the other people emojis

this argument is literally over whether or not to use '' for grapheme clusters, or "". it’s not about changing any APIs. it’s not about changing anything about Swift’s string paradigm. it’s not a debate about whether a grapheme is a fundamentally different concept than a code point. (everyone agrees it is!!) this is literally about whether or not it is useful to use notation that makes that point clear because that is the purpose of notation. I don’t think it’s useful to use the two types of quotes to differentiate based on how many Characters there are. everyone knows how to count characters. the definition of a Character is literally “the thing that human readers count”. taking up that syntactical space for that is a waste. however, using the two types of quotes to differentiate based on unicode representation, i think that’s useful.

the difference being that utf8 and utf16 are examples of ways to store Unicode.Scalar data, but the scalars themselves are the logical atomic units of information. i could store all the scalars in 128 bit integers, and i would still have unicode.

You could similarly say that a sequence of one or more Unicode scalars is a way to store single grapheme cluster so I'm not sure that is a useful distinction, or why scalars are automatically the logical atomic units. The logical atomic unit depends on the context of what you are doing in your code, which is fundamentally why there are multiple views into a String's contents.

1 Like

This is a perfect place to apply the swift-evolution process. Write the proposal in the way that makes sense to you, then run it in a pitch phase. You'll collect feedback from other people and will either confirm your beliefs or may lead you to change. By the time the proposal is officially reviewed and approved, it will have gotten a lot of careful consideration from many people.

-Chris

5 Likes

This is not true. Graphemes can contain other graphemes. Take the judge emoji (:woman_judge:) for example. it has two codepoints, the woman emoji (:woman:‍) and the law emoji (:balance_scale:). both of them are valid emojis themselves. that’s why you can start with "👩‍" (1 character long) and append "⚖️" (also one character long, plus a zwj) to it, and end up with 1 character, "👩‍⚖️".

This is not true for code units. A utf8 continuation byte on its own has no meaning. you cannot break up a utf16 surrogate pair. they always travel together,, i.e. they form a logically atomic unit.

Sure, that is true, but it's not related to what the logical atomic unit to operate on in a given piece of code is. The default atomic unit in Swift is a Character, regardless of the fact that you can break some Characters into multiple Characters, and there are three other built in views for code where the logical unit is different.

I wouldn’t say that as much as Swift’s Strings are Collections of Characters because most of the String methods we use are high-level algorithms that think in terms of Characters, not Unicode.Scalars. i am not sure if having String be a Collection of Characters is entirely correct, in fact in Swift 3 it wasn’t for this reason, but ergonomics proved way more salient than pedantics. but if you ask me, not having to hold down Shift to type a single quotation mark is not so much better than we should make that same compromise here.

A high-level algorithm that thinks in terms of Characters is exactly what Character being the logical atomic unit for those algorithms means to me.

1 Like

the thing about atoms, is by definition you can’t put two of them next to each other and call that thing an atom too. better to think of these algorithms as operating on calculated intervals within a sequence of code points (themselves encoded in variable-length code units). it just so happens that those intervals are so useful that they are exposed as the default iteration unit in String.

That's what they thought about real atoms as well, until they discovered fusion and fission, so perhaps it's a better analogy than anyone thought :)

For anyone wondering about the similar issues with Strings, this provides good context for the Swift 4 decision to make Strings collections of characters again. Similar argument might apply here.

2 Likes

I'd just like to throw a wrench out here for consideration. The OP states that in C

Without reviewing the actual C standard I'm going to go out on a limb and say I don't believe this is necessarily true. On an EBCDIC platform char a = 'a' would set the value of the variable named a to 129. Or a bit more obtusely, since char is (I believe; I'm not a C programmer) signed, then a = -127. In any case, the hex value is 0x81.

I am beta testing IBM's port of Swift to the z/OS (mainframe) platform, so things like this weigh on my mind.

On a similar track, I was interested to see someone throw out a thought about a new type he called ASCIICodeUnit. Not sure if it should be discussed here or if a new thread should be used, but I'll start it here for the moment. It seems to me it might be useful for some sort of CodeUnit type that is backed by a single byte of storage, but is neither an Int8 nor a UInt8, but is something that is "C compatible" with both "[signed] char" and "unsigned char". I've had a heck of a time interfacing with some of z/OS's APIs because there seems to be no consistency with regard to if character strings are declared as "char " or "unsigned char". Not to mention the fact that these APIs, in order to support the primary z/OS programming languages COBOL and PL/I generally use fixed-length strings rather than null-terminated strings.

I have many more thoughts on the issue of EBCDIC and z/OS and Swift, but they are not yet written down, and surely deserve a thread of their own. Thanks for listening. :-)

In hopefully the last contribution from my own personal bout of swift-evolution desk-clearing, I’d like to see if it was possible to get this pitch moving again. Single quoted “character" literals are something I’d really like to see in Swift 5.0.

Let's return to the OP’s suggestion a single quoted literal ’a’ be modelled as a [20 bit] integer literal to the compiler. If we accept this, the implementation is absolutely trivial. The only departure from some of the discussion above is that the default type of a single quoted literal be Int and not Character. If we make them an Int literal and accept it’s not in the spirit of single quoted literals to represent multi-codepoint graphemes (as we already have the Character type) everything else follows . Adding straightforward ExpressibleByIntegerLiteral conformances to Character and Unicode.Scalar and it is possible to realise the following semantics:

// valid
let asInt = '🙂'
let scalar: Unicode.Scalar = '🙂'
let character: Character = '🙂’
// invalid
let invalildAscii: UInt8 = '🙂'
j.swift:24:28: error: integer literal '128578' overflows when stored into 'UInt8'
let multipleCodepoint = '👨🏼‍🚀'
j.swift:26:30: error: character not expressible as a single codepoint

What am I missing? What are the real world advantages of creating a Character literal?

9 Likes