SE-0243: Codepoint and Character Literals

Even further off-topic: sin(.pi) is almost exactly equal to (π - .pi), for any floating-point type.

In other words, you can get a highly accurate estimate of the error in an approximation of pi, just by taking its sine.

4 Likes

Back on topic:

I'd like to add a few more points onto the plate:

Unicode is deeply integrated into Swift, the same can not be said with ASCII, which I think is also why there's quite a friction when the proposal's trying to add compile-time ASCII validation.
Swift can bless ASCII in the future, but it's unclear if Swift will take that direction, and perhaps that needs to be decided/declared first.

ASCII validation is a valid concern, but there're also other Unicode subset (or even ascii subset) that see common usage as well, including safe HTML/URL/JSON characters, alphanumeric characters, etc. Many of which are strict subsets of ASCII. It's even arguable that the full range of ASCII itself is rarely used, only some subset of it, and that ASCII is just the unloved lowest common denominator. Which bring me to the point: having ASCII checking at compile-time may occur from a valid concern, but ASCII character isn't a very scalable solution especially when we usually still need to further validate the string.

One more thing that I can think of is that we can do the checking (and warning) during the constant-folding phase if it happens across init boundary, which I don't know if it does. Though I'm no compiler expert, so I don't know the viability of this.

1 Like

I haven't found any argument throughout this review thread that has shaken my core belief that single quoted literals should default to Character, which is clearly the most natural interpretation and gives you a convenient way to work with what should be the most common character-like type in Swift code. If this isn't considered useful enough to create a new literal form, then I must conclude that the new literal form isn't useful enough at all. And if you think that Unicode.Scalar is much more useful than Character for Swift users in general, then perhaps that suggests that Character APIs need improvement.

In the pitch threads I was also supportive of the secondary goal of using this literal form to help solve some ergonomic issues with low-level string processing. This thread has raised/re-raised various issues and concerns that have made the best way to support this kind of processing unclear to me. Perhaps allowing single-quoted literals to express integers isn't the way forward, and this use case would be better supported through some other API.

2 Likes

Do you have an example for a piece of code that is difficult to express in today’s Swift, but it would be meaningfully improved by the introduction of Character literals? (I’m not trying to be difficult — I’m genuinely interested to see some actual usecases. We have seen plenty of arguments for Unicode scalars, but essentially none for Character.)

I guess I just don’t see an obvious reason why the single quote character must be dedicated to literals of String’s element type. We seem to be doing fine without such a notation.

We can use single quotes for whatever new syntax we like: grapheme clusters, Unicode scalars, localized text, regular expressions, metaprogramming, inline assembly, interned symbols, complex number literals, etc. etc. It seems to me that Unicode scalars are on the more pedestrian and least questionable side of this spectrum.

There is even a nice mnemonic in the difference between and — double quotes are thicker and imply heavier processing, while single quotes are thinner and closer to raw data.

1 Like

This isn't a problem with lack of APIs. The point here is that what a Character models semantically is appropriate for working with text in many ways, but it is inappropriate for low-level text processing that operates on values by which text is encoded. Take, for example, @lorentey's point that the Greek question mark is equivalent to semicolon after Unicode normalization (which is to say, in Swift, they are substitutable characters). You would not want that if you were, say, writing a parser for JavaScript.

(The weirdness about the greek question mark was raised by @SDGGiesbrecht on the pitch thread, along with a number of other interesting points. His posts are very much worth a careful read.)

1 Like

To my eye, we've had plenty of arguments for ASCII and not much for Unicode scalars more generally. But I don't really want to extend this thread by discussing various examples and how important they may or may not be (it's already very dense for a proposal review, with a lot of pitch-phase discussions between a small number of people). There are a lot of very reasonable bottom-up arguments in this thread about why some sort of Unicode.Scalar/ASCII literal would make sense, but I'm arguing top-down here: if you tell me there's a single-quoted character-like literal in Swift I'm going to strongly presume it defaults to Character. If we can get by without a literal form designed primarily for Character then my assumption is that we can also get by without a literal form designed for the (hopefully) much more niche Unicode.Scalar type. We can live with the status quo of just using double quotes for both, and neither being a default type for a literal.

Agreed, but I didn't say “for low-level text processing” in that sentence (I had in mind things like @lorentey's example of typing '\u{301}'.name into a playground). My second paragraph admits I don't know what the best solution is for such text processing, but I think it's very unclear (at best) that pivoting to optimising single-quoted literals for this purpose is the right approach. Compile-time evaluation and some new APIs might give you a solution that is concise, efficient and has good compile-time errors, for either double-quoted literals or single-quoted literals that default to Character.

Here are some of my observations and perhaps a new avenue in the solution space.

Single quoted literals should clearly default to Character, which is the constituent element of a string at the level of human interpretation, and is the most common character-like type. This is a particularly beginner visible part of Swift.

It would be great to have improved egonomics for working with low level text processing, especially in ascii.

Conversion from Character literals to Numbers should be explicit and mention the encoding used. If this were not to be the case it would deviating substantially from the rest of the Swift language.

It seems that a branch of this discussion is exploring using single quoted literals to represent a Unicode.Scalar so that it becomes easier to have a trapping ascii property that lets us write let a: UInt8 = 'a'.ascii bypassing weird unicode normalisation edge cases such as '\r\n' becoming a single character. Using trapping for this use case doesn't seem ideal because of the fact that it would be so easily triggered, and by making single quoted literals default to Unicode.Scalar this would exist in a prominent part of Swift.

Character already has an let a = c.asciiValue //Optional(Int) property which is the correct form to use on an arbitrary character parameter. A trapping variant would only be useful when bound to a literal ie.

'a'.ascii //useful
arbitrary.ascii //not useful

It seems to me that Swift is missing a langauge feature that would let us have the best overall outcome. That feature being Literal Bound Functions these would allow us to write the following, which is making use of a hypothetical literal bound ascii property on character-literal.

let a  = 'A' //inferred to be of type Character
let b  = 'B'.ascii //b == 66
let c  = getArbitraryCharacter()
let c1 = c.ascii //type Character has no such function
guard let c2 = c.asciiValue //this is the way to do it

Literal bound functions would be compiler evaluated and would operate on the literal before unicode normalisation so all of the drawbacks of using Character vs. Unicode.Scalar could be caught. A literal bound function could show up in code completion on a literal alongside those functions belonging to the default type.

You could also have a similar literal bound ascii property on string-literal which would be compiler verified to be ascii and not be vulnerable to unicode normalisation.

let tags = "IHDR".ascii //[73, 72, 68, 82]

For those complaining about optionals and performance/trapping: have you heard of the unsafelyUnwrapped property? It doesn't check or trap in optimised builds:

The unsafelyUnwrapped property provides the same value as the forced unwrap operator (postfix ! ). However, in optimized builds ( -O ), no check is performed to ensure that the current instance actually has a value. Accessing this property in the case of a nil value is a serious programming error and could lead to undefined behavior or a runtime error.

In debug builds ( -Onone ), the unsafelyUnwrapped property has the same behavior as using the postfix ! operator and triggers a runtime error if the instance is nil .

The unsafelyUnwrapped property is recommended over calling the unsafeBitCast(_:) function because the property is more restrictive and because accessing the property still performs checking in debug builds.

Sounds fine for use with String literals, unless your literal changes between Debug/Release modes.

Isn’t this one of those things they tell us never to use? aiming a gun at your foot shouldn’t be the price you have to pay for an ASCII literal

Counterpoint: beginner exercises tend to involve processing fixed-length, structured, machine-readable strings. Think of any old CS homework problem—does it involve UTF8 continuations or grapheme breaking? If you want to teach a beginner how to get the query out of a URL string, you don’t want to throw Unicode at them, you just want them to treat/convert the String as an array of ASCII characters. (ASCII characters are actually the only characters valid in a url). There’s been a lot of lament on the forum about how Swift Strings and its unicode-correct indexing model are getting in the way of Swift education and adoption, and I’ve always considered better ASCII support to be the most appropriate solution to this problem.

Unicode is an advanced topic for advanced users, and Characters are Unicode, a lot more than Unicode.Scalar, ironically. Maybe placing this API on the higher shelf isn’t the worst idea in the world.

I agree, a trapping property is not appropriate on either Character or Unicode.Scalar. There would be no language precedent,, .first and .last are important, commonly used properties, and we do not make them trap.

I don’t know how I feel about abusing function call syntax (yes, 'a'.ascii is a function call, just written without parentheses) for literal coercion. Why not the 'A' as UInt8 syntax? I think as is a natural fit for this role, given its current usage in the language, and the literal | as grammar is well-established in Swift, so we wouldn’t have to teach users a new rule “‘literal bound functions’ can only be called on literal values”.

This is not the case. A trapping Unicode.Scalar.ascii is perfectly useful for arbitrary variables, not just literals. We already have the isASCII property.

func asciiDigitValue(_ digit: Unicode.Scalar) -> Int? {
  guard digit.isASCII else { return nil }
  let v = digit.ascii
  switch v {
    case '0'.ascii ... '9'.ascii:
      return v - '0'.ascii
    default:
      return nil
    }
}

This again? That's my point (2) above.

  1. Characters aren’t integers; integers aren’t characters. There must not be any implicit conversion between ‘charactery’ and integral types and values. The act of encoding a string/character needs to leave a mark in the source; the mark must unambiguously indicate the specific encoding used.

'A' as UInt8 is encoding the character 'A' without specifying the encoding. This is subpar API design. The word ASCII in 'A'.ascii serves a critical purpose.

One thing that hasn't come up yet: If we allow 'A' as UInt8, then obviously we'll want to also allow 65 as Character.

1 Like

I don't know - maybe? Depends who "they" are.

I agree that best practice is not to mix feet with guns, but I'm not sure why a programming language should care what people do with their guns/feet :wink:

isASCII is fundamentally different. It just tells you if the object satisfies some property or not. It doesn’t make an assumption that it does, and crash the program if it doesn’t.

An analogy is Array.first. We have Array.isEmpty, which tells you if Array.first exists or not, but Array.first doesn’t and shouldn’t crash the program if you invoke it on an empty array.

Swift is Unicode, and Unicode is ASCII. We do not need to specify the encoding, because the encoding has already been specified by the language.

Maybe not on Character but certainly 65 as Unicode.Scalar makes sense. This is the whole point of explicit coercible-but-not-expressible as expressions. They allow us to express literal coercions without opening up the whole pandora’s box that ExpressibleBy brings with itself. Keep in mind we still don’t have a way of going the other way, i.e. converting an integer literal to a codepoint, that is validated at compile-time. (this is a bit of a rarified point though, we already have "\u{XX}" syntax which works just as well, except only in hex.)

Code points are an extension of ASCII, but there are ASCII-incompatible Unicode encoding forms.

More importantly, the fact that Swift standardizes on Unicode strings does not mean that we can ignore internationalization best practices in our API design. "Unicode" is not a meaningful encoding when it comes to byte sequences. Even if we restrict ourself to UTF-8, UTF-16-LE, UTF-16-BE, UTF-32-LE, UTF-32-BE, we do need to keep track of which one we're using at any given point.

We could also say that "A" does not specify its encoding. We know it's unicode because Swift defines it that way, and Swift could define 'A' to mean ASCII. [To be clear, an ASCII literal is not exactly what the current proposal is trying to achieve, just what I wish it to be.]

ASCII is meaningful for byte sequences however. So we just need to define the literal as an ASCII literal.

Only around half the strings are meant for internationalization though. The rest includes identifiers, data formatted in specific ways (number and dates for instance) that must be parsed, signature bytes in binary structures, etc. All of this is predominantly ASCII.

Having to write "ascii" around every literal is a bit like using naturalLogarithm(of: 5) for math. It might be helpful if you aren't familiar the context, but you'll tire out quickly of writing equations using such verbose tools. It'd be good if people did not feel C has better ergonomics for writing efficient parsers, because this is a huge source of security bugs in C.

4 Likes

String is a high-level abstraction. The string value "Café" is a value of an abstract data type representing a Unicode string. It's true, String specifies nothing about how it internally encodes its value -- indeed, the internal encoding depends on the stdlib version used and the provenance of the value. However, String's internal encoding is irrelevant during its use. String's API is carefully defined in terms of abstract concepts like grapheme clusters, not any particular encoding.

In order to convert a String value into a series of bytes like [UInt8], you need to encode it. This is never done automatically. The APIs we provide to do this explicitly specify the encoding being used.

One simple way of encoding Strings is to use the encoded views provided by the stdlib. "A".utf8 gives you a collection containing the single UInt8 value 65. This is perfectly fine -- the encoding used is explicit in the name of the view, and the potential for confusion is minimal.

We deliberately do not provide "A" as [UInt8], however useful it might seem like for the common UTF-8 case. It would be a bad idea to introduce such a language feature.

I fail to see how Unicode scalars are in any way different. Unicode.Scalar is an abstraction for a grapheme. It has an assigned number, but it isn't the same as that number. For any Unicode scalar, there are many ways to encode it into a series of bytes, and the encoding usually has no obvious relationship to the numerical value of its code point. Implicitly defaulting to any particular encoding (even for a relatively(!) common subset like ASCII) would be a mistake.

Edit:

That would indeed work, as long as the literal syntax would only accept ASCII characters. In this case, the literal syntax itself would be a shorthand for the encoding specification.

let a: Character = 'e'
let b: Unicode.Scalar = 'e'
let c: UInt8 = 'e'

let d: Character = 'é' // error: 'é' is not an ASCII character
let e: Unicode.Scalar = 'é' // error: 'é' is not an ASCII character
let d: UInt8 = 'é' // error: 'é' is not an ASCII character

I don't think that would fly, though.

2 Likes

I think the ' ' notation is not pulling its weight. Anyway I would never use it outside ASCII, because if I write 'é' in a source code I'm not sure if it has been encoded as one scalar (U+00E9) or two (U+0301 and U+0065), and I'd fall back on an explicit code '\u{E9}', and in that case the notation is hardly better than the constructor Unicode.Scalar(0xE9)!.

The only use cases I had with this notation was to more easily manipulate Character and Int values within the ASCII range. But if it can't be both a Character and an Int, as the discussion showed, I'm not sure it is still worth considering.

2 Likes

And this is precisely why the Unicode scalar literal idea is so powerful in terms of making it easier to work with Unicode scalars, because the compiler would enforce the literal being one and only one scalar—so that you can be sure!

4 Likes

Oh yes, right, but still, I don't think I would use it outside ASCII if it's a Unicode scalar. I'm French and I maybe could use 'é' in a source code, but in that case I'd need it as a Character because I'd compare it to characters in French strings. The notation is supposed to be short and simple, I don't see the point if I have to do the conversion Character('é').

It would enforce that it is only one scalar, which would catch many problems such as . But in general it would not enforce that it is the intended scalar, since not all canonical equivalence involves expansion.

Even with 'é', someone still has to go back and retype it now that the program refuses to compile, (or use some tool to normalize the file to NFC, which might break some other literal the same way). It really is best to do '\u{E9}', as @plorenzi said. — P.S. : Vive la francophonie !

Of course, when working with text as text, equivalence is desired and Character is the tool for the job, regardless of which quotation marks end up being necessary to express it, 'é' or "é". Then it does not matter what happens to the file, because its source code is under the same equivalence rules as the text value the literal represents.

In case it has not been said yet, 'é' as Character should not be a thing if it has to be converted through Unicode.Scalar to get there. It would be very surprising that it sometimes plays out like 'e' as Character, which would always succeed; and other times like 'ȩ̨̛̇̉̆̃̂́̀̈̌̄̊' as Character, which would always fail even though it could be represented by a single Character. It would mirror the large UInt literals that used to overflow Int along their way to becoming actual UInt instances.

2 Likes