SE-0243: Codepoint and Character Literals

congratulations on your eyeballs. you get the point. It’s really really easy to accidentally type a ’ instead of a ' on a lot of platforms. Some of them will turn one into the other automatically. And in a lot of programming fonts, these characters look similar. It’s also not the only character susceptible to “unicode trespassing”,, there’s - and –, ` and ˋ, etc. So yes, i think this is an issue.

Of course you need careful reading. The great thing about ASCII strings is the characters are all visually different enough that careful reading is all you need to do to verify your constants are correct. Unicode means you have to examine them byte by encoded byte.

It is an issue that one can make typos in constants. You need to test them. I'd say that adding new features to the language so that people feel comfortable not testing constants is an anti-goal.

My point is that it isn't all you need. Would you trust that the value of pi in hexadecimal is correct, and correctly rounded, by visual inspection only? I wouldn't. Not only isn't it "monumentally stupid" to test your constants, it's essential.

You’re mixing up content and encoding. It’s reasonable to have verify that all the digits in a pi constant are correctly rounded. It’s not reasonable to, for example, have to verify that the digits are all actual ASCII digits.

This is a distinction without a difference in your use case. If you attempted to test the content even once, the encoding would be verified at runtime. There is no circumstance under which you can do the former without the latter. Ultimately, you are advocating for compile-type diagnosis of some typos while conceding that you need to write tests for typos anyway. I fail to see the purpose of this feature.

This is hyperbole. We’re talking about the addition of one property; Unicode.Scalar is obviously useful for much more than ASCII processing, and the addition of a trapping .ascii does not change that the slightest.

It’s true that a property that traps is somewhat irregular. But I don’t think this is a huge deal. People stated their dislike of how Character.asciiValue returns an optional, and given that we already have Unicode.Scalar.isASCII, I don’t have a problem adding a trapping .ascii/.asASCII/whatever-we-like-to-call-it member.

1 Like

While discussing these single-quoted literals, I have become increasingly convinced of several interesting things about them. Here is an opinionated point by point summary:

  1. The ergonomics of low-level string processing (most especially, ASCII processing) is a significant pain point for Swift users, especially when it comes to dealing with individual code points. We need to do something to fix this.

  2. Characters aren’t integers; integers aren’t characters. There must not be any implicit conversion between ‘charactery’ and integral types and values. The act of encoding a string/character needs to leave a mark in the source; the mark must unambiguously indicate the specific encoding used.

    This prohibition against implicit conversions applies to initialization, comparisons (==), pattern matching (~=), and any other way a number and a character could appear in an expression together.

  3. The consequences of allowing integer types to be directly initialized from single- or double-quoted stringy literals are unacceptable. Abominations like ’a’ % ‘2’ or 56 >> ‘q’ or 42.multipliedFullWidth(by: ‘b’) must not be allowed to compile. This is merely a corollary of the previous point, but it highlights a particularly obnoxious side effect.

  4. Character is on the wrong level of abstraction when it comes to processing ASCII bytes. (”\r\n” is a single Character that represents a sequence of two ASCII characters, and Character considers the ASCII semicolon ; to be substitutable with GREEK QUESTION MARK (U+037E). These are clearly inappropriate features for the byte processing usecase.)

    Character.asciiValue is fundamentally broken — it can cause silent data loss, and therefore it needs to be deprecated. (Note: This is not to say the Character abstraction isn’t useful at all. On the contrary: it’s clearly the right choice for String’s element type.)

  5. Unicode.Scalar and its associated string view are much closer to the level of actual encodings, and they are much more appropriate abstractions for low-level text processing. This is particularly true for ASCII, but it also applies to any other context where equivalency under Unicode normalization would be inappropriate/unnecessary.

    Unicode.Scalar is a type that is crying out for its own literal syntax. It has grown an awesome set of APIs in Swift 5 — it’s a shame that its rich properties are locked away behind convoluted syntax. I want to be able to just type ‘\u{301}’.name into a playground to learn about a particular code point.

  6. There are no strong usecases for adding dedicated literal syntax for the Character type. ”👨‍👩‍👧‍👦“ as Character and ’👨‍👩‍👧‍👦‘ as Character both seem acceptable spellings, with a preference for the first.

  7. Arranging things so that ’\r’ evaluates to the Unicode scalar U+000D and ’\r’.ascii evaluates to the UInt8 value of 13 would resolve all of the issues above.

    As a best-effort quality-of-implementation improvement, the compiler should produce a diagnostic for obviously wrong cases like ’é’.ascii. However, if the diagnostic proves too difficult or objectionable, then it’s acceptable to merely rely on runtime traps provided by the stdlib. People processing ASCII text can be expected to know what characters are in ASCII.

    (AFICT, the diagnostic would be straightforward to implement and relatively uncontroversial.)

(Note: I’m using strongly worded statements above, but these are merely my personal convictions. I might change my mind about some of them in the future, if I’m given good reason to do so. Indeed, I previously believed that ‘a’ should default to Character.)

6 Likes

well, you’re probably not computing pi to test it, you look it up in hex format and check the digits against the reference source.

0x1.921fb54442d18p1

You do that every time you commit the constants file.

The same goes for ASCII tags, like opentype features.

I'm doing neither. When I test pi, I actually use the constant and verify common trigonometric identities.

Not only is "looking it up" using your link insufficient, in the case of Float.pi it would be wrong because that value in Swift is not intended to be rounded to nearest but rounded down (for reasons explained in the documentation).

Point is: test your constants.

off topic:

  1> _sin(Double.pi) == 0 
$R0: Bool = false

doubles are hard

Even further off-topic: sin(.pi) is almost exactly equal to (π - .pi), for any floating-point type.

In other words, you can get a highly accurate estimate of the error in an approximation of pi, just by taking its sine.

4 Likes

Back on topic:

I'd like to add a few more points onto the plate:

Unicode is deeply integrated into Swift, the same can not be said with ASCII, which I think is also why there's quite a friction when the proposal's trying to add compile-time ASCII validation.
Swift can bless ASCII in the future, but it's unclear if Swift will take that direction, and perhaps that needs to be decided/declared first.

ASCII validation is a valid concern, but there're also other Unicode subset (or even ascii subset) that see common usage as well, including safe HTML/URL/JSON characters, alphanumeric characters, etc. Many of which are strict subsets of ASCII. It's even arguable that the full range of ASCII itself is rarely used, only some subset of it, and that ASCII is just the unloved lowest common denominator. Which bring me to the point: having ASCII checking at compile-time may occur from a valid concern, but ASCII character isn't a very scalable solution especially when we usually still need to further validate the string.

One more thing that I can think of is that we can do the checking (and warning) during the constant-folding phase if it happens across init boundary, which I don't know if it does. Though I'm no compiler expert, so I don't know the viability of this.

1 Like

I haven't found any argument throughout this review thread that has shaken my core belief that single quoted literals should default to Character, which is clearly the most natural interpretation and gives you a convenient way to work with what should be the most common character-like type in Swift code. If this isn't considered useful enough to create a new literal form, then I must conclude that the new literal form isn't useful enough at all. And if you think that Unicode.Scalar is much more useful than Character for Swift users in general, then perhaps that suggests that Character APIs need improvement.

In the pitch threads I was also supportive of the secondary goal of using this literal form to help solve some ergonomic issues with low-level string processing. This thread has raised/re-raised various issues and concerns that have made the best way to support this kind of processing unclear to me. Perhaps allowing single-quoted literals to express integers isn't the way forward, and this use case would be better supported through some other API.

2 Likes

Do you have an example for a piece of code that is difficult to express in today’s Swift, but it would be meaningfully improved by the introduction of Character literals? (I’m not trying to be difficult — I’m genuinely interested to see some actual usecases. We have seen plenty of arguments for Unicode scalars, but essentially none for Character.)

I guess I just don’t see an obvious reason why the single quote character must be dedicated to literals of String’s element type. We seem to be doing fine without such a notation.

We can use single quotes for whatever new syntax we like: grapheme clusters, Unicode scalars, localized text, regular expressions, metaprogramming, inline assembly, interned symbols, complex number literals, etc. etc. It seems to me that Unicode scalars are on the more pedestrian and least questionable side of this spectrum.

There is even a nice mnemonic in the difference between and — double quotes are thicker and imply heavier processing, while single quotes are thinner and closer to raw data.

1 Like

This isn't a problem with lack of APIs. The point here is that what a Character models semantically is appropriate for working with text in many ways, but it is inappropriate for low-level text processing that operates on values by which text is encoded. Take, for example, @lorentey's point that the Greek question mark is equivalent to semicolon after Unicode normalization (which is to say, in Swift, they are substitutable characters). You would not want that if you were, say, writing a parser for JavaScript.

(The weirdness about the greek question mark was raised by @SDGGiesbrecht on the pitch thread, along with a number of other interesting points. His posts are very much worth a careful read.)

1 Like

To my eye, we've had plenty of arguments for ASCII and not much for Unicode scalars more generally. But I don't really want to extend this thread by discussing various examples and how important they may or may not be (it's already very dense for a proposal review, with a lot of pitch-phase discussions between a small number of people). There are a lot of very reasonable bottom-up arguments in this thread about why some sort of Unicode.Scalar/ASCII literal would make sense, but I'm arguing top-down here: if you tell me there's a single-quoted character-like literal in Swift I'm going to strongly presume it defaults to Character. If we can get by without a literal form designed primarily for Character then my assumption is that we can also get by without a literal form designed for the (hopefully) much more niche Unicode.Scalar type. We can live with the status quo of just using double quotes for both, and neither being a default type for a literal.

Agreed, but I didn't say “for low-level text processing” in that sentence (I had in mind things like @lorentey's example of typing '\u{301}'.name into a playground). My second paragraph admits I don't know what the best solution is for such text processing, but I think it's very unclear (at best) that pivoting to optimising single-quoted literals for this purpose is the right approach. Compile-time evaluation and some new APIs might give you a solution that is concise, efficient and has good compile-time errors, for either double-quoted literals or single-quoted literals that default to Character.

Here are some of my observations and perhaps a new avenue in the solution space.

Single quoted literals should clearly default to Character, which is the constituent element of a string at the level of human interpretation, and is the most common character-like type. This is a particularly beginner visible part of Swift.

It would be great to have improved egonomics for working with low level text processing, especially in ascii.

Conversion from Character literals to Numbers should be explicit and mention the encoding used. If this were not to be the case it would deviating substantially from the rest of the Swift language.

It seems that a branch of this discussion is exploring using single quoted literals to represent a Unicode.Scalar so that it becomes easier to have a trapping ascii property that lets us write let a: UInt8 = 'a'.ascii bypassing weird unicode normalisation edge cases such as '\r\n' becoming a single character. Using trapping for this use case doesn't seem ideal because of the fact that it would be so easily triggered, and by making single quoted literals default to Unicode.Scalar this would exist in a prominent part of Swift.

Character already has an let a = c.asciiValue //Optional(Int) property which is the correct form to use on an arbitrary character parameter. A trapping variant would only be useful when bound to a literal ie.

'a'.ascii //useful
arbitrary.ascii //not useful

It seems to me that Swift is missing a langauge feature that would let us have the best overall outcome. That feature being Literal Bound Functions these would allow us to write the following, which is making use of a hypothetical literal bound ascii property on character-literal.

let a  = 'A' //inferred to be of type Character
let b  = 'B'.ascii //b == 66
let c  = getArbitraryCharacter()
let c1 = c.ascii //type Character has no such function
guard let c2 = c.asciiValue //this is the way to do it

Literal bound functions would be compiler evaluated and would operate on the literal before unicode normalisation so all of the drawbacks of using Character vs. Unicode.Scalar could be caught. A literal bound function could show up in code completion on a literal alongside those functions belonging to the default type.

You could also have a similar literal bound ascii property on string-literal which would be compiler verified to be ascii and not be vulnerable to unicode normalisation.

let tags = "IHDR".ascii //[73, 72, 68, 82]

For those complaining about optionals and performance/trapping: have you heard of the unsafelyUnwrapped property? It doesn't check or trap in optimised builds:

The unsafelyUnwrapped property provides the same value as the forced unwrap operator (postfix ! ). However, in optimized builds ( -O ), no check is performed to ensure that the current instance actually has a value. Accessing this property in the case of a nil value is a serious programming error and could lead to undefined behavior or a runtime error.

In debug builds ( -Onone ), the unsafelyUnwrapped property has the same behavior as using the postfix ! operator and triggers a runtime error if the instance is nil .

The unsafelyUnwrapped property is recommended over calling the unsafeBitCast(_:) function because the property is more restrictive and because accessing the property still performs checking in debug builds.

Sounds fine for use with String literals, unless your literal changes between Debug/Release modes.

Isn’t this one of those things they tell us never to use? aiming a gun at your foot shouldn’t be the price you have to pay for an ASCII literal

Counterpoint: beginner exercises tend to involve processing fixed-length, structured, machine-readable strings. Think of any old CS homework problem—does it involve UTF8 continuations or grapheme breaking? If you want to teach a beginner how to get the query out of a URL string, you don’t want to throw Unicode at them, you just want them to treat/convert the String as an array of ASCII characters. (ASCII characters are actually the only characters valid in a url). There’s been a lot of lament on the forum about how Swift Strings and its unicode-correct indexing model are getting in the way of Swift education and adoption, and I’ve always considered better ASCII support to be the most appropriate solution to this problem.

Unicode is an advanced topic for advanced users, and Characters are Unicode, a lot more than Unicode.Scalar, ironically. Maybe placing this API on the higher shelf isn’t the worst idea in the world.

I agree, a trapping property is not appropriate on either Character or Unicode.Scalar. There would be no language precedent,, .first and .last are important, commonly used properties, and we do not make them trap.

I don’t know how I feel about abusing function call syntax (yes, 'a'.ascii is a function call, just written without parentheses) for literal coercion. Why not the 'A' as UInt8 syntax? I think as is a natural fit for this role, given its current usage in the language, and the literal | as grammar is well-established in Swift, so we wouldn’t have to teach users a new rule “‘literal bound functions’ can only be called on literal values”.

This is not the case. A trapping Unicode.Scalar.ascii is perfectly useful for arbitrary variables, not just literals. We already have the isASCII property.

func asciiDigitValue(_ digit: Unicode.Scalar) -> Int? {
  guard digit.isASCII else { return nil }
  let v = digit.ascii
  switch v {
    case '0'.ascii ... '9'.ascii:
      return v - '0'.ascii
    default:
      return nil
    }
}

This again? That's my point (2) above.

  1. Characters aren’t integers; integers aren’t characters. There must not be any implicit conversion between ‘charactery’ and integral types and values. The act of encoding a string/character needs to leave a mark in the source; the mark must unambiguously indicate the specific encoding used.

'A' as UInt8 is encoding the character 'A' without specifying the encoding. This is subpar API design. The word ASCII in 'A'.ascii serves a critical purpose.

One thing that hasn't come up yet: If we allow 'A' as UInt8, then obviously we'll want to also allow 65 as Character.

1 Like