SE-0243: Codepoint and Character Literals

It is difficult to see why single-quoted literals should be presumed to default to Character, as no language offers such a syntax.

Here's how popular programming languages make use of single quotation marks:

String

  • Delphi/Object Pascal
  • JavaScript
  • MATLAB (char array)
  • Python
  • R
  • SQL

'Raw' string

  • Groovy
  • Perl
  • PHP
  • Ruby

Code unit/code point/Unicode scalar

  • C: int
  • C++: char (if literal is prefixed, it can be char8_t, char16_t, char32_t, or wchar_t)
  • C#: char (16-bit)
  • Java: char (16-bit)
  • Kotlin: Char (16-bit)
  • Go: rune (32-bit)
  • Rust: char (32-bit)

In Go, a Unicode code point is known as a rune (a term now also adopted in .NET). In Rust, a Unicode scalar value is known as a character; in Swift, it is known as a Unicode scalar. (A Unicode scalar value is any Unicode code point except high- and low-surrogate code points.)

As can be seen, Go and Rust use single quotation marks for what in Swift is known as a Unicode scalar literal.

No language uses this notation for what in Swift is known as an extended grapheme cluster literal (i.e., character literal).

The version of Unicode supported, and therefore grapheme breaking, is a runtime concept. In other words, it is the version of the standard library linked at run time that determines whether a string's contents are one extended grapheme cluster (i.e., Character) or not.

Adding syntax to distinguish between a single character and a string that may contain zero or more such characters will enable only best-effort diagnostics at compile time. In other words, a dedicated extended grapheme cluster literal syntax can provide users no guarantees about grapheme breaking as it relates to the contents of the literal, because such knowledge cannot be "baked in" statically into the code.

6 Likes