SE-0243: Codepoint and Character Literals

The reason is clarity at the point of use. If we introduce single-quoted literals as this proposal suggests, then the difference between them and double-quoted literals should be meaningful:

'a' can be a Character or Unicode.Scalar, but not a String nor StaticString

"a" can be a String or StaticString, but not a Character nor Unicode.Scalar

That way the meaning of character and string literals in source code is much clearer.

Seriously, please stop with the slippery-sloping. We are under no obligation to consider any such thing, and even if we did consider it we have no obligation to adopt it.

off topic but @johnno1962 why does let s:String = 'a' work again? String’s ExpressibleByExtendedGraphemeClusterLiteral conformance takes a String argument

String.init(extendedGraphemeClusterLiteral: "a" as Character)
error: repl.swift:3:49: error: cannot convert value of type 'Character' to expected argument type 'String'
1 Like

Way off topic, It works because ’a’ is a character literal expression which can be expressed in a single unicode scalar so it searches for ExpressibleByUnicodeScalarLiteral which String must conform to by virtue of the inheritance hierarchy from ExpressibleByStringLiteral as opposed to Character which is a type. I don’t know exactly why you're seeing that specific error.

ExpressibleByUnicodeScalarLiteral does allow you to run into the same issues today. But there are three significant differences.

  1. To encounter most of the issues today, you have to compose separate concepts:

    This is dangerous code:

    let x = "é" // Creating a text literal.
    let n = x.unicodeScalars.first!.value // Getting the numeric encoding.
    

    Thanks to Unicode equivalence, after copy‐and‐paste, etc., this source might compile so that n is 0x65 or 0xE8.

    But each line on its own would have been perfectly reasonable and safe in different contexts:

    let x = "é" // Creating a text literal (like line 1 above).
    if readLine().contains(x.first) { // Using it safely as text.
        print("Your string has no “é” in it.")
    }
    
    print("Your string is composed of these characters:")
    for x in readLine().unicodeScalars {
        let n = x.value // Getting the numeric encoding (like line 2 above).
        print("U+\(n)") // Safely expecting it to be absolutely anything.
    }
    

    But this proposal adds the combination as a single operation:

    let n: UInt8 = 'a'
    

    Thus it has the extra responsibility to consider where and when the entire combined operation as a whole is and isn’t reliable.

  2. The subset of issues you can already encounter today without composition fail immediately with a compiler error.

    This is dangerous code:

    let x = "é" as Unicode.Scalar
    

    That may or many not still fit into a Unicode.Scalar after copy‐and‐paste, but you will know immediately.

    The proposal, had it not limited itself, would have introduced instances of single operations that would be derailed silently:

    let x: UInt32 = `Å` // Started out as 0x212B, might become 0xC5.
    

    Hidden, nebulous logic changes would be far worse than the sudden compiler failures we can encounter today.

  3. Today, the most straightforward, safest way to get a vulnerable scalar from a literal currently requires all the same compiler and language functionality as the dangerous code:

    let x = "\u{E8}" as Unicode.Scalar
    

    Making the dangerous variant illegal would have made this safe way impossible as well.

    On the other hand, regarding the additions in the proposal, the most straightforward safe way to get a vulnerable integer is completely unrelated in both syntax and functionality, and is thus unaffected by the safety checks:

    let x = 0xE8
    

(For any new readers who want more information about this, the relevant discussion and explanations begin here in the pitch thread and run for about 20 posts until a consensus was reached around what was called “Option 4”. Please read those first if you want to ask a question or post an opinion about the restriction to ASCII.)

7 Likes

Oh. That's unfortunate. I actually did not realize the proposal would allow that. (This isn't immediately obvious from the proposal text, and it's only tangentially implied by a test in the implementation.)

Searching back, I now see this did come up during the pitch. I'm sorry I missed it:

I wholeheartedly agree with @xwu here.

I accept that there is some utility in using ASCII character literals for the purposes of integer initialization and pattern matching. But I absolutely do object to the prospect of character literals directly appearing in arithmetic expressions like 'a' + 12, much less abominations like 'a' * 'a'.

let m1 = ('a' as Int) + 12     // this is (barely) acceptable to me
let m2 = 'a' + 12              // ...but this seems extremely unwise
let wat = 'a' * 'b' / 'z'      // ...and this is just absurd

If we cannot prevent the last two lines to compile, then in my opinion the proposed integer conformances should be scrapped altogether.

An argument can perhaps be made to keep the Int8/UInt8 conformances to cater for the parsing usecase, however absurd these expressions might be. But this argument doesn't apply to any other FixedWidthInteger type, so I believe it'd be best if the generic initializer was removed from the proposal.

4 Likes

I agree with this feedback, and it is the main thrust of my review comment about this revised proposal:

In general, I thought that the proposal was acceptable in its previous incarnation, but removing conformances due to ABI limitations and then making them opt-in is a very strange way of vending a feature. It's not discoverable, and it's certainly not convenient.

Instead of trying to push ahead with the minimum possible modification to the design, we need to revisit this design more comprehensively because the pros and cons have changed dramatically.

(To make it explicit: The previous pro of the highest possible convenience in writing let x: Int = 'a' is very much tempered when one must write a conformance. We also introduce a new con that we are encouraging people to conform a type they don't own to a protocol they don't own, which is absolutely contrary to what we tell people otherwise.)

By contrast, init(ascii:) seems very straightforward, and if we care indeed about convenience for initializing arrays of integers in this way, we can vend also a similar initializer on extension Array where Element: FixedWidthInteger.

If compile-time checking is required, there is no reason why the compiler cannot be made to have special knowledge of these init(ascii:) methods to make this work until such time as the constexpr-like features mature and supersede that magic.

7 Likes

-1. Main reason: Unsigned Integers are numbers, not codepoints. Codepoints are represented by unsigned integers, but that doesn't mean that UInt8 etc. need to be able to be represented by these kinds of literals. There should be a wrapper type such as

struct SomeKindOfCodepoint {
    var value: UInt8
}

instead of polluting integer interfaces. I would hate it to make code like this valid:

let a: UInt8 = 10

let b = a + 'x' //wat?

Do codepoints have multiplication? Do codepoints have division? No? Then they should be their own type that only wraps the underlying integer representation. Afterwards, this wrapper type can conform to a new type of ExpressibleBy...Literal as this proposal wants.

But the way it looks now, this is moving way too close into C-like unsigned char territory, where there is no difference between text and numbers, which is a poor choice for a high-level language with a strong type system.

8 Likes

I'm -1 for any version that adds non-numeric literals to integers. It undermines Swift's approach to Unicode correctness. I can already imagine the all-too-easily copy-pasta'ed Stack Overflow posts that will recommend pattern matching over String.utf8.

The basic "Haven't you had this problem?" in the proposal isn't compelling. Nobody's forcing anybody to elide tons of UInt8(ascii:) calls into a single ugly array literal. Plenty of C parsers big chunk of #defines for code points they care about, and I've seen similar approaches in Swift without any trouble.

Core Team members have often said self-hosting Swift is a non-goal because it biases the language towards what's good for a compiler. Likewise, I don't think the language needs unfairly biased in favor of ASCII as part of its string ergonomic story. That's every other programming language. We already have those languages.

I'm perfectly in favor of single-quote literals for Swift's for UnicodeScalar, Character, and String. It notably improves on the status quo and aligns with other languages.

2 Likes

This paragraph makes no sense. C already has character literal <-> Int8 equivalence. C applications use #defines for unicode codepoints beyond the ASCII range (e.g. '€'), because C does not have true unicode scalar literals. This is essentially the opposite problem that Swift has, and this proposal is trying to solve. The difference is that C code usually only cares about a handful of “unicode” characters, like typographic punctuation characters {'‘', '’', '“', '”'} so a block of uint32_t #defines is fine, even preferable for that purpose. When you’re dealing with ASCII, you generally care about the whole alphabet, and #defines (or the Swift equivalent) start looking awfully cringy.

// ascii.swift.gyb
enum ASCII 
{
    %{
        specials    = [(0x00 - 0x20, 'null'), (0x09 - 0x20, 'tab'), 
            (0x0A - 0x20, 'newline'), (0x0D - 0x20, 'carriageReturn')
        ]
        identifiers = ['space', 'exclamation', 'doubleQuote', 'hashtag', 
            'dollarSign', 'percent', 'ampersand', 'singleQuote', 
            'leftParentheses', 'rightParentheses', 'asterisk', 
            'plus', 'comma', 'hyphenMinus', 'period', 'forwardSlash', 
            'zero', 'one', 'two', 'three', 'four', 'five', ...
        ]
    }%
    % for i, identifier in specials + enumerate(identifiers):
        static let ${identifier}:UInt8 = ${i + 0x20}
    % end
}
// opentype.swift
extension FontFeature 
{
    var tag:(UInt8, UInt8, UInt8, UInt8)
    {
        switch self 
        {
        case .kern:
            return (ASCII.k, ASCII.e, ASCII.r, ASCII.n)
        case .calt:
            return (ASCII.c, ASCII.a, ASCII.l, ASCII.t)
        
        ... // like 20 more otf features
        
        case .c2sc:
            return (ASCII.c, ASCII.two, ASCII.s, ASCII.c)
        case .c2pc:
            return (ASCII.c, ASCII.two, ASCII.p, ASCII.c)
        }
    }
}

The thing about ASCII is by virtue of computing tradition, the textual mnemonics have become inseparable from the numeric representations. For example, in the PNG binary format, several chunk metadata flags are defined in terms of bits set in fixed-length ASCII strings embedded in the image. For example, the safe-to-copy flag is defined as the fifth bit in the fourth character of a chunk’s name. You check it in Swift like this:

if name.3 & (1 << 5) != 0 
{
    ...
}

taken from the PNG specification:

   bLOb  <-- 32 bit chunk type code represented in text form
   ||||
   |||+- Safe-to-copy bit is 1 (lowercase letter; bit 5 is 1)
   ||+-- Reserved bit is 0     (uppercase letter; bit 5 is 0)
   |+--- Private bit is 0      (uppercase letter; bit 5 is 0)
   +---- Ancillary bit is 1    (lowercase letter; bit 5 is 1)

Two wrong representations don't make a right.

constexpr/@compilerEvaluable is not a silver bullet that will magically give us all the compile-time guarantees we could ever want. It would at the minimum require a bunch of additional function annotations to make it possible for the compiler to see inside init(ascii:). Because the ascii: argument is an external parameter passed to the function, this becomes the textbook example of an expression which cannot be @compilerEvaluable. We would need to rework the type system to support @staticCall/@constArgument or something like that which would make it illegal to call such a function if certain arguments are not @compilerEvaluable. But since this would break both source and ABI (all of a sudden init(ascii:) on a local variable won’t work anymore, and you are effectively removing this method from ABI) we would have to introduce new “static” variants of these initializers on all the types we care about.

There actually is some utility in character literals appearing in arithmetic expressions

let digit:UInt8 = '0' + n % 10

let letter:UInt8 = 'A' + (character - 'A' + 26 - cipher) % 26

I agree 'a' * 'a' is just silly, but at some point i think we need to accept that people can write nonsense code with any syntax you give them, and the threshold for whether that’s okay or not has to come down to whether the nonsense expression has an alternative sensical meaning that the user might have intended instead. I don’t think anyone writes 'a' * 'a' and expects it to mean anything. Someone might write 'a' * 5 and expect it to mean "aaaaa", but it’d be pretty clear that’s not the case when the type checker complains about getting a UInt8 and not a [UInt8] or a String.

I would prefer compiler magic to error the last two lines over compiler magic to make init(ascii:) statically checked. The first could be done in the parser before even getting to the type checker (filter out any ast that has a character literal inside an operator node that is not +), and it would give a pretty clear and simple error message.

1 Like

If I understand the proposal, the eventual goal is to have code like:

let eight: UInt8 = '0' + 8

be legal. I think I would prefer this to be simplified over UInt8.init(ascii:), but still be explicit. Perhaps expanding the prefix operator system, something like:

let eight: UInt8 = x"0" + 8
// and perhaps
let hexLookup: [Int8] = x"0123456789ABCDEF"

(with 'x' being a bikesheddable value, be it a letter or symbol).

Since this would not be extending the existing ExpressibleBy... protocols, this avoids the versioned protocol conformance issues. It also clarifies that a system that supports values outside 7-bit ASCII can exist, but is deferred because of the issues on how to handle this across file encodings and tools performing text normalization.

-1
This proposal and the associated pitch emphasize that these are character literals. For me the character x is fundamentally different from the number 120.

I would be okay with them being ascii literals and 'x' being just a fancy way to write 120 (and keeping "x" for the Character struct), but I don't like how 'x' can represent both a Character with all the unicode complexity and the ascii value...

2 Likes

I’m not sure what you mean by all of this. There are no especial barriers to evaluation of init(ascii:) at compile time when the argument is a literal, should that feature be desired.

Okay, so you're already able to use this with just integer APIs. How would character literals even help in the task of checking if it's uppercase? Unless you add an isUppercase or isLowercase to UInt8, which sadly would very much just continue going down the path of mixing up integers with codepoint/characters that this proposal is starting.

1 Like

These both look horrible to me and I think even these "good" examples absolutely need some explicit conversions to make them less magical.

3 Likes

I would prefer neither. ASCII is a very low-level API that should be abstracted over as soon as possible anyways. No need to have tons of init(ascii:) calls in your code.

Also, there is another reason why I dont't like this proposal, and that is that it forces ASCII on us. UInt8 doesn't know anything about encoding and there are already lots of different string encodings, lots of them being 8-bit encodings.

So what if one day we introduce some new encoding that is not compatible to ASCII? Current literals would work just fine, because they create String and Character and UnicodeScalar instances that are encoding-independent, but these codepoint literals could be potentially used to operate on the already encoded data to replace characters etc.

That would work fine in UTF-8, but with our new encoding, even though it is also 8 bit, the literal 'a' could actually yield the letter "x" or something completely different, just because the literal is naively assuming that we always want ASCII.

2 Likes

There might, someday, be a common future 8 bit encoding where the literal ASCII ‘a’ might be misinterpreted as an ‘x', but TODAY, we live in a world where there is a ton of ASCII, esp at the lower levels.

If there is some future 8 bit encoding that makes this untenable, Swift could almost certainly find a way to make this clear.

But for now, we need a clear concise way to express single character ASCII values without a ton of boilerplate.

I’m seeing a surprising amount of pushback on integers being expressible by character literals and all that implies in that code like the following would compile:

let m1 = ('a' as Int) + 12     // this is (barely) acceptable to me
let m2 = 'a' + 12              // ...but this seems extremely unwise
let wat = 'a' * 'b' / 'z'      // ...and this is just absurd

In fact the last line gives an error but only because there are too many ways it could compile (it is ambiguous). This does compile:

let wat: Int = 'a' * 'b' / 'z'      // ...and this is just absurd

Arithmetic on character values s more useful than one might think, consider the decoding hex example:

    func nibble(_ char: UInt16) throws -> UInt16 {
        switch char {
        case '0' ... '9':
            return char - '0'
        case 'a' ... 'f':
            return char - 'a' + 10
        case 'A' ... 'F':
            return char - 'A' + 10
        default:
            throw NSError(domain: "bad character", code: -1, userInfo: nil)
        }
    }

Imagine you had to write a JSONDecoder.

I'd thought through an accident of ABI stability history we had found the goldilocks point where this sort of behaviour wasn’t enabled by default but users had to opt in (by declaring an ExpressibleByUnicodeScalaLiteral conformance) but this seems to be sufficiently unintuitive to leave it in no mans land in peoples minds, neither conservative or convenient.

I’m fine with not deprecating UInt8(ascii: “a”) as the alternative but I’d have more time for if it was generic by return type able to take it’s type from the expression context as in reality you typically have to use something like Int8(UInt8(ascii: “a”)) which is a bit of a handful.

I’ve seen surprisingly little push back on the other aspect of the proposal which is to deprecate “” in favour of ’' for Unicode.Scalar and Character literals. This is a source breaking change but seems to crop up in comparatively few places:

"string".split(separator: "a")
...: warning: double quotes deprecated in favour of single quotes to express ‘Character'
// and ironically
UInt8(ascii: “a”)
...: warning: double quotes deprecated in favour of single quotes to express ‘Unicode.Scalar'

Is there any appetite for proceeding with this part of the proposal until it is possible to allow gated conformances to ExpressibleByUnicodeScalaLiteral to be added by default and we can judge the other part of the proposal on it’s end state rather than it’s awkward half way point. IMO this is a worthwhile change in it’s own right in terms of the ergonomics of the language to make explicit contexts where we are dealing with a single character and starting this change now will make eventual adoption of integers being expressible by character literals easier if we take another look at it.

Then please show us this way. From my point of view, once the API of Integers has been polluted with the new protocol conformance, there is no way back, because code that makes use of it never explicitly mentions encodings. So how would the future compiler know if the collection of integers being worked on is supposed to eventually be interpreted as UTF-8 etc. or some other encoding? It quite obviously can't.

2 Likes