Prepitch: Character integer literals

There is no semantic reason why UInt8(ascii:) requires the ICU runtime or could not be done at compile time.

Okay, so it is mostly a limitation of the current implementation? Or something to do with the moving target unicode spec/ICU version/whatever? My main concern is that it won't make any sense from a user's perspective.

What's the ICU runtime?

Again, that's still very niche. If I ever saw someone classify between ASCII and latin by a mysterious let isAscii = i > 0, I would instantly reject the CR. No questions asked.

That sort of feature is much better to just be wrapped in a method or boolean computed property.

It could be better used for regex literals, grammar generators, or made into some sort of facility for supporting user DSLs.

I think Swift has more important things to do for the comma, besides mimicking the look of a 50 year old language feature* for supporting a largely defunct 52 year old text encoding.

  • Single quotes as chars date back at least as old as the B programming language, circa 1969.
3 Likes

i mean, it’s only niche to some people. If you do any sort of work with any kind of binary format, you’re gonna need these, and need them a lot

8 Likes

Yes, It’s mostly a limitation of the current implementation though it may be difficult to remove the limitation that single quoted literals be single code points. The key requirement is that you get an error when a character overflows the target type at compile time.

let d2: Int8 = '🙂'
k.swift:6:16: error: codepoint literal '128578' overflows when stored into 'Int8'

The easiest way to do this is to leverage the existing Integer literal code. This reuse imposes the limitation that a single quoted literal can only be a single code point but it does mean you can give a meaningful error message early on.

let invalid = â€˜đŸ‘šđŸŒâ€đŸš€'
k.swift:2:15: error: character not expressible as a single codepoint

Perhaps this limitation can be relaxed in the longer term if it turned out to be a problem but for now the prototype is probably adequate along with Chris’ draft to move to a worthwhile review.

2 Likes

that proposal (which is mine btw though chris contributed a lot) has an important difference and that’s that the inferred type of 'a' is Character and you do 'a' as Int or let scalar:Int = 'a'. So we have a CodepointLiteral that defaults to Character, and we conform Int and friends to ExpressibleByCodepointLiteral.

Your idea if i get it right is that the inferred type of 'a' is Int and you do 'a' as Character or let character:Character = 'a' via ExpressibleByIntegerLiteral.

Personally i think the inferred type being Int makes more sense, but we also do miss out on an opportunity to provide a syntax for Character literals that don’t need an as annotation which we currently don’t have. Being able to write

let character:Character = 'a'

isn’t much of a win over the current

let character:Character = "a"

The drawback is we risk confusing people (even more) about Swift’s string model since not every valid Character can be written as a CodepointLiteral, but every valid Character can be written as a StringLiteral.

Sorry @taylorswift, I should have mentioned this was your proposal & I’m not trying to propose anything different but I can provide you with an implementation to meet the review bar if it helps as I want to see this pitch succeed. Initially I tried to get away with a model where these literals were of type Int but made a second pass and they are now created using a protocol ExpressibleByCodepointLiteral which all int types conform to along with Unicode.Scalar and Character so one of these literals can have any of those types.

The implementation imposes a limitation that these “codepoint” literals be only a single codepoint as they are processed internally as integer literals. The default type is very easy to configure and should probably remain Character to future-proof the model for a time when perhaps this restriction can be relaxed. This will confuse some people but there is a clear error message and we are still far better off than we are with UInt8(ascii:) supporting the majority of common 20 bit characters.

2 Likes

the more i think about it the more I think 'a' should default to Unicode.Scalar because, we can think of “charactery” constructs in Swift as going from a spectrum from raw to cooked

 raw                       cooked
UInt32 → Unicode.Scalar → Character

so we have three choices for what 'a' should default to.

  • UInt32 is a perfectly sensible default type. Note that I really don’t think Int or UInt is a good idea, since this should never work: "\u{200000}". I’m fine with the default type having 11 more bits than it should be allowed to have since the purpose is just to remind people that these are not just “integer literals written with letters”. And I assume most people would be using these with explicit type annotations anyway.
    Just to be clear, we should be able to coerce 'a' to a 64-bit integer type, we just shouldn’t allow this method to set any bits higher than position 21.

  • Unicode.Scalar is also a perfectly sensible default type, since, well, that’s what these 'a' literals are. We also kill two birds with one stone by underloading UnicodeScalarLiteral from double quotes, so that Unicode.Scalars actually have their own literal syntax.

  • Character as a default type would have similar benefits to Unicode.Scalar in that we would get a way to write a CharacterLiteral without needing an explicit type annotation. However it’s not a good candidate because not every possible Character can be written as a codepoint literal. I don’t think 'đŸ‡ș🇾' should ever work. This means we don’t get to underload CharacterLiteral from double quotes, since we still need a way to express Characters like "đŸ‡ș🇾".
    You can argue that maybe 'đŸ‡ș🇾' should work, and we should just apply the single-codepoint restriction to explicitly typed integers or Unicode.Scalars, but then single-quoted literals kind of just become double quoted literals that work for integers, but don’t work for Strings. I don’t think this overlap helps us and I think this would only lead to confusion for the Unicode.Scalar and Character types, since we now have two ways of expressing these literals. It would be as if let n:Int = 1.0 became a thing.
    This also means we can’t call single-quoted literals “CodepointLiteral”s since, well, they’re not codepoints anymore.

I think most of us agree that it should default to the most cooked representation we can reasonably do, which is why Character as a default was so popular in this thread. But I think there’s a good reason to pull back one level and set the default at Unicode.Scalar. It also takes the problem from adding a whole new literal type to the language to just modifying the existing Unicode.Scalar literal type to have different syntax.

6 Likes

I'm in the niche you describe, I have an app that reads a binary format. I convert a String to Data using the ASCII encoding. Serves this purpose perfectly.

In Swift, a string is now a collection of characters, as it should be in a modern language that has full, first-class support for Unicode. The currency ‘unit’ of a string is firmly the character (or extended grapheme cluster), and I think it would be quite unjustifiable if 'đŸ‡ș🇾' didn’t work.

3 Likes

Are you saying to draw the line between ' and " between String and Character instead of between Character and Unicode.Scalar? I’m not opposed to it but it would mean messing with the existing syntax for both Character and Unicode.Scalar (getting rid of " for those two) instead of just Unicode.Scalar. So a little more source breaking.

Sure, no problem if that's how you want to approach it. It does impact the naming here though. ExpressibleByCodepointLiteral isn't an accurate name if the limitation is expected to be relaxed. And in a formal review I would argue against it not being able to represent all Characters because I don't think it makes sense from a user's perspective.

I don't agree with this at all. I think 'đŸ‡ș🇾' should definitely work, and single quotes should be the preferred way to write all Character literals (I'm ambivalent about whether the double-quoted versions should be eventually deprecated). I don't find this let n: Int = 1.0 argument convincing. It's more like if let d = 1.0 and let d: Double = 1 both worked to write a Double, and the difference was just the default type inferred for the different literal forms. And hey, that's exactly how it does work.

I don't see why anything has to be source breaking. The conformance to 'ExpressibleByStringLiteral' could be deprecated (if not strictly by annotating as deprecated, then by a custom warning built into the compiler) instead of removed.

I think there’s a better way to do this:

Swift has several existing protocols ExpressibleByUnicodeScalarLiteral, ExpressibleByGraphemeClusterLiteral, ExpressibleByStringLiteral that all cohabit the double-quote literal space and look like this:

protocol ExpressibleByUnicodeScalarLiteral 
{
    init(unicodeScalarLiteral:{Unicode.Scalar, Character, String, StaticString}) 
}

where the unicodeScalarLiteral: argument in the requirement is a user chosen associatedtype chosen from Unicode.Scalar, Character, and String. The compiler checks if the double quoted literal can be narrowed into a Unicode.Scalar, and then converts it to whatever type you like for the initializer. This conversion always succeeds because you can always upcast a Unicode.Scalar to a Character or a String, so the compiler (or standard library) does it for you.

In contrast the ExpressibleByGraphemeClusterLiteral protocol requirement doesn’t let you write an initializer that takes a Unicode.Scalar, since the compiler in that case only checked if the literal could be narrowed to a Character, not all the way down to a Unicode.Scalar.

protocol ExpressibleByExtendedGraphemeClusterLiteral 
{
    init(extendedGraphemeClusterLiteral:{Character, String, StaticString}) 
}

These protocols filter out the double-quoted literals that don’t match their requirement at compile time. That’s why you get a nice error message when you try to do this:

struct S:ExpressibleByExtendedGraphemeClusterLiteral
{
    let value:UInt32
    init(extendedGraphemeClusterLiteral:String) 
    {
        self.value = extendedGraphemeClusterLiteral.unicodeScalars.first!.value
    }
}

let s1:S = "a", 
    s2:S = "aa"
literal.swift:11:12: error: cannot convert value of type 'String' to specified type 'S'
    s2:S = "aa"
           ^~~~

That sounds a lot like what we’re trying to do, except more extreme. What we want to do is add two protocols ExpressibleByUnicode16Literal and ExpressibleByUnicode8Literal just like the ones above, except they don’t just check if the double quoted literal can be downcast into a Character or a 21-bit Unicode.Scalar, they check if it can be downcast all the way to a 16 or 8 bit integer.

There’s a lot of benefits.

  • We get compile time checking of whether a double-quoted literal overflows a {8, 16, 21}-bit integer, and the compiler emits the same helpful warning it already gives for double-quoted literals that can’t be Characters or Unicode.Scalars.

  • To write let n:UInt8 = "a" all we have to do is conform UInt8 to ExpressibleByUnicode8Literal and the compiler will do the overflow checking. We can make all the ints do the same:

    • Int8 -> ExpressibleByUnicode8Literal
    • UInt16 -> ExpressibleByUnicode16Literal
    • Int16 -> ExpressibleByUnicode16Literal
    • UInt32 -> ExpressibleByUnicodeScalarLiteral
    • Int32 -> ExpressibleByUnicodeScalarLiteral
    • UInt64 -> ExpressibleByUnicodeScalarLiteral
    • Int64 -> ExpressibleByUnicodeScalarLiteral
    • UInt -> ExpressibleByUnicodeScalarLiteral
    • Int -> ExpressibleByUnicodeScalarLiteral
  • This means you can’t sneak something like this past the compiler:

literal.swift:1:25: error: invalid unicode scalar
let u:UInt = "\u{800000}"

which is a good thing.

Instead of using single quote literals for integers, we can just add 8-bit and 16-bit literals as a logical extension of the String -> Character -> Unicode.Scalar train. We can actually do Int32 and higher already:

extension Int32:ExpressibleByUnicodeScalarLiteral 
{
    public 
    init(unicodeScalarLiteral:Unicode.Scalar) 
    {
        self = .init(bitPattern: unicodeScalarLiteral.value)
    }
}

let integer:Int32 = "a"
// 97

let hex:[Int32] = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", 
                   "a", "b", "c", "d", "e", "f"]
//[48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 97, 98, 99, 100, 101, 102]

Maybe we don’t need to use single quote literals after all.

2 Likes

Interesting approach, with the benefit of not requiring a new literal form. The benefits you list seem possible with either single- or double-quoted literals though. I think I would still prefer the single-quoted version because it cordons off a lot of the weird behaviour away from the double quoted literals, which are more commonly used. With your Int32 prototype, for example, you can write:

let x = "8" * "9" // 3192 (in a few (degenerate) languages this would give 72 or "72" or something)
let y = "=" * 5 // 305 (in some languages this is used to repeat a string, giving "=====")

And probably many other weird edge cases I haven't thought of. Most of these things will still be possible with single-quoted literals, but they will be encountered less often.

Single quotes also provide a more convenient way to write a Character, which seems like a nice minor benefit to me for some string processing (e.g. defining a Set of Characters to later filter/split with).

2 Likes

This is getting to the nub of the problem. As historically protocol ExpressibleByStringLiteral is a descendant of ExpressibleByUnicodeScalarLiteral so we can’t make Int types conform to ExpressibleByUnicodeScalarLiteral without giving undesirable behaviour to String type. Therefore, we need a new protocol like ExpressibleByCodepointLiteral for Int types to conform to for things to work.

I’d go further and say if only Int types conform to ExpressibleByCodepointLiteral then it is only useful for these literals to represent integer code points - Character’s conformance to this protocol is a convenience and shouldn’t drive it’s semantics. It’s a bit of a break for Swift to concede that something exists in a string other than Characters but I don’t see how to avoid it.

wait I don’t understand, if ExpressibleByStringLiteral derives from ExpressibleByUnicodeScalarLiteral then we can conform Int to its superprotocol without affecting stuff like String which conforms to the subprotocols, right? What undesirable String behavior do you forsee?

I think we should separate the integer literal part of the problem from the single-quotes part of the problem. We can get integer literals just by extending the set of existing double-quoted literal protocols. Single-quoted literals are then a question of whether we want some of the types in the double-quoted literal space
(UInt8 → UInt16 → Unicode.Scalar → Character → String)
to have a different syntax. If we do that I think it should be a clean partitioning, and if something can be written with single-quotes, it shouldn’t be allowed to be written with double-quoted literals.

The design I would propose is

ExpressibleByUnicode8Literal                // adopted by: UInt8,  Int8
    ↓
ExpressibleByUnicode16Literal               // adopted by: UInt16, Int16 
    ↓
ExpressibleByUnicodeScalarLiteral           // adopted by: UInt32, Int32 
                                            //             UInt64, Int64
    ↓                                       //             UInt,   Int
ExpressibleByExtendedGraphemeClusterLiteral // adopted by: Character
    ↓
ExpressibleByStringLiteral                  // adopted by: String


// if we create a new single quoted literal type, we should make it the 
// sole literal type for `Character` and below, and set `Character` to be 
// its default inferred type.
typealias ExtendedGraphemeClusterType = Character
typealias UnicodeScalarType           = Character 
typealias Unicode16Type               = Character  
typealias Unicode8Type                = Character  

In your example:

extension Int32: ExpressibleByUnicodeScalarLiteral
{
    public init(unicodeScalarLiteral: Unicode.Scalar)
    {
        self = .init(bitPattern: unicodeScalarLiteral.value)
    }
}

let i: Int32 = "1" + "1"
print(UInt8(ascii: "1") , i)
// prints 49, 98

This is what @jawbroken was trying to avoid.

I don’t think this is avoidable. When you use the + operator you’re basically signalling to swift that you want this:

let i:Int32 = ("1" as Int32) + ("1" as Int32)

If you agree that

let i1:Int32 = "1", 
    i2:Int32 = "1"
let i:Int32  = i1 + i2

should give you 98, then the first one should too.

Keep in mind that this:

let string = "1" + "1"
// "11"

would still work as expected.

1 Like