Prepitch: Character integer literals

taylorswift · March 1, 2018, 6:16am

i hear legends of some (rare, old-school) platforms and contexts where ascii data is stored in 16 or even 32 bit integers, with the high bits zeroed out or ignored. but i agree that UInt8 is the main one to support.

i see no real benefit to this, ascii is usually stored in 8-bits so you basically get the other 128 character plane for free. maybe we should use signed Int8 so you can use a sign test to distinguish the two

i feel like y’all getting hung up on the single quotes vs double quotes thing which is really its own side issue and not particularly relevant here. again, i suggest single quotes because i think " is way too overloaded in the language and people are too stingy with reserving ' for something else but there’s no real problem with using " for everything

Chris_Lattner3 · March 1, 2018, 6:16am

What I'm getting at is that this can be solved in multiple ways: 1) introducing new types, 2) introducing a new literal form, 3) allowing the existing string literal forms to work with existing integer types.

#1 is bad because it doesn't work with existing APIs and common cases, it leads to syntactic bloat for the common cases we're trying to sugar. #3 leads to potential for confusion and "creep" problems as I mentioned upthread.

I don't see the downside of #2. We didn't use single quotes initially because we thought that they could be used for multi-line string literals, but that has already been resolved. I don't see the downside to taking the (highly precedented! :-) approach of using them for character literals. Once you do that, then the behavior is immediately obvious:

  let a = 'c'  // Character
  let b : Int8 = 'x'  // ok
  let c : Int16 = '˚' // ok if it fits.
  let d : Int = '\u{123}'  // ok

  let e : Int8 = "x" // obviously not ok, "x" is a string.

etc. What's the bad thing about this approach? What possibility for confusion or other problem do you forsee?

-Chris

Chris_Lattner3 · March 1, 2018, 6:19am

Going back to the OP, this C code:

would be expressible like this:

let hexcodes: [UInt8] = 
    ['0', '1', '2', '3', '4' ,'5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f']

which seems to directly address the problem at hand.

taylorswift · March 1, 2018, 6:20am

right now Unicode.Scalar is really bad to use just because it uses " which means you have to type as Unicode.Scalar everywhere which is annoying. making use of ' could make things better, and it’s not like anyone else has any serious plans for this character anytime soon

allevato · March 1, 2018, 6:35am

I can't say that I'm against it, especially now that it's been fleshed out a bit more with the other types you gave examples of.

My initial hesitation was to the Character type specifically being treated differently from other strings or scalars, but I think if we partitioned the various types by their cardinality and said:

Single quotes represent literals that can be treated as a single ASCII code unit, UTF-16 code unit, Unicode.Scalar, (edit: or Character), depending on the type it's being coerced to
Double quotes represent literals that contain may only be coerced to types that support multiple characters

...then we'd be telling a consistent and compelling story.

Chris_Lattner3 · March 1, 2018, 6:40am

+1 with the extension that single quotes should allow a single grapheme cluster as well, e.g. when of type Character.

I believe that this can all be added in an additive and compatible manner. We would want to eventually deprecate these though:

   Int8(ascii: "x")  // should migrate to single quotes, and possibly eliminate the initializer outright.
   let x : Character = "y"  // should migrate to using single quotes instead of double.

I don't see a rush to doing this though.

-Chris

allevato · March 1, 2018, 6:43am

Indeed, I didn't mean to leave that one out! (Edited above.)

Also this one, correct?

let x: Unicode.Scalar = "y"  // should migrate to single quotes

Chris_Lattner3 · March 1, 2018, 6:46am

Yep, everything like that.

jawbroken · March 1, 2018, 10:08am

Okay, if it's used consistently with Character etc, cleanly separating a single character or UTF-8/UTF-16 code point from multiple, then I can see the argument for it. It's not that compelling to me personally, and I feel like they would be more useful as e.g. the raw string feature that has been discussed here previously, but it does provide a clean syntax for people who work with characters-as-integers a lot. I don't know how large that audience is, though.

benrimmington · March 1, 2018, 4:46pm

In swift-corelibs-foundation, UInt16 is already ExpressibleByUnicodeScalarLiteral:

public typealias unichar = UInt16

extension unichar : ExpressibleByUnicodeScalarLiteral {
    public typealias UnicodeScalarLiteralType = UnicodeScalar
    
    public init(unicodeScalarLiteral scalar: UnicodeScalar) {
        self.init(scalar.value)
    }
}

@Tony_Parker, should this be removed for compatibility with Darwin platforms?

benrimmington · March 1, 2018, 5:13pm

There was an issue with how to validate Character literals:

[SR-4546] Need more permissive ExtendedGraphemeClusterLiteral parsing

I don't know if this will be re-evaluated for Swift 5.

taylorswift · March 1, 2018, 9:28pm

just to be clear, what is the integer value of a single-quoted Character containing multiple Unicode.Scalars?

tali · March 1, 2018, 9:33pm

just to be clear, what is the integer value of a single-quoted Character containing multiple Unicode.Scalars?

That should be a compile error.

allevato · March 1, 2018, 10:41pm

Yep. I would expect the following behavior:

let x1: Character = 'a'  // ok
let x2: Int = 'a'  // ok, x2 == 97
let x3: Character = '🇨🇦'  // ok
let x4: Int = '🇨🇦'  // compile error, not a single scalar

This is the same behavior we already have today if you try to initialize a Unicode.Scalar from a double-quoted string literal with more than one scalar, or a Character from a double-quoted string literal with multiple grapheme clusters.

Chris_Lattner3 · March 1, 2018, 10:41pm

I'm not sure what your question is. A single grapheme cluster (which is what Character models) is a single character that may be composed of multiple unicode scalars. It is all of the unicode scalars that make up that character, not a single one.

-Chris

Chris_Lattner3 · March 1, 2018, 10:43pm

+1 to @allevato's post. Exactly.

taylorswift · March 1, 2018, 10:51pm

that’s exactly the point i’m making, now all of a sudden we have '' literals that fail depending on the inferred type

allevato · March 1, 2018, 10:59pm

Why is this a problem? Your original post was asking for a way to cleanly get the integer value of an ASCII code unit, and this does that, with the added benefit that it supports more than just ASCII and you get compile-time verification that the value can be represented by the inferred type.

I don't understand how that objection would be resolved unless you say that single quotes could only support the lowest common denominator, which would be 7-bit ASCII, and I think that would be far too restrictive to exhaust single quotes for.

taylorswift · March 1, 2018, 10:59pm

can I suggest that instead Character still use double quotes because Characters are really sequences of scalar code units instead of scalars themselves. we kind of have this weird hierarchy in our string model where there’s Unicode.Scalar ∈ Character ⊆ String.

allevato · March 1, 2018, 11:17pm

While that seems like a reasonable technical model, it would provide an inconsistent user experience if we look at the whole picture. Let's consider a double-quoted string literal "abc" and how we would view its elements under various string views under that model:

The UTF-8 code units of "abc" are equivalent to ['a', 'b', 'c']
The UTF-16 code units of "abc" are equivalent to ['a', 'b', 'c']
The scalars of "abc" are equivalent to ['a', 'b', 'c']
The characters of "abc" are equivalent to ["a", "b", "c"]

Keeping Character literals as double-quoted breaks what is otherwise a clean and consistent mapping between string elements and sequences of string elements. I can see both sides of this, but I have to say I prefer the solution that makes all four bullets above consistent because that's an easier relationship to explain to newcomers to Swift.