Prepitch: Character integer literals

running --emit-assembly and counting the movs is a learning exercise, not a workflow.

2 Likes

Swift already has precedent for treating untyped double-quoted string literals as different types via the ExpressibleBy{String,Character,UnicodeScalar}Literal protocols, instead of introducing different quotes.

So, if we want to pursue this, the consistent way would be to define a new compiler-known protocol, ExpressibleByASCIICodeUnitLiteral, have UInt8 conform to it, and then update the handling of string literals in the compiler to make them type-check correctly, detecting errors at compile-time instead of runtime:

func foo(_ x: UInt8) { ... }

foo("a")   // works
foo("È")   // error: U+00C9 is not a valid ASCII code unit
foo("ab")  // error: the literal has more than one code unit
8 Likes

this is pretty close to what i’m proposing i only use single quotes because i think double quotes are a little overloaded in the language currently

Still not sure what you're getting at. You asked a yes-or-no question and I answered it; I believe my answer is correct, and I showed my work.

1 Like

yes and you’ve convinced me that the compiler knows that in this case (at least for now barring a regression) but anyone else not reading this thread would have to repeat the experiment for themselves to figure this out. and you haven’t convinced me that will happen in any context, since the compiler’s thought process is pretty much opaque to anyone who’s not a swift compiler dev and we’re all too familiar with the compiler making strange and nonobvious decisions

either way we’re really making a mountain out of what was originally a small aside as this prepitch is really more about ergonomics

4 Likes

This can be made to work, but the logical endpoint of this is that UInt16 would allow code points that fit into a UTF16 size and UInt32 would allow any code point. Is that desirable? I could see how this could be confusing to some:

let x : Int = "f"

One nice thing about the C approach with single quotes is that it makes it clear what is going on, and it would allow defining a new default type for:

 let x = 'x'

which would clearly be Character.

-Chris

6 Likes

I think the argument could be made that we want to provide APIs that satisfy the most common needs of users without worrying about taking things to their logical end. Unscientifically, my gut tells me that processing ASCII text that arrives in the form of [UInt8] or Data probably qualifies as more valuable to have as shorthand than UTF-16 or UTF-32.

Another option would be to introduce a unique ASCIICodeUnit type that is essentially a 7-bit unsigned integer instead of extending UInt8 directly, but that might make interop with UInt8s more verbose elsewhere.

I wouldn't expect that to be supported; for clarity, we'd probably only want to support specific-size integers, if support for more than UInt8 was up for debate. But that's just my opinion and attempt at drawing the line somewhere.

By itself, a new quoting scheme specifically for Character feels like a separate issue than the one in the OP about making it cleaner to write ASCII literals that can be inferred as UInt8. Unless, are you suggesting that the scheme could be stretched such single quotes would default to Character but also be inferred as other types representing "singular" text entities (ASCII UInt8, Unicode.Scalar) and double quotes would only be inferred as sequences of those entities (String, StaticString)?

If we're going to enable uses like

func foo(_ x: UInt8) { ... }
foo("a")   // works

then I think that'd be the way to go.

It looks very confusing that you can supply what looks like a string literal to a function that takes a numeric argument. Unless I'm mistaken, the Swift compiler can already work with builtin types such as Int7 internally, and the LLVM intrinsics are there to support casting without issue. With such a type, your hypothetical function would be self-documenting:

func foo(_ x: ASCII.CodeUnit) { ... }
foo("a") // of course this works
1 Like

i hear legends of some (rare, old-school) platforms and contexts where ascii data is stored in 16 or even 32 bit integers, with the high bits zeroed out or ignored. but i agree that UInt8 is the main one to support.

i see no real benefit to this, ascii is usually stored in 8-bits so you basically get the other 128 character plane for free. maybe we should use signed Int8 so you can use a sign test to distinguish the two

i feel like y’all getting hung up on the single quotes vs double quotes thing which is really its own side issue and not particularly relevant here. again, i suggest single quotes because i think " is way too overloaded in the language and people are too stingy with reserving ' for something else but there’s no real problem with using " for everything

1 Like

What I'm getting at is that this can be solved in multiple ways: 1) introducing new types, 2) introducing a new literal form, 3) allowing the existing string literal forms to work with existing integer types.

#1 is bad because it doesn't work with existing APIs and common cases, it leads to syntactic bloat for the common cases we're trying to sugar. #3 leads to potential for confusion and "creep" problems as I mentioned upthread.

I don't see the downside of #2. We didn't use single quotes initially because we thought that they could be used for multi-line string literals, but that has already been resolved. I don't see the downside to taking the (highly precedented! :-) approach of using them for character literals. Once you do that, then the behavior is immediately obvious:

  let a = 'c'  // Character
  let b : Int8 = 'x'  // ok
  let c : Int16 = '˚' // ok if it fits.
  let d : Int = '\u{123}'  // ok

  let e : Int8 = "x" // obviously not ok, "x" is a string.

etc. What's the bad thing about this approach? What possibility for confusion or other problem do you forsee?

-Chris

9 Likes

Going back to the OP, this C code:

would be expressible like this:

let hexcodes: [UInt8] = 
    ['0', '1', '2', '3', '4' ,'5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f']

which seems to directly address the problem at hand.

1 Like

right now Unicode.Scalar is really bad to use just because it uses " which means you have to type as Unicode.Scalar everywhere which is annoying. making use of ' could make things better, and it’s not like anyone else has any serious plans for this character anytime soon

2 Likes

I can't say that I'm against it, especially now that it's been fleshed out a bit more with the other types you gave examples of. :slightly_smiling_face:

My initial hesitation was to the Character type specifically being treated differently from other strings or scalars, but I think if we partitioned the various types by their cardinality and said:

  • Single quotes represent literals that can be treated as a single ASCII code unit, UTF-16 code unit, Unicode.Scalar, (edit: or Character), depending on the type it's being coerced to
  • Double quotes represent literals that contain may only be coerced to types that support multiple characters

...then we'd be telling a consistent and compelling story.

2 Likes

+1 with the extension that single quotes should allow a single grapheme cluster as well, e.g. when of type Character.

I believe that this can all be added in an additive and compatible manner. We would want to eventually deprecate these though:

   Int8(ascii: "x")  // should migrate to single quotes, and possibly eliminate the initializer outright.
   let x : Character = "y"  // should migrate to using single quotes instead of double.

I don't see a rush to doing this though.

-Chris

7 Likes

Indeed, I didn't mean to leave that one out! (Edited above.)

Also this one, correct?

let x: Unicode.Scalar = "y"  // should migrate to single quotes
1 Like

Yep, everything like that.

1 Like

Okay, if it's used consistently with Character etc, cleanly separating a single character or UTF-8/UTF-16 code point from multiple, then I can see the argument for it. It's not that compelling to me personally, and I feel like they would be more useful as e.g. the raw string feature that has been discussed here previously, but it does provide a clean syntax for people who work with characters-as-integers a lot. I don't know how large that audience is, though.

In swift-corelibs-foundation, UInt16 is already ExpressibleByUnicodeScalarLiteral:

public typealias unichar = UInt16

extension unichar : ExpressibleByUnicodeScalarLiteral {
    public typealias UnicodeScalarLiteralType = UnicodeScalar
    
    public init(unicodeScalarLiteral scalar: UnicodeScalar) {
        self.init(scalar.value)
    }
}

@Tony_Parker, should this be removed for compatibility with Darwin platforms?

There was an issue with how to validate Character literals:

[SR-4546] Need more permissive ExtendedGraphemeClusterLiteral parsing

I don't know if this will be re-evaluated for Swift 5.

just to be clear, what is the integer value of a single-quoted Character containing multiple Unicode.Scalars?