Prepitch: Character integer literals

taylorswift · November 21, 2018, 6:45pm

wait exactly what does this sentence mean? Are you saying the ABI impact of an in-place implementation of this proposal is small enough that it could make it in after Swift 5?

Douglas_Gregor · November 28, 2018, 4:44am

I mean that it’s ABI-additive, so we could stage it in after Swift 5. It’s also much smaller and therefore carries less risk.

Doug

johnno1962 · December 19, 2018, 12:43am

Hi @Douglas_Gregor, I’ve come up with a new implementation using the existing protocols here:

It’s not perfect but could you check it’s the sort of thing that could pass “fixed ABI” muster please. The only additive elements are a couple of new conformances to ExpressibleByUnicodeScalarLiteral for each integer type. Swift5 toolchain here.

A few details on what the new implementation offers: It makes no real distinction between single and double quoted strings except single quoted strings are checked to be a single extended grapheme cluster and double quoted strings can not initialise an integer type without an error. Both can initialise a String, Character or Unicode.Scalar variables and the default type for both literals is String as it is not possible to make a distinction while maintaining compatibility. The default type for ’1’+’1’ is String and it’s value ”11" but '1'+’1' will give the ascii value for ’1’ times two in an integer context. digit - ‘0’ works if digit already has an integer type. The implementation checks for overflows when assigning into Int8 and Int16 but doesn’t make any distinction between signed and unsigned values so var acute: Int8 = ‘é' gives a negative number.

So, all in all the new implementation has all the features of the previous implementation though it is a little looser. It is certainly a good deal simpler at only 60-odd changed lines.

dwaite · December 20, 2018, 6:29am

I don't think anyone suggested adding multiple encodings. @johnno1962's example includes a Latin 1 character simply because Unicode code points that fit into one byte are equivalent to Latin 1. It obviously won't work for any other encoding (unless you count ASCII) because Unicode only has this particular relationship with Latin 1.

Is that desirable though? All serialized text has an encoding of some sort - be it ASCII, Latin-1, EBCDIC, UTF-8, etc. As soon as you stop working in terms of characters, you have to start dealing with this.

If I had

let a:UInt8 = 'å'

in my code, the default would be to assume I'm doing a binary comparison as latin-1 text?

If I was instead trying to compare against UTF-8 text, this code should not compile - but instead it will compile and work only when it matches portions of UTF-8 encoded codepoints.

Avi · December 20, 2018, 7:16am

Disclaimer: I have been following the discussion, but I'm no expert.

I think that the intent is if you initialize a type of UInt8, it's an integer from that point on. It's not an a with a funny circle (sorry I don't know the character name). So you're not using it as a character of any sort after initialization. Initialization is just a funny spelling of whatever the Latin-1 or ASCII value of that character is, but it's not a representation of the character. The semantic meaning is lost to the compiler at that point.

michelf · December 20, 2018, 12:42pm

Indeed, if you were comparing 'å' as UInt8 with a byte from a UTF-8 string, it wouldn't work because you are comparing a unicode scalar to a code unit and they aren't equivalent for UTF-8 code units above 127. We could make the character literal an error in this situation. That might help people working with UTF-8 code units, but it'll also be a hindrance to those working with binary formats where the Latin-1 visualisation is commonly used for byte-based signatures.

If we had a distinct type specifically meant to represent UTF-8 code units (similar to UnicodeScalar), then it'd certainly make sense to make this an error. But UInt8 is an integer and I'm not sure it'd be appropriate to restrict it to UTF-8 semantics.

taylorswift · January 1, 2019, 9:25pm

tell me if i’m wrong but can we not do what you described and add back the restrictions at the type checker level? The type checker should have enough information to know if a single quoted literal or double quoted literal belongs in a given context,, and that should all take place before ABI ever gets involved. Like i said before, we can keep all the old entry points but just make it so they never actually get used when they shouldn’t, because this proposal shouldn’t affect how Swift programs run at all, just how they’re written.

johnno1962 · January 1, 2019, 11:24pm

Absolutely, the changes required for character literals should be confined to compile time as much as possible but I don’t think expressing integers can be done without the minor ABI change to add new conformances to ExpressisbleByUnicodeScalarLiteral. It is these conformances which guide the type checker and provide the implementation to initialise an integer type from a unicode scalar. All the rest can now be implemented in CSApply.cpp including gating invalid combinations, overflow detection, deprecation warnings etc. From the outside. the solution section of he proposal is not much changed:

Proposed solution

Let's do the obvious thing here, and re-use the existing ExpressibleByUnicodeScalarLiteral and ExpressibleByExtendedGraphemeClusterLiteral protocols. Character literals are essentially short strings (with default type String for compatability), checked to contain only a single extended grapheme cluster and will become the preferred (eventually only) syntax to express Unicode.Scalar and Character values.

In addition, new conformances to ExpressibleByUnicodeScalarLiteral will be added to all integer types so they can be initialised with the codepoint value when a character is a single unicode scalar. These new conformances are gated such that only single quoted character literals can be used and a compile-time check is made that the codepoint value fits into the target integer type.

`ExpressibleBy`…	`UnicodeScalarLiteral`	`ExtendedGraphemeClusterLiteral`
`UInt8:`, … , `Int:`	yes	no
`Unicode.Scalar:`	yes	no
`Character:`	yes	yes
`String:`	yes	yes

ExpressibleByUnicodeScalarLiteral will work essentially as it does today. This allows us to statically diagnose overflowing codepoint literals, just as the compiler and standard library already work together to detect overflowing integer literals:

	`'a'`	`'é'`	`'β'`	`'𓀎'`	`'👩‍✈️'`	`"ab"`
`:String`						"ab"
`:Character`	`'a'`	`'é'`	`'β'`	`'𓀎'`	`'👩‍✈️'`
`:Unicode.Scalar`	U+0061	U+00E9	U+03B2	U+1300E
`:UInt32`	97	233	946	77838
`:UInt16`	97	233	946
`:UInt8`	97	233
`:Int8`	97	−23

Note that unlike ExpressibleByUnicodeScalarLiteral, the highest bit of the codepoint goes into the sign bit of the integer value. This makes processing C char buffers easier.

taylorswift · January 1, 2019, 11:59pm

yes, but isn’t that ABI additive?

johnno1962 · January 2, 2019, 12:10am

Looks like it should be allowable, adding a few conformances rather than the new protocols. It works out better if I’m honest being simpler once I worked out how to detect overflows. From a functional point of view the new implementation is just about equivalent apart from the default type being stuck as String for compatibility.

taylorswift · January 2, 2019, 12:43am

This kind of forfeits many of the (side) benefits of having a separate literal type though, mainly that Character would no longer be promoted to a first class type with a literal syntax that doesn’t need an as coercion

johnno1962 · January 2, 2019, 7:41am

If that’s a deal breaker it’s possible to make a “tactical” patch to the type checker specific to character literals that I’m not proud of but it does solve the problem given the constraints. It relies on a new typealias in the stdlib but that shouldn’t be an ABI issue as it is only relevant at compile time.

let imACharacter = 'a'
let imAString = 'a' + 'b'
let im195 = 'a' + 'b' as Int

New toolchain uploaded if you want to kick the tires.

taylorswift · January 2, 2019, 11:50pm

can you make a linux build? I can’t run the macOS binaries

johnno1962 · January 3, 2019, 12:00am

16.04 Ubuntu? What’s the error you’re getting?

taylorswift · January 3, 2019, 12:02am

18.04, but the toolchain you linked to is an osx toolchain…
i’ll just build it from source from your fork, but i figured i’d let you know lol

johnno1962 · January 3, 2019, 12:04am

I don’t have 18.04 setup alas. Curious what the problem is when you try the mac binaries. The compiler doesn’t even work?

johnno1962 · January 3, 2019, 6:13pm

Lunix 18.04 toolchain available http://johnholdsworth.com/swift-LOCAL-2019-01-02-a-linux.tar.gz

lorentey · January 7, 2019, 4:57pm

As a Central European engineer who lived through the Dark Age of 8-bit code pages, I would strongly prefer limiting this feature to US-ASCII (for UInt8 and Int8) and full Unicode (for Int32+).

Assuming/defaulting to any particular 8-bit encoding is just asking for trouble. For example, in an 8-bit context, 0xE6 isn't at all the same as 'æ' -- depending on encoding, it can mean any of 'W', 'Ê', 'ć', 'ĉ', 'Š', 'ц', 'ن', 'ζ', 'ז', 'ๆ', 'و', 'µ', 'φ', etc. etc. etc.

In the vast majority of contexts, the old single-byte 8-bit encodings shouldn't be used at all. In the (legacy) cases where their use is unavoidable, it should not be possible to encode non-ASCII characters like 'æ' or 'é' etc. to a single byte without also specifying an explicit encoding, in a highly visible way.

Implicitly hardwiring Latin-1 into Swift's syntax seems completely unnecessary and wildly anachronistic to me. Code like let obviousBug: UInt8 = 'é' raises alarms in my brain that haven't triggered for years and years -- I associate it with the Macarena and tamagotchis.

In our brave new UTF-8 world, the default assumption should be that the 8-bit value 0xE6 is the first of three bytes in the UTF-8 encoding of '歹', or one of 4095 other ideographs.

I'm not as strongly opinionated on 16-bit encodings, but it seems to me UCS-2 has also lost most of its lustre by now. I'd prefer to not support initializing UInt16/Int16 with character literals at all, or to limit the feature to the 128 characters available in ASCII.

taylorswift · January 7, 2019, 6:57pm

what if we just got rid of the high bit in UInt8? then 'é' would trigger an overflow and you would have to choose a wider type like UInt16.

lorentey · January 7, 2019, 7:12pm

That's what I'm suggesting for UInt8 and Int8.

As I said, I see no reason to add UCS-2 support directly into Swift's syntax, either. What would be the rationale for supporting let dubious: UInt16 = 'é'?