Prepitch: Character integer literals


#135

Code is read more often than it is written. So, when you’re reading some code, how can you tell, say, which of these literal strings is ASCII?

let a = "ЈFІF"
let b = "JFⅠF"
let c = "JꓝIꓝ"
let d = "JFIF"
let e = "JϜΙϜ"
let f = "JF‌IF"

(Xiaodi Wu) #136

If the ease of picking out such differences written by others is important to you, then that's a great argument that integers should never be expressible by character or string literals: put another way, if you find that to be persuasive, that's an argument against the use case advanced by @taylorswift for character literals.


(BJ Homer) #137

A lot of the use cases in this thread might be solved by an AsciiString* type, which would be initializable with a string literal and fail for any non-ascii characters.

let a: AsciiString = "JFIF"

The use cases where someone wants to compare strings byte-wise with incoming data don't really line up with needing Unicode support, so perhaps a separate type is in order? It could also be integer-subscriptable since each character would be fixed-width. Reasonable use cases for integer subscripting of strings usually involve an assumption that the string will be ASCII-compatible.

Anyway, I'm not sure exactly how such a type would affect the desire for Character integer literals, but it seems like it might address a need here.


*Technically, "ASCII" only refers to the bottom 128 characters of the 256 possibilities in a single byte. It would be nice to have a better name to indicate that it covers the top half as well, but I'm not sure what an appropriate name would be.


(John Holdsworth) #138

I’m sad to see this pitch get mired down in specifics and falter again. For a pithy high level summary of the issues in play you need look no further than this:

@xwu, if it is your opinion that no attempt should be made to build a bridge between character literals and their ascii/copepoint value then this was never a pitch you were going to find attractive, but character literals are little use without this feature. It’s a debate that needed to be had and we all understand you’re trying to protect the Swift string model from dilution but would it be possible to sit this one out for now rather than saying the same thing again and again so we can see where this pitch leads us. I for one think this is a worthwhile idea that I would like to see go to review when you’ll have a chance to have your say.

To that end, I’ve created a “Detailed Design” paragraph with an example use case to drop into Chris’ draft.

Detailed Design

These new character literals will have default type Character and be statically checked to contain only a single extended grapheme cluster. They will be processed largely as if they were a short String.

When the Character is representable by a single UNICODE codepoint however, (a 20 bit number) they will be able to express a Unicode.Scalar and any of the integer types provided the codepoint value fits into that type.

As an example:

let a = 'a' // This will have type Character
let s: Unicode.Scalar = 'a' // is also possible
let i: Int8 = 'a' // takes the ASCII value

In order to implement this a new protocol ExpressibleByCodepointLiteral is created and used for character literals that are also a single codepoint instead of ExpressibleByExtendedGraphemeClusterLiteral. Conformances to this protocol for Unicode.Scalar, Character and String will be added to the standard library so these literals can operate in any of those roles. In addition, conformances to ExpressibleByCodepointLiteral will be added to all integer types in the standard library so a character literal can initialize variables of integer type (subject to a compile time range check) or satisfy the type checker for arguments of these integer types.

These new conformances and the existing operators defined in the Swift language will make the following code possible out of the box:

func unhex(_ hex: String) throws -> Data {
    guard hex.utf16.count % 2 == 0 else {
        throw NSError(domain: "odd characters", code: -1, userInfo: nil)
    }

    func nibble(_ char: UInt16) throws -> UInt16 {
        switch char {
        case '0' ... '9':
            return char - '0'
        case 'a' ... 'f':
            return char - 'a' + 10
        case 'A' ... 'F':
            return char - 'A' + 10
        default:
            throw NSError(domain: "bad character", code: -1, userInfo: nil)
        }
    }

    var out = Data(capacity: hex.utf16.count/2)
    var chars = hex.utf16.makeIterator()
    while let next = chars.next() {
        try out.append(UInt8((nibble(next) << 4) + nibble(chars.next()!)))
    }

    return out
}

One area which may involve ambiguity is the + operator which can mean either String concatenation or addition on integer values. Generally this wouldn't be a problem as most reasonable contexts will provide the type checker the information to make the correct decision.

print('1' + '1' as String) // prints 11
print('1' + '1' as Int) // prints 98

#139

This implies that ("1" as Character) + ("1" as Character) as String should work. Or unless you can also write var str: String = '1'. Which would again fold together String literal and character literal, which is the status quo. No?


(Ladislas de Toldi) #140

Thanks @taylorswift for the proposal and thanks @johnno1962 for the insightful summary.

Being able to write let i: UInt8 = 'a' would be a game changer for people working a lot with embedded devices and serial/bluetooth/ble communication.

We basically just send & receive bytes and sometimes it makes a lof of sense to use the ASCII representation to make the code clearer.

One super simple example using Arduino. Imagine you have the following code running on your board:

  while (Serial.available() > 0) {
    
    uint8_t c = Serial.read();

    if (c == '+') {
      digitalWrite(LED_BUILTIN, HIGH);
    }
    else if (c == '-') {
      digitalWrite(LED_BUILTIN, LOW);
    }

    delay(250);
  
  }

Right now on the swift side, I need to write this to make the led blink:

let buffer: [UInt8] = [43, 45, 43, 45, 43, 45, 43, 45, 43, 45]
Darwin.write(fd, buffer, buffer.count)

It would be much clearer with:

let buffer: [UInt8] = ['+', '-', '+', '-', '+', '-', '+', '-', '+', '-', ]
Darwin.write(fd, buffer, buffer.count)

It's just an example but as things get more complex, it could really be helpful.


(John Holdsworth) #141

'1' + '1' as String works as String has an ExpressibleByCodepointLiteral conformance in the prototype and uses the String+String operator. Perhaps I should remove it and there wouldn’t be any ambiguity.


(Frank Swarbrick) #142

ByteChar? ByteValue?


(^) #143

i have taken to calling them “byte strings” but it doesn’t really have a good name as the upper half of that range wasnt well-standardized before unicode came along


(Alexander Momchilov) #144

Again, you could just do this:

let buffer = ["+", "-", "+", "-", "+", "-", "+", "-", "+", "-", ].map(UInt8.init(ascii:))
Darwin.write(fd, buffer, buffer.count)

or

let plusAndMinusAscii = ["+", "-"].map(UInt8.init(ascii:))
let buffer = (0..<500).flatMap { _ in plusAndMinusAscii }
Darwin.write(fd, buffer, buffer.count)

or even

extension StringProtocol {
	var ascii: [UInt8] {
		return unicodeScalars.map { unicodeScalar in 
			guard unicodeScalar.isASCII else {
				fatalError("Tried to get ascii code of non-ascii unicode scalars.")
			}
			return UInt8(unicodeScalar.value)
		}
	}
}

let buffer = "+-+-+-+-+-+-".ascii
print(buffer)

(^) #145

because the codepoint version is transparent whereas this one constructs Character objects and tries to transform them back into UInt8s at run time? It’s like initializing integers by casting Float literals. It’ll work, but why would that ever be considered ideal? Casting Float literals at the very least still has the possibility that the whole chain could be optimized out by the compiler, whereas such an optimization cannot be done for Character since grapheme checking needs to be done at run time as since 3.0 Swift links against the system ICU library.

And don’t forget, an invalid grapheme (but valid Character) in your example is a run time error, whereas an invalid codepoint literal is a compile time error.


(Alexander Momchilov) #146

The solution to that is to make compiler optimizations to allow these to be converted at compile time. I don't see why it needs new language syntax. And why stop at ASCII? What about ISO 8859? What about other character encodings?


(John Holdsworth) #147

The proposal does not stop at ASCII. Any codepoint is valid provided it fits into the target type.

let ascii: Int8 = 'a' // 97
let latin1: UInt8 = 'ÿ' // 255
let utf16: UInt16 = 'Ƥ' //  420
let emoji: UInt32 = '🙂' // 128578

(Michel Fortin) #148

At least this last one can sort of be written as:

let emoji: UnicodeScalar = "🙂"

... assuming you don't mind using UnicodeScalar in lieu of UInt32. They're both the same thing under the hood.


(Alexander Momchilov) #149

So a multitude of encoding would be supported by this, and which encoding is used is an implicit function of the character and the destination datatype's size?

I'm sure there would be ambitious cases when the same character exists in 2 encodings, with the same size. How would such a case be disambiguated?

Again, all this complexity, just to avoid something like a UInt8.init(ascii:) call? No thanks.


(^) #150

What? There are no such things as duplicate codepoints in unicode, only equivalent encodings and equivalent grapheme compositions which are one of the precise existing problems this proposal is designed to help solve. Perhaps you are confusing unicode codepoints with unicode code units or unicode graphemes?


(Michel Fortin) #151

I don't think anyone suggested adding multiple encodings. @johnno1962's example includes a Latin 1 character simply because Unicode code points that fit into one byte are equivalent to Latin 1. It obviously won't work for any other encoding (unless you count ASCII) because Unicode only has this particular relationship with Latin 1.


(Chris Lattner) #152

Thank you so much for driving this forward John, and I appologize for abandoning you with this before. I would also really love to see this make progress and am thrilled you're pushing on it (I just don't have time to dedicate to it). Thank you thank you thank you! :slight_smile:

-Chris


(Chris Lattner) #153

+1 to this design.

-Chris


(Alexander Momchilov) #154

I was responding to this comment. Prepitch: Character integer literals

My question was: why do we highlight ASCII and Latin 1? Why should those 2 encodings get such special treatment from the language? If there is a need to initialize integers from characters, I would like to see a generic mechanism that supports arbitrary encodings, and without wasting the ' sigil.