SE-0243: Codepoint and Character Literals

taylorswift · March 17, 2019, 10:22pm

If the problem were just limited to generating arrays of ASCII characters, I might agree with you, but as i said in my other post, there are a lot of other use cases that .utf8 doesn’t solve. I also don’t think new syntax should be the cost we need to be worried about,, I’m a lot more concerned with potential solutions like “literal-bound 'a'.ascii”, which overload existing syntax with new semantics.

lorentey · March 17, 2019, 10:43pm

These are really useful! I really don't see your point, though -- String.utf8 (and Unicode.Scalar.ascii) seem to provide perfectly elegant, safe and efficient solutions to all of them:

// storing a bytestring value
static var liga = "liga".utf8

// storing an ASCII scalar to mixed utf8-ASCII text
var xml: [UInt8] = ...
xml.append('/'.ascii)
xml.append('>'.ascii)

// ASCII range operations 
let current: UnsafePointer<UInt8> = ...
if 'a'.ascii ... 'z'.ascii ~= current.pointee {
    ...
}

// ASCII arithmetic operations 
let year: ArraySlice<UInt8> = ...
var value: Int = 0
for digit: UInt8 in year {
    guard '0'.ascii ... '9'.ascii ~= digit else {
        ...
    }
    value *= 10
    value += Int(digit - '0'.ascii)
}

// reading an ASCII scalar from mixed utf8-ASCII text 
let xml: [UInt8] = ... 
if let i: Int = xml.firstIndex(of: '<'.ascii) {
    ...
}

// matching ASCII signatures 
let c: UnsafeRawBufferPointer = ...
if c.starts(with: "PLTE".utf8) {
    ...
}

Note: I took the liberty of replacing Int8 above with UInt8. As far as I know, Int8 data typically comes from C APIs imported as CChar, which is a truly terrible type: it's documented to be either UInt8 or Int8, depending on the platform. Any code that doesn't immediately and explicitly rebind C strings to UInt8 is arguably broken.

lorentey · March 17, 2019, 10:53pm

I fully agree with this part; there is no need for any language feature beyond Unicode.Scalar literals. A regular .ascii property would work just fine.

In this particular topic, we keep trying to come up with needlessly complicated syntax-level solutions to a rather niche problem that'd be much better resolved through a bit of careful API design.

taylorswift · March 17, 2019, 11:30pm

I agree CChar is horrible but i don’t know enough about Swift’s C interop to say if getting rid of it is possible. If so, we can probably drop the “overloads on return type” issue with options 1, 2, and 5.

this isn’t exactly ideal since we’d get a heap allocation. That’s why I’ve used (UInt8, UInt8, UInt8, UInt8) for all these 32-bit ASCII string examples so far. (v v popular in binary file formats since they’re the same size as a C int or float.)

I don’t think all of these issues can be solved at the standard library level. Ultimately, there’s 3 questions that come into play for all users of ASCII bytestrings, and those are

I don’t want my code to crash.
I don’t want my code to be cryptic and indecipherable.
I don’t want my code to be a sprawling inefficient mess.

API design will only give you two out of three.

If you care about 2 and 3, but not 1, then 'a'.ascii is the right solution for you. But then you’ll be vending a trapping API on all Unicode.Scalar values, regardless of context. And even though you could sweep all misuse under the “programmer error” rug, it would still be a questionable addition to the standard library, just like a trapping .first property on Array would be.

If you care about 1 and 2, but not 3, then 'a'.ascii would be the right solution for you, but you would want it to return an Optional, just like the (v v problematic) .asciiValue property on Character. It doesn’t take a lot of imagination to see how cumbersome careful usage of this API could get.

If you care about 1 and 3, but not 2, then no solution is needed,, you’ve probably memorized the ASCII table by now, and you should just go on plugging in hex or decimal values in everywhere an UInt8 ASCII value is needed. But I don’t think anyone is advocating for that.

You definitely could insist on a standard library-only solution to this problem, but it’d be pretty suboptimal. I ask that you keep an open mind to syntax and feature-based solutions, as they’re not all as complicated as you make them sound. Option 4 (a'a') is a relatively superficial change in the compiler that would touch nothing below the lexer/parser level. Option 3 ('a' as UInt8) changes neither syntax nor semantics, it just makes it possible for certain types to make existing syntax explicit and mandatory.

johnno1962 · March 17, 2019, 11:59pm

If I was to post to this thread I’d be restating what I wrote 50 posts ago so I’ll just post a link: SE-0243: Codepoint and Character Literals - #252 by johnno1962

TLDR; a few well chosen operators added to stdlib can solve the problem without trying to tie down a new literal syntax. One opinion that has changed is that I’m leaning toward @michelf’s suggestion that single quotes be reserved for ASCII only literals to allow compile time validation and put an end to any Unicode Shenanigans.

michelf · March 18, 2019, 12:34am

One thing that won't work if you take the full "ASCII literal" suggestion I made in the pitch thread is that it allows single quotes to represent both a String and a UInt8 and thus suffers from the same problem I pointed out earlier where UInt8('8') != ('8' as UInt8). To fix this we'd have to amend it by either:

disallowing single quote literals for integer types, or
disallowing single quote literals for String

johnno1962 · March 18, 2019, 12:37am

I’ve given up on Integer types being expressible by quoted literals and am currently putting forward a different approach involving targeted operators so this shouldn’t be an issue. Just saying a literal form for ASCII only strings which I think is currently being floated as a’a’ might be worthwhile.

jawbroken · March 18, 2019, 1:58am

Sure, but we are discussing the default type here, not the sole type, and there is definitely an obvious default in Swift.

It's similarly difficult to express a Character, which should be much more commonly used than a Unicode.Scalar, so I still hold the position that if it's not important to have a literal form designed primarily for Characters (i.e. defaults to Character) then how can it be important to have one designed for Unicode.Scalars? And if I'm wrong, and Unicode.Scalar really is more important than Character in this sense, then it seems to me that the whole Swift string design must have failed.

lorentey · March 18, 2019, 2:36am

There is no heap allocation in "liga".utf8. It’s either an immortal string or a small string. If this isn’t the case, then that’s a bug!

I’m not saying we can change it; that’s a different discussion. I’m saying that Int8 byte sequences tend to originate from CChar, but there is no reason code should keep them in that form. Swift APIs have standardized on UInt8; imported data in any other format should be converted to match.

Meh. Array vends a trapping subscript on all array values, regardless of context. The utility of a trapping .ascii property seems clear to me, and it feels similar to how/why Array.subscript doesn’t return an optional value. In case asciiness is non-obvious in a particular context, then the (already existing) .isASCII property can be used to test for it before accessing .ascii.

Yes, a trapping property is somewhat unusual. But in this case I feel it’s the right trade off.

lorentey · March 18, 2019, 3:17am

After close to 300 messages on this thread alone, it’s still not clear to me why that must be the case, if there aren’t any clear usecases. Is Character used in any context where the lack of a syntactic shortcut is actively hurting Swift’s usability?

taylorswift · March 18, 2019, 3:27am

It’s a heap allocation because you’d have to materialize it to an [UInt8] array to be able to subscript it with an integer, and there’s nothing in the type system that tells you this array (or the original UTF-8 view) contains exactly 4 ASCII scalars. (String.UTF8View is also a really weird thing to vend in your API — aren’t types with names that end in “-View” supposed to be ephemeral by definition?)

Why do we care about subscripting and the exact 4-character length? Because the ultimate end destination for these ASCII quadruples is usually a 32-bit integer slug. Code for packing these slugs gets a little more complicated once you throw String’s infamous indexing model into the mix.

// i would love to see the optimizer have a go at this, compared 
// with a similar function that takes a `(UInt8, UInt8, UInt8, UInt8)` 
// tuple
func slug(_ feature:String.UTF8View) -> UInt32 
{
    let index0:String.Index = feature.startIndex, 
        index1:String.Index = feature.index(after: index0), 
        index2:String.Index = feature.index(after: index1), 
        index3:String.Index = feature.index(after: index2)
    let slug:UInt32   = .init(feature[index0]) << 24
                      | .init(feature[index1]) << 16
                      | .init(feature[index2]) << 8
                      | .init(feature[index3])
    return slug
}

Yes, this is a thing.
Yes, C is pretty far ahead of us in this regard.

taylorswift · March 18, 2019, 3:34am

This is not a valid comparison. As I’m sure you’re aware from the perennial “Array subscript should return optional?” pitches here, Array vends a trapping subscript because the performance penalty associated with having it return an optional would be unacceptable. This logic doesn’t apply to literals, where performance is an irrelevant concept, since (barring ICU dependencies, which don’t apply at the level we’re talking about) they are a purely compile-time construct.

There are tons of examples in the standard library of APIs which return optionals to the detriment of ergonomics, even when the optional case is a rare edge case. Unsafe${Mutable}BufferPointer.baseAddress comes to mind as an example. And I think a buffer pointer with a count of 0 is a far less common occurrence than a Unicode.Scalar outside of the ASCII range.

jawbroken · March 18, 2019, 3:34am

Is Unicode.Scalar? Most of the use cases in the proposal and thread would favour a design where single-quoted literals were just ASCII integer literals. Fundamentally, Character and Unicode.Scalar are going to be used very similarly, and which one you use will just depend on the level at which you are processing/iterating through the string.

lorentey · March 18, 2019, 3:38am

For Turing’s sake! Why would you ever do that? Just add your own property to UTF8View to extract an UInt32 from a four-byte ASCII sequence. You can go as wild as you like — go ahead and reinterpret the contents of the underlying contiguous buffer, if that’s what you want.

taylorswift · March 18, 2019, 3:43am

So far into the bowels of String we must go,, and I thought the goal here was to make bytestrings easier.

lorentey · March 18, 2019, 3:44am

The runtime performance penalty (if any) is irrelevant; the actual reason is that dealing with optionals would make Swift arrays highly inconvenient to use. It’s a pretty close match with .ascii.

taylorswift · March 18, 2019, 3:50am

(we should change the rationale in the commonly rejected then)

Avi · March 18, 2019, 3:50am

I have been following the discussion since the original pitch. I won't pretend to understand all the nuances of Unicode and how it interacts with all the Swift character-ish types. I only want to say that the way I see the proposal, we're not talking about strings at all. We are talking about byte sequences that happen to be both representable and represented as human-readable characters (lowercase c). The byte sequences may be chosen such that they form a nemonic, but they are bytes first, characters second.

The way I see it, this proposal is about representing such characters in the language such that they represent the intended byte values (which happen to be given by the ASCII standard, but theoretically could come from anywhere). There is a strong preference for static checking of the representation to ensure the intended values are encoded.

I don't think this proposal has anything to do with strings, and changes to the language to accommodate the feature are in no way an indictment of string design. It's rather an acknowledgement of the fact that human communication is complicated, and that we use textual characters for different purposes at different times.

lorentey · March 18, 2019, 3:51am

You insist on doing questionable micro-optimizations; I’m just showing you the way to do it right. String literals have contiguous storage, like arrays. Feel free to use it when you must.

I’d personally just use starts(with:), like I did above. Is it not good enough for your case?

taylorswift · March 18, 2019, 4:19am

No, because the case we are talking about is the

// storing a bytestring value 
static 
var liga:(Int8, Int8, Int8, Int8) 
{
    return (108, 105, 103, 97) // ('l', 'i', 'g', 'a')
}

one, not the signature matching one. (#6)

The count of 4 is part of the type information,, which "liga".utf8 discards. OTF font features, like many other ASCII tags, are not stringly-typed identifiers, but strongly-typed values, whose “honest” type is actually an enum with a UInt32 rawValue backing.

(In this situation, you would also want the user to be able to supply their own (UInt8, UInt8, UInt8, UInt8) slugs in addition to the common ones defined as static vars, so String.UTF8View is absolutely the wrong currency type to use for this. For example, your library might not define CommonFeatures.scientificInferiors:UInt32/(UInt8, UInt8, UInt8, UInt8), but you’d want people to be able to supply ('s', 'i', 'n', 'f') just as easily in its place. I’m sure you’ll suggest changing the API to traffic in "liga" and "sinf" instead of ('l', 'i', 'g', 'a') and ('s', 'i', 'n', 'f'), but it would be a shame if the entire API had to be weakened to taking untyped Strings just to accommodate this.)