SE-0243: Codepoint and Character Literals

There is no reason whatsoever to introduce new literal syntax for this; String’s UTF-8 view already provides a succint and highly efficient way to represent ASCII byte sequences.

let needle = “PNG89a”.utf8
// needle is a sequence of bytes corresponding to the 
// UTF-8 encoding of “PNG89a”. For ASCII strings,
// this is exactly the same as their 7-bit ASCII encoding 
// zero-extended to 8-bit bytes.

If the standard library doesn’t provide convenient enough APIs to match such byte sequences, then that can and should be remedied by introducing new APIs in stdlib. Inventing new syntax won’t help.

(The fact that this also works for non-ASCII characters still seems like a great feature to me. UTF-8 is the new ASCII.)

We do not have a similarly succinct syntax to express individual bytes between 0 and 127 by their corresponding ASCII character. If this is an important usecase, then Unicode scalar literal syntax would give us that by allowing ’a’.ascii. (Character is on the wrong abstraction level for this; its asciiValue property is broken.)

Support for other legacy encodings (ISO 8859-x, EBCDIC variants, etc.) can be provided by external packages, by simply defining similar properties on Unicode.Scalar. These would work just as nicely as .ascii.

let hello = “I’m an ASCII bytestring”.utf8

Is this unsafe or unreadable? Why?

Yes, it is unsafe. The reason why is located immediately after the I.

3 Likes

Great point. I’ve been typing most of my posts directly into a poor web emulation of a text editor, not a code editor. Most of my apostrophes and quotes have been converted to the proper punctuation marks for English text.

There is no need for any additional compile-time checks, though: my code above (and throughout this discussion) already won’t compile because it uses English left and right quotation marks, not the ASCII approximation that Swift requires for String literals.

(Note how the corruption exhibited in these forum posts is not related to Unicode normalization. It’s the browser trying to be helpful and work around the limitations of my keyboard, which has a fewer keys than English text requires.)

If such corruption is likely enough in practical contexts to deserve special treatment, then there is a wide spectrum of possible approaches to detect it. Adding dedicated language syntax for ASCII literals to protect against these seems like severe overreaction to me; the same practical effect can be achieved by runtime checks, possibly combined with special-cased warning diagnostics.

1 Like

If the problem were just limited to generating arrays of ASCII characters, I might agree with you, but as i said in my other post, there are a lot of other use cases that .utf8 doesn’t solve. I also don’t think new syntax should be the cost we need to be worried about,, I’m a lot more concerned with potential solutions like “literal-bound 'a'.ascii, which overload existing syntax with new semantics.

These are really useful! I really don't see your point, though -- String.utf8 (and Unicode.Scalar.ascii) seem to provide perfectly elegant, safe and efficient solutions to all of them:

// storing a bytestring value
static var liga = "liga".utf8
// storing an ASCII scalar to mixed utf8-ASCII text
var xml: [UInt8] = ...
xml.append('/'.ascii)
xml.append('>'.ascii)
// ASCII range operations 
let current: UnsafePointer<UInt8> = ...
if 'a'.ascii ... 'z'.ascii ~= current.pointee {
    ...
}
// ASCII arithmetic operations 
let year: ArraySlice<UInt8> = ...
var value: Int = 0
for digit: UInt8 in year {
    guard '0'.ascii ... '9'.ascii ~= digit else {
        ...
    }
    value *= 10
    value += Int(digit - '0'.ascii)
}
// reading an ASCII scalar from mixed utf8-ASCII text 
let xml: [UInt8] = ... 
if let i: Int = xml.firstIndex(of: '<'.ascii) {
    ...
}
// matching ASCII signatures 
let c: UnsafeRawBufferPointer = ...
if c.starts(with: "PLTE".utf8) {
    ...
}

Note: I took the liberty of replacing Int8 above with UInt8. As far as I know, Int8 data typically comes from C APIs imported as CChar, which is a truly terrible type: it's documented to be either UInt8 or Int8, depending on the platform. Any code that doesn't immediately and explicitly rebind C strings to UInt8 is arguably broken.

1 Like

I fully agree with this part; there is no need for any language feature beyond Unicode.Scalar literals. A regular .ascii property would work just fine.

In this particular topic, we keep trying to come up with needlessly complicated syntax-level solutions to a rather niche problem that'd be much better resolved through a bit of careful API design.

1 Like

I agree CChar is horrible but i don’t know enough about Swift’s C interop to say if getting rid of it is possible. If so, we can probably drop the “overloads on return type” issue with options 1, 2, and 5.

this isn’t exactly ideal since we’d get a heap allocation. That’s why I’ve used (UInt8, UInt8, UInt8, UInt8) for all these 32-bit ASCII string examples so far. (v v popular in binary file formats since they’re the same size as a C int or float.)

I don’t think all of these issues can be solved at the standard library level. Ultimately, there’s 3 questions that come into play for all users of ASCII bytestrings, and those are

  1. I don’t want my code to crash.

  2. I don’t want my code to be cryptic and indecipherable.

  3. I don’t want my code to be a sprawling inefficient mess.

API design will only give you two out of three.

If you care about 2 and 3, but not 1, then 'a'.ascii is the right solution for you. But then you’ll be vending a trapping API on all Unicode.Scalar values, regardless of context. And even though you could sweep all misuse under the “programmer error” rug, it would still be a questionable addition to the standard library, just like a trapping .first property on Array would be.

If you care about 1 and 2, but not 3, then 'a'.ascii would be the right solution for you, but you would want it to return an Optional, just like the (v v problematic) .asciiValue property on Character. It doesn’t take a lot of imagination to see how cumbersome careful usage of this API could get.

If you care about 1 and 3, but not 2, then no solution is needed,, you’ve probably memorized the ASCII table by now, and you should just go on plugging in hex or decimal values in everywhere an UInt8 ASCII value is needed. But I don’t think anyone is advocating for that.

You definitely could insist on a standard library-only solution to this problem, but it’d be pretty suboptimal. I ask that you keep an open mind to syntax and feature-based solutions, as they’re not all as complicated as you make them sound. Option 4 (a'a') is a relatively superficial change in the compiler that would touch nothing below the lexer/parser level. Option 3 ('a' as UInt8) changes neither syntax nor semantics, it just makes it possible for certain types to make existing syntax explicit and mandatory.

If I was to post to this thread I’d be restating what I wrote 50 posts ago so I’ll just post a link: SE-0243: Codepoint and Character Literals

TLDR; a few well chosen operators added to stdlib can solve the problem without trying to tie down a new literal syntax. One opinion that has changed is that I’m leaning toward @michelf’s suggestion that single quotes be reserved for ASCII only literals to allow compile time validation and put an end to any Unicode Shenanigans.

3 Likes

One thing that won't work if you take the full "ASCII literal" suggestion I made in the pitch thread is that it allows single quotes to represent both a String and a UInt8 and thus suffers from the same problem I pointed out earlier where UInt8('8') != ('8' as UInt8). To fix this we'd have to amend it by either:

  1. disallowing single quote literals for integer types, or
  2. disallowing single quote literals for String

I’ve given up on Integer types being expressible by quoted literals and am currently putting forward a different approach involving targeted operators so this shouldn’t be an issue. Just saying a literal form for ASCII only strings which I think is currently being floated as a’a’ might be worthwhile.

Sure, but we are discussing the default type here, not the sole type, and there is definitely an obvious default in Swift.

It's similarly difficult to express a Character, which should be much more commonly used than a Unicode.Scalar, so I still hold the position that if it's not important to have a literal form designed primarily for Characters (i.e. defaults to Character) then how can it be important to have one designed for Unicode.Scalars? And if I'm wrong, and Unicode.Scalar really is more important than Character in this sense, then it seems to me that the whole Swift string design must have failed.

There is no heap allocation in "liga".utf8. It’s either an immortal string or a small string. If this isn’t the case, then that’s a bug!

I’m not saying we can change it; that’s a different discussion. I’m saying that Int8 byte sequences tend to originate from CChar, but there is no reason code should keep them in that form. Swift APIs have standardized on UInt8; imported data in any other format should be converted to match.

Meh. Array vends a trapping subscript on all array values, regardless of context. The utility of a trapping .ascii property seems clear to me, and it feels similar to how/why Array.subscript doesn’t return an optional value. In case asciiness is non-obvious in a particular context, then the (already existing) .isASCII property can be used to test for it before accessing .ascii.

Yes, a trapping property is somewhat unusual. But in this case I feel it’s the right trade off.

1 Like

After close to 300 messages on this thread alone, it’s still not clear to me why that must be the case, if there aren’t any clear usecases. Is Character used in any context where the lack of a syntactic shortcut is actively hurting Swift’s usability?

It’s a heap allocation because you’d have to materialize it to an [UInt8] array to be able to subscript it with an integer, and there’s nothing in the type system that tells you this array (or the original UTF-8 view) contains exactly 4 ASCII scalars. (String.UTF8View is also a really weird thing to vend in your API — aren’t types with names that end in “-View” supposed to be ephemeral by definition?)

Why do we care about subscripting and the exact 4-character length? Because the ultimate end destination for these ASCII quadruples is usually a 32-bit integer slug. Code for packing these slugs gets a little more complicated once you throw String’s infamous indexing model into the mix.

// i would love to see the optimizer have a go at this, compared 
// with a similar function that takes a `(UInt8, UInt8, UInt8, UInt8)` 
// tuple
func slug(_ feature:String.UTF8View) -> UInt32 
{
    let index0:String.Index = feature.startIndex, 
        index1:String.Index = feature.index(after: index0), 
        index2:String.Index = feature.index(after: index1), 
        index3:String.Index = feature.index(after: index2)
    let slug:UInt32   = .init(feature[index0]) << 24
                      | .init(feature[index1]) << 16
                      | .init(feature[index2]) << 8
                      | .init(feature[index3])
    return slug
}

Yes, this is a thing.
Yes, C is pretty far ahead of us in this regard.

This is not a valid comparison. As I’m sure you’re aware from the perennial “Array subscript should return optional?” pitches here, Array vends a trapping subscript because the performance penalty associated with having it return an optional would be unacceptable. This logic doesn’t apply to literals, where performance is an irrelevant concept, since (barring ICU dependencies, which don’t apply at the level we’re talking about) they are a purely compile-time construct.

There are tons of examples in the standard library of APIs which return optionals to the detriment of ergonomics, even when the optional case is a rare edge case. Unsafe${Mutable}BufferPointer.baseAddress comes to mind as an example. And I think a buffer pointer with a count of 0 is a far less common occurrence than a Unicode.Scalar outside of the ASCII range.

Is Unicode.Scalar? Most of the use cases in the proposal and thread would favour a design where single-quoted literals were just ASCII integer literals. Fundamentally, Character and Unicode.Scalar are going to be used very similarly, and which one you use will just depend on the level at which you are processing/iterating through the string.

For Turing’s sake! Why would you ever do that? Just add your own property to UTF8View to extract an UInt32 from a four-byte ASCII sequence. You can go as wild as you like — go ahead and reinterpret the contents of the underlying contiguous buffer, if that’s what you want.

So far into the bowels of String we must go,, and I thought the goal here was to make bytestrings easier.

The runtime performance penalty (if any) is irrelevant; the actual reason is that dealing with optionals would make Swift arrays highly inconvenient to use. It’s a pretty close match with .ascii.

1 Like
Terms of Service

Privacy Policy

Cookie Policy