If the problem were just limited to generating arrays of ASCII characters, I might agree with you, but as i said in my other post, there are a lot of other use cases that .utf8
doesn’t solve. I also don’t think new syntax should be the cost we need to be worried about,, I’m a lot more concerned with potential solutions like “literal-bound 'a'.ascii
”, which overload existing syntax with new semantics.
These are really useful! I really don't see your point, though -- String.utf8
(and Unicode.Scalar.ascii
) seem to provide perfectly elegant, safe and efficient solutions to all of them:
// storing a bytestring value
static var liga = "liga".utf8
// storing an ASCII scalar to mixed utf8-ASCII text
var xml: [UInt8] = ...
xml.append('/'.ascii)
xml.append('>'.ascii)
// ASCII range operations
let current: UnsafePointer<UInt8> = ...
if 'a'.ascii ... 'z'.ascii ~= current.pointee {
...
}
// ASCII arithmetic operations
let year: ArraySlice<UInt8> = ...
var value: Int = 0
for digit: UInt8 in year {
guard '0'.ascii ... '9'.ascii ~= digit else {
...
}
value *= 10
value += Int(digit - '0'.ascii)
}
// reading an ASCII scalar from mixed utf8-ASCII text
let xml: [UInt8] = ...
if let i: Int = xml.firstIndex(of: '<'.ascii) {
...
}
// matching ASCII signatures
let c: UnsafeRawBufferPointer = ...
if c.starts(with: "PLTE".utf8) {
...
}
Note: I took the liberty of replacing Int8
above with UInt8
. As far as I know, Int8
data typically comes from C APIs imported as CChar
, which is a truly terrible type: it's documented to be either UInt8
or Int8
, depending on the platform. Any code that doesn't immediately and explicitly rebind C strings to UInt8
is arguably broken.
I fully agree with this part; there is no need for any language feature beyond Unicode.Scalar
literals. A regular .ascii
property would work just fine.
In this particular topic, we keep trying to come up with needlessly complicated syntax-level solutions to a rather niche problem that'd be much better resolved through a bit of careful API design.
I agree CChar
is horrible but i don’t know enough about Swift’s C interop to say if getting rid of it is possible. If so, we can probably drop the “overloads on return type” issue with options 1, 2, and 5.
this isn’t exactly ideal since we’d get a heap allocation. That’s why I’ve used (UInt8, UInt8, UInt8, UInt8)
for all these 32-bit ASCII string examples so far. (v v popular in binary file formats since they’re the same size as a C int
or float
.)
I don’t think all of these issues can be solved at the standard library level. Ultimately, there’s 3 questions that come into play for all users of ASCII bytestrings, and those are
-
I don’t want my code to crash.
-
I don’t want my code to be cryptic and indecipherable.
-
I don’t want my code to be a sprawling inefficient mess.
API design will only give you two out of three.
If you care about 2 and 3, but not 1, then 'a'.ascii
is the right solution for you. But then you’ll be vending a trapping API on all Unicode.Scalar
values, regardless of context. And even though you could sweep all misuse under the “programmer error” rug, it would still be a questionable addition to the standard library, just like a trapping .first
property on Array
would be.
If you care about 1 and 2, but not 3, then 'a'.ascii
would be the right solution for you, but you would want it to return an Optional
, just like the (v v problematic) .asciiValue
property on Character
. It doesn’t take a lot of imagination to see how cumbersome careful usage of this API could get.
If you care about 1 and 3, but not 2, then no solution is needed,, you’ve probably memorized the ASCII table by now, and you should just go on plugging in hex or decimal values in everywhere an UInt8
ASCII value is needed. But I don’t think anyone is advocating for that.
You definitely could insist on a standard library-only solution to this problem, but it’d be pretty suboptimal. I ask that you keep an open mind to syntax and feature-based solutions, as they’re not all as complicated as you make them sound. Option 4 (a'a'
) is a relatively superficial change in the compiler that would touch nothing below the lexer/parser level. Option 3 ('a' as UInt8
) changes neither syntax nor semantics, it just makes it possible for certain types to make existing syntax explicit and mandatory.
If I was to post to this thread I’d be restating what I wrote 50 posts ago so I’ll just post a link: SE-0243: Codepoint and Character Literals - #252 by johnno1962
TLDR; a few well chosen operators added to stdlib can solve the problem without trying to tie down a new literal syntax. One opinion that has changed is that I’m leaning toward @michelf’s suggestion that single quotes be reserved for ASCII only literals to allow compile time validation and put an end to any Unicode Shenanigans.
One thing that won't work if you take the full "ASCII literal" suggestion I made in the pitch thread is that it allows single quotes to represent both a String
and a UInt8
and thus suffers from the same problem I pointed out earlier where UInt8('8') != ('8' as UInt8)
. To fix this we'd have to amend it by either:
- disallowing single quote literals for integer types, or
- disallowing single quote literals for
String
I’ve given up on Integer types being expressible by quoted literals and am currently putting forward a different approach involving targeted operators so this shouldn’t be an issue. Just saying a literal form for ASCII only strings which I think is currently being floated as a’a’
might be worthwhile.
Sure, but we are discussing the default type here, not the sole type, and there is definitely an obvious default in Swift.
It's similarly difficult to express a Character
, which should be much more commonly used than a Unicode.Scalar
, so I still hold the position that if it's not important to have a literal form designed primarily for Character
s (i.e. defaults to Character
) then how can it be important to have one designed for Unicode.Scalar
s? And if I'm wrong, and Unicode.Scalar
really is more important than Character
in this sense, then it seems to me that the whole Swift string design must have failed.
There is no heap allocation in "liga".utf8
. It’s either an immortal string or a small string. If this isn’t the case, then that’s a bug!
I’m not saying we can change it; that’s a different discussion. I’m saying that Int8 byte sequences tend to originate from CChar, but there is no reason code should keep them in that form. Swift APIs have standardized on UInt8; imported data in any other format should be converted to match.
Meh. Array
vends a trapping subscript on all array values, regardless of context. The utility of a trapping .ascii
property seems clear to me, and it feels similar to how/why Array.subscript doesn’t return an optional value. In case asciiness is non-obvious in a particular context, then the (already existing) .isASCII
property can be used to test for it before accessing .ascii
.
Yes, a trapping property is somewhat unusual. But in this case I feel it’s the right trade off.
After close to 300 messages on this thread alone, it’s still not clear to me why that must be the case, if there aren’t any clear usecases. Is Character used in any context where the lack of a syntactic shortcut is actively hurting Swift’s usability?
It’s a heap allocation because you’d have to materialize it to an [UInt8]
array to be able to subscript it with an integer, and there’s nothing in the type system that tells you this array (or the original UTF-8 view) contains exactly 4 ASCII scalars. (String.UTF8View
is also a really weird thing to vend in your API — aren’t types with names that end in “-View” supposed to be ephemeral by definition?)
Why do we care about subscripting and the exact 4-character length? Because the ultimate end destination for these ASCII quadruples is usually a 32-bit integer slug. Code for packing these slugs gets a little more complicated once you throw String
’s infamous indexing model into the mix.
// i would love to see the optimizer have a go at this, compared
// with a similar function that takes a `(UInt8, UInt8, UInt8, UInt8)`
// tuple
func slug(_ feature:String.UTF8View) -> UInt32
{
let index0:String.Index = feature.startIndex,
index1:String.Index = feature.index(after: index0),
index2:String.Index = feature.index(after: index1),
index3:String.Index = feature.index(after: index2)
let slug:UInt32 = .init(feature[index0]) << 24
| .init(feature[index1]) << 16
| .init(feature[index2]) << 8
| .init(feature[index3])
return slug
}
Yes, this is a thing.
Yes, C is pretty far ahead of us in this regard.
This is not a valid comparison. As I’m sure you’re aware from the perennial “Array subscript should return optional?” pitches here, Array
vends a trapping subscript because the performance penalty associated with having it return an optional would be unacceptable. This logic doesn’t apply to literals, where performance is an irrelevant concept, since (barring ICU dependencies, which don’t apply at the level we’re talking about) they are a purely compile-time construct.
There are tons of examples in the standard library of APIs which return optionals to the detriment of ergonomics, even when the optional case is a rare edge case. Unsafe${Mutable}BufferPointer.baseAddress
comes to mind as an example. And I think a buffer pointer with a count
of 0
is a far less common occurrence than a Unicode.Scalar
outside of the ASCII range.
Is Unicode.Scalar
? Most of the use cases in the proposal and thread would favour a design where single-quoted literals were just ASCII integer literals. Fundamentally, Character
and Unicode.Scalar
are going to be used very similarly, and which one you use will just depend on the level at which you are processing/iterating through the string.
For Turing’s sake! Why would you ever do that? Just add your own property to UTF8View to extract an UInt32 from a four-byte ASCII sequence. You can go as wild as you like — go ahead and reinterpret the contents of the underlying contiguous buffer, if that’s what you want.
So far into the bowels of String
we must go,, and I thought the goal here was to make bytestrings easier.
The runtime performance penalty (if any) is irrelevant; the actual reason is that dealing with optionals would make Swift arrays highly inconvenient to use. It’s a pretty close match with .ascii
.
(we should change the rationale in the commonly rejected then)
I have been following the discussion since the original pitch. I won't pretend to understand all the nuances of Unicode and how it interacts with all the Swift character-ish types. I only want to say that the way I see the proposal, we're not talking about strings at all. We are talking about byte sequences that happen to be both representable and represented as human-readable characters (lowercase c). The byte sequences may be chosen such that they form a nemonic, but they are bytes first, characters second.
The way I see it, this proposal is about representing such characters in the language such that they represent the intended byte values (which happen to be given by the ASCII standard, but theoretically could come from anywhere). There is a strong preference for static checking of the representation to ensure the intended values are encoded.
I don't think this proposal has anything to do with strings, and changes to the language to accommodate the feature are in no way an indictment of string design. It's rather an acknowledgement of the fact that human communication is complicated, and that we use textual characters for different purposes at different times.
You insist on doing questionable micro-optimizations; I’m just showing you the way to do it right. String literals have contiguous storage, like arrays. Feel free to use it when you must.
I’d personally just use starts(with:)
, like I did above. Is it not good enough for your case?
No, because the case we are talking about is the
// storing a bytestring value
static
var liga:(Int8, Int8, Int8, Int8)
{
return (108, 105, 103, 97) // ('l', 'i', 'g', 'a')
}
one, not the signature matching one. (#6)
The count of 4 is part of the type information,, which "liga".utf8
discards. OTF font features, like many other ASCII tags, are not stringly-typed identifiers, but strongly-typed values, whose “honest” type is actually an enum
with a UInt32
rawValue
backing.
(In this situation, you would also want the user to be able to supply their own (UInt8, UInt8, UInt8, UInt8)
slugs in addition to the common ones defined as static var
s, so String.UTF8View
is absolutely the wrong currency type to use for this. For example, your library might not define CommonFeatures.scientificInferiors:UInt32
/(UInt8, UInt8, UInt8, UInt8)
, but you’d want people to be able to supply ('s', 'i', 'n', 'f')
just as easily in its place. I’m sure you’ll suggest changing the API to traffic in "liga"
and "sinf"
instead of ('l', 'i', 'g', 'a')
and ('s', 'i', 'n', 'f')
, but it would be a shame if the entire API had to be weakened to taking untyped String
s just to accommodate this.)