one of the things i love about UTF-8 is it’s designed to do exactly that: feeding UTF-8 strings into functions that only understand ASCII and getting back a UTF-8 string with non-ASCII codepoints preserved.
UTF-8 has a really simple contract:
leave the 1-prefixed bytes alone, and you never have to worry about messing up unicode data.
so as long as a string algorithm that knows nothing about UTF-8 promises to round-trip the bytes it does not understand, it is forwards compatible with UTF-8.
this means it is always safe to:
replace an ASCII scalar with another ASCII scalar:
which is why i am not a fan of an Optional<ASCII> abstraction, because it is going out of its way to break the UTF-8 contract for no significant benefit, and because the designers of UTF-8 invested significant effort into ensuring that we would not need such an abstraction in the first place.
this also isn’t exclusive to UTF-8; UTF-16 has a similar contract for BMP algorithms:
leave the surrogates (0xD800 ... 0xDFFF) alone, and you never have to worry about messing up unicode data.
which means it is safe to run a BMP algorithm like:
and i think it is precisely because many developers do not understand unicode encoding that people make a lot of mistakes with Unicode.Scalar, because we do not have ASCII/BMP literals, we only have Unicode.Scalar literals and people end up writing findSection(utf16:) like:
but then if someone comes along and tries to port this algorithm to UTF-8, they might just replace the UInt16.init(exactly:) with UInt8.init(exactly:) and come up with something like:
and worse, they would probably think they already accounted for unicode weirdness because the ! in UInt8.init(exactly:) doesn’t trap when you run this, but this implementation is wrong because you get:
print(findSection(utf8: "16 U.S.C. § 42".utf8) as Any)
// Optional("� 42")
print(findSection(utf8: "français".utf8) as Any)
// Optional("�ais")
and if we had ASCII/BMP literals you would be less likely to make this mistake because you would get a compiler error if you copy-and-paste the '§' literal into a UInt8 context:
func findSection<UTF8>(utf8:UTF8) -> String?
where UTF8:BidirectionalCollection<UInt8>
{
utf8.firstIndex(of: '§')
// ^~~
// error: character literal '§' encodes a UTF-8 continuation byte
// when stored into type 'UInt8'
.map
{
.init(decoding: utf8.suffix(from: $0), as: Unicode.UTF8.self)
}
}
I've pushed a commit to the reference implementation last night that will produce an error if you try to use a single quoted literal in a String context in an ABI neutral way. So these problematic conversions such as 'x' + 'y' == "xy" are now an error and must be expressed using double quotes as was the case before. I'm of two minds as to whether this is a step forward conceptually but I can see it may make the feature more focused and explicable in peoples minds.
I've decided to redraft one proposal with this and other changes, with far more detailed explanations of the subtitles of the ExpressableBy protocols and in a manner that it presents two separate questions for two separate reviews as requested by the core team last time around. I'm not sure I view the chances of both reviews passing as being highly likely but at least I'll be out of the loop in time for Christmas.
No, the protocol hierarchy is still as I described and the type checker finds the ASCII to string conversions. There is just a hard coding in C++ to flag them as an error now.
Having hopefully diffused the debate about Single Quoted literals being convertible to Strings with a bit of hard coded slight-of-hand. I'd like to start rehearsing arguments for the over-generalised arithmetic being available on the integer values Single Quoted literals can express as it is coming up again and again.
The first thing to note is this form of unconstrained arithmetic on code point values is a feature of four out of the five most popular computer languages used today according to tiobe (aside: seriously? is Python really the most widely use programming language in the world??). It is seen as a legacy concept however and one that Swift is "above" with it's strongly abstract String model.
Perhaps the better defence for trying to introduce integer convertible single quoted literals into Swift is the argument for the presence of UnsafePointers in the language. Something many would never/rather not use but if you need to it's critical that there is an escape hatch available. Look at the code@beccadax mentions (in the Swift compiler project no less.)
Could we create a new ASCII type on which it is possible to only define the "good" operators? I don't think so. Apart from ABI issues involved in introducing a new type, you loose the interpretability with buffers of ints that is a primary motivation. While we could clearly avoid defining multiplication and divide operators some operators (+-) are useful for offsetting and taking the difference of code points so how can we avoid 'x' + 'y' working? I don't see the solution lies there.
I guess in the end you just have to roll with it and accept that it will not be possible to prevent people from writing some absurd expressions in their code if the integer conversions are allowed. I don't believe this form of permissiveness would have too many negative consequences. It is unlikely to crash your app and is not the sort of thing one would type inadvertently. Note, one of the nonsense expressions much discussed during review was 'a'.isMultiple(of: 2). This was never possible as the default type of a Single Quoted literal is Character and that determines which methods are available.
Anyway, let me know if you find these arguments convincing or not.
The simplest argument against allowing arithmetic on ASCII characters is “why just ASCII?” What about ISO Latin 1? Or Windows-1252, which is what most text that claims to be ISO Latin 1 is actually encoded in? All Unicode codepoints have a numeric value; why not allow arithmetic directly on UnicodeScalar? Why not EBCDIC for those folks at IBM writing Swift for z/OS?
All of the reasons Swift doesn’t have arithmetic on these character encodings apply equally to ASCII.
(Maybe this discussion should be split off from the pitch thread…)
My original preference was for all potentially 20 bit code points but there was considerable push back on this in the review thread due to the multiple encodings Unicode allows for "the same" character. I resisted it for a while but conceded it was best to stick to "just ASCII". I don't know what the solution for EBCDIC would look like (which arithmetic on code points make absolutely no sense at all as letters and numbers do not have sequential code points) and that's not a problem I'm trying to solve at this point.
The utility of Single Quoted literals is partly aesthetic but primarily as a more convenient syntax for UInt8(ascii: "\n"). The arithmetic is an unfortunate side-effect of the integer conversions though it has its uses.
A bigger issue IMO is that a major potential audience for this feature seems to be people working with binary formats, which often use legacy encodings. But the BMP is only compatible with one specific legacy encoding: ISO 8859-1. The aesthetic appeal of 'z' - 'a' masks the complexity of understanding necessary to reason about '€ + 1`. Does it depend on the encoding of the source file? Of the host machine? Of the target? Of the user’s current locale?
How many people would even think to ask these questions? No matter which behavior is chosen, some large number of Windows programmers will think there’s a bug, because half of them will be expecting it to behave like other Windows legacy APIs that assume CP1252, and the other half will expect it to behave like modern, Unicode-aware APIs that incorporate ISO Latin 1.
Bitwise operations do make sense on EBCDIC. But I’m not bringing up EBCDIC because I think the API design for EBCDIC needs to be solved; I’m bringing it up because it’s part of the design space and the API needs to be designed in a way that it can eventually be accommodated.
EBCDIC is like Walter’s ex-wife’s Pomeranian in the Big Lebowski. He’s watching it while his ex is in Hawaii with her boyfriend, so he brings it to the bowling alley.
The Dude [incredulous]: “you brought the Pomeranian bowling?!”
Walter: “I brought it bowling; I’m not renting it shoes. I’m not buying it a beer. He’s not taking your turn.”
The Dude: “If my ex-wife asked me to watch her dog while she and her boyfriend went to Honolulu […]”
Walter: “First of all, Dude, you don’t have an ex.”
Some unfortunate folks out there have to carry EBCDIC the Pomeranian around, and the rest of us don’t understand why because we were never married to its owner, IBM.
My sympathies for the IBM folk. They also have the misfortune of being one of the last big-endian 64 bit architectures which must cause them no end of problems but that doesn't mean they should share those problems with us. In the end the only least common denominator one can reach for is ASCII.
Here's a weird idea... since the idea is to produce integers, we could expand the current hex, octal, and binary literals to recognize unicode and/or ascii scalar values:
let x = 0uA // unicode scalar value of "A"
let y = 0aA // ascii value of "A"
But this wouldn't really work for tabs or space characters (among others), so a quoted version should be allowed too:
let x = 0u'A' // unicode scalar value of "A"
let y = 0a'A' // ascii value of "A"
This is just another syntax to write an integer literal. Examples above would produce values as Int, the default integer type, because there's no context to infer another type.
This is an interesting idea if a little unprecedented. For non-printibles you could use \. One problem I guess is typically you'd be using character literals for symbols not letters which isn't going to interact well with the lexer.
Re the other languages, one of the strongest arguments for Swift's strict typing is to avoid the problems that arise in those languages because they let you do things like those that would be added by this proposal.
While you see this as an argument for the proposal, I see it as an argugument against the proposal, one of the strongest. I have years of experience of C/C++ and don't want to go back there.
I haven't thought much about what set of characters would work without quotes. But the quoted version would be necessary for many character values:
let asciiSpace = 0a' '
Maybe it'd be better to always require quotes, I don't know. The basic idea was to make it look like an integer literal so you can't argue that arithmetic is unexpected. But maybe some people will object to this regardless:
Please don't be melodramatic, nobody is asking you to make that choice. Introducing this accommodation for a particular type of coding in Swift does not deprecate any other aspect of the powerfully abstract Swift String model. The suggested feature is opt-in for those that need it.
It doesn't work like that though. If a feature is added to the language then people will use it. Then whether I want to or not I will encounter it – in sample code, in Swift library code, in code written by other people that I'm asked to work on, have to reason about.
I don't understand why we need an expression like 'a' + 1 in the first place. Wouldn't an expression like 'a'.asciiOffset(1) be sufficient? And if we do so we can avoid 'x' + 'y' from the beginning.
Such operations appear natural because our expected alphabetical order happens to match the order of the ASCII code. Expressions like '*' + '1' are never natural. I think using method is clearer to make it explicit that it is using ASCII order.
i’ve gone back and reworked the proposal based on the feedback from this thread. since it is a significant departure from the proposal in its current form, i decided to post it as a new thread. here is the link for anyone interested: