Single Quoted Character Literals (Why yes, again)

I think a struct is probably better, but I tried an enum ASCII: UInt8 with 128 cases. There's a spare bit, so that MemoryLayout<Optional<ASCII>>.size == 1. However, only value UInt8(0x80) is reinterpreted as nil.

The wrapper type could also be ExpressibleByIntegerLiteral, so that both case "0"..."9" and case 0x30...0x39 are supported. It could have APIs such as isUppercase and uppercased().

1 Like

once again, i would like to encourage everyone to keep the discussion focused on the feature being pitched and avoid delving into the meta of who is sitting on which iron throne, because that is needlessly hurtful and really not relevant here.

i think the best way to illustrate the extent of ASCII’s interoperability problems is with examples.

you see, the problem is not UInt8 or the hypothetical ASCII, the problem is actually with Equatable. because in swift we do not have a way of expressing subtype relationships with Equatable. this means we cannot do:

func findTag(html:ByteBuffer) -> Tag
{
    let start:Int = html.readableBytesView.firstIndex(of: '<')
    ...
}
func isHTTPS(url:[UInt8]) -> Bool
{
    url.starts(with: ['h', 't', 't', 'p', 's', ':', '/', '/'])
}
func csv(text:String.UTF8View) -> [Substring.UTF8View]
{
    text.split(separator: ',')
}

because all of these standard library APIs rely on Equatable, and Equatable relies on lhs and rhs both being of type Self.

it becomes evident then that if ASCII does not get along with UInt8 and 'x' cannot become UInt8, then [UInt8], ByteBufferView, and String.UTF8View must learn to speak ASCII. which seems to be the direction you hint at with:

the first problem with this is that, in swift, we generally do not support ā€œreinterpret castsā€ from [T] to [U], regardless of stride or layout compatibility. this kind of interface would have to be vended as some sort of mutable collection view, and [UInt8], ByteBuffer, String, Data, etc. would all have to grow their own ā€œASCIIView:BidirectionalCollection<ASCII?>ā€ wrappers.

and even if all these types managed to adopt Optional<ASCII>, in my opinion, this would not be a particularly useful abstraction.

why? because pretty much all ASCII string processing today needs to interoperate with unicode code unit sequences, and abstracting the continuation bytes/surrogate pair elements as nil values is a really bad idea.

to understand how you might shoot yourself in the foot with Optional<ASCII>, consider the following function that replaces all the ASCII newlines, tabs, etc. in a UTF-8 encoded string with space characters:

extension ASCII
{
    var withNormalizedWhitespace:Self
    {
        switch self
        {
        case '\n', '\r', '\t', '\f':
            return ' '
        case let other:
            return other
        }
    }
}

func normalizeWhitespace(inout utf8:[UInt8])
{
    for index:Int in utf8.ascii.indices
    {
        utf8.ascii[index] = utf8.ascii[index]?.withNormalizedWhitespace
    }
}

the normalizeWhitespace(utf8:) function will work fine for strings that only contain ASCII codepoints, but it will clobber all non-ASCII characters, because UInt8 -> ASCII? -> UInt8 is a destructive transformation. so Optional<ASCII> is the wrong abstraction to use here.

what would be the right abstraction? probably for UTF-8, we would want something like:

enum _UTF8CodeUnit
{
    case ascii(lowBits: _Builtin.UInt7)
    case codepointFragment(lowBits: _Builtin.UInt7)
}

(UTF-16 would look similar, but instead of ASCII it would most likely segregate by BMP.)

but, if you take a step back and look at the shape of _UTF8CodeUnit, you will realize the enum tag bit is really nothing more than a sign bit, and all we have really done is reinvent UInt8. (or more pedantically, Int8.)

and that, i imagine, is why today in the standard library, we have

typealias Unicode.UTF8.CodeUnit = UInt8

moreover, thinking of ASCII as ā€œUInt7ā€ is just not correct, because ASCII not only exists in the UTF-8 (UInt8) world, it also exists in the UTF-16 (UInt16) world. we would really want to be able to use ASCII literals to express UInt16, like in the following example:

func isWindowsCDrive(path:[UInt16]) -> Bool
{
    path.starts(with: ['C', ':', '\\'])
}

so, to summarize:

  • ASCII isn’t a type, ASCII is an encoding. ASCII can be 8 or 16 bits long, depending on what kind of unicode text you have.

  • having a type that is statically guaranteed to contain an ASCII codepoint is not very useful. pretty much all ASCII string processing needs to interoperate safely with unicode text, and discarding unicode data with Optional.none is actively harmful. what is needed is a way to express a UTF-8 or UTF-16 code unit using a source literal that is statically guaranteed to represent an ASCII codepoint.

  • any effort to safely model a UTF-8 or UTF-16 code unit inevitably ends up re-inventing UInt8/UInt16. therefore, UInt8 and UInt16 are the right abstraction to use for the element type of UTF-8 and UTF-16-encoded text, respectively.

6 Likes

Nobody else has made references to iron thrones. I’ve been responding to @johnno1962’s frustration about not making progress on this feature, trying to help him understand why I believe that to be the case.

Yes—sorry if I was only hinting at this, it is explicitly what I would contemplate (or at least explore the feasibility of) if we were to introduce an ASCII type—we'd want ASCII views on all of these collections-of-possibly-ASCII-bytes.

That'd be a feature, not a bug. It is user error to feed a UTF-8 string into a function that operates only on ASCII strings and expect to get a UTF-8 string back with non-ASCII code points preserved, and by contrast, this function would do exactly what I'd expect if I wanted only the ASCII code points back.

Yes, it's true that many people only think they want to handle ASCII strings and inadvertently get other input, but if they want to commit to that assumption, then with this design Swift can at least assure them of ASCII-only output.

Which, not coincidentally, is how Character.asciiValue behaves today (although I did and still do have qualms about its handling of \r\n).

Right, but I'm not trying to model a UTF-8 or UTF-16 code unit: I'm trying to model a run of ASCII text. It sounds like you're describing a totally different use case altogether—which is fine, but I'm engaging with the content as discussed here; by the looks of it, the core team was thinking of the same thing back in the day when they described as alternative designs "a trapping or nil-returning ascii property".

1 Like

Context: review discussion around this point due to the limitations of Character.asciiValue:

one of the things i love about UTF-8 is it’s designed to do exactly that: feeding UTF-8 strings into functions that only understand ASCII and getting back a UTF-8 string with non-ASCII codepoints preserved.

UTF-8 has a really simple contract:

leave the 1-prefixed bytes alone, and you never have to worry about messing up unicode data.

so as long as a string algorithm that knows nothing about UTF-8 promises to round-trip the bytes it does not understand, it is forwards compatible with UTF-8.

this means it is always safe to:

  1. replace an ASCII scalar with another ASCII scalar:
if  utf8[index] == ' '
{
    utf8[index] =  '_'
}

UTF-8 is also self-synchronizing, which means it’s always safe to:

  1. remove an ASCII scalar (anywhere):
if  utf8[index] == '?'
{
    utf8.remove(at: index)
}
  1. insert an ASCII scalar before another ASCII scalar:
if  utf8[index] == '+'
{
    utf8.insert(' ', at: index)
}
  1. insert an ASCII scalar after another ASCII scalar:
if  utf8[index] == '+'
{
    utf8.insert(' ', at: utf8.index(after: index))
}
  1. insert an ASCII scalar at the beginning or end of a UTF-8 string:
utf8.insert('"', at: utf8.startIndex)
utf8.append('"')

which is why i am not a fan of an Optional<ASCII> abstraction, because it is going out of its way to break the UTF-8 contract for no significant benefit, and because the designers of UTF-8 invested significant effort into ensuring that we would not need such an abstraction in the first place.

this also isn’t exclusive to UTF-8; UTF-16 has a similar contract for BMP algorithms:

leave the surrogates (0xD800 ... 0xDFFF) alone, and you never have to worry about messing up unicode data.

which means it is safe to run a BMP algorithm like:

func findSection<UTF16>(utf16:UTF16) -> String?
    where UTF16:BidirectionalCollection<UInt16>
{
    utf16.firstIndex(of: '§').map
    {
        .init(decoding: utf16.suffix(from: $0), as: Unicode.UTF16.self)
    }
}

without knowing anything about UTF-16 encoding.

and i think it is precisely because many developers do not understand unicode encoding that people make a lot of mistakes with Unicode.Scalar, because we do not have ASCII/BMP literals, we only have Unicode.Scalar literals and people end up writing findSection(utf16:) like:

func findSection<UTF16>(utf16:UTF16) -> String?
    where UTF16:BidirectionalCollection<UInt16>
{
    utf16.firstIndex(of: UInt16.init(
        exactly: ("§" as Unicode.Scalar).value)!)
        .map
    {
        .init(decoding: utf16.suffix(from: $0), as: Unicode.UTF16.self)
    }
}

print(findSection(utf16: "16 U.S.C. § 42".utf16) as Any)
// Optional("§ 42")

but then if someone comes along and tries to port this algorithm to UTF-8, they might just replace the UInt16.init(exactly:) with UInt8.init(exactly:) and come up with something like:

func findSection<UTF8>(utf8:UTF8) -> String?
    where UTF8:BidirectionalCollection<UInt8>
{
    utf8.firstIndex(of: UInt8.init(
        exactly: ("§" as Unicode.Scalar).value)!)
        .map
    {
        .init(decoding: utf8.suffix(from: $0), as: Unicode.UTF8.self)
    }
}

and worse, they would probably think they already accounted for unicode weirdness because the ! in UInt8.init(exactly:) doesn’t trap when you run this, but this implementation is wrong because you get:

print(findSection(utf8: "16 U.S.C. § 42".utf8) as Any)
// Optional("ļæ½ 42")
print(findSection(utf8: "franƧais".utf8) as Any)
// Optional("ļæ½ais")

and if we had ASCII/BMP literals you would be less likely to make this mistake because you would get a compiler error if you copy-and-paste the '§' literal into a UInt8 context:

func findSection<UTF8>(utf8:UTF8) -> String?
    where UTF8:BidirectionalCollection<UInt8>
{
    utf8.firstIndex(of: '§')
//                      ^~~
//  error: character literal '§' encodes a UTF-8 continuation byte
//         when stored into type 'UInt8'
        .map
    {
        .init(decoding: utf8.suffix(from: $0), as: Unicode.UTF8.self)
    }
}
4 Likes

I've pushed a commit to the reference implementation last night that will produce an error if you try to use a single quoted literal in a String context in an ABI neutral way. So these problematic conversions such as 'x' + 'y' == "xy" are now an error and must be expressed using double quotes as was the case before. I'm of two minds as to whether this is a step forward conceptually but I can see it may make the feature more focused and explicable in peoples minds.

I've decided to redraft one proposal with this and other changes, with far more detailed explanations of the subtitles of the ExpressableBy protocols and in a manner that it presents two separate questions for two separate reviews as requested by the core team last time around. I'm not sure I view the chances of both reviews passing as being highly likely but at least I'll be out of the loop in time for Christmas.

1 Like

thank you John, i think that resolves one of the biggest pain points many people have anticipated with this proposal.

does this mean that ExpressibleByASCIILiteral no longer implies ExpressibleByStringLiteral?

No, the protocol hierarchy is still as I described and the type checker finds the ASCII to string conversions. There is just a hard coding in C++ to flag them as an error now.

I'm trying to find the pitch thread of this previous attempt.

I can only find Three outstanding proposals (August 2019), where someone asked:

Do you have link to the current thread on Single Quote Character Literal?

but didn't receive a reply.

I never re-pitched it being a follow-up on a review that had only just happened. I should clearly have been consensus building and advocating.

Having hopefully diffused the debate about Single Quoted literals being convertible to Strings with a bit of hard coded slight-of-hand. I'd like to start rehearsing arguments for the over-generalised arithmetic being available on the integer values Single Quoted literals can express as it is coming up again and again.

The first thing to note is this form of unconstrained arithmetic on code point values is a feature of four out of the five most popular computer languages used today according to tiobe (aside: seriously? is Python really the most widely use programming language in the world??). It is seen as a legacy concept however and one that Swift is "above" with it's strongly abstract String model.

Perhaps the better defence for trying to introduce integer convertible single quoted literals into Swift is the argument for the presence of UnsafePointers in the language. Something many would never/rather not use but if you need to it's critical that there is an escape hatch available. Look at the code @beccadax mentions (in the Swift compiler project no less.)

Could we create a new ASCII type on which it is possible to only define the "good" operators? I don't think so. Apart from ABI issues involved in introducing a new type, you loose the interpretability with buffers of ints that is a primary motivation. While we could clearly avoid defining multiplication and divide operators some operators (+-) are useful for offsetting and taking the difference of code points so how can we avoid 'x' + 'y' working? I don't see the solution lies there.

I guess in the end you just have to roll with it and accept that it will not be possible to prevent people from writing some absurd expressions in their code if the integer conversions are allowed. I don't believe this form of permissiveness would have too many negative consequences. It is unlikely to crash your app and is not the sort of thing one would type inadvertently. Note, one of the nonsense expressions much discussed during review was 'a'.isMultiple(of: 2). This was never possible as the default type of a Single Quoted literal is Character and that determines which methods are available.

Anyway, let me know if you find these arguments convincing or not.

1 Like

The simplest argument against allowing arithmetic on ASCII characters is ā€œwhy just ASCII?ā€ What about ISO Latin 1? Or Windows-1252, which is what most text that claims to be ISO Latin 1 is actually encoded in? All Unicode codepoints have a numeric value; why not allow arithmetic directly on UnicodeScalar? Why not EBCDIC for those folks at IBM writing Swift for z/OS?

All of the reasons Swift doesn’t have arithmetic on these character encodings apply equally to ASCII.

(Maybe this discussion should be split off from the pitch thread…)

1 Like

My original preference was for all potentially 20 bit code points but there was considerable push back on this in the review thread due to the multiple encodings Unicode allows for "the same" character. I resisted it for a while but conceded it was best to stick to "just ASCII". I don't know what the solution for EBCDIC would look like (which arithmetic on code points make absolutely no sense at all as letters and numbers do not have sequential code points) and that's not a problem I'm trying to solve at this point.

The utility of Single Quoted literals is partly aesthetic but primarily as a more convenient syntax for UInt8(ascii: "\n"). The arithmetic is an unfortunate side-effect of the integer conversions though it has its uses.

A bigger issue IMO is that a major potential audience for this feature seems to be people working with binary formats, which often use legacy encodings. But the BMP is only compatible with one specific legacy encoding: ISO 8859-1. The aesthetic appeal of 'z' - 'a' masks the complexity of understanding necessary to reason about '€ + 1`. Does it depend on the encoding of the source file? Of the host machine? Of the target? Of the user’s current locale?

How many people would even think to ask these questions? No matter which behavior is chosen, some large number of Windows programmers will think there’s a bug, because half of them will be expecting it to behave like other Windows legacy APIs that assume CP1252, and the other half will expect it to behave like modern, Unicode-aware APIs that incorporate ISO Latin 1.

Bitwise operations do make sense on EBCDIC. But I’m not bringing up EBCDIC because I think the API design for EBCDIC needs to be solved; I’m bringing it up because it’s part of the design space and the API needs to be designed in a way that it can eventually be accommodated.

EBCDIC is like Walter’s ex-wife’s Pomeranian in the Big Lebowski. He’s watching it while his ex is in Hawaii with her boyfriend, so he brings it to the bowling alley.

The Dude [incredulous]: ā€œyou brought the Pomeranian bowling?!ā€
Walter: ā€œI brought it bowling; I’m not renting it shoes. I’m not buying it a beer. He’s not taking your turn.ā€
The Dude: ā€œIf my ex-wife asked me to watch her dog while she and her boyfriend went to Honolulu […]ā€
Walter: ā€œFirst of all, Dude, you don’t have an ex.ā€

Some unfortunate folks out there have to carry EBCDIC the Pomeranian around, and the rest of us don’t understand why because we were never married to its owner, IBM.

1 Like

My sympathies for the IBM folk. They also have the misfortune of being one of the last big-endian 64 bit architectures which must cause them no end of problems but that doesn't mean they should share those problems with us. In the end the only least common denominator one can reach for is ASCII.

Here's a weird idea... since the idea is to produce integers, we could expand the current hex, octal, and binary literals to recognize unicode and/or ascii scalar values:

let x = 0uA // unicode scalar value of "A"
let y = 0aA // ascii value of "A"

But this wouldn't really work for tabs or space characters (among others), so a quoted version should be allowed too:

let x = 0u'A' // unicode scalar value of "A"
let y = 0a'A' // ascii value of "A"

This is just another syntax to write an integer literal. Examples above would produce values as Int, the default integer type, because there's no context to infer another type.

2 Likes

This is an interesting idea if a little unprecedented. For non-printibles you could use \. One problem I guess is typically you'd be using character literals for symbols not letters which isn't going to interact well with the lexer.

Re the other languages, one of the strongest arguments for Swift's strict typing is to avoid the problems that arise in those languages because they let you do things like those that would be added by this proposal.

While you see this as an argument for the proposal, I see it as an argugument against the proposal, one of the strongest. I have years of experience of C/C++ and don't want to go back there.

I haven't thought much about what set of characters would work without quotes. But the quoted version would be necessary for many character values:

let asciiSpace = 0a' '

Maybe it'd be better to always require quotes, I don't know. The basic idea was to make it look like an integer literal so you can't argue that arithmetic is unexpected. But maybe some people will object to this regardless:

assert(0a1 + 0a2 == 99)
// or
assert(0a'1' + 0a'2' == 99)

Edit: arithmetic example changed to be easier to object to.

1 Like