Concise ASCII Usage

kyledr · May 2, 2020, 6:45pm

Forgive me if there's a better way.

Old way:

let c = Character("a").asciiValue!

New way:

let c: UInt8 = "a"

or

let c: UInt8 = 'a'

Why is the new way better?

In the old way, let c = Character("🐶🐮").asciiValue! crashes at runtime.
In the new way, let c = "🐶🐮" gives a compiler error.

A compiler error is preferred to a runtime crash (or failure to coalesce nil) for safety and code simplicity reasons.

Open to thoughts on double quotes only, single quotes only, or supporting both.

woolsweater · May 2, 2020, 6:50pm

I'd like to see something like this as well.

It was pitched a while back: SE-0243: Codepoint and Character Literals, and evenually rejected as presented; the rejection suggests that it should be revised and reconsidered.

xwu · May 2, 2020, 6:52pm

There have been extensive discussions on this topic which you may be interested to read.

If, after reviewing these links, you feel like there's a new direction or perspective to be had on the topic, then by all means please do share! In that case, it can be helpful for the community if you'd write a short synopsis of what you learn from these readings so that we're not starting back at square one.

taylorswift · May 2, 2020, 7:50pm

this topic was discussed to death last year so i’m just gonna summarize the main issue with this idea, which is that this becomes allowed:

'a' + 'b' == 195

there are not good ways around this problem, since if you want to use an ascii literal as an integer, then you also have to be okay with ascii literals appearing anywhere you can use an integer

the right thing to do here is to have the 'a' literals be their own ASCII type, and then somehow have a way of safely rebinding a [UInt8] buffer to an [ASCII] buffer without copying it (since the underlying memory representation is the same). but we don’t have zero-cost abstractions in swift, so this isn’t going to work.

i’m nowadays in the camp that any solution to the 'a' as UInt8 problem is probably going to have to go through newtype or something like it

jrose · May 2, 2020, 8:01pm

"Zero-cost abstractions" usually refers to "zero runtime cost" (EDIT: although I suppose that's zero code cost and not zero data cost, because of Swift's runtime metadata), and we absolutely do have zero-cost abstractions in Swift—a struct with one member has the same layout as that member. I don't think that was the problem.

taylorswift · May 2, 2020, 8:05pm

what i mean is right now you have to do this

let buffer:[UInt8]
let ascii:[ASCII] = .init(buffer)

which involves copying the entire bytestring. ofc you can use the unsafe memory rebinding APIs to do this without the copy, but we shouldn’t expect people to have to drop down to withMemoryRebound(to:_:) to do ascii string processing

Nickolas_Pohilets · May 3, 2020, 6:58am

@kyledr, would StaticString help in some of your use cases? Something like this:

let hexcodes: StaticString = “0123456789abcdef”
let buffer = UnsafeBufferPointer(start: hexcodes.utf8Start, count: hexcodes.utf8CodeUnitCount)
print(buffer[4])

Do you need to distinguish between ASCII and UTF8?

Saklad5 · May 3, 2020, 5:49pm

Why do you need this, anyway? It shouldn’t be a common task.

As for your proposal, consider that there are multiple ways to interpret a as an integer, and if Swift was going to choose one of them it would be UTF-8.

I think the current approach is fine. Just use UInt8(ascii: "a"). If you don’t want that to fail, don’t explicitly restrict it to ASCII: UInt32("a" as Unicode.Scalar).

By the way, it is really weird that that UInt32 initializer doesn't have any argument labels. You'd expect it to be init(utf8:).

Lantua · May 3, 2020, 6:13pm

It is a lossless conversion. There are no labels for those cases.

Edit:

If you're feeling fancy, you can also do

.init("a") as UInt32

Saklad5 · May 3, 2020, 8:11pm

That explains a lot. Still, I would expect an exception to be made here, since there are multiple initializers taking a single parameter with a type that conforms to ExpressibleByUnicodeScalarLiteral.

taylorswift · May 3, 2020, 8:30pm

the weird thing about this is that

UInt32.init("a") as UInt32

gives you 97, but

UInt32.init("a")

gives you nil. yes i know technically this is overloading on argument type which is okay, but from the perspective of the user it’s effectively overloading on return type (init(_:) vs init?(_:)) which is just bad swift.

Saklad5 · May 3, 2020, 8:46pm

That's my problem with it: the compiler happily assumes the (in this case) wrong thing, without any indication that there's an issue.

xwu · May 3, 2020, 8:54pm

Yeah this is not great and merits some reconsideration in my view.

Karl · May 3, 2020, 8:58pm

Personally, I think the idea of exposing conversions between characters and integers in the language to be just plain weird, especially with Swift's String model. It seems straightforward at a glance, but the more you think about it, the more confusing it becomes.

Instead, I use an ASCII struct (note: this is pretty rough, and I'm still trying to refine the API).

public struct ASCII: Equatable, Comparable {
    public var codePoint: UInt8
    @inlinable public init(_ v: UInt8) { self.codePoint = v }
}

public extension ASCII {
    // Control Characters.
    @inlinable static var null                  : ASCII { ASCII(0x00) }
    @inlinable static var startOfHeading        : ASCII { ASCII(0x01) }
    @inlinable static var startOfText           : ASCII { ASCII(0x02) }
    @inlinable static var endOfText             : ASCII { ASCII(0x03) }
    @inlinable static var endOfTransmission     : ASCII { ASCII(0x04) }
    @inlinable static var enquiry               : ASCII { ASCII(0x05) }
    @inlinable static var acknowledge           : ASCII { ASCII(0x06) }
...
   // Upper-case letters.
    @inlinable static var A: ASCII { ASCII(0x41) }
    @inlinable static var B: ASCII { ASCII(0x42) }
    @inlinable static var C: ASCII { ASCII(0x43) }
    @inlinable static var D: ASCII { ASCII(0x44) }
...
}

Additionally, I've got a bunch of heterogenous operator overloads so you can use == between ASCII-type values, Character, and UInt8, as well as pattern-match between them, like this:

let input: String = ...

let c = input[idx]
switch c {
  case ASCII.questionMark:
    url.query = ""
    state     = .query
  case ASCII.numberSign:
    url.fragment = ""
    state        = .fragment

Like I said, it's still kind of rough, but I've found it pretty okay so far. You can try it out and see what you think: ascii prototype · GitHub

taylorswift · May 3, 2020, 9:06pm

the issue for me, which i brought up the last time this got slugged out was i can just never remember the full names of all the special characters

i cant find the post but i think this got ruled out pretty early on last time because it’s a very band-aidy solution and just exacerbated existing issues with over-overloaded operators

taylorswift · May 3, 2020, 9:10pm

i just don’t know what even justifies the existence of this method when you can just use Unicode.Scalar.value it’s not like it’s that shorter, and its way less clear what’s going on if you don’t include the name Unicode.Scalar

.init("a") as UInt32
("a" as Unicode.Scalar).value

Karl · May 3, 2020, 9:11pm

Oh, yeah - that happens to me quite often as well. I tried to be as objective as possible, so I took the list of names from somewhere official-looking. For example, I grew up calling # "hash" or "pound", but that list calls it "number sign" so that's what I use in that type, even though the British names are more familiar to me.

As for the operators - it works for me. Certainly the Character operators make sense IMO, but I could see some people taking issue with comparing an integer directly with an ASCII character.

taylorswift · May 3, 2020, 9:13pm

i really don’t like == with different types on the left and right hands, first off it kind of contradicts the whole meaning of == second it basically doubles the amount of overloads you have to provide since it has to work in both orders

Karl · May 3, 2020, 9:17pm

Does it? I'm not sure. That's certainly a matter for debate.

Besides, at some point you need to balance pedantry with usability. I certainly prefer doing this over the alternative of making character literals assignable to integer-type values!

Sure, but you do that once when you implement the type, and never again. I think it is worth paying the boilerplate cost for a more natural interface (assuming the type-checker can handle it).

xwu · May 3, 2020, 9:31pm

(Narrator voiceover: it can't.)

The longer reply to this is that, as we all know, literals + operators (see what I did there?) are a major bottleneck for type checking. Essentially any new operator overload that takes integer literals will throw some existing expressions that are just under the "too complex" limit over that limit. This is very much not a nice thing to do to users who write manifestly well-formed code; nor is the increase in compilation time. Until there is a dramatic reworking of this area of the compiler, vending additional overloads in the standard library is a big no-go.