Prepitch: Character integer literals


(Alexander Momchilov) #175

I'm familiar with unicode, EGCs, various encodings, and have used both high level string implementations for unicode-correct text, and low level raw ascii for machine parsing. I've been around the block.

I'm trying my best to understand this proposal, and frankly, I have no idea what's going on. Here's what I understand so far, please do correct me if I'm wrong.

  • Single quotes become the preferred way of expressing X, while double quotes remain solely for strings.

    • What's "X" in this case? Character, Unicode.Scalar, (U?)Int(8|16|32|64)?? All of the above?
  • What's Unicode.Scalar for? If this single quote syntax can be used for both integer types and Unicode.Scalar, aren't they redundant to each other?

This is just dishonest. Multiple times now, I've very clearly shown that you can very simply do this:

What I'm most unclear about is: what problem does this solve?

Presumably, the answer is something like "to more easily initialize integers from explicitly spelt out and visibly identifiable characters". Sure, but then

  1. What are such integers needed? What's being done with them? Understanding the motivations of why people need to convert integers from characters would better inform a solution.
  2. What does this solution offer that isn't already covered by UInt8.init(ascii:)? Could those improvement be made to refinements/improvements UInt8.init(ascii:), to avoid the need for language changes?

(^) #176

{Character, Unicode.Scalar}. Asking this of the integer types is an impossible question, as it depends on the context. Numeric literals are appropriate in a numeric context, and textual literals are appropriate in a textual context.

It is effectively a wrapper around a constrained UInt32 which provides unicode-specific functionality that would be nonsensical to put on UInt32 itself. This was the role of Unicode.Scalar before this proposal, and this proposal does not change that.

If you wish to make personal attacks, you should know that Chris wrote that section, not me, so you can bring it up with him.

At any rate, you are introducing an additional layer of complexity, as now the mental model of this goes

  • Construct 16 Character objects corresponding to ['0'-'9'] + ['a'-'z']
  • Use a function UInt8.init(ascii:) to convert the Characters into UInt8 values
  • Apply this function to each Character object in the array using a map transformation.

Whereas it should have been as simple as

  • Hardcode the 16 integer values ['0'-'9'] + ['a'-'z'] as an array literal.

You have also continually ignored other aspects of this problem, which, really, could be considered a form of dishonesty too. UInt8.init(ascii:) doesn’t work on codepoints like é. It won’t help you convert codepoints with higher scalar values, or give you any different integer type than UInt8. And you have still not answered the question of how we can get compile-time overflow checking by writing a run-time map transformation.

See above, as I have answered this question about 65,535 times.

Compile time overflow checking can only be done on literal values. To allow diagnostics to “see through” multiple levels of function calls would be a major break in the language precedent. Things like foo(nil) would become errors. ABI would not work as the compiler would have to see the implementation of everything in order to diagnose this.


(^) #177

Why can’t we save source-compatibility and just add derivation arrows between ExpressibleByUnicodeScalarLiteral ← _LegacyExpressibleByUnicodeScalarLiteral and ExpressibleByCharacterLiteral ← _LegacyExpressibleByExtendedGraphemeClusterLiteral? We should probably remove the relationship between _LegacyExpressibleByUnicodeScalarLiteral → _LegacyExpressibleByExtendedGraphemeClusterLiteral to keep the graph neat, but this should not affect any code since currently Unicode.Scalar literals and Character literals look exactly the same.

Also, why have you renamed ExtendedGraphemeClusterLiteral to CharacterLiteral?


(John Holdsworth) #178

Because this could mean a double quoted String literal could express an integer once we add conformances to ExpressibleByUnicodeScalarLiteral and to keep the old and new implementations separate.

It’s a better name, connected to surfaced Swift types not Unicode terminology.


(^) #179

this was the benefit of the extra ExpressibleBy???Literal protocol at the top of the tree


(Michael Ilseman) #180

The protocols are a bit of a mess right now, especially the _Builtin ones, and we'd like to fix this all up prior to ABI stability to avoid compatibility hacks, or very soon after with compatibility hacks. I'm currently laser-focused on substantial ABI changes for String at the moment (essential for better byte-string-like support in the future, but way outside the scope of this thread). But, I will try to help as much as I can.

I haven't followed all of the earlier talk on this thread, or paged in all of the protocols recently, but I could see something like this making sense:

ExpressibleByCharacterLiteral ExpressibleByStringLiteral
FixedWidthInteger :white_check_mark: :x:
Unicode.Scalar :white_check_mark: :white_check_mark:
Character :white_check_mark: :white_check_mark:
String :x: :white_check_mark:

Just as with integer literals, the compiler does overflow checking when it can. FixedWidthInteger character literals only support single-scalar graphemes and need enough bit-width to encode the scalar value. The compiler continues to do a bestminimal-effort job of ensuring Character conformance to either ExpressiblyBy* is probably a single grapheme. Examples:

UInt8 UInt16 UInt32 Unicode.Scalar Character
'x' 120 120 120 U+0078 x
'花' :no_entry_sign: 33457 33457 U+82B1
'𓀎' :no_entry_sign: :no_entry_sign: 77838 U+1300E 𓀎
':family_woman_woman_boy_boy:' :no_entry_sign: :no_entry_sign: :no_entry_sign: :no_entry_sign: :family_woman_woman_boy_boy:
'ab' :no_entry_sign: :no_entry_sign: :no_entry_sign: :no_entry_sign: :no_entry_sign:

I haven't thought about what to do about signed integers when the value ends up setting the sign bit. We'd probably want to just produce the negative number, but perhaps there's an argument to error.


(Alexander Momchilov) #181

Ok, I'm on board with this. I was not a fan of the idea of having " continueing to exist for Character and Unicode.Scalar literals. I think that would have made our quotes a conceptual mess. I like the idea of deprecating ", and reserving them for String.

Yes I'm aware, and isn't it redundant to have implicit conversion to integers, when they can be added as computed properties on Unicode.Scalar? Is it the compile-time overflow checking that you're looking for?

It's not a personal attack, just a frustration over an reiterated point that wasn't taken. Good to know it wasn't out of malice.

I feel like a similar argument can be made to motivate compile-time checking of URL, rather than first "going through" String.

I do acknowledge the lacking there, and I think it's easily remedied by introducing similar initializers for the other integer types. Or more preferably (imo), computed properties on Unicode.Scalar.

Fair enough. I think that's a pretty good motivator. However, if this reasoning is taken seriously, I would expect to see an immediate push to have it be extended to other similar problems, with much stronger motivating cases. E.g. compile-time syntactically validated URLs, URIs, bundle file references, assert names, regular expressions, etc.

Thanks for taking the time to address my points/questions. If you don't mind, could you answer one last question, which to me is the most important:


(Michael Ilseman) #182

@taylorswift, @johnno1962, what are your thoughts on things like '\u{200D}', interpolations, multi-line and raw, for Character literals? It's conceivable that Character's conformance could construct from raw scalar values or interpolations that way. Off the cuff, I'd say this is a "rejected" direction, because users can always use the double quotes to access all of that.

For the content of the pitch itself, I feel like it can be distilled down into its essence:

This can just mention that we have awesome String literals (which are continuing to get increasingly awesome), but it's also common in programming to want to use things that appears as characters to users in numeric contexts, using their Uncode scalar value (example: C chars).

The bytestrings concept is totally something we'll be exploring more in the future, but seems very out of place in this pitch because we're not pitching bytestrings. We can just drop it.

We can just drop this entire section. The discussion of encodings is unrelated, terminology used can just be standard terminology, canonical equivalence is unrelated, bytestrings is an unrelated concept except as a literary device for the motivation section, same for machine strings, etc.

Again, can drop bytestrings concept and encoding validity discussion, which is unrelated to this pitch except as a motivator. The motivation is simply that it's common to want to use the visual representation of a numeric value in code when that corresponds to a character.

One of the future directions for String (a more recent link escapes me, but an old one is here) is to provide performance-sensitive or low-level users with direct access to code units. In that world, it would be much nicer to have numeric-character literals for use in conjunction with this hypothetical future API:

extension String {
  func withCodeUnits(_ f: (UnsafeBufferPointer<UInt8>) throws -> T) rethrows -> T { ... }
}

The value of character literals which can convert to UInt8 for the body of f is hugely motivating compared to raw numbers in code.

If we want to go with the tables in Prepitch: Character integer literals, then the proposed solution is fairly straight forward. The tables are pretty self-explanatory, we use single-quote for character literals, and we can list the protocol declarations under "Detailed Design". The deck-chair rearrangement necessary for source compatibility can go under the "Source Compatibility" section to keep it out of the spotlight.

Actually, this has a nice ABI impact of purging all the unnecessary intermediary protocols. We can keep the entry points if this doesn't make the deadline if necessary.

It seems like there have been several ones debated on this thread. You can also mention that we're not going to extend anything fancy like interpolations or scalar values into the character literal syntax.

If you'd like, I can also help drive this proposal because I think it is a compelling future direction for String, if you're willing to wait a few weeks ;-)


(Michael Ilseman) #183

Pitch please!


(Tony Allevato) #184

This should absolutely be built on top of compile-time constant expressions and could be done entirely in the library once the feature is implemented and powerful enough. I'd love to see a @compilerEvaluable init(_ url: StaticString) that does exactly what you describe.


(^) #185

yes, no, no, and no. Most codepoints are very hard to type in source, so \u{200D} would definitely be useful. I don’t see interpolations and raw being worth it for a character literal. i don’t see a use-case for multi-line character literals at all.

The idea is that character literals combined with Array’s Collection conformance are a good enough bytestring API that we wouldn’t need a separate bytestring type anymore.

A lot of confusion and arguing took place earlier in the thread because a lot of people misunderstood or were speaking with different definitions, or were confused about Unicode, so I figured it’d be worth it to define them.

That was the original motivation paragraph, but that just caused everyone to argue about “C strings vs Swift strings”

I’m confused at that table.

i thought Unicode.Scalar and Character currently aren’t ExpressibleByStringLiteral, and we were’t planning on making them. String is however ExpressibleBy(ExtendedGraphemeCluster)Literal and ExpressibleByUnicodeScalarLiteral as well. We were planning on removing these though.


(Michael Ilseman) #186

Currently, they are not but they are ExpressibleByExpressibleByExtendedGraphemeClusterLiteral / ExpressibleByUnicodeScalarLiteral respectively. Both of those opt you into the syntax of the double quote, with some compiler checking of single-grapheme / single-scalar.

This seems over complicated for user type conformances. I guess what I was getting at is that it would be nice if there was one protocol per syntactic form you wanted to be expressible by: one for single quote and one for double quotes. I don't know how often it is that a user type would want the compiler "overflow" checking (i.e. single-grapheme/single-scalar), or how best to signal this to the compiler.

As @allevato mentioned, this might be doable via @compilerEvaluable and some kind of compilation-time assertion in the future, and stdlib types are still checked much like integer literals. Integer literals do not have this split with one protocol per bit-width, and in hindsight guaranteeing single-grapheme in the compiler ended up being unworkable.


#187

The way it’s done for integers is with an associated type on the Expressible protocol. To implement a conforming type, you give it an initializer that takes an instance of the associated type.

The associated type, of course, must conform to a built-in, compiler-known, standard-library-only protocol. Thus the integer literal is converted into a built-in integer type at compile-time, which is then passed into the init for the user-defined type.

Essentially the same approach should work here as well: the Expressible protocol gets an associated type constrained to a compiler-known protocol which only Character, Unicode.Scalar, and certain integer types conform to.

That way, when the associated type is an integer, the compiler can verify that the literal comprises just one codepoint and does not overflow.


(^) #188

Yes, we know all of this lol, it’s how the implementation works:

There are more details in the latest draft: https://gist.github.com/kelvin13/98cf9cf7e119a4684a245ecf1c982257


(Douglas Gregor) #189

I feel like you're taking a very strict interpretation of backward compatibility and "additive" that's painting the design into a corner. I'd like to address your paragraph above point-to-point to give a different perspective:

  • "[these protocols] are already spoken for so it would be a source-breaking change": yes, these protocols have an existing tie-in to double-quoted string literals. On the other hand, their use is extremely rare, and it may be that a change in behavior---while technically source-breaking---will have little practical impact. The Core Team has accepted a number of such changes (including the Swift 5 time-frame) where the benefits of moving the language forward outweigh the costs of minor source breakage.
  • "[...] would no longer be additive": yes, it's technically true, but there are two issues here. First, if the result of forcing this to be additive is that we have 5 ExpressibleByStringishLiteral protocols in a complicated hierarchy. Second, it puts enormous pressure on the process because this proposal has ABI impact in a way that is currently hard to back-deploy should it miss the "Swift 5 ABI stability" window.
  • "problems like Ints being expressible by double quoted string literals": again, this assumes strict backward compatibility. If we were to say that single-quoted literals correspond to the ExpressibleByUnicodeScalarLiteral and ExpressibleByExtendedGraphemeClusterLiteral protocols, whereas double-quoted literals correspond to 'ExpressibleByStringLiteral' (only), we don't get this problem. Perhaps there is a fallback for Swift <= 4 mode where double-quoted literals can correspond to ExpressibleByUnicodeScalarLiteral and ExpressibleByExtendedGraphemeClusterLiteral with a suitable warning.
  • "single-grapheme double quoted literals having a different default type from multi grapheme strings": The default type for double-quoted literals would remain String. For single-quoted literals it would become 'Character'.

I recommend that this proposal re-use the existing ExpressibleByUnicodeScalarLiteral and ExpressibleByExtendedGraphemeClusterLiteral protocols. Doing so has a number of advantages:

  • The proposal has less (or no) ABI impact, so we don't need to rush the process quite as thoroughly. Adding conformances for the integer types to ExpressibleByUnicodeScalarLiteral is ABI-impacting, but on a smaller scale
  • The end result is simpler: 3 protocols that are well-motivated vs. 5 protocols
  • The proposal itself is a smaller change, making it easier to review
  • Single vs. double-quoted literals are used consistently to distinguish character/code-point literals vs. string literals (still)
  • Source compatibility is a mostly matter of making the compiler cope with double-quoted literals in Swift <= 4 mode, rather than an enduring part of the standard library design (and ABI)

I'd also encourage you to remove the + and * operators from the proposal. They are not central to the proposal itself, can be independently added later, and are likely to cause a significant distraction during the review.

Doug


(John Holdsworth) #190

Hi Doug,

I’m not going to disagree with anything you said at all.. in fact further down the thread from the message you quote I contradict myself to go with the flow and put forward the following model:

ExpressibleByUnicodeScalarLiteral        _LegacyExpressibleByUnicodeScalarLiteral
             ↓                                               ↓ 
ExpressibleByCharacterLiteral            _LegacyExpressibleByExtendedGraphemeClusterLiteral
                                                             ↓ 
                                                ExpressibleByStringLiteral

typealias StringLiteralType                         = String 
typealias _LegacyExtendedGraphemeClusterLiteralType = String
typealias _LegacyUnicodeScalarLiteralType           = String 
typealias CharacterLiteralType     = Character 
typealias UnicodeScalarLiteralType = Character

Is this something we could agree on? ExpressibleByExtendedGraphemeClusterLiteral has been renamed ExpressibleByCharacterLiteral even though it has the same signature as it is a much better name - freed of Unicode jargon and, ExpressibleByUnicodeScalarLiteral differs only in that it takes as IntegerLiteral rather than BuiltIn.UInt32 in order for the compile time overflow detection to work. The _Legacy protocols are kept about for stdlib to code the swift4 conformances of double quoted strings against and select the default type but clearly signalled as being for deprecation. In this model the way forward is clear though anybody writing their own custom conformances to the legacy protocols (which has to be incredibly rare) will have to make minor changes to their code so this is scarcely source breaking.

The real problem with this process however is not the specifics (though the proposal should be updated) but that we need to get to review quite soon to tie these things down in time for adoption Swift5. An implementation of almost anything is possible in the time we have available even if it is getting short - we only need to decide what it is.

John


(^) #191

No, the two protocols we seek to add are unconnected to the existing ExpressibleByExtentedGraphemeClusterLiteral and ExpressibleByUnicodeScalarLiteral protocols, other than that many types such as Character and Unicode.Scalar will conform to both the old protocols and the new protocols. The new protocols have no relationship to ExpressibleByStringLiteral by design.

The old protocols are intended to be deprecated by Swift 6, so in the end we would only have 3 textual literal protocols, as before. The question is just whether we remove them in Swift 5 or in Swift 6.


(John Holdsworth) #192

I’’ve dusted off and rebased the prototype implementation an squashed to into a single commit and made the minor changes to have it conform to the model I put forward above. I’m pretty sure this is our best option and we’ll not be able to reuse the existing protocols - certainly not to the point of being non-api-breaking. As @taylorswift points out as they need to break the inheritance relationship to ExpresiibleByStringLiteral. The result is 3 public well defined protocols related to character/string literals: ExpressibleByUnicodeScalarLiteral, ExpressibleByStringLiteral, with the renamed ExpressibleByCharcterLiteral in between and two legacy protocols to bring Swift4 behaviour forward in the short term.

Were it me I’d update the proposal to adopt this model and hope we can get this in before the freeze. Looks like introducing character literals after Swift 5 looks like it could be much more difficult if I’m reading the tea leaves correctly.


(Xiaodi Wu) #193

The final branching for Swift 5 is tomorrow.


(John Holdsworth) #194

Commits are merged after “final” branching. My maths is based on 2 weeks for review 1 week of adjudication Two months bake-in time before February. The implementation is ready. Given there seems little prospect of abi-breaking changes after that I don’t see an alternative other than to push.