Prepitch: Character integer literals

michelf · October 21, 2018, 4:40pm

The Swift type Character is indeed a bit funny because its underlying data structure need to be similar to a string (since graphemes can be made of many code points). But that's an implementation detail. Surely nobody would oppose a BigInt type initializable from an integer literal, even though its underlying implementation is likely using an array of smaller integers. I think the same principle should hold for a Character made of multiple code points.

I do like that character literals are meant to represent "characters" in the very general sense of a "unit of a string". If you're working at the grapheme level, a "character" is a grapheme. If you're operating at the code point level (like in an XML parser), a "character" is a code point for the purpose of this algorithm. If you're working with ASCII or some binary format, a "character" is a one-byte code point. Regardless of which level I'm working with, it's likely I'll do things like firstIndex(of: ':'), and it'd be a bit strange for the character literal to have a different syntax depending on whether or not I'm working at the grapheme level.

That said, I agree there's some logic in reserving 'x' for only code points because of the semantic discontinuity between a code point and Character: the former behave like an integer while the later does not. If it is important to express this difference in the literal syntax, then Character could keep the double quote and single quotes could be reserved for code points. But I'm not sure the integer behavior of code points is important enough to justify that.

taylorswift · October 21, 2018, 5:06pm

The integer literal checking is currently completely magical in the compiler. is that a pattern we actually want to copy? i was actually thinking of another proposal to split up the integer literal protocol.

Also, ExpressiblebyCodepointLiteral cannot replace both ExpressibleByUnicodeScalarLiteral and ExpressibleByExtendedGraphemeClusterLiteral, if you still want let flag:Character = '🇦🇶' to work: we need a version of the single-quoted literal protocol that provides a grapheme cluster as the Character initializer argument.

The first drafts of this proposal originally did not allow multiple codepoints in the literal, but many people wanted to be able to write Character literals with single quotes, and if codepoint literals should be able to express some Characters, then they should be able to express all of them. Otherwise we don’t gain the ability to deprecate the old double-quoted literal syntax, and Character will be in the awkward position of having two spellings depending on its internal structure, which is exactly what the type tries to abstract away.

johnno1962 · October 21, 2018, 5:43pm

Yes, it does exactly what we want interns of check codepoint values fit into target types. Honest.

At present ExpressiblebyCodepointLiteral is only for codepoint literals. The prototype uses the existing String protocols to generate Character instances. Character literals have a dual nature and follow completely different paths through the compiler after an initial determination whether or not it is a single codepoint. If this sounds complicated and hacky it’s the smallest change you can make to fit in with the existing compiler code. The PR is running at only 170 lines changes. It may not be the final implementation but it satisfies the various requirements.

taylorswift · October 21, 2018, 5:53pm

I don’t understand your standard library implementation, can you explain:

extension Character : ExpressibleByCodepointLiteral {
  public init(codepointLiteral value: IntegerLiteralType) {
    self.init(Unicode.Scalar(_value: UInt32(value)))
  }
}

how do we write multi-codepoint literals?

extension Unicode.Scalar : ExpressibleByIntegerLiteral {
  public init(integerLiteral value: Int) {
    self.init(_value: UInt32(value))
  }
}

how do we check for invalid Int values? Is this guaranteed elsewhere? Why not truncatingIfNeeded:? Also, is let u:Unicode.Scalar = 65 an intentional side effect?

extension String : ExpressibleByCodepointLiteral {
  public init(codepointLiteral value: UInt32) {
    self.init(Unicode.Scalar(_value: value))
  }
}

I don’t think we should have this. There’s no reason for let s:String = 'a' to work if we already have let s:String = "a".

cukr · October 21, 2018, 6:36pm

In my mind Character is fundamentally something more than just a code point, not as implementation detail, but that isn't relevant to the discussion.

Oh! So I shouldn't think about 'f' as a code point literal, or as a Character literal, but as a string unit that that could be interpreted as any of these? That clears a lot for me.

:(

Yeah. That would be very bad.

johnno1962 · October 21, 2018, 8:29pm

I’m happy you’re looking at the prototype… The key code is:

  case '\'': {
     const char *TokStart = CurPtr-1;
     if (lexCodepointLiteral(CurPtr, TokStart) != ~0U)
       return formToken(tok::integer_literal, TokStart);
     CurPtr = TokStart + 1;
     lexStringLiteral();
     StringRef Text = NextToken.getText().drop_front().drop_back();
     if (!unicode::isSingleExtendedGraphemeCluster(Text))
       diagnose(TokStart, diag::lex_character_not_cluster);
     return;
   }

taylorswift:

extension Character : ExpressibleByCodepointLiteral {
  public init(codepointLiteral value: IntegerLiteralType) {
    self.init(Unicode.Scalar(_value: UInt32(value)))
  }
}

how do we write multi-codepoint literals?

They go down a different path and use the existing String protocols.

taylorswift:

extension Unicode.Scalar : ExpressibleByIntegerLiteral {
  public init(integerLiteral value: Int) {
    self.init(_value: UInt32(value))
  }
}
how do we check for invalid Int values? Is this guaranteed elsewhere? Why not truncatingIfNeeded: ? Also, is let u:Unicode.Scalar = 65 an intentional side effect?

The integer literal code path looks after validating int values. You should never use truncatingIfNeeded:. Looks like you’ve got an old version. That Unicode.Scalar conformance to Integer literal shouldn’t be there. The current prototype is here.

taylorswift:

extension String : ExpressibleByCodepointLiteral {
  public init(codepointLiteral value: UInt32) {
    self.init(Unicode.Scalar(_value: value))
  }
}
I don’t think we should have this. There’s no reason for let s:String = 'a' to work if we already have let s:String = "a" .

I agree but this is the only reason ‘1’ + ‘1’ is able to give a String result. There is actually no Character + Character operator in Swift at the moment.

taylorswift · October 21, 2018, 9:02pm

I think the protocol graph in the standard library needs more work. In particular, i noticed you inserted a new protocol ExpressibleByCharacterLiteral in between ExpressibleByUnicodeScalarLiteral and ExpressibleByExtendedGraphemeClusterLiteral. And your ExpressibleByCodepointLiteral protocol isn’t in the graph at all.

I’m fine with dropping ExpressibleByUnicode{8, 16}Literal if you think sending it through IntegerLiteralType and the existing integer literal compiler code is better from an implementation standpoint.

Tell me if im wrong, but you seem to be adding single-quoted literals as a new case in the lexer and assigning them to new literal types CodepointLiteral and CharacterLiteral. I think this just adds confusion, especially since we’re forseeing the transition period to last until Swift 6. A better idea i think is:

currently double-quoted literals get lexed to {UnicodeScalarLiteral, ExtendedGraphemeClusterLiteral, StringLiteral}. We should map them instead to {_LegacyUnicodeScalarLiteral, _LegacyExtendedGraphemeClusterLiteral, StringLiteral}, where _LegacyUnicodeScalarLiteral and _LegacyExtendedGraphemeClusterLiteral are under-the-hood types we introduce to tide-over source compatibility.
we should introduce single-quoted literals to the lexer and map them to {CodepointLiteral, UnicodeScalarLiteral, ExtendedGraphemeClusterLiteral}. Note that CodepointLiteral is a terrible name since codepoints are a superset of unicode scalars, not the other way around, but that’s a problem for later.
make ExpressibleByUnicodeScalarLiteral inherit from CodepointLiteral & _ExpressibleByLegacyUnicodeScalarLiteral, and make ExpressibleByExtendedGraphemeClusterLiteral inherit from ExpressibleByUnicodeScalarLiteral & _ExpressibleByLegacyExtendedGraphemeClusterLiteral. This provides backwards compatibility with the double-quoted literals.

I don’t think having two protocols in the standard library one called “ExpressibleByExtendedGraphemeClusterLiteral” and the other called “ExpressibleByCharacterLiteral” is a good idea.

hmm. I really wonder if it is worth muddying up the character literal system by making String conform to it just to avoid '1' + '1' == 98. I think there is a strong case to prevent people from encountering "1" + "1" == 98. The case against '1' + '1' == 98 is less clear.

We could also consider adding that operator as part of the proposal. It’s not the weirdest idea in the world. We could also define * on Character × Int while we’re at it so that 'a' * 5 == "aaaaa", which is fairly precedented in other languages. Of course that opens the door to 'a' * 'a' = "aaaaaaaaaaaaaaaa..." but at least it returns a String and not an Int, and i doubt anyone writes 'a' * 'a' on purpose anyway.

michelf · October 21, 2018, 9:13pm

Perhaps '1' + '1' should:

be an error in a context that expects a string — fix with: "1" + '1' or similar
be a warning in a context that expect an integer — silence with: ('1' as Int) + '1'

That way it's always clear what the code does. Silencing the warning can be a bit ugly in the Int case because it is pretty much always nonsense to add the value of two code points anyway.

taylorswift · October 21, 2018, 9:16pm

Is this necessary? this will already give a type checker error “expected expression of type String” and i would guess from that most people could figure out what went wrong on their own.

sure. users are highly unlikely to encounter this anyway, since no one puts literal constants on both sides of the operator. I would expect '1' + value or '1' + 5 to be way more common.

michelf · October 21, 2018, 9:23pm

If it's already an error, it's is already as it should be. I wrote that line to put the two sides against each other for comparison purpose, not as a todo list.

(Improving the error and/or adding a warning can be done later if people get confused by this.)

johnno1962 · October 21, 2018, 9:29pm

The standard library side of things isn’t the strongest. The ExpressibleByCharacterLiteral was added at the last minute to be able to get the default type for non-codepoint character literals right.
ExpressibleByCodepointLiteral is outside the graph intentionally to prevent affecting the behaviour of double quoted strings.

The two cases codepoint/non codepoint character literals probably need to have separate protocols, their signatures are so different. I would be fine with a larger re-org to phase out the existing implementation of UnicodeScalarLiteral, ExtendedGraphemeClusterLiteral and replace them with character literal specific protocols. That’s the sort of thing I’d look at further down the road as I’m sure the standard library’s keepers would have plenty to say. There is more work to do on the Swift side, it’s just a question of when the prototype is sufficiently detailed for a review.

taylorswift · October 21, 2018, 9:51pm

how does this look?

ExpressibleBy???Literal ('a')       _ExpressibleByLegacyUnicodeScalarLiteral ("a")
                   ↓                      ↓
           ExpressibleByUnicodeScalarLiteral ('a', "a")    _ExpressibleByLegacyExtendedGraphemeClusterLiteral ("a")
                               ↓                                ↓                                         ↓
                      ExpressibleByExtendedGraphemeClusterLiteral ('a', "a")          ExpressibleByStringLiteral ("aa", "a")


typealias StringLiteralType                         = String 
typealias _LegacyExtendedGraphemeClusterLiteralType = String
typealias _LegacyExtendedUnicodeScalarLiteralType   = String 

typealias ExtendedGraphemeClusterLiteralType        = Character
typealias UnicodeScalarLiteralType                  = Character 
typealias ???LiteralType                            = Character

Then we deprecate the legacy protocols, leaving us with

ExpressibleBy???Literal ('a')  
          ↓  
ExpressibleByUnicodeScalarLiteral ('a')    
          ↓                           
ExpressibleByExtendedGraphemeClusterLiteral ('a')    ExpressibleByStringLiteral ("aa")


typealias StringLiteralType                         = String 

typealias ExtendedGraphemeClusterLiteralType        = Character
typealias UnicodeScalarLiteralType                  = Character 
typealias ???LiteralType                            = Character

johnno1962 · October 22, 2018, 9:34am

Looks sort-of OK to me. I’d simplify it all a bit though. Before, we have

ExpressibleByUnicodeScalarLiteral
                   ↓ 
ExpressibleByExtendedGraphemeClusterLiteral
                   ↓ 
ExpressibleByStringLiteral

typealias StringLiteralType                  = String 
typealias ExtendedGraphemeClusterLiteralType = String
typealias UnicodeScalarLiteralType           = String

I’d do a straightforward rename of ExpressibleByUnicodeScalarLiteral to something like _LegacyExtendedUnicodeScalarLiteralType and ExpressibleByExtendedGraphemeClusterLiteral to _LegacyExtendedGraphemeClusterLiteral along with their default type definitions. Though these protocols are public it is unlikely people are conforming to them.

Then, create the new version of ExpressibleByUnicodeScalarLiteral which will be based on the ExpressibleByCodepointLiteral protocol of the prototype and I’d suggest the last protocol be renamed ExpressibleByChracterLiteral with the signature of the existing ExpressibleByExtendedGraphemeClusterLiteral but no longer an ancestor of ExpressibleByStringLiteral so the there would be:

ExpressibleByUnicodeScalarLiteral        _LegacyExpressibleByUnicodeScalarLiteral
             ↓                                               ↓ 
ExpressibleByCharacterLiteral            _LegacyExpressibleByExtendedGraphemeClusterLiteral
                                                             ↓ 
                                                ExpressibleByStringLiteral

typealias StringLiteralType                         = String 
typealias _LegacyExtendedGraphemeClusterLiteralType = String
typealias _LegacyUnicodeScalarLiteralType           = String 
typealias CharacterLiteralType     = Character 
typealias UnicodeScalarLiteralType = Character

This should mean everything works as before with a view to possible eventual deprecation of the Legacy types if they no longer serve any purpose. This is a bigger change than I’d originally intended and no longer strictly additive though in practice the ExpressibleBy protocols are not widely used. It’s not clear I’ll be updating the prototype to take all this in without some help as it currently works sufficiently well for review. How do these changes sound oh keeper of all things String, @Michael_Ilseman?

AlexanderM · October 22, 2018, 4:32pm

I'm familiar with unicode, EGCs, various encodings, and have used both high level string implementations for unicode-correct text, and low level raw ascii for machine parsing. I've been around the block.

I'm trying my best to understand this proposal, and frankly, I have no idea what's going on. Here's what I understand so far, please do correct me if I'm wrong.

Single quotes become the preferred way of expressing X, while double quotes remain solely for strings.
- What's "X" in this case? Character, Unicode.Scalar, (U?)Int(8|16|32|64)?? All of the above?
What's Unicode.Scalar for? If this single quote syntax can be used for both integer types and Unicode.Scalar, aren't they redundant to each other?

taylorswift:

// This is the best we can get right now, while showing the textual // letter form.
let hexcodes = [
    UInt8(ascii: "0"), UInt8(ascii: "1"), UInt8(ascii: "2"), UInt8(ascii: "3"),
    UInt8(ascii: "4"), UInt8(ascii: "5"), UInt8(ascii: "6"), UInt8(ascii: "7"),
    UInt8(ascii: "8"), UInt8(ascii: "9"), UInt8(ascii: "a"), UInt8(ascii: "b"),
    UInt8(ascii: "c"), UInt8(ascii: "d"), UInt8(ascii: "e"), UInt8(ascii: "f")
]

This is just dishonest. Multiple times now, I've very clearly shown that you can very simply do this:

What I'm most unclear about is: what problem does this solve?

Presumably, the answer is something like "to more easily initialize integers from explicitly spelt out and visibly identifiable characters". Sure, but then

What are such integers needed? What's being done with them? Understanding the motivations of why people need to convert integers from characters would better inform a solution.
What does this solution offer that isn't already covered by UInt8.init(ascii:)? Could those improvement be made to refinements/improvements UInt8.init(ascii:), to avoid the need for language changes?

taylorswift · October 22, 2018, 5:11pm

{Character, Unicode.Scalar}. Asking this of the integer types is an impossible question, as it depends on the context. Numeric literals are appropriate in a numeric context, and textual literals are appropriate in a textual context.

It is effectively a wrapper around a constrained UInt32 which provides unicode-specific functionality that would be nonsensical to put on UInt32 itself. This was the role of Unicode.Scalar before this proposal, and this proposal does not change that.

AlexanderM:

This is just dishonest. Multiple times now, I've very clearly shown that you can very simply do this:
let hexcodes = [
	"0", "1", "2", "3", "4" ,"5", "6", "7",
	"8", "9", "a", "b", "c", "d", "e", "f"
].map(UInt8.init(ascii:))

If you wish to make personal attacks, you should know that Chris wrote that section, not me, so you can bring it up with him.

At any rate, you are introducing an additional layer of complexity, as now the mental model of this goes

Construct 16 Character objects corresponding to ['0'-'9'] + ['a'-'z']
Use a function UInt8.init(ascii:) to convert the Characters into UInt8 values
Apply this function to each Character object in the array using a map transformation.

Whereas it should have been as simple as

Hardcode the 16 integer values ['0'-'9'] + ['a'-'z'] as an array literal.

You have also continually ignored other aspects of this problem, which, really, could be considered a form of dishonesty too. UInt8.init(ascii:) doesn’t work on codepoints like é. It won’t help you convert codepoints with higher scalar values, or give you any different integer type than UInt8. And you have still not answered the question of how we can get compile-time overflow checking by writing a run-time map transformation.

See above, as I have answered this question about 65,535 times.

Compile time overflow checking can only be done on literal values. To allow diagnostics to “see through” multiple levels of function calls would be a major break in the language precedent. Things like foo(nil) would become errors. ABI would not work as the compiler would have to see the implementation of everything in order to diagnose this.

taylorswift · October 22, 2018, 5:21pm

Why can’t we save source-compatibility and just add derivation arrows between ExpressibleByUnicodeScalarLiteral ← _LegacyExpressibleByUnicodeScalarLiteral and ExpressibleByCharacterLiteral ← _LegacyExpressibleByExtendedGraphemeClusterLiteral? We should probably remove the relationship between _LegacyExpressibleByUnicodeScalarLiteral → _LegacyExpressibleByExtendedGraphemeClusterLiteral to keep the graph neat, but this should not affect any code since currently Unicode.Scalar literals and Character literals look exactly the same.

Also, why have you renamed ExtendedGraphemeClusterLiteral to CharacterLiteral?

johnno1962 · October 22, 2018, 5:35pm

Because this could mean a double quoted String literal could express an integer once we add conformances to ExpressibleByUnicodeScalarLiteral and to keep the old and new implementations separate.

It’s a better name, connected to surfaced Swift types not Unicode terminology.

taylorswift · October 22, 2018, 6:04pm

this was the benefit of the extra ExpressibleBy???Literal protocol at the top of the tree

Michael_Ilseman · October 22, 2018, 7:15pm

The protocols are a bit of a mess right now, especially the _Builtin ones, and we'd like to fix this all up prior to ABI stability to avoid compatibility hacks, or very soon after with compatibility hacks. I'm currently laser-focused on substantial ABI changes for String at the moment (essential for better byte-string-like support in the future, but way outside the scope of this thread). But, I will try to help as much as I can.

I haven't followed all of the earlier talk on this thread, or paged in all of the protocols recently, but I could see something like this making sense:

	ExpressibleByCharacterLiteral	ExpressibleByStringLiteral
FixedWidthInteger
Unicode.Scalar
Character
String

Just as with integer literals, the compiler does overflow checking when it can. FixedWidthInteger character literals only support single-scalar graphemes and need enough bit-width to encode the scalar value. The compiler continues to do a ~~best~~minimal-effort job of ensuring Character conformance to either ExpressiblyBy* is probably a single grapheme. Examples:

	UInt8	UInt16	UInt32	Unicode.Scalar	Character
'x'	120	120	120	U+0078	x
'花'		33457	33457	U+82B1	花
'𓀎'			77838	U+1300E	𓀎
''
'ab'

I haven't thought about what to do about signed integers when the value ends up setting the sign bit. We'd probably want to just produce the negative number, but perhaps there's an argument to error.

AlexanderM · October 22, 2018, 10:56pm

Ok, I'm on board with this. I was not a fan of the idea of having " continueing to exist for Character and Unicode.Scalar literals. I think that would have made our quotes a conceptual mess. I like the idea of deprecating ", and reserving them for String.

Yes I'm aware, and isn't it redundant to have implicit conversion to integers, when they can be added as computed properties on Unicode.Scalar? Is it the compile-time overflow checking that you're looking for?

It's not a personal attack, just a frustration over an reiterated point that wasn't taken. Good to know it wasn't out of malice.

I feel like a similar argument can be made to motivate compile-time checking of URL, rather than first "going through" String.

I do acknowledge the lacking there, and I think it's easily remedied by introducing similar initializers for the other integer types. Or more preferably (imo), computed properties on Unicode.Scalar.

Fair enough. I think that's a pretty good motivator. However, if this reasoning is taken seriously, I would expect to see an immediate push to have it be extended to other similar problems, with much stronger motivating cases. E.g. compile-time syntactically validated URLs, URIs, bundle file references, assert names, regular expressions, etc.

Thanks for taking the time to address my points/questions. If you don't mind, could you answer one last question, which to me is the most important: