Unicode scalar literals

You keep stating that this is the primary use case; it may be yours, but it is emphatically not the primary use case for this pitch. As the core team decided, the pitch should not touch the topic of ASCII APIs. @michelf is illustrating one major use case improved by making Unicode scalars more ergonomic to use. Again, at the core team's direction, it is explicitly a non-goal of this pitch to make any changes to ASCII facilities available in Swift.

It's not my primary use case, it's the primary, or only, use case presented in all the threads leading to this point. If you read my replies here I would prefer to make Character more ergonomic, if we're choosing one. And if ASCII processing isn't important then perhaps you should mention that to whomever wrote the pitch:

I'm really only stating that low-level text processing, the motivation mentioned constantly throughout the pitch, is primary done at the ASCII level, and I don't see how an ergonomic way of expressing Unicode.Scalar leads to an ergonomic and efficient way of doing such processing.

I'm not sure what major use case you're referring to. JSON parsing would not be done at the Unicode.Scalar level.

2 Likes

ASCII byte processing is very important, but it is not the focus of this pitch. It can be done at the Unicode scalar level, however, and cannot be done at the level of extended grapheme clusters.

Unicode text cannot be processed “at the ASCII level.” When it comes to JSON, it is properly done at the Unicode scalar level.

If you mean processing as a sequence of UTF-8 bytes, you certainly can do that, but you lose access to any Unicode properties for inspecting and manipulating any contents unless you go back to the Unicode scalar level.

I misinterpreted your question to be about looking for ASCII delimiters using Character. Sorry about that.

But you can use the unicode scalar view to parse text formats. It might not be as perfectly optimized as dealing directly with the UTF-8 code units, but it'll give you correct results and will be easier to write code for because we have a unicode scalar literal.

Yes, the various delimiters, etc. are in the ASCII-compatible range so you would scan the UTF-8 bytes, nicely matching the new encoding in Swift 5. You are only interested in comparisons to these ASCII characters, so these would be the only ones that would conceivably be expressed as literals. And, as far as I'm aware, you wouldn't be inspecting any of the Unicode.Scalar properties when parsing JSON, because they're not relevant. And even if you were for some reason, you would presumably be inspecting them on an element from the Unicode.Scalar view, not on a literal.

1 Like

And now for something completely different

Stepping back into the meleé after re-grouping for a few days.. I became involved in this pitch for two reasons. It seemed to me a cool idea that Swift could have a modern analogue for how single quoted literals are used in C and Java say i.e. an element of a String and using the integer conversions we could round of a few rough corners in the ergonomics of working with buffers of integers representing text.

If we believe that it is impossible to anticipate the run time segmentation of of a extended grapheme cluster at compile time due to the fluid nature of Unicode and integer conversions have been roundly rejected by the community then the first goal is not possible to achieve and I no longer support the idea of creating a single quoted Character literal let alone Unicode.Scalar literals.

Turning to the second of my goals, the ergonomics: Let's focus on another element of the Core Team's decision -- That the proposal could be broken into two separate proposals:

I'd like to propose a separate, targeted "pure swift" solution to the ergonomics problems, adding code to the standard library rather than the far more involved adding of a new literal type.

Focusing on the original motivations, for me, one problem was the fixed type of the utility initialiser UInt8(ascii: "a"). What if you wanted the more common CChar or Int8 type for your ascii literal? UInt8(ascii: "a") isn't so bad but the more typical CChar(UInt8(ascii: "a")) or UInt16(UInt8(ascii: "a")) is getting inconvenient. This can be solved by writing an extension on FixedWidthInteger and including it in the standard library:

extension FixedWidthInteger {
	  /// Construct any FixedWidthInteger with value `v.value`.
	  ///
	  /// - Precondition: `v.value` can be represented as ASCII (0..<128).
	  @inlinable
	  @available(swift 5.1)
	  public init(ascii v: Unicode.Scalar) {
	    _precondition(v.value < 128,
	                  "Code point value does not fit into ASCII")
	    self = Self(v.value)
	  }
	}

This would allow Int8(ascii: "a") or UInt16(ascii: "a") or any of the integer types directly. QED.

Looking at the second use case initialising arrays which is particularly inconvenient at the moment I'd suggest a analogous initialiser for arrays extracting ASCII values from characters in a String. This would allow:

let hexcodes = [Int8](ascii: "0123456789abcdef")

The third use case was typically scanning through a buffer of integer values for particular ASCII values, currently:

if cString.advanced(by: 2).pointee == UInt8(ascii: "i") {

The ergonomics of this can be improved by adding an operator to the standard lib for comparison of an integer to a Unicode.Scalar:

  public func ==<T: FixedWidthInteger> (lhs: T?, rhs: Unicode.Scalar) -> Bool {
    _precondition(rhs.isASCII, "Only ASCII Unicode.Scalar accepted in this context")
    return lhs == T(rhs.value)
}

The following would then work:

if cString.advanced(by: 2).pointee == "i" {

I don't feel the implicit assumption that the value used in the comparison is the ASCII value of the character is unreasonable. Number literals are taken to be in base 10 by convention without having to suffix them with .decimalValue or suchlike. We have to draw the line somewhere.

This along with a few other operators to get switch statements working results in a PR that runs to all of 101 lines (including tests) that solves the problem, rather the delicate, staged changes that would be involved introducing a single quoted literal into the compiler and deprecating the old syntax. If someone would like to help turn this into a proposal for review, please get in touch.

3 Likes

The core team’s guidance is that the portion of the previous proposal about APIs for processing ASCII should be pitched second, after this topic about single-quoted literals:

Once single-quoted literals have been added to the language, this part of the proposal (or an alternative, such as the addition of a trapping or nil-returning ascii property) can be re-pitched separately.

It’s an important discussion, but since we’ve been asked to defer that discussion until later, let’s focus on the first topic first.

Yes, if you have no interest in the content of the JSON you’re parsing, then you can parse by only inspecting the ASCII bytes. But this is again begging the question: if by definition you are interested only in ASCII parsing, then this pitch does not specifically address your use case.

I’ve given up on single quoted literals altogether and am not at all keen on Unicode.Scalar literals — let's save single quotes for something else — 100 lines in the standard lib can achieve a lot of what I was after. It seems very strange to me to design these two things independently.

1 Like

Yet that’s the core team’s guidance, so let’s see how this one plays out.

C’est la vie. If you’re looking for a mostly complete implementation it’s here

I do feel that this implicit assumption is unreasonable. And this is where the lack of a dedicated Unicode scalar literal syntax gets you into ambiguity:

Unicode scalars have corresponding Unicode scalar values, so it's perfectly reasonable (in my view) to compare any fixed-width integer to any Unicode scalar: there is no question of encoding as you're already working at a specific level: the level of Unicode scalars.

public func == <T: FixedWidthInteger>(lhs: T, rhs: Unicode.Scalar) -> Bool {
    return lhs == rhs.value
}

public func == <T: FixedWidthInteger>(lhs: Unicode.Scalar, rhs: T) -> Bool {
    return lhs.value == rhs
}

So far we agree.

But this reads quite unacceptably in today's Swift when literal values are involved, because double-quotation marks denote string literals first, then possibly extended grapheme clusters or Unicode scalars. Strings can be encoded in any of several ways, as can extended grapheme clusters; extended grapheme clusters don't just have a value--I guess that's why they're called "clusters." With no visible indication you then have to ask, in what sense are we equating this literal with a number? That does not seem reasonable to leave unanswered at the point of use.

1 Like

I agree that line would read better if the Unicode.Scalar was expressed by a single quoted literal.

if cString.advanced(by: 2).pointee == ‘i' {

Part of my point is that using operators we don’t need to box ourselves in to Unicode.Scalar literals instead of Character literals as we do with a trapping .ascii property which requires a particular default type. There are many reasons to have Unicode.Scalar literals including better error reporting but it’s a very awkward fit into the Swift String model.

We don't need to, but it's not just about the single-quoted literal version reading better; because of the lack of any indication as to encoding, I at least would be opposed to any operator being added which causes the double-quoted literal version to be accepted. (This pitch would address that by causing a deprecation warning with that usage.)

OK, how about the best of both worlds? Single quoted literals that have a default type of Unicode.Scalar if they are a Unicode Scalar and default type of Character otherwise.

(swift) 'a'.ascii
// r0 : UInt8 = 97
(swift) '🇨🇦'.ascii
<REPL Input>:1:12: error: value of type 'Character' has no member 'ascii'
'🇨🇦'.ascii
(swift) '🙂'.ascii
Fatal error: Code point value does not fit into ASCII

I would think that's the worst of both worlds.

We would lose static checking that the literal is a single Unicode scalar, which is a major feature of this pitch. We would make extended grapheme clusters expressible by single-quoted literals, again raising the question of unspecified encodings. And we would lose any ability for the user to know by inspection that we are working at the level of non-normalized code points.

Meanwhile, most characters when surrounded by single-quoted literals would default to creating a Unicode scalar and users wouldn't know why, and normalizing the source code would change the inferred type of some literals.

1 Like

You’re a tough audience.

:slight_smile: It's definitely worth talking through all options, though.

My honest opinion is that any modern file format should be designed to either not pretend to be text in any way, or else actually behave as text. If a file format is under the guise of text, it should be designed to safely undergo normal text handling operations. Swapping line endings preserves text intent; the file format’s intent should also be preserved. Converting a text file from one encoding to another preserves text intent; the file format’s intent should also be preserved. Performing normalization on a text file also preserves text intent; the file format’s intent should also be preserved.

I understand that file formats predating Unicode often don’t meet that expectation. I also understand that some more recent file formats fall short of that expectation, usually due to oversight. I even understand that .swift is a file format that doesn’t succeed in this respect. Finally, I understand that for backwards‐compatibility reasons, many such issues can never be fixed. I still firmly believe that it should be an unwavering law of any future design that, “If it says it is text, it really is text.

That means for Swift, in my very strong opinion, the only correct way to write code at the level of non‐normalized code points is to reference them using their code point identifiers: \u{E9}. Such references are intent‐preserving when processed as text. Beyond that, their non‐normalized intent is plain to any reader.

Holding that view, it follows logically that I find the very existence of ExpressibleByUnicodeScalar unfortunate. Unicode.Scalar should have been ExpressibleByIntegerLiteral instead. We are stuck with ExpressibleByUnicodeScalar because of backwards‐compatibility, but it would be best to sweep it under the rug and let it remain a quirky corner case that requires extra effort to pull out from under ExpressibleByExtendedGraphemeClusterLiteral. I am categorically opposed to elevating Unicode scalar literals.

I make no judgement on how useful either of the following may or may not be, but they are the only reasonable ways forward that I can see:

  1. If the goal is to find some sort of general text element deserving of a separate literal from strings, it is Character. Character literals preserve text intent, and they are always safe to use. I suspect this is closest to what the core team wants, since it resembles what they originally envisioned way back at the beginning of all of this.

    • Whether or not compile time validation is possible is irrelevant; runtime breakage from a different version of ICU only occurs when operating systems are updated (which can cause other, much more widespread runtime issues :wink:). ICU‐related breakage is limited to recent aspects of Unicode, and it evaporates as devices catch up. Only developers trying to use cutting‐edge Unicode features are likely to notice, and they are probably aware of both Unicode and the issues to expect in the wake of its updates.
    • On the other hand, Unicode.Scalar is not a candidate for a separate literal from strings unless it is number‐based. Yes, scalars are the correct level to be working at for a lot of text processing, but spelling them with text literals is the wrong way to do it. The breakage such spelling encounters occurs across all of Unicode, including its oldest parts. This kind of breakage will never go away, and as Unicode adoption and understanding spreads into ever more tools, the resulting problems are likely to only become more frequent over time.
  2. If the desire is to make scalar‐based file format parsing easier, then a new ExpressibleByASCIILiteral should be considered, either for a separate ASCII.Scalar type or where Unicode.Scalar would simply conform to it.

    • ASCII code points are the most significant when it comes to parsing existing formats (XML, JSON, YML, CSV, C). When parsing Swift, Java or some format that can use non‐ASCII scalars for semantic purposes, individual code points are irrelevant, only large categorized sets of code points matter, so even then there is no real need to use individual literals. That means ASCII already covers the vast majority of real use cases.
    • ASCII code points are inert across almost all text processing, so they are safe to use. The only instance where they are not inert is if line endings are switched, and that is completely irrelevant because directly spelled line endings are not valid in text literals anyway§.

provided the new encoding is a superset of the file’s contents.
except conversions to EBCDIC or other small encodings, which are lossy conversions unlikely to be performed on Swift source anyway.
§except in multiline string literals, which are not candidates for any form of character literal anyway.


This topic has demonstrated a tendency to go in circles, so I do not intend to continue discussion. I have taken extra effort this time to express my final thoughts clearly and thoroughly. If you genuinely want clarification about something I said, you are welcome to ask, but if your post seems more like an argument, I am unlikely to reply. You are free to disagree with my opinions and advice, but please respect my wish to leave it at that and move on.

5 Likes

Very interesting write-up. It advances an argument very clearly, but I think you start with a mistaken premise:

"Plain text is a pure sequence of character codes; plain Unicode-encoded text is therefore a sequence of Unicode character codes" (their definition, not ours). Therefore, if you substitute a UTF-8 encoded text with a UTF-16 encoded text, that can be the same Unicode text. But if you substitute a Unicode text with one in which the line endings have been swapped, or in which the character codes have been normalized, that is a different Unicode text. Two strings may be equivalent for the purposes of Swift's == operator, but that only guarantees its substitutability for the purposes of modeling a sequence of extended grapheme clusters.

In other words, Unicode does not match the expectation you set out as a premise. What it means to "behave as [Unicode] text" is multilayered and subject to more constraints that you have outlined. In Swift, Unicode text is modeled by String and exposes views at many levels (UTF-8, UTF-16, Unicode scalars, extended grapheme clusters) to support standards-based text manipulation. This is similar in some ways to how an Int models both an integer and a sequence of bits (hence, the bitwise operators it exposes).

If we are to have first-class support for Unicode's facilities, we need make it more ergonomic to work with text as a sequence of Unicode scalars, and (therefore) to work with Unicode scalars generally, just as we are trying to make it more ergonomic to work with String in other ways.

I agree with some of your other premises: for instance, that it is important to make "non-normalized intent" of Unicode scalar literals plain to the reader. This is why I am proposing a dedicated notation for such literals. These literals do not need to be inert to normalization just as no literals need to be inert to normalization, but they do achieve rather than frustrate the goal you articulate of better delineating when normalization is or is not at play.

1 Like