Unicode scalar literals

xwu · March 28, 2019, 5:25am

This is not an "implicit promotion relationship" but one protocol refining the other. I'm not sure why you would conclude that this particular relationship is problematic; I have demonstrated where there would be a benefit, and I have seen no examples where it would be unfortunate in the way that 'x' + 'y' == "xy" might be.

jawbroken · March 28, 2019, 6:37am

Throughout the several threads leading up to this (previous pitch threads, the rejected proposal, this thread), I still remain unshaken in my core belief that if there's going to be a single-quoted literal for character-like things in Swift then it should naturally default to Character. I don't want to rehash all of my posts, but I'll summarise some of them as responses to this pitch.

Sure, but so is Unicode.Scalar. If the primary use case is low-level processing of ASCII strings then perhaps single-quoted literals should be devoted to ASCII as you mention in your alternatives. Processing at the level of Unicode.Scalar is a small niche of the already-niche use case of low-level string processing. It seems to me to be in no man's land, satisfying neither people processing strings in the natural way for Swift (i.e. by Character) or people processing ASCII strings.

The same thing is true of Character, and it would be similarly great if people learning Swift could explore properties of Character, the type they are more likely to use, in a playground. The set of properties isn't currently as rich as it is on Unicode.Scalar, but that won't necessarily be true forever.

Your examples here really demonstrate to me how confusing mixing together UTF-8 bytes, Unicode scalars, Unicode code points, etc. can be in a language. But this suggests to me that adding a shorthand syntax for Unicode.Scalar may only bring some of this confusion to Swift as well. Are there any languages with a similar focus on Unicode-correctness to Swift that we could look at instead? I'm having trouble seeing “some other modern languages have seriously confusing string implementations” as a great argument for this pitch.

I don't believe this is the reason that there is currently not a dedicated literal syntax for Character (and would be an argument against even the current status of using double-quoted string literals with Character). If I recall correctly, core Swift developers have previously said that single-quoted literals were being reserved mostly in case they were going to be used for raw strings, which are now implemented with a different syntax. And I don't see this as a great argument against a literal syntax that defaults to Character, as the literals could be checked at compile-time in a best-effort way that will handle a lot of cases, with the rest verified at run-time (i.e. how double-quoted literals and Character currently work). A lot of other Swift features work in the same way.

I would personally replace Character with Unicode.Scalar in these sentences. If Character isn't important enough to be the default type for a literal (and it may not be), then I don't think Unicode.Scalar is.

xwu · March 28, 2019, 11:51am

This is explicitly the use case that this pitch seeks to make more ergonomic. There are not many users today who do this kind of processing with Swift because the language makes it very cumbersome. If you have concluded already that this use case is not worth addressing, then you have presupposed that this pitch should be rejected.

If you read the motivations in the document you link, you will see that one of the reasons why these properties were added to Character is that use of Unicode.Scalar properties is not ergonomic. This pitch addresses that problem directly.

It will necessarily be true forever. You will note, as written in the document you linked, that Unicode does not define these properties for extended grapheme literals, only Unicode scalars. That proposal makes best efforts at adding a small number of them for Character, and for reasons outlined here, at least one of these is a footgun for ASCII byte processing.

These languages are cited because they have a focus on Unicode correctness. In fact, the term “rune” adopted in Go was first used by Rob Pike and colleagues when they created UTF-8, and Rob Pike now works on Go.

Swift is ambitious in its Unicode support, but do not suppose that it has already achieved its ambitions. Unicode defines a Unicode string as a sequence of code units, which is modeled more explicitly in other languages than Swift.

When contributors to .NET considered whether to adopt a Go-like rune, they also surveyed Swift’s design choices and Miguel de Icaza wrote: “but also Swift is not a great model for handling strings.”

Of course, whether a model is great or not depends on use case, but I would put it to you that Go and .NET are not “a small niche.”

Because Unicode grapheme breaking changes from version to version, compile-time checking produces false positives and false negatives: by definition, it can “handle” zero cases.

michelf · March 28, 2019, 11:55am

So what happens if you backward-deploy a skin-tone-modified emoji on an older macOS:

let x = "🧒🏽" as Character

Just did the test by back-deploying to macOS 10.9 and 10.11. Nothing special happen: you just have a character variable that looks like it contains two characters.

Maybe that's not relevant, but you can sort-of do the same thing with unicode scalars:

let x = "ﬂ" as Unicode.Scalar

This "ﬂ" ligature is meant to display as two characters, although it is (I believe) still a single grapheme.

I'd choose Unicode.Scalar for single quote literals because it is a different level of abstraction than String and Character. Using a different syntax would clarify we're working at the level where '\u{37e}' != ';'. But that thinking somewhat breaks if Character and String can be initialized using the same literals (the separation becomes fuzzy again), so I'm not too sure what to think.

xwu · March 28, 2019, 12:22pm

Swift uses the term Character to refer to an extended grapheme cluster. This concept is distinct from what “looks like a character” although it deliberately approximates that.

michelf:

let x = "🧒🏽" as Character
Just did the test by back-deploying to macOS 10.9 and 10.11. Nothing special happen: you just have a character variable that looks like it contains two characters.

This is, as I’m sure you see, problematic. You should not be able to instantiate a Character with “” on older systems where that is not a single extended grapheme cluster. If you instantiate a String with “” and deploy it, you will see that its count is 1 on newer systems and 2 on older systems.

This is a good point and would argue for compile-time warnings against mixing the notations.

jawbroken · March 28, 2019, 12:51pm

This partial quote removes the context from my post here that most use cases from the previous threads and in this pitch are for processing ASCII not Unicode scalars. This pitch does essentially nothing to make the ASCII case more ergonomic, as @jrose points out. Hence what I said about being in no man's land, in my opinion.

It does not follow from “Unicode [currently] does not define these properties” that this will necessarily be true forever. And, as you note, Swift already defines properties on Character despite the lack of such definitions.

I'm having trouble seeing the relevance here. Swift has already made this choice for handling strings, and I presume you're not proposing to change it, as you support it in the Motivation section. And this decision was made in the context of a mature language with a 16-bit Char and very different backwards compatibility issues. And he also writes: “So the Swift character does not have a fixed 32-bit size, it is variable length (and we should also have that construct, but that belongs in a different data type)”.

I have no idea what you are responding to here. I said that low-level text processing on Unicode scalars was a small niche of a niche (again, the larger part being processing ASCII), but you've somehow interpreted that as me saying that Go and .Net are a small niche?

Clearly an exaggeration, and applies to the current use of double-quoted literals anyway. And arbitrary changes to the Unicode specification could invalidate any part of Swift's Unicode implementation. I'll defer to the experts here, but the recent changes to grapheme breaking that I'm aware of have been to broaden what counts as a single grapheme, which seems benign in this context. And the best-effort checking can be fairly broad, as I understand it currently is, while still catching most mistakes in practice.

michelf · March 28, 2019, 1:14pm

For my part, I have a hard time seeing how unicode scalars are "a niche within a niche" when I can't even see when I would want to use Character as my abstraction level.

I think we need to come with a list of string processing use cases and the levels of string representation adequate for each, otherwise it's just us arguing in a void.

jawbroken · March 28, 2019, 1:17pm

If you can't see when you would ever want to process a string at the Character level then the Swift string design is a failure and we have bigger problems than single-quoted literals. If you're looking for low-level string processing use cases then see the previous threads but, as I said, I mostly (only?) recall seeing ASCII examples, and this proposal doesn't seem to make that case more ergonomic.

michelf · March 28, 2019, 1:44pm

I was hopeful I would be given one or two interesting use cases for the Character abstraction level when I wrote that. I agree the lower levels were pretty well covered in the last pitch and review threads.

My view is that searching for user-entered text is best done at the Character level, while reading textual file formats is better done at the Unicode.Scalar level in most cases.

Karl · March 28, 2019, 4:25pm

I think everybody should read this exchange again and again:

Character-ness is (in general) a runtime decision. It can only be done at compile-time for ASCII. So anything related to Unicode-aware (i.e. potentially non-ASCII) text processing heavily depends on runtime features and IMO should not be done at compile time (or it might introduce mismatches which lead to catastrophic bugs in other parts of the system).

Perhaps our existing .asciiValue property makes assumptions which are not appropriate for byte comparisons, but I don't think unicode scalars are any better than adding a fixed/"raw" version of .asciiValue.

Having .asciiValue collapse CRLF extended grapheme clusters to a single UInt8 (LF) was an intentional decision designed to support returning a single UInt8 value, as CRLF is the only extended grapheme cluster which exists in ASCII. The discussion there also predicts this question and considers returning a 2-element tuple, which I would consider a more acceptable solution than baking more stuff in to the compiler.

(off-topic: yay for quoting code blocks )

I think this is the correct way to do it. Other languages, like C++, are moving away from compiler magic and towards 'ordinary code' which is evaluated at compile-time.

The canonical byte form of your String is provided by the whatever bytes are stored in your source file. The standard library's .asciiValue (or whatever "raw" version we add) does some numeric checks which IIUC could be trivially implemented with @compilerEvaluable. I see no need for special syntax or additional compiler features.

Michael_Ilseman · March 28, 2019, 9:33pm

Err... in general you can't make assumptions about ASCII grapheme segmentation. Future versions of Unicode can change grapheme segmentation even for ASCII. For example, it was proposed at one point (not sure what came of it) that all contiguous horizontal whitespace be segmented as a single grapheme cluster. All grapheme-breaking in the standard library is behind a resilient function call for this reason, even the fast-paths.

But when it comes to declaring a Character, we're not segmenting a whole string, we're just making sure there is no boundary inside of the content. So, all single-scalar Characters are accepted, CR-LF, etc.

I'm not sure why this is relevant to this pitch regarding literals. You can already say in Swift:

let c: Character = "🧟‍♀️"

lorentey · March 28, 2019, 10:58pm

It's true that what constitutes a single grapheme cluster can and does change between Unicode versions. However, this isn't true for every part of Unicode and it definitely does not imply that "anything related to Unicode-aware text processing" must be a runtime-only feature -- that is an absurd statement.

For example, the definition of a Unicode scalar is not going to arbitrarily change whenever a new set of emojis is introduced. We can safely let the compiler decide whether a single-quoted literal contains a single Unicode scalar, like this pitch proposes.

Yes, they are.

A Unicode scalar is a well-understood concept that doesn't depend on the version of Unicode.

There are stable mappings between legacy encodings (such as ASCII and Latin-15) and Unicode scalars. (In fact, such mappings are arguably the reason Unicode exists.) The definition of what Unicode scalars encode the 128 ASCII characters is never ever going to change.

We can trust that the 137,928 characters that Unicode 12 encodes will remain encoded as such.

Unicode categorizes the characters it defines in interesting ways that have highly practical applications; we expose these categorizations through properties on Unicode.Scalar.

Sadly, intentional decisions can be just as wrong as accidental mistakes. (If only that wasn't the case!)

Character is an inappropriate host for such encoding-related properties, and it should've never grown an asciiValue property. Unicode.Scalar is the proper place for such things.

Ignoring that there is no way to guarantee a Character won't ever contain more than two ASCII characters at the same time, how do you propose that would look in actual use?

There are many correct ways of extracting the ASCII portions of a string as an array of UInt8s. My function above is definitely not one of them, for the obvious reason that it doesn't return all the ASCII parts. This was my entire point.

Here are a few implementations that do work correctly: (at least in theory -- I haven't tried any of these)

func asciiBytes1(of input: String) -> [UInt8] {
  return input.utf8.filter { $0 < 0x80 }
}
func asciiBytes2a(of input: String) -> [UInt8] {
  return input.unicodeScalars.compactMap { $0.value < 0x80 ? $.value : nil }
}
func asciiBytes2b(of input: String) -> [UInt8] {
  // If we have Unicode.Scalar.asciiValue:
  return input.unicodeScalars.compactMap { $0.asciiValue }
}
func asciiBytes2c(of input: String) -> [UInt8] {
  // If we have a trapping Unicode.Scalar.ascii:
  return input.unicodeScalars.compactMap { $0.isASCII ? $0.ascii : nil }
}
func asciiBytes3(of input: String) -> [UInt8] {
  return input.flatMap { $0.utf8.compactMap { $0 < 0x80 } }
}

Note how tricky the last case is, compared to the others.

jawbroken · March 29, 2019, 6:02am

Most textual formats should be processed at the ASCII level, which nicely matches the UTF-8 backing of strings in Swift 5. Which textual file formats do you want to process at the Unicode.Scalar level?

michelf · March 29, 2019, 11:56am

If you're looking only for ASCII delimiters, it'd be equivalent to scan for them at the UTF-8, UTF-16, or unicode scalar levels. But some formats like XML have unicode scalar range requirements beyond ASCII, for instance what's allowed in element and attribute names.

What I really meant when I said that is you can't parse them as Character because of combining code points. As silly as it might look, this is valid JSON:

["⃝"]

(And valid Swift too!)

jawbroken · March 29, 2019, 12:03pm

I think we're talking past each other somewhat. It's equivalent semantically, but not ergonomically. The only benefit of this proposal is to make it more ergonomic to express Unicode.Scalars, but this does nothing to make it more ergonomic to efficiently scan a string for ASCII delimiters, which is the primary use case that has been presented. Nobody is suggesting that you should parse JSON at the Character level.

xwu · March 29, 2019, 12:08pm

You keep stating that this is the primary use case; it may be yours, but it is emphatically not the primary use case for this pitch. As the core team decided, the pitch should not touch the topic of ASCII APIs. @michelf is illustrating one major use case improved by making Unicode scalars more ergonomic to use. Again, at the core team's direction, it is explicitly a non-goal of this pitch to make any changes to ASCII facilities available in Swift.

jawbroken · March 29, 2019, 12:19pm

It's not my primary use case, it's the primary, or only, use case presented in all the threads leading to this point. If you read my replies here I would prefer to make Character more ergonomic, if we're choosing one. And if ASCII processing isn't important then perhaps you should mention that to whomever wrote the pitch:

I'm really only stating that low-level text processing, the motivation mentioned constantly throughout the pitch, is primary done at the ASCII level, and I don't see how an ergonomic way of expressing Unicode.Scalar leads to an ergonomic and efficient way of doing such processing.

I'm not sure what major use case you're referring to. JSON parsing would not be done at the Unicode.Scalar level.

xwu · March 29, 2019, 12:34pm

ASCII byte processing is very important, but it is not the focus of this pitch. It can be done at the Unicode scalar level, however, and cannot be done at the level of extended grapheme clusters.

Unicode text cannot be processed “at the ASCII level.” When it comes to JSON, it is properly done at the Unicode scalar level.

If you mean processing as a sequence of UTF-8 bytes, you certainly can do that, but you lose access to any Unicode properties for inspecting and manipulating any contents unless you go back to the Unicode scalar level.

michelf · March 29, 2019, 12:34pm

I misinterpreted your question to be about looking for ASCII delimiters using Character. Sorry about that.

But you can use the unicode scalar view to parse text formats. It might not be as perfectly optimized as dealing directly with the UTF-8 code units, but it'll give you correct results and will be easier to write code for because we have a unicode scalar literal.

jawbroken · March 29, 2019, 1:42pm

Yes, the various delimiters, etc. are in the ASCII-compatible range so you would scan the UTF-8 bytes, nicely matching the new encoding in Swift 5. You are only interested in comparisons to these ASCII characters, so these would be the only ones that would conceivably be expressed as literals. And, as far as I'm aware, you wouldn't be inspecting any of the Unicode.Scalar properties when parsing JSON, because they're not relevant. And even if you were for some reason, you would presumably be inspecting them on an element from the Unicode.Scalar view, not on a literal.