"\r\n" normalizes to "\n" but != "\n"?

In the docs for asciiValue it mentions:
/// A character with the value "\r\n" (CR-LF) is normalized to "\n" (LF) and
/// has an asciiValue property equal to 10.

My understanding of unicode normalization was that this implies equality, but these don't equal atm?

That (CR + LF → LF) is not Unicode normalization, and the two are not canonically equivalent by Unicode’s standards. Hence they do not satisfy ==.

The “normalization” we are talking about here (by using the basic dictionary definition, not the Unicode technical term) is done simply because CR + LF is one Character (extended grapheme cluster in Unicode parlance), but two ASCII values. For the Character instance to produce a single UInt8, it has to somehow handle two as one. To do that, it was decided to convert the pair to the equivalent UNIX line ending when needing to express it as a single ASCII byte. The alternative design choice would have been to return nil, as is done for or any other Unicode‐only character, but that design seems even less intuitive.

6 Likes

That makes sense then, thank you. Yeah with the alternative approach a string with only ASCII unicode scalars might have characters returning false for isASCII

Is this the right forum category for this discussion? Just checking. :thinking:

The thread began as essentially, “The standard library seems to be doing this wrong. Should we fix it or not?” For that sentiment, I think it was posted in the right place.

(JIRA is another reasonable place for bringing this sort of thing up.)

2 Likes