"\r\n" normalizes to "\n" but != "\n"?

Patrick_Pijnappel · April 13, 2020, 1:30am

In the docs for asciiValue it mentions:
/// A character with the value "\r\n" (CR-LF) is normalized to "\n" (LF) and
/// has an asciiValue property equal to 10.

My understanding of unicode normalization was that this implies equality, but these don't equal atm?

SDGGiesbrecht · April 13, 2020, 1:49am

That (CR + LF → LF) is not Unicode normalization, and the two are not canonically equivalent by Unicode’s standards. Hence they do not satisfy ==.

The “normalization” we are talking about here (by using the basic dictionary definition, not the Unicode technical term) is done simply because CR + LF is one Character (extended grapheme cluster in Unicode parlance), but two ASCII values. For the Character instance to produce a single UInt8, it has to somehow handle two as one. To do that, it was decided to convert the pair to the equivalent UNIX line ending when needing to express it as a single ASCII byte. The alternative design choice would have been to return nil, as is done for ≠ or any other Unicode‐only character, but that design seems even less intuitive.

Patrick_Pijnappel · April 13, 2020, 3:37am

That makes sense then, thank you. Yeah with the alternative approach a string with only ASCII unicode scalars might have characters returning false for isASCII

DevAndArtist · April 13, 2020, 6:53am

Is this the right forum category for this discussion? Just checking.

SDGGiesbrecht · April 13, 2020, 8:29pm

The thread began as essentially, “The standard library seems to be doing this wrong. Should we fix it or not?” For that sentiment, I think it was posted in the right place.

(JIRA is another reasonable place for bringing this sort of thing up.)