Unexpected length of \r\n

is this a bug or a feature?

"\n\r".count == 2
"\r\n".count == 1 // ?

looks like \r is ignored when before \n but not entirely:

"\r\n" != "\n"

i checked other languages (kotlin, python, go), in those languages length of both strings is 2.

It's expected behavior. Unicode treats CR + LF as a single "character" (aka extended grapheme cluster). This is documented in Unicode TR #29 (Text Segmentation) (search for "Do not break within CRLF").

Swift follows Unicode, so "\r\n" is a single Character.

13 Likes

Afaik, there is no consensus about the answer ;-)
I guess Swift has that feature because with unicode, such odd behavior is common, so you can't rely length(a + b) == length(a) + length(b) anyways.

1 Like

i checked three more languages: java, C#, rust. length of "\r\n" is 2 in there as well as in the above three.

good to know swift is correct and everything else isn't :)

Note that if you want the “incorrect” but common result, you can use "\r\n".unicodeScalars.count.

3 Likes

In general there is no common result

utf16.count should be consistent with kotlin, obj-c, and java
utf8.count should be consistent with python3 and rust
I think not even C is consistent with C
unicodeScalars should be consistent with rust's chars()

6 Likes

Also "\r\n".utf8.count. It's only at the Character level that you get 1.

To give the cliff notes version:

  • A unicode scalar is a 21-bit number (usually represented as 32-bits because we don't have 21-bit numbers). They represent Unicode codepoints, including combining characters (scalars which might not even be printed characters by themselves, but combine with later scalars to produce a different printed character).

  • UTF8 is a way to encode those scalars in a sequence of bytes (code_units_) that is compatible with ASCII. It's just another way to represent a scalar, so it is also not a character.

  • A Character (as a human would understand it) can be composed of multiple unicode scalars. There isn't really any pattern to it - Unicode makes a huge table that incorporates all of that knowledge.

  • As part of that, Unicode decided that "\r\n" is a single Character. They are 2 independent scalars or code-points (and 2 UTF8/ASCII code-units, because of their particular numerical values) which combine to produce 1 character. Puh :cold_sweat:.

  • String's default Collection view is of Characters, because that gives you the most "natural" idea of the contents. For instance "👨‍👧‍👧".count == 1 ; if it used unicode scalars, the answer would be 5, and if it used UTF8 code-units, the answer would be 18!

10 Likes

Related (invalid) bug reports: [SR-11936] String.split(separatedBy:) splits incorrectly for "\r" or "\n" if string contains "\r\n" · Issue #54355 · apple/swift · GitHub, [SR-8716] String.split no splitting CR+LF · Issue #51228 · apple/swift · GitHub . The behavior you are seeing is intentional.

interesting. the second report brings up an example that still works wrongly according to the comment:

var str = "Hello\r\nplayground"
str.contains("\r") // true, shall be false
str.contains("\n") // true, shall be false

They're both (correctly) false for me in Apple Swift 5.2.4.

1 Like

i see. my Xcode is 3 months old, time to update

The result changes when you import Foundation.

("\r\n").contains("\r") // false
("\r\n").contains("\n") // false

Screen Shot 2020-06-17 at 09.30.47

But:

import Foundation

("\r\n").contains("\r") // true
("\r\n").contains("\n") // true

Screen Shot 2020-06-17 at 09.31.50

(Tested in Xcode 11.5.1, Swift 5.2.4)

This is because String implicitly imports NSString.contains(_:) when you import Foundation, and the compiler prefers this over the generic Sequence.contains(_:) method. And NSString counts UTF-16 code units, not Characters.

21 Likes

Ouch ...

5 Likes

Wow that’s... bad. :frowning:

1 Like

Thanks @ole, we're tracking this with rdar://64449322.

8 Likes

similar disparity with letters like À:

//import Foundation
("\u{0041}\u{0300}").contains("\u{0041}") // false
("\u{0041}\u{0300}").contains("\u{0300}") // false

import Foundation
("\u{0041}\u{0300}").contains("\u{0041}") // false
("\u{0041}\u{0300}").contains("\u{0300}") // true

interestingly here it is "false,true" rather than "true,true" in the foundation case.

Nice find! The following is a guess as to why this is, I could be wrong:

NSString.contains(_:) in Swift is really the Objective-C method -[NSString containsString:]. The documentation for this method says:

Calling this method is equivalent to calling rangeOfString:options: with no options.

The documentation for rangeOfString:options: says:

NSString objects are compared by checking the Unicode canonical equivalence of their code point sequences.

So this method presumably treats combining sequences as single units by default. Here's a little Obj-C snippet to confirm the behavior you've seen from Swift:

NSString *s = @"A\u0300";
NSRange range = [s rangeOfString:@"A" options:0];
// returns { location: NSNotFound, length: 0 }, aka "not found"

This should be good news from the perspective of Swift because it means that NSString.contains(_:)'s behavior is closer to what you'd expect from String in many cases.

1 Like

I guess this is highly debatable.

To be clear, I'd prefer a solution where we don't bridge NSString.contains(_:) to String at all.

9 Likes

To be clear, it seems that Foundation's StringProtocol extension methods are working.