tera
1
is this a bug or a feature?
"\n\r".count == 2
"\r\n".count == 1 // ?
looks like \r is ignored when before \n but not entirely:
"\r\n" != "\n"
i checked other languages (kotlin, python, go), in those languages length of both strings is 2.
ole
(Ole Begemann)
2
It's expected behavior. Unicode treats CR + LF as a single "character" (aka extended grapheme cluster). This is documented in Unicode TR #29 (Text Segmentation) (search for "Do not break within CRLF").
Swift follows Unicode, so "\r\n" is a single Character.
13 Likes
Tino
(Tino)
3
Afaik, there is no consensus about the answer ;-)
I guess Swift has that feature because with unicode, such odd behavior is common, so you can't rely length(a + b) == length(a) + length(b) anyways.
1 Like
tera
4
i checked three more languages: java, C#, rust. length of "\r\n" is 2 in there as well as in the above three.
good to know swift is correct and everything else isn't :)
mayoff
(Rob Mayoff)
5
Note that if you want the “incorrect” but common result, you can use "\r\n".unicodeScalars.count.
3 Likes
cukr
6
In general there is no common result
utf16.count should be consistent with kotlin, obj-c, and java
utf8.count should be consistent with python3 and rust
I think not even C is consistent with C
unicodeScalars should be consistent with rust's chars()
6 Likes
Karl
(👑🦆)
7
Also "\r\n".utf8.count. It's only at the Character level that you get 1.
To give the cliff notes version:
-
A unicode scalar is a 21-bit number (usually represented as 32-bits because we don't have 21-bit numbers). They represent Unicode codepoints, including combining characters (scalars which might not even be printed characters by themselves, but combine with later scalars to produce a different printed character).
-
UTF8 is a way to encode those scalars in a sequence of bytes (code_units_) that is compatible with ASCII. It's just another way to represent a scalar, so it is also not a character.
-
A Character (as a human would understand it) can be composed of multiple unicode scalars. There isn't really any pattern to it - Unicode makes a huge table that incorporates all of that knowledge.
-
As part of that, Unicode decided that "\r\n" is a single Character. They are 2 independent scalars or code-points (and 2 UTF8/ASCII code-units, because of their particular numerical values) which combine to produce 1 character. Puh
.
-
String's default Collection view is of Characters, because that gives you the most "natural" idea of the contents. For instance "👨👧👧".count == 1 ; if it used unicode scalars, the answer would be 5, and if it used UTF8 code-units, the answer would be 18!
10 Likes
tera
9
interesting. the second report brings up an example that still works wrongly according to the comment:
var str = "Hello\r\nplayground"
str.contains("\r") // true, shall be false
str.contains("\n") // true, shall be false
jrose
(Jordan Rose)
10
They're both (correctly) false for me in Apple Swift 5.2.4.
1 Like
tera
11
i see. my Xcode is 3 months old, time to update
ole
(Ole Begemann)
12
The result changes when you import Foundation.
("\r\n").contains("\r") // false
("\r\n").contains("\n") // false

But:
import Foundation
("\r\n").contains("\r") // true
("\r\n").contains("\n") // true

(Tested in Xcode 11.5.1, Swift 5.2.4)
This is because String implicitly imports NSString.contains(_:) when you import Foundation, and the compiler prefers this over the generic Sequence.contains(_:) method. And NSString counts UTF-16 code units, not Characters.
21 Likes
Thanks @ole, we're tracking this with rdar://64449322.
8 Likes
tera
16
similar disparity with letters like À:
//import Foundation
("\u{0041}\u{0300}").contains("\u{0041}") // false
("\u{0041}\u{0300}").contains("\u{0300}") // false
import Foundation
("\u{0041}\u{0300}").contains("\u{0041}") // false
("\u{0041}\u{0300}").contains("\u{0300}") // true
interestingly here it is "false,true" rather than "true,true" in the foundation case.
ole
(Ole Begemann)
17
Nice find! The following is a guess as to why this is, I could be wrong:
NSString.contains(_:) in Swift is really the Objective-C method -[NSString containsString:]. The documentation for this method says:
Calling this method is equivalent to calling rangeOfString:options: with no options.
The documentation for rangeOfString:options: says:
NSString objects are compared by checking the Unicode canonical equivalence of their code point sequences.
So this method presumably treats combining sequences as single units by default. Here's a little Obj-C snippet to confirm the behavior you've seen from Swift:
NSString *s = @"A\u0300";
NSRange range = [s rangeOfString:@"A" options:0];
// returns { location: NSNotFound, length: 0 }, aka "not found"
This should be good news from the perspective of Swift because it means that NSString.contains(_:)'s behavior is closer to what you'd expect from String in many cases.
1 Like
sveinhal
(Svein Halvor Halvorsen)
18
I guess this is highly debatable.
ole
(Ole Begemann)
19
To be clear, I'd prefer a solution where we don't bridge NSString.contains(_:) to String at all.
9 Likes
To be clear, it seems that Foundation's StringProtocol extension methods are working.