is this a bug or a feature?
"\n\r".count == 2
"\r\n".count == 1 // ?
looks like \r is ignored when before \n but not entirely:
"\r\n" != "\n"
i checked other languages (kotlin, python, go), in those languages length of both strings is 2.
is this a bug or a feature?
"\n\r".count == 2
"\r\n".count == 1 // ?
looks like \r is ignored when before \n but not entirely:
"\r\n" != "\n"
i checked other languages (kotlin, python, go), in those languages length of both strings is 2.
It's expected behavior. Unicode treats CR + LF as a single "character" (aka extended grapheme cluster). This is documented in Unicode TR #29 (Text Segmentation) (search for "Do not break within CRLF").
Swift follows Unicode, so "\r\n"
is a single Character
.
Afaik, there is no consensus about the answer ;-)
I guess Swift has that feature because with unicode, such odd behavior is common, so you can't rely length(a + b) == length(a) + length(b) anyways.
i checked three more languages: java, C#, rust. length of "\r\n" is 2 in there as well as in the above three.
good to know swift is correct and everything else isn't :)
Note that if you want the “incorrect” but common result, you can use "\r\n".unicodeScalars.count
.
In general there is no common result
utf16.count
should be consistent with kotlin, obj-c, and java
utf8.count
should be consistent with python3 and rust
I think not even C is consistent with C
unicodeScalars
should be consistent with rust's chars()
Also "\r\n".utf8.count
. It's only at the Character
level that you get 1.
To give the cliff notes version:
A unicode scalar is a 21-bit number (usually represented as 32-bits because we don't have 21-bit numbers). They represent Unicode codepoints, including combining characters (scalars which might not even be printed characters by themselves, but combine with later scalars to produce a different printed character).
UTF8 is a way to encode those scalars in a sequence of bytes (code_units_) that is compatible with ASCII. It's just another way to represent a scalar, so it is also not a character.
A Character
(as a human would understand it) can be composed of multiple unicode scalars. There isn't really any pattern to it - Unicode makes a huge table that incorporates all of that knowledge.
As part of that, Unicode decided that "\r\n" is a single Character
. They are 2 independent scalars or code-points (and 2 UTF8/ASCII code-units, because of their particular numerical values) which combine to produce 1 character. Puh .
String's default Collection
view is of Characters
, because that gives you the most "natural" idea of the contents. For instance "👨👧👧".count == 1
; if it used unicode scalars, the answer would be 5, and if it used UTF8 code-units, the answer would be 18!
Related (invalid) bug reports: [SR-11936] String.split(separatedBy:) splits incorrectly for "\r" or "\n" if string contains "\r\n" · Issue #54355 · apple/swift · GitHub, [SR-8716] String.split no splitting CR+LF · Issue #51228 · apple/swift · GitHub . The behavior you are seeing is intentional.
interesting. the second report brings up an example that still works wrongly according to the comment:
var str = "Hello\r\nplayground"
str.contains("\r") // true, shall be false
str.contains("\n") // true, shall be false
They're both (correctly) false for me in Apple Swift 5.2.4.
i see. my Xcode is 3 months old, time to update
The result changes when you import Foundation.
("\r\n").contains("\r") // false
("\r\n").contains("\n") // false
But:
import Foundation
("\r\n").contains("\r") // true
("\r\n").contains("\n") // true
(Tested in Xcode 11.5.1, Swift 5.2.4)
This is because String
implicitly imports NSString.contains(_:)
when you import Foundation, and the compiler prefers this over the generic Sequence.contains(_:)
method. And NSString
counts UTF-16 code units, not Character
s.
Ouch ...
Wow that’s... bad.
Thanks @ole, we're tracking this with rdar://64449322.
similar disparity with letters like À:
//import Foundation
("\u{0041}\u{0300}").contains("\u{0041}") // false
("\u{0041}\u{0300}").contains("\u{0300}") // false
import Foundation
("\u{0041}\u{0300}").contains("\u{0041}") // false
("\u{0041}\u{0300}").contains("\u{0300}") // true
interestingly here it is "false,true" rather than "true,true" in the foundation case.
Nice find! The following is a guess as to why this is, I could be wrong:
NSString.contains(_:)
in Swift is really the Objective-C method -[NSString containsString:]
. The documentation for this method says:
Calling this method is equivalent to calling
rangeOfString:options:
with no options.
The documentation for rangeOfString:options:
says:
NSString
objects are compared by checking the Unicode canonical equivalence of their code point sequences.
So this method presumably treats combining sequences as single units by default. Here's a little Obj-C snippet to confirm the behavior you've seen from Swift:
NSString *s = @"A\u0300";
NSRange range = [s rangeOfString:@"A" options:0];
// returns { location: NSNotFound, length: 0 }, aka "not found"
This should be good news from the perspective of Swift because it means that NSString.contains(_:)
's behavior is closer to what you'd expect from String
in many cases.
I guess this is highly debatable.
To be clear, I'd prefer a solution where we don't bridge NSString.contains(_:)
to String
at all.
To be clear, it seems that Foundation's StringProtocol
extension methods are working.