Unexpected length of \r\n

tera · June 16, 2020, 8:37pm

is this a bug or a feature?

"\n\r".count == 2
"\r\n".count == 1 // ?

looks like \r is ignored when before \n but not entirely:

"\r\n" != "\n"

i checked other languages (kotlin, python, go), in those languages length of both strings is 2.

ole · June 16, 2020, 8:51pm

It's expected behavior. Unicode treats CR + LF as a single "character" (aka extended grapheme cluster). This is documented in Unicode TR #29 (Text Segmentation) (search for "Do not break within CRLF").

Swift follows Unicode, so "\r\n" is a single Character.

Tino · June 16, 2020, 8:55pm

Afaik, there is no consensus about the answer ;-)
I guess Swift has that feature because with unicode, such odd behavior is common, so you can't rely length(a + b) == length(a) + length(b) anyways.

tera · June 16, 2020, 9:32pm

i checked three more languages: java, C#, rust. length of "\r\n" is 2 in there as well as in the above three.

good to know swift is correct and everything else isn't :)

mayoff · June 16, 2020, 9:35pm

Note that if you want the “incorrect” but common result, you can use "\r\n".unicodeScalars.count.

cukr · June 16, 2020, 9:55pm

In general there is no common result

utf16.count should be consistent with kotlin, obj-c, and java
utf8.count should be consistent with python3 and rust
I think not even C is consistent with C
unicodeScalars should be consistent with rust's chars()

Karl · June 16, 2020, 10:08pm

Also "\r\n".utf8.count. It's only at the Character level that you get 1.

To give the cliff notes version:

A unicode scalar is a 21-bit number (usually represented as 32-bits because we don't have 21-bit numbers). They represent Unicode codepoints, including combining characters (scalars which might not even be printed characters by themselves, but combine with later scalars to produce a different printed character).
UTF8 is a way to encode those scalars in a sequence of bytes (code_units_) that is compatible with ASCII. It's just another way to represent a scalar, so it is also not a character.
A Character (as a human would understand it) can be composed of multiple unicode scalars. There isn't really any pattern to it - Unicode makes a huge table that incorporates all of that knowledge.
As part of that, Unicode decided that "\r\n" is a single Character. They are 2 independent scalars or code-points (and 2 UTF8/ASCII code-units, because of their particular numerical values) which combine to produce 1 character. Puh .
String's default Collection view is of Characters, because that gives you the most "natural" idea of the contents. For instance "👨‍👧‍👧".count == 1 ; if it used unicode scalars, the answer would be 5, and if it used UTF8 code-units, the answer would be 18!

typesanitizer · June 16, 2020, 10:34pm

Related (invalid) bug reports: [SR-11936] String.split(separatedBy:) splits incorrectly for "\r" or "\n" if string contains "\r\n" · Issue #54355 · apple/swift · GitHub, [SR-8716] String.split no splitting CR+LF · Issue #51228 · apple/swift · GitHub . The behavior you are seeing is intentional.

tera · June 16, 2020, 10:54pm

interesting. the second report brings up an example that still works wrongly according to the comment:

var str = "Hello\r\nplayground"
str.contains("\r") // true, shall be false
str.contains("\n") // true, shall be false

jrose · June 16, 2020, 11:03pm

They're both (correctly) false for me in Apple Swift 5.2.4.

tera · June 16, 2020, 11:11pm

i see. my Xcode is 3 months old, time to update

ole · June 17, 2020, 7:38am

The result changes when you import Foundation.

("\r\n").contains("\r") // false
("\r\n").contains("\n") // false

Screen Shot 2020-06-17 at 09.30.47

But:

import Foundation

("\r\n").contains("\r") // true
("\r\n").contains("\n") // true

Screen Shot 2020-06-17 at 09.31.50

(Tested in Xcode 11.5.1, Swift 5.2.4)

This is because String implicitly imports NSString.contains(_:) when you import Foundation, and the compiler prefers this over the generic Sequence.contains(_:) method. And NSString counts UTF-16 code units, not Characters.

Jens · June 17, 2020, 8:15am

Ouch ...

hisekaldma · June 17, 2020, 1:00pm

Wow that’s... bad.

Tony_Parker · June 17, 2020, 3:35pm

Thanks @ole, we're tracking this with rdar://64449322.

tera · June 17, 2020, 5:40pm

similar disparity with letters like À:

//import Foundation
("\u{0041}\u{0300}").contains("\u{0041}") // false
("\u{0041}\u{0300}").contains("\u{0300}") // false

import Foundation
("\u{0041}\u{0300}").contains("\u{0041}") // false
("\u{0041}\u{0300}").contains("\u{0300}") // true

interestingly here it is "false,true" rather than "true,true" in the foundation case.

ole · June 17, 2020, 7:20pm

tera:

import Foundation
("\u{0041}\u{0300}").contains("\u{0041}") // false
("\u{0041}\u{0300}").contains("\u{0300}") // true
interestingly here it is "false,true" rather than "true,true" in the foundation case

Nice find! The following is a guess as to why this is, I could be wrong:

NSString.contains(_:) in Swift is really the Objective-C method -[NSString containsString:]. The documentation for this method says:

Calling this method is equivalent to calling rangeOfString:options: with no options.

The documentation for rangeOfString:options: says:

NSString objects are compared by checking the Unicode canonical equivalence of their code point sequences.

So this method presumably treats combining sequences as single units by default. Here's a little Obj-C snippet to confirm the behavior you've seen from Swift:

NSString *s = @"A\u0300";
NSRange range = [s rangeOfString:@"A" options:0];
// returns { location: NSNotFound, length: 0 }, aka "not found"

This should be good news from the perspective of Swift because it means that NSString.contains(_:)'s behavior is closer to what you'd expect from String in many cases.

sveinhal · June 17, 2020, 8:16pm

I guess this is highly debatable.

ole · June 17, 2020, 8:20pm

To be clear, I'd prefer a solution where we don't bridge NSString.contains(_:) to String at all.

nukka123 · June 18, 2020, 1:38am

To be clear, it seems that Foundation's StringProtocol extension methods are working.

github.com

apple/swift/blob/59add196849228148dd76fce5c20ad20719b9d1d/stdlib/public/Darwin/Foundation/NSStringAPI.swift#L1675


      
          #endif
          
            //===--- From the 10.10 release notes; not in public documentation ------===//
            // No need to make these unavailable on earlier OSes, since they can
            // forward trivially to rangeOfString.
          
            /// Returns `true` if `other` is non-empty and contained within `self` by
            /// case-sensitive, non-literal search. Otherwise, returns `false`.
            ///
            /// Equivalent to `self.range(of: other) != nil`
            public func contains<T : StringProtocol>(_ other: T) -> Bool {
              let r = self.range(of: other) != nil
              if #available(macOS 10.10, iOS 8.0, *) {
                assert(r == _ns.contains(other._ephemeralString))
              }
              return r
            }
          
            /// Returns a Boolean value indicating whether the given string is non-empty
            /// and contained within this string by case-insensitive, non-literal
            /// search, taking into account the current locale.