Escaped line endings in Strings

I find it convenient sometimes to treat a string as a collection of characters e.g.

let v = Character("a")
let c = Character("x")
let b1 = "aeiou".contains(v) // true
let b2 =  "aeiou".contains(c) // false

However, I have a piece of code that does this looking for a character that might be the start of a line ending. It's looking for \r and \n on the grounds that most reasonable line endings start with one of those two characters. The test I had didn't work because for some reason "\r\n" behaves strangely in a collection context. The following code illustrates the weirdness.

let s1 = "\r\n"
let s2 = "\r \n"
let lf = Character("\n")

print(s1.contains(lf))
print(s2.contains(lf))

print(Array(s1))
print(Array(s2))

This is the output

false
true
["\r\n"]
["\r", " ", "\n"]

The string "\r\n" does not contain a line feed but the string "\r \n" does. Not only that, but when converted to an array, the former contains only a single element consisting of a some mashup of the two characters, but the latter contains the three characters (correctly IMO).

This looks like a bug but it has occurred to me that there might be some Unicode standard that says CR LF can be considered a single character in decomposed form (or whatever the terminology).

Any comments?

1 Like

I should say, I found an easy workaround:

let s1 = "\n\r"

Gives me the behaviour I wanted.

You may find this thread interesting.

1 Like

You're probably running into trouble because "\r\n" is a single Unicode character composed of two code points. Since String operates over characters, not code points, it composes those two code points into a single character. Surprisingly, you'll find that "\r\n".count == 1!

For the specific problem you're encountering, I'd recommend using an array of characters rather than a string:

let s: [Character] = ["\r", "\n"]
let lf = Character("\n")
print(s.contains(lf))

@David_Smith, I vaguely recall you and I discussed this exact scenario way back in the day, but I can't find a record of the discussion. I know you know more than I doβ€”can you elaborate at all?

Edit: @tera beat me to it. :slight_smile:

2 Likes

To be annoyingly precise, an "extended grapheme cluster."

3 Likes

Yes, but I'm lazy and it's a drag having to type all those commas and quotes and have to put the type annotation on so you don't get an array of strings.

I was annoyed by this, but it actually makes my problem a bit easier because I can test for all types of line ending assuming the line ending is a single character instead of sometimes being two.

Sigh. Yes. :face_with_spiral_eyes:

2 Likes

If you want to test if a character is a newline, you can use Character.isNewline. For example:

let s = "Hello\rWorld!\nHow's\r\nThings?"
print(s.contains(where: \.isNewline))
3 Likes

Indeed, as others have commented, you have stumbled on this edict of the Unicode standard that CRLF is a single Unicode extended grapheme cluster, which is known in Swift as a character. Hurray!

If you're interested (and how could anybody not be?) there is a whole specification on Unicode line-breaking algorithms - UAX#14.

Most of it relates to wrapping (finding opportunities to break based on how the text is presented), but it also defines mandatory breaks, which I think is what your code is looking for - elements of the text such as \n, which force a break.

Long story short, you can handle all of those mandatory breaks using Character.isNewline as @grynspan suggested.

Otherwise, just to reiterate what others have said, if you care about this specific kind of break only, CRLF is a single character and you can indeed create a Swift Character with that value:

Character("\r\n")  // Works.
2 Likes

Yes, I am aware of this, but I'm porting a C library to Swift. I thought I'd get a straightish translation working before I started refactoring it and making it more Swifty.

If you want to process text like C, it may be better to work with the UTF-8 view and forgo high-level Unicode processing for now. You can replace operations with Unicode-aware variants later, where it makes sense to do so (you may even need to ignore some Unicode behaviour for compatibility reasons).

So maybe something like:

func hasCROrLF(_ str: String) -> Bool {
  str.utf8.contains { $0 == UInt8(ascii: "\n") || $0 == UInt8(ascii: "\r") }
}
9 Likes

That's how I started, but it turned out to be extremely painful.The library reads the text into a raw buffer undecoded and then converts it from whatever character set (only UTF-8 and UTF-16 supported) into effectively Unicode code points and then converts those into UTF-8 (or back into UTF-8). Then it has more code to treat multibyte UTF-8 as single characters.

I've saved several hundred lines by just converting the raw input into a sequence of Character instead of UTF-8. But, in writing this comment, I have realised my previous comment does not hold water as an excuse not to do it properly. I'm in a sort of half way house in which I'm duplicating some of the Unicode work that Apple has already done except with added bugs.

2 Likes