Read text file line by line

Would this be a valid "Swifty" solution for testing different line
endings?

No. I haven’t looked at the code in detail but it definitely includes a subtle mistake that folks commonly make when trying to integrate C APIs into a Unicode environment. Specifically, fgets knows nothing about UTF-8, so it will happily split a UTF-8 sequence. When parsing lines like this, you have to accumulate the C string into a buffer until you get to the line break and then convert that entire line to a Swift String. If you convert each chunk to a Swift String, you will run into problems if the chunk splits a UTF-8 sequence.

Consider this code snippet:

let a = [CChar]("Hello Naïve\nWorld!".utf8.map { CChar(bitPattern: $0) })
let s = strdup(a)!
let f = fmemopen(s, strlen(s), "r")
var buf = [CChar](repeating: 0, count: 10)
while let l = fgets(&buf, Int32(buf.count), f) {
    let s = String(cString: l)
    print(s)
}

It prints:

Hello Na•
•ve

World!

where is the Unicode replacement character. This is because the ï in Naïve is stored as a two-byte UTF-8 sequence, C3 AF, and the buffer size I chose just happens to split those two bytes.

The data originates from MacWinLinuxEmbedded platforms in every
variant of file format and protocol ever invented.

Parsing line breaks in the general case is really tricky. Most folks assume that line breaks are indicated by LF (Unix-y), CR (Traditional Mac OS), or CR LF (Windows). However, that’s not the case. There are plenty of files out there with non-standard line breaks, and others with mixed line breaks (source code control systems are particularly good at creating those). To sort out that mess you have to rely on heuristics.

For example:

  • I’ve seen files that use CR CR LF to indicate a single line break. This makes sense when you think about teletypes, where two CRs in a row are no-ops.

  • I’ve seen files that use CR LF LF to indicate two line breaks. Again, this makes sense from a teletype perspective, where the second CR in the sequence CR LF CR LF is redundant.

  • But both of the could also represent either two or three line breaks in a mixed file.

So, before you tackle this problem you need to lock down exactly what you mean by a line break.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

4 Likes