Read text file line by line

It's regrettable that 5 years on, this is still so hard in Swift. I'm dealing with really large files. Ridiculously large tab-delimited text files and XML files that range from 500MB to 5GB per file. I handle them with Python every day. Would love a Swift solution. :slight_smile:

4 Likes

Are those data files generated on a different platform? Maybe you can try to get correct newline character for data files:

$ tr ^M '\n' < Textfile.txt > Newfile.txt

or

$ tr '\r' '\n' < Textfile.txt > Newfile.txt

The ^M is one character inputted by Ctrl+v Ctrl+m

Keep in mind that my recommendations are limited to what’s built in to the Swift standard library and Apple systems. There’s a whole world of third-party stuff out there, and it’s much easier to access now that Xcode 11 supports Swift Package Manager.

In terms of built-in stuff:

  • You can parse arbitrarily-long XML using Foundation’s XMLParser API.
  • For text, I kinda like @linuxjh’s approach of ‘fixing’ the file. If that doesn’t work for you, you’ll need to write or acquire your own code for this.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

I write this version which can read a whole line.

import Foundation

// read text file with ascii characters line by line
class File {
    init? (_ path: String) {
        errno = 0
        file = fopen(path, "r")
        if file == nil {
            perror(path)
            return nil
        }
    }
    
    deinit {
        fclose(file)
    }
    
    // read an entire line. lines should be with line terminators '\n' at the end
    func getLine() -> String? {
        var line = ""
        repeat {
            var buf = [CChar](repeating: 0, count: 1024)
            errno = 0
            if fgets(&buf, Int32(buf.count), file) == nil {
                if ferror(file) != 0 {
                    perror(nil)
                }
                return nil
            }
            line += String(cString: buf)
        } while line.lastIndex(of: "\n") == nil
        return line
    }
    
    private var file: UnsafeMutablePointer <FILE>?
}

/*
let path = "/Users/jianhuali/temp/Newfile.txt"
if let file = File(path){
    while let line = file.getLine() {
        print(line, terminator: "")
    }
}
*/

@linuxjh Thank you for that example.

Would this be a valid "Swifty" solution for testing different line endings?

import Foundation

//var test = "12345\r12345\r"
//print(test.lastIndex(of: "\r") )

class File {
    init? (_ path: String) {
        errno = 0
        file = fopen(path, "r")
        if file == nil {
            perror(nil)
            return nil
        }
    }
    
    deinit {
        fclose(file)
    }

    func testIndex(line: String) -> Bool {
        guard line.lastIndex(of: "\r") == nil else {
            return false
        }
        guard line.lastIndex(of: "\n") == nil else {
            return false
        }
        guard line.lastIndex(of: "\r\n") == nil else {
            return false
        }
        return true
    }

    func getLine() -> String? {
        var line = ""
        repeat {
            var buf = [CChar](repeating: 0, count: 1024)
            errno = 0
            if fgets(&buf, Int32(buf.count), file) == nil {
                if feof(file) != 0 {
                    return nil
                } else {
                    perror(nil)
                    return nil
                }
            }
            line += String(cString: buf)
//        } while (line.lastIndex(of: "\r") == nil)
        } while testIndex(line: line)
        return line
    }
    private var file: UnsafeMutablePointer<FILE>? = nil
}

let path = "/Users/user/Desktop/file.txt"
if let file = File(path){
    while let line = file.getLine() {
        print(line, terminator: "")
    }
}

Are those data files generated on a different platform? Maybe you can try to get correct newline character for data files:

The data originates from MacWinLinuxEmbedded platforms in every variant of file format and protocol ever invented. :face_vomiting: Sometimes the files are totally malformed from not-so-helpful upstream pre-processing, contain multiple variants of line endings, multiple delimiters in one file, etc, you name it.

I guess the above example actually doesn't work in my test case where the line endings are "\r" because this just reads the whole test file (which is <1024 chars):

        var buf = [CChar](repeating: 0, count: 1024)

but if I change it to

        var buf = [CChar](repeating: 0, count: 2)

then it works...?

Maybe instead of a custom interface for C's FILE, you could provide a sequence or iterator interface. When I was thinking of writing the previous sentence, it reminded me that I wrote code for the next step: you could use something like my InternetLines library to extract the lines after getting the raw bytes. The separation of concerns is probably better than doing the reading, buffering, and line parsing all at once.

Hmm, I think the iterator to extract from a FILE would need to be a class, since FILE-reading is more like a single-pass resource, and you need to be able to open and close properly.

Hi, you can backup your data files first and try to convert other kinds of line delimiters e.g '\r' to '\n' and this will make things easy.

Hi, you mean, without fgets.. how am I supposed to read the file?

I had an idea, which I'll admit I haven't actually looked into doing:

Given that modern OS's have the ability to map files into the address space, and given we're exclusively in a 64-bit model now, why isn't a straight-forward, modular, and Swifty approach to implement memory-mapped files followed by using the existing whole-file-at-once reader code?

If the file is huge huge huge I would just use:

let data = try? Data(contentsOf: fileURL, options: [.mappedIfSafe, .uncached])

And then step through that reading UTF-8 up to each newline. You don’t need to worry about the size of the file, it gets demand-paged into memory as needed by Mach, and paged-out when memory is low. Mach memory-mapped files have always been way faster than all the alternatives.

(Reading a stream of UTF-8 bytes is a solved problem, although it does require you to write some code. But you can look that up anywhere.)

BUT, if the file isn’t huge huge huge you should absolutely just use code like the (kinda bizarre) sample you found on StackOverflow, like:

import Foundation

let url = URL(fileURLWithPath: "/Users/yourname/file.txt")

try? String(contentsOf: url, encoding: .utf8)
    .split(separator: "\n")
    .forEach { line in
        print("line: \(line)")
}

Why make work for yourself if it runs fast enough in two lines of code?

-Wil

2 Likes

why [not] implement memory-mapped files followed by using the existing
whole-file-at-once reader code?

There’s three problems with this:

  • Memory mapping isn’t always safe. If there is an error reading a memory-mapped file, the VM system translates that to a memory access exception, and you really don’t want to get into the business of trying to handle those! This means that you can only rely on memory mapping from the root volume [1], the logic there being that, if the root volume starts returning disk errors, your app isn’t going to live long anyway.

    Keep in mind that, on macOS, the user’s home directory might not be on the root volume, and thus the fact that the file is in the user’s home directory doesn’t buy you anything. Indeed, macOS supports network home directories, where disk errors are common.

    Note that Data has a .mappedIfSafe option but using that on huge files is problematic because, if Data decides that mapping isn’t safe, it reverts to allocate-then-read, which will likely cause your process to be jetsam’ed as you scream past your memory budget.

  • While all current iOS devices run app code in a 64-bit address space, you don’t get to use all of that space. If you memory map a huge file on iOS, you’ll run out of address space.

  • If, for example, you want to break up the file into an array of strings that represent the lines, you have to be absolutely sure that the string share storage with the memory mapping. I expect that’s possible to do, but it’s tricky code.

    Even you achieve this, for a truly huge file you have to worry about the memory consumption of the array itself.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

[1] On platforms with a read-only system partition (iOS and macOS 10.15 beta), this includes both the system and data volumes.

3 Likes

Thanks William

Would this be a valid "Swifty" solution for testing different line
endings?

No. I haven’t looked at the code in detail but it definitely includes a subtle mistake that folks commonly make when trying to integrate C APIs into a Unicode environment. Specifically, fgets knows nothing about UTF-8, so it will happily split a UTF-8 sequence. When parsing lines like this, you have to accumulate the C string into a buffer until you get to the line break and then convert that entire line to a Swift String. If you convert each chunk to a Swift String, you will run into problems if the chunk splits a UTF-8 sequence.

Consider this code snippet:

let a = [CChar]("Hello Naïve\nWorld!".utf8.map { CChar(bitPattern: $0) })
let s = strdup(a)!
let f = fmemopen(s, strlen(s), "r")
var buf = [CChar](repeating: 0, count: 10)
while let l = fgets(&buf, Int32(buf.count), f) {
    let s = String(cString: l)
    print(s)
}

It prints:

Hello Na•
•ve

World!

where is the Unicode replacement character. This is because the ï in Naïve is stored as a two-byte UTF-8 sequence, C3 AF, and the buffer size I chose just happens to split those two bytes.

The data originates from MacWinLinuxEmbedded platforms in every
variant of file format and protocol ever invented.

Parsing line breaks in the general case is really tricky. Most folks assume that line breaks are indicated by LF (Unix-y), CR (Traditional Mac OS), or CR LF (Windows). However, that’s not the case. There are plenty of files out there with non-standard line breaks, and others with mixed line breaks (source code control systems are particularly good at creating those). To sort out that mess you have to rely on heuristics.

For example:

  • I’ve seen files that use CR CR LF to indicate a single line break. This makes sense when you think about teletypes, where two CRs in a row are no-ops.

  • I’ve seen files that use CR LF LF to indicate two line breaks. Again, this makes sense from a teletype perspective, where the second CR in the sequence CR LF CR LF is redundant.

  • But both of the could also represent either two or three line breaks in a mixed file.

So, before you tackle this problem you need to lock down exactly what you mean by a line break.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

4 Likes

And then you have the systems that use (the occasional) FF (form feed), because, hey, why not?

I understand your broader point, but I think the closing statement is bikeshedding to some degree.

I see a lot of files in my work. Probably 99.999% of the line breaks I encounter are any one or combination of the following three: \r \n \r\n (ascii 13 & 10). I don't recollect any files in roughly the last 10 years that used something other than some combination of \r or \n to indicate a line break.

So, strictly for the sake of parsing a file one line at a time (without regard for preserving multiple contiguous line breaks in any subsequent output - as the title of this thread is "read text file line by line" not "re-write a text file line by line"), I would assert (welcoming debate here) that the vast majority of use cases could be handled with the addition of some syntax in Swift which interprets "any occurrence of one or more contiguous occurrences of \r or \n as a single line break".

Can you name a modern example? I haven't seen that in a long time.

No. I was just reminiscing.

That’s a wonderfully complete answer. Thank you so much!

Barry

I occasionally capture output of some legacy applications by making them print into a text-only printer driver that captures a text file. In those files, I do regularly encounter FF and other weird combinations mentioned by @eskimo. Even there are cases where they try to print boldface by backspacing and overtyping words, or underlining by backspacing and overtyping underscores...