Async iteration over lines of a file is surprising slow

Just for my own edification I took to optimising some code from another poor-performance-related thread, Use RegexBuilder als grep replacement?.

Ultimately I discovered that using the async machinery for reading lines from a file - which seems to be formally the only scalable way to read lines from an arbitrary file, within the Swift standard libraries - is surprisingly inefficient and slow. There's a huge amount of time wasted in Swift Concurrency overhead, context switching between OS threads, etc. Simply doing the same thing synchronously, using readLine and a stdin redirection hack, is about twice as fast (and much more comparable in performance to a simple, naive approach using a shell pipeline of greps).

Can anyone explain why doing this sort of work asynchronously is so inefficient? Am I "using it wrong" re. Swift Concurrency? Is there something peculiar about how this example case is written that's causing the poor performance?

Tangentially, is there another way to read lines scalably from a file in Swift, without reimplementing readLine or using a 3rd party dependency? There's myriad unscalable ways - i.e. requiring reading the entire contents of the file into memory first - but those aren't interesting to me.

Here's the code you can throw into a Swift executable:

import Foundation
import RegexBuilder


let path = "access.log"

#if true

print("Creating synthetic log file at \(path)…")

guard let file = FileHandle(forWritingAtPath: path) else { exit(EXIT_FAILURE) }

for day in 20 ... 30 {
    for _ in 1 ... 1_000_000 {
        let method = if 0 == Int.random(in: 0...20) { "GET" } else { "POST" }
        let address = if Bool.random() { "192.168.0.\(Int.random(in: 1...254))" } else { "null" }
        let host = switch Int.random(in: 0...50) {
        case 0:
            "www.apple.com/go/applebot"
        case 1:
            "www.bing.com/bingbot.htm"
        case 2:
            "www.googlebot.com/bot.html"
        case 3:
            "Xing Bot"
        default:
            "Not a bot!"
        }

        try! file.write(contentsOf: "\(address) \(day)/Apr/2023 \"\(method) \(host)\"\n".data(using: .utf8)!)
    }
}

try! file.close()

#else

let botListRegEx = Regex {
    ChoiceOf {
        "www.apple.com/go/applebot"
        "www.bing.com/bingbot.htm"
        "www.googlebot.com/bot.html"
        "Xing Bot"
    }
}

let ipAtStartRegEx = Regex {
    Anchor.startOfLine

    Capture {
        Repeat(1...3) { .digit }
        Repeat(3...3) {
            "."
            Repeat(1...3) { .digit }
        }
    }
}.asciiOnlyDigits()

var startDate = Date()

#if false

guard let file = freopen(path, "r", stdin) else {
    exit(EXIT_FAILURE)
}

let matchedLines = sequence(state: 0) { _ in readLine() }
    .filter { $0.contains("26/Apr") }
    .filter { $0.contains("\"GET ") }
    .filter { !$0.contains(botListRegEx) }
    .compactMap { $0.firstMatch(of: ipAtStartRegEx)?.output.0 }

#else

let matchedLines = try await withThrowingTaskGroup(of: [String].self, body: { group in
    group.addTask {
        let baseLines = URL(fileURLWithPath: path).lines
        let lines = baseLines
                    .filter { $0.contains("26/Apr") }
                    .filter { $0.contains("\"GET ") }
                    .filter { !$0.contains(botListRegEx) }

        var matched : [String] = []

        for try await line in lines {
            if let m = line.firstMatch(of: ipAtStartRegEx) {
                let (s, _) = m.output
                matched.append(String(s))
            }
        }

        return matched
    }

    var matchedLines : [String] = []

    for try await partialMatchedLines in group {
        matchedLines.append(contentsOf: partialMatchedLines)
    }

    return matchedLines
})

#endif

print("Match time: \(abs(startDate.timeIntervalSinceNow)) s")
print("Found \(matchedLines.count) lines.")
print("Found \(Set(matchedLines).count) IP adresses.")

#endif

There's a pair of nested # if's: the outermost can be used to switch between generating a synthetic log file for testing and running the actual log grepping test, while the innermost switches between async and sync implementations. (I prefer to switch at compile time rather than runtime to ensure the Swift compiler's optimiser can do its best work)

It's also not favourable to the async version that it's much more verbose (although some of that in this example is due to a workaround for AsyncSequence not working correctly at the top level).

3 Likes

Thanks for the test case. I did a bunch of investigation here back when I wrote this and it was in pretty reasonable shape, but perhaps the situation has changed, or perhaps this case is different in some way. I've got a few things going on but I'll try to take a look at this soonish.

7 Likes

I am using this:

// Read a file line by line

// ---------------------------------------------------
//
class LineByLineFileReader {
    private let fp         : UnsafeMutablePointer <FILE>!
    private var buffer     : UnsafeMutablePointer <Int8>?
    private var bufferSize : Int

    init? (filePath: String) {
        self.fp = fopen (filePath, "r")
        if fp == nil {
            return nil
        }
        
        // getline function will set these as required
        self.buffer = nil
        self.bufferSize = 0
    }
    
    deinit {
        fclose (fp)
        if buffer != nil {
            free (buffer)
            buffer = nil
        }
    }
}

// ---------------------------------------------------
//
extension LineByLineFileReader {
   func getNextLine () -> String? {
        let r = getline (&buffer, &bufferSize, fp)
    
        if r == -1 {
            return nil
        }
        
        // convert the buffer contents to a string
        let line = String (cString: buffer!)

        return line
    }
}

Right, but that's reimplementing readline. It's not hard to do per se, it's just really surprising to me that Swift doesn't have a good (fast) version of this built into the core or standard libraries.

Also, while UTF-8 covers most real-world needs, it's still worth noting that you're hard-coding an implicit assumption that the file is UTF-8 formatted. I think, from a quick read of Wikipedia, that UTF-8, UTF-16 & UTF-32 can be safely read this way, since a newline (0x0A) seems like it cannot appear in any UTF-n multi-byte character encoding… but who knows if that applies to all character encodings.

1 Like

Note that you might need to manually create the "access.log" file before the sample code will work. Bafflingly, FileHandle(forWritingAtPath:) just silently returns nil if there's not already a file at the given path. I didn't notice earlier because I happened to always have the file already there when I ran this.

To put this in perspective, the following checksum calculation completes in just under 0.1 sec for this 450 MB log file:

extension URL {
    var checkSum: UInt8 {
        let fd = Darwin.open(path, O_RDONLY)
        if fd == -1 { return 0 }
        var s = stat()
        let ret = fstat(fd, &s)
        if ret < 0 { return 0 }
        let size = Int(s.st_size)
        if size < 0 { return 0 }
        guard let mem = mmap(nil, size, PROT_READ, MAP_PRIVATE, fd, 0) else {
            return 0
        }
        if mem == .init(bitPattern: -1)! { return 0 }
        let p = mem.assumingMemoryBound(to: UInt8.self)
        var sum: UInt8 = 0
        for i in 0 ..< size {
            sum &+= p[i]
        }
        munmap(mem, size)
        Darwin.close(fd)
        return sum
    }
}

I recon a manual parser that's doing what your original code is doing could complete in under 0.2 sec (the most complex part would be proper unicode support, would be simpler with ASCII).

I've seen this before many times when custom parser outperformed "normally written" code by a factor of 10x or 100x, but that still amazes me every time I see it.


My point is: if you care about performance – you'd probably prefer a solution that completes in 0.2 sec to a solution that completes in 8 seconds – but that means a custom tailored parser. And if you don't care about performance so much then maybe 30 seconds is not so bad compared to 8 seconds – in which case normal of the shelf high level code will do.

2 Likes

My concern with this mentality is that it puts Swift in the same bucket as Python. i.e. "don't expect good performance unless you (basically) write your core code in C".

I don't think I'm off-base in expecting Swift to generate performant programs. Swift has from the outset been advertised as a "systems" language, meaning in the same class as C/C++ and nominally suitable for writing space- and performance-sensitive code. It's okay if it has some weak spots - every language & stdlib does - but this is far from the first time I've written some simple code in Swift, or rewritten it from Objective-C, and found that Swift is surprisingly unperformant.

That said, I'm still holding onto the hope that I'm just doing something daft in this particular example, or overlooking some simple but effective optimisation.

5 Likes

Just to be clear, I was not suggesting rewriting in C/C++ instead of Swift. Language change could be a win but a marginal one - 20% or so, not 100x. The other example I had in mind – a 3D parser written in high level C++, then rewritten from scratch in low level C++ to get 50x performance gain – same language, the difference was in the parser, the first being very high level the second tailored and quite low level. So it is more like:

Don't expect good performance unless you write your parser from scratch.

:crossed_fingers:

I know. But your example is basically C masquerading as Swift. You're using only C library calls, doing manual memory management, etc. With only superficial syntax changes your example would be perfectly valid C.

I do like that Swift allows this. Actually having to write C modules and link them in etc is more annoying. I'm just not enthusiastic about having to write this sort of Swift.

1 Like

Using assumingMemoryBound on the pointer returned from mmap and then dereferencing it is undefined behavior. In this situation you must use bindMemory or withMemoryRebound to create a typed pointer. Alternatively, stick with the raw pointer and use load(as: UInt8.self).

Yep. You could masquerade it a bit further without sacrificing much of performance:

extension URL {
    func checkSum() throws -> UInt8 {
        let file = try self.openFile(.readOnly) // closes when variable goes out of scope
        return try file.mmap(protection: .read, flags: .private) { bytes, size in
            // INSIDE
        }
    }
}

This'd look "less C", will have ARC bits here and there; but what's inside the block marked "INSIDE" would still be very different compared to normal "every day application code" – a custom tailored parser. I guess that's the bit you don't find fascinating to write – me neither – just the reality is the current high level abstractions are too costly for the end result being ~20 seconds which is 100x compared to ~0.2 seconds achievable with a custom parser.

As mentioned in the other thread (Why is Regex slower than using a bunch of String.contains? - #16 by David_Smith), I found a few things that look suboptimal here in async line iteration and filed a bug to investigate improving it.

5 Likes