Just for my own edification I took to optimising some code from another poor-performance-related thread, Use RegexBuilder als grep replacement?.
Ultimately I discovered that using the async machinery for reading lines from a file - which seems to be formally the only scalable way to read lines from an arbitrary file, within the Swift standard libraries - is surprisingly inefficient and slow. There's a huge amount of time wasted in Swift Concurrency overhead, context switching between OS threads, etc. Simply doing the same thing synchronously, using readLine
and a stdin redirection hack, is about twice as fast (and much more comparable in performance to a simple, naive approach using a shell pipeline of greps).
Can anyone explain why doing this sort of work asynchronously is so inefficient? Am I "using it wrong" re. Swift Concurrency? Is there something peculiar about how this example case is written that's causing the poor performance?
Tangentially, is there another way to read lines scalably from a file in Swift, without reimplementing readLine
or using a 3rd party dependency? There's myriad unscalable ways - i.e. requiring reading the entire contents of the file into memory first - but those aren't interesting to me.
Here's the code you can throw into a Swift executable:
import Foundation
import RegexBuilder
let path = "access.log"
#if true
print("Creating synthetic log file at \(path)β¦")
guard let file = FileHandle(forWritingAtPath: path) else { exit(EXIT_FAILURE) }
for day in 20 ... 30 {
for _ in 1 ... 1_000_000 {
let method = if 0 == Int.random(in: 0...20) { "GET" } else { "POST" }
let address = if Bool.random() { "192.168.0.\(Int.random(in: 1...254))" } else { "null" }
let host = switch Int.random(in: 0...50) {
case 0:
"www.apple.com/go/applebot"
case 1:
"www.bing.com/bingbot.htm"
case 2:
"www.googlebot.com/bot.html"
case 3:
"Xing Bot"
default:
"Not a bot!"
}
try! file.write(contentsOf: "\(address) \(day)/Apr/2023 \"\(method) \(host)\"\n".data(using: .utf8)!)
}
}
try! file.close()
#else
let botListRegEx = Regex {
ChoiceOf {
"www.apple.com/go/applebot"
"www.bing.com/bingbot.htm"
"www.googlebot.com/bot.html"
"Xing Bot"
}
}
let ipAtStartRegEx = Regex {
Anchor.startOfLine
Capture {
Repeat(1...3) { .digit }
Repeat(3...3) {
"."
Repeat(1...3) { .digit }
}
}
}.asciiOnlyDigits()
var startDate = Date()
#if false
guard let file = freopen(path, "r", stdin) else {
exit(EXIT_FAILURE)
}
let matchedLines = sequence(state: 0) { _ in readLine() }
.filter { $0.contains("26/Apr") }
.filter { $0.contains("\"GET ") }
.filter { !$0.contains(botListRegEx) }
.compactMap { $0.firstMatch(of: ipAtStartRegEx)?.output.0 }
#else
let matchedLines = try await withThrowingTaskGroup(of: [String].self, body: { group in
group.addTask {
let baseLines = URL(fileURLWithPath: path).lines
let lines = baseLines
.filter { $0.contains("26/Apr") }
.filter { $0.contains("\"GET ") }
.filter { !$0.contains(botListRegEx) }
var matched : [String] = []
for try await line in lines {
if let m = line.firstMatch(of: ipAtStartRegEx) {
let (s, _) = m.output
matched.append(String(s))
}
}
return matched
}
var matchedLines : [String] = []
for try await partialMatchedLines in group {
matchedLines.append(contentsOf: partialMatchedLines)
}
return matchedLines
})
#endif
print("Match time: \(abs(startDate.timeIntervalSinceNow)) s")
print("Found \(matchedLines.count) lines.")
print("Found \(Set(matchedLines).count) IP adresses.")
#endif
There's a pair of nested # if's: the outermost can be used to switch between generating a synthetic log file for testing and running the actual log grepping test, while the innermost switches between async and sync implementations. (I prefer to switch at compile time rather than runtime to ensure the Swift compiler's optimiser can do its best work)
It's also not favourable to the async version that it's much more verbose (although some of that in this example is due to a workaround for AsyncSequence
not working correctly at the top level).