This thread intrigued me, even though it's a little old.
Nowhere did @GreatOm actually provide an example log file - not even an example line from one - so I synthesised one that seemed to match the format implied by the example code and shell pipelines:
guard let file = FileHandle(forWritingAtPath: path) else { exit(EXIT_FAILURE) }
for day in 20 ... 30 {
for _ in 1 ... 1_000_000 {
let method = if 0 == Int.random(in: 0...20) { "GET" } else { "POST" }
let address = if Bool.random() { "192.168.0.\(Int.random(in: 1...254))" } else { "null" }
let host = switch Int.random(in: 0...50) {
case 0:
"www.apple.com/go/applebot"
case 1:
"www.bing.com/bingbot.htm"
case 2:
"www.googlebot.com/bot.html"
case 3:
"Xing Bot"
default:
"Not a bot!"
}
try! file.write(contentsOf: "\(address) \(day)/Apr/2023 \"\(method) \(host)\"\n".data(using: .utf8)!)
}
}
try! file.close()
That generates a 623 MB example log file.
I also had to guess what some of the regexes were, since @GreatOm never defined them.
let botListRegEx = Regex {
ChoiceOf {
"www.apple.com/go/applebot"
"www.bing.com/bingbot.htm"
"www.googlebot.com/bot.html"
"Xing Bot"
}
}
let ipAtStartRegEx = Regex {
Anchor.startOfLine
Capture {
Repeat(1...3) { .digit }
Repeat(3...3) {
"."
Repeat(1...3) { .digit }
}
}
}.asciiOnlyDigits()
On my M2 MacBook Air, the shell pipeline takes about 6 seconds:
cat access.log | grep '26\/Apr' | grep '"GET ' | egrep -v '(www.apple.com/go/applebot | www.bing.com/bingbot.htm | www.googlebot.com/bot.html | Xing Bot)' | egrep -o '^\d{1,3}(\.\d{1,3}){3}\b' | sort -u | wc -l
(I upgraded it from the original example to better match what @GreatOm gave as their 'final' Swift code)
@GreatOm's latest code here runs over the 623 MB synthesised log file in about 17.1 seconds (quite consistently), matching 21,959 lines containing 254 unique IP addresses.
I limited my testing to reading a single file at a time, for simplicity. I don't expect there'd be any inherent advantage or disadvantage to parallelism in Swift vs parallelism in the shell, since reading files independently is embarassingly parallelisable, up to the simple I/O limits of the SSD.
It was very easy - via Time Profiling in Instruments - to see some glaring problems re. the performance of the Swift code. While about 5.5 seconds were in the String.contains
implementation (mostly CFStringFindWithOptionsAndLocale
), way more than that was spent in Swift Concurrency overhead, e.g. swift_isUniquelyReferenced_nonNull_native
, (anonymous namespace)::ExecutorTrackingInfo::current()
, etc. There were also a few things which were ambiguous but I think are also essentially Swift Concurrency overhead, e.g. specialized AsyncLineSequence.AsyncIterator.next()
.
The actual file I/O seemed to add about 0.3s, depending on where you looked in Instruments. Seemingly all in latency waiting for data from the SSD. Which seems reasonable for thousands of small reads from a 623 MB file.
FileHandle
(and the similar URL(fileURLWithPath:).lines
) read in 16.47 KiB chunks, which is both a weirdly non-round size and also suspiciously small. Unfortunately there's seemingly no way to make it read a more sensible size at a time, to reduce read syscall overhead & disk I/O latency. I don't know if & to what extent the kernel was doing any read-ahead caching, though.
Figuring out how to optimise this turned out to be way harder than I anticipated. Most things I tried made no difference, even seemingly obvious things like:
-
Combining all the
filter
steps into one (to reduceAsyncIterator
overheads) - actually made it a few hundred milliseconds slower! -
Combining all the string matching into a single regex - that made it take 6.6 times longer at nearly two minutes!!! Seemingly Swift Regexes are really slow.
let combinedRegex = Regex { ipAtStartRegEx ZeroOrMore(.whitespace) "26/Apr" ZeroOrMore(.anyNonNewline) "\"GET " NegativeLookahead { botListRegEx } }
-
Fiddling with the regexes re. matching behaviour, e.g. explicitly restricting them to ASCII, trying Unicode scalar matching instead of graphemes, etc. None of that made any difference.
-
Removing the unnecessary capture in the
botListRegEx
- shaved only about 0.1 second off.
My naive attempt at removing the TaskGroup
& Task
wrappings resulted in an app which just hangs forever - apparently you still can't enumerate an AsyncSequence
at the top level. Even after mucking around to fix that, it turns out it made no difference - it's not the TaskGroup
that's the problem, it's the AsyncIterator
stuff.
I tried something I know used to be reasonably efficient in the Objective-C era - creating an NSData
memory-mapping the file with the intent to somewhat manually tease it out into individual lines, but before I even got that far I noticed that it takes NSData
nearly a minute just to 'memory-map' the file, and it used 1 GiB of RAM to do it (so evidently not memory-mapping it, but reading it in v e r y s l o w l y …). So obviously something is very broken there. Sigh.
So I gave up on any of the "proper" ways of doing this in Cocoa / Swift, threw out all the async stuff, and used the Swift standard library's readLine
as a quick hack. That cut the time down to about nine seconds (and made the code so much briefer and simpler).
guard let file = freopen(path, "r", stdin) else {
exit(EXIT_FAILURE)
}
let matchedLines = sequence(state: 0) { _ in readLine() }
.filter { $0.contains("26/Apr") }
.filter { $0.contains("\"GET ") }
.filter { !$0.contains(botListRegEx) }
.compactMap { $0.firstMatch(of: ipAtStartRegEx)?.output.0 }
Now the overheads were things like Objective-C message sends (a few hundred milliseconds), memmove (several hundred milliseconds), etc. Two seconds was being spent in readLine
itself - nearly half of that is looking for newlines (getdelim
), and the rest is converting to String
s.
Then I realised from further time profiling that some of the overhead was bridging Swift String
s to CFString
s inside contains
. Doing that bridge once and calling CFStringFind
directly shaves another 2.5 seconds off, bringing the time down to 6.5 seconds. Pretty close to the grep pipeline.
let targetDate = "26/Apr" as CFString
let GET = "\"GET " as CFString
let matchedLines = sequence(state: 0) { _ in readLine() }
.filter {
let str = $0 as CFString
return (kCFNotFound != CFStringFind(str, targetDate, []).location
&& kCFNotFound != CFStringFind(str, GET, []).location)
}
.filter { !$0.contains(botListRegEx) }
.compactMap { $0.firstMatch(of: ipAtStartRegEx)?.output.0 }
There's still a few hundred milliseconds in bridging String
to CFString
, among other Objective-C & Swift overheards. But I think at this point to make it meaningfully faster you'd have to drop down to straight C (or wait for Apple to optimise Regex
).