As part of Use RegexBuilder als grep replacement? some folks before me figured out that Regex
seems a little slow. I tested this a bit myself and confirmed that not only is Regex
alone slower than "pre-screening" candidate lines with String.contains()
, but it's massively slower. As in, an order of magnitude slower.
The Async iteration over lines of a file is surprising slow topic has the base example code for context, but the essential parts are:
let botListRegEx = Regex {
ChoiceOf {
"www.apple.com/go/applebot"
"www.bing.com/bingbot.htm"
"www.googlebot.com/bot.html"
"Xing Bot"
}
}
let ipAtStartRegEx = Regex {
Anchor.startOfLine
Capture {
Repeat(1...3) { .digit }
Repeat(3...3) {
"."
Repeat(1...3) { .digit }
}
}
}.asciiOnlyDigits()
ā¦and their application:
let matchedLines = sequence(state: 0) { _ in readLine() }
.filter { $0.contains("26/Apr") }
.filter { $0.contains("\"GET ") }
.filter { !$0.contains(botListRegEx) }
.compactMap { $0.firstMatch(of: ipAtStartRegEx)?.output.0 }
In principle this can be done using a single regex, and one might think that ideally it should be for simplicity. Yet if you do that, like this:
let combinedRegex = Regex {
ipAtStartRegEx
ZeroOrMore(.whitespace)
"26/Apr"
ZeroOrMore(.anyNonNewline)
"\"GET "
NegativeLookahead {
botListRegEx
}
}
let matchedLines = sequence(state: 0) { _ in readLine() }
.compactMap { $0.firstMatch(of: combinedRegex)?.output.0 }
ā¦then the performance sucks. In a ~420 MB test case (test case synthesier included in the base code) the contains
-using version takes about seven seconds on my M2 MacBook Air but the unified regex approach takes 81 seconds!
Plus, the contains
version can be further optimised to reduce Objective-C bridging overheads, like so:
let targetDate = "26/Apr" as CFString
let GET = "\"GET " as CFString
let matchedLines = sequence(state: 0) { _ in readLine() }
.filter {
let str = $0 as CFString
return (kCFNotFound != CFStringFind(str, targetDate, []).location
&& kCFNotFound != CFStringFind(str, GET, []).location)
}
.filter { !$0.contains(botListRegEx) }
.compactMap { $0.firstMatch(of: ipAtStartRegEx)?.output.0 }
ā¦and that shaves it down further to five seconds. No such optimisation is possible when using Regex
, as far as I can tell (Regex
under the covers uses the same CoreFoundation functions as contains
, among others, and seems to suffer a lot from Objective-C bridging overheads).
So sixteen times faster than Regex
alone. Is thatā¦ expected? Am I misusing Regex
? Are there optimisation that can be made to improve its performance?
It's particularly surprising since in the 'decomposed' case each of those contains
/ CFStringFind
/ firstMatch
calls is ignorant of the other, re-scanning prefixes of the line that cannot possibly match. The combined Regex
has the advantage of only having to search suffixes of the line for a possible match for each component (after the first).