Use RegexBuilder als grep replacement?

That makes just 10% difference in this sample. But still it is a good idea to keep this in mind.

Thanks for your help. It is now faster but the result is wrong.
Instead of 1813 lines I get just 94.

I will investigate this and report later.

Edit: Got it
I need to replace

.filter { $0.contains(botListRegEx) }

with

.filter{ $0.firstMatch(of: botListRegEx) == nil }

to get again 1813 hits.

Now down to 3.3 s

Now I added the rest of the SHELL tasks to get just the IP-address, sort the resulting array, and get the unique IP-address from the array.

Many thanks for your help. It was great to see how code can be improved to run much faster in about 3 seconds down from more than 80 seconds.

Here is my final sample code:

let paths = ["/regTest/logs/access.log", "/regTest/logs/access.log.1"]
var startDate = Date()

let matchedLines = try await withThrowingTaskGroup(of: [String].self, body: { group in
    for aPath in paths {
        group.addTask {
            let lines = FileHandle(forReadingAtPath: aPath)!.bytes.lines
                        .filter { $0.contains("26/Apr") }
                        .filter { $0.contains("\"GET ") }
                        .filter{ !$0.contains(botListRegEx) }
            var matched : [String] = []
            for try await line in lines {
                if let m = line.firstMatch(of: ipAtStartRegEx) {
                    let (s, _) = m.output
                    matched.append(String(s))
                }
            }
            return matched
        }
    }
    var matchedLines : [String] = []
    for try await partialMatchedLines in group {
        matchedLines.append(contentsOf: partialMatchedLines)
    }
    return matchedLines
})

let uniqArray = Array(Set(matchedLines))

print("Match time: \(abs(startDate.timeIntervalSinceNow)) s")
print("Found \(matchedLines.count) lines.")
print("Found \(uniqArray.count) IP adresses.")

Match time: 3.4042569398880005 s
Found 1665 lines.
Found 144 IP adresses.

Edit: Updated code

Oh that’s a mistake. You can use !$0.contains(botListRegEx) for better readability.

1 Like

Ok. Got it.

This thread intrigued me, even though it's a little old.

Nowhere did @GreatOm actually provide an example log file - not even an example line from one - so I synthesised one that seemed to match the format implied by the example code and shell pipelines:

guard let file = FileHandle(forWritingAtPath: path) else { exit(EXIT_FAILURE) }

for day in 20 ... 30 {
    for _ in 1 ... 1_000_000 {
        let method = if 0 == Int.random(in: 0...20) { "GET" } else { "POST" }
        let address = if Bool.random() { "192.168.0.\(Int.random(in: 1...254))" } else { "null" }
        let host = switch Int.random(in: 0...50) {
        case 0:
            "www.apple.com/go/applebot"
        case 1:
            "www.bing.com/bingbot.htm"
        case 2:
            "www.googlebot.com/bot.html"
        case 3:
            "Xing Bot"
        default:
            "Not a bot!"
        }

        try! file.write(contentsOf: "\(address) \(day)/Apr/2023 \"\(method) \(host)\"\n".data(using: .utf8)!)
    }
}

try! file.close()

That generates a 623 MB example log file.

I also had to guess what some of the regexes were, since @GreatOm never defined them.

let botListRegEx = Regex {
    ChoiceOf {
        "www.apple.com/go/applebot"
        "www.bing.com/bingbot.htm"
        "www.googlebot.com/bot.html"
        "Xing Bot"
    }
}

let ipAtStartRegEx = Regex {
    Anchor.startOfLine

    Capture {
        Repeat(1...3) { .digit }
        Repeat(3...3) {
            "."
            Repeat(1...3) { .digit }
        }
    }
}.asciiOnlyDigits()

On my M2 MacBook Air, the shell pipeline takes about 6 seconds:

cat access.log | grep '26\/Apr' | grep '"GET ' | egrep -v '(www.apple.com/go/applebot | www.bing.com/bingbot.htm | www.googlebot.com/bot.html | Xing Bot)' | egrep -o '^\d{1,3}(\.\d{1,3}){3}\b' | sort -u | wc -l

(I upgraded it from the original example to better match what @GreatOm gave as their 'final' Swift code)

@GreatOm's latest code here runs over the 623 MB synthesised log file in about 17.1 seconds (quite consistently), matching 21,959 lines containing 254 unique IP addresses.

I limited my testing to reading a single file at a time, for simplicity. I don't expect there'd be any inherent advantage or disadvantage to parallelism in Swift vs parallelism in the shell, since reading files independently is embarassingly parallelisable, up to the simple I/O limits of the SSD.

It was very easy - via Time Profiling in Instruments - to see some glaring problems re. the performance of the Swift code. While about 5.5 seconds were in the String.contains implementation (mostly CFStringFindWithOptionsAndLocale), way more than that was spent in Swift Concurrency overhead, e.g. swift_isUniquelyReferenced_nonNull_native, (anonymous namespace)::ExecutorTrackingInfo::current(), etc. There were also a few things which were ambiguous but I think are also essentially Swift Concurrency overhead, e.g. specialized AsyncLineSequence.AsyncIterator.next().

:exploding_head:

The actual file I/O seemed to add about 0.3s, depending on where you looked in Instruments. Seemingly all in latency waiting for data from the SSD. Which seems reasonable for thousands of small reads from a 623 MB file.

FileHandle (and the similar URL(fileURLWithPath:).lines) read in 16.47 KiB chunks, which is both a weirdly non-round size and also suspiciously small. Unfortunately there's seemingly no way to make it read a more sensible size at a time, to reduce read syscall overhead & disk I/O latency. I don't know if & to what extent the kernel was doing any read-ahead caching, though.

:man_shrugging:

Figuring out how to optimise this turned out to be way harder than I anticipated. Most things I tried made no difference, even seemingly obvious things like:

  • Combining all the filter steps into one (to reduce AsyncIterator overheads) - actually made it a few hundred milliseconds slower!

  • Combining all the string matching into a single regex - that made it take 6.6 times longer at nearly two minutes!!! Seemingly Swift Regexes are really slow.

    let combinedRegex = Regex {
        ipAtStartRegEx
    
        ZeroOrMore(.whitespace)
    
        "26/Apr"
    
        ZeroOrMore(.anyNonNewline)
    
        "\"GET "
    
        NegativeLookahead {
            botListRegEx
        }
    }
    
  • Fiddling with the regexes re. matching behaviour, e.g. explicitly restricting them to ASCII, trying Unicode scalar matching instead of graphemes, etc. None of that made any difference.

  • Removing the unnecessary capture in the botListRegEx - shaved only about 0.1 second off.

:thinking:

My naive attempt at removing the TaskGroup & Task wrappings resulted in an app which just hangs forever - apparently you still can't enumerate an AsyncSequence at the top level. Even after mucking around to fix that, it turns out it made no difference - it's not the TaskGroup that's the problem, it's the AsyncIterator stuff.

:triumph:

I tried something I know used to be reasonably efficient in the Objective-C era - creating an NSData memory-mapping the file with the intent to somewhat manually tease it out into individual lines, but before I even got that far I noticed that it takes NSData nearly a minute just to 'memory-map' the file, and it used 1 GiB of RAM to do it (so evidently not memory-mapping it, but reading it in v e r y s l o w l y …). So obviously something is very broken there. Sigh.

:man_facepalming:

So I gave up on any of the "proper" ways of doing this in Cocoa / Swift, threw out all the async stuff, and used the Swift standard library's readLine as a quick hack. That cut the time down to about nine seconds (and made the code so much briefer and simpler).

guard let file = freopen(path, "r", stdin) else {
    exit(EXIT_FAILURE)
}

let matchedLines = sequence(state: 0) { _ in readLine() }
    .filter { $0.contains("26/Apr") }
    .filter { $0.contains("\"GET ") }
    .filter { !$0.contains(botListRegEx) }
    .compactMap { $0.firstMatch(of: ipAtStartRegEx)?.output.0 }

Now the overheads were things like Objective-C message sends (a few hundred milliseconds), memmove (several hundred milliseconds), etc. Two seconds was being spent in readLine itself - nearly half of that is looking for newlines (getdelim), and the rest is converting to Strings.

Then I realised from further time profiling that some of the overhead was bridging Swift Strings to CFStrings inside contains. Doing that bridge once and calling CFStringFind directly shaves another 2.5 seconds off, bringing the time down to 6.5 seconds. Pretty close to the grep pipeline.

let targetDate = "26/Apr" as CFString
let GET = "\"GET " as CFString

let matchedLines = sequence(state: 0) { _ in readLine() }
    .filter {
        let str = $0 as CFString
        return (kCFNotFound != CFStringFind(str, targetDate, []).location
                && kCFNotFound != CFStringFind(str, GET, []).location)
    }
    .filter { !$0.contains(botListRegEx) }
    .compactMap { $0.firstMatch(of: ipAtStartRegEx)?.output.0 }

There's still a few hundred milliseconds in bridging String to CFString, among other Objective-C & Swift overheards. But I think at this point to make it meaningfully faster you'd have to drop down to straight C (or wait for Apple to optimise Regex).

Thanks for your dive into this problem. Your results are interesting and similar to my findings.

I wrote before

I analyze Apache logs. Poor man statistics of a Web server. :wink:

I'm happy with my current code which is now down to 3 s where the shell takes just 1 s. That's good enough for me.

The important parts is the usage of a FileHandle to get rid of slow String handling and the filter rules which use plain strings instead of RegEx for the first two filters

        let lines = FileHandle(forReadingAtPath: aPath)!.bytes.lines
           .filter { $0.contains("26/Apr/2023") }
           .filter { $0.contains("GET") }
           .filter{ !$0.contains(botListRegEx) }

A sample line would be

52.167.144.125 - - [26/Apr/2023:17:57:25 +0200] "GET / HTTP/1.1" 301 732 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/103.0.5060.134 Safari/537.36"