Use RegexBuilder als grep replacement?

Hi!
I'd like to improve an app which does currently use grep for log file parsing. A simple example in shell is

% LOG="aaa bbb
aaa ccc
aaa ddd
aaa fff
aaa ggg
bbb hhh
aaa iii
"
% echo $LOG | egrep '(bbb|ccc)' 

shows the output

aaa bbb
aaa ccc
bbb hhh

Now I tried RegexBuilder with following code

import RegexBuilder
let log = """
aaa bbb
aaa ccc
aaa ddd
aaa fff
aaa ggg
bbb hhh
aaa iii
"""

let logRegEx = Regex {
    Capture {
        ChoiceOf {
            "bbb"
            "ccc"
        }
    }
}

let m = log.matches(of: logRegEx)
for aMatch in m {
    print(aMatch.output)
}

which gives

("bbb", "bbb")
("ccc", "ccc")
("bbb", "bbb")

I looked at the documentation and searched the web but I have no clue how to get the lines from the original text which match.
Any clue how to proceed?

As bonus I need a reverse match so that I get all lines which does not match.
My goal is to filter large files by dropping unneeded lines.

The code you have written finds all matches of your regex in the string (not line-based). aMatch.output is a tuple of two strings in your example. The first element is the whole matched string while the second element is the part matched by your capture group. Given your regex, they are both equal. That's why you see the according output.

The code

let unmatchedLines = log.split(separator: "\n")
    .filter { $0.firstMatch(of: logRegEx) == nil }
print(unmatchedLines)

would print all lines which do not contain a match.

1 Like

Many thanks for your code sample. Works great.

Now I changed my code to test the performance of Swift RegEx vs. SHELL calls. I'm disappointed from the speed: 80 s vs. 1 s!
Any clue how so speed this up?

Here is an example with about 100 MB real log data.
In SHELL I get following result:

% time (cd /regTest/logs;cat access.log access.log.1 | grep '26\/Apr' | egrep -v '(www.apple.com/go/applebot | www.bing.com/bingbot.htm | www.googlebot.com/bot.html | Xing Bot)' | awk '/GET/ {print $1}' | sort -n | uniq 1>/dev/null)
1,09s user 0,05s system 105% cpu 1,081 total

My Swift test takes much more time:

% time ./regTest
Found 1813 lines.
./regTest  82,54s user 0,26s system 99% cpu 1:22,83 total

Here is the sample code I used:

import Foundation
import RegexBuilder

guard let fullText = try? String(contentsOf: URL(filePath: "/regTest/logs/access.log")) + String(contentsOf: URL(filePath: "/regTest/logs/access.log.1")) else {
    print("Cannot read files!")
    exit(1)
}

let yesterdayRegEx = Regex {
    Capture {
        "26/Apr/2023"
    }
}
let botListRegEx = Regex {
    Capture {
        ChoiceOf {
            "www.apple.com/go/applebot"
            "www.bing.com/bingbot.htm"
            "www.googlebot.com/bot.html"
            "Xing Bot"
        }
    }
}

let dateMatch = fullText.split(separator: "\n")
    .filter{ $0.firstMatch(of: yesterdayRegEx) != nil }
    .filter{ $0.firstMatch(of: botListRegEx) == nil }

print("Found \(dateMatch.count) lines.")

I made another test. The split takes about 50 s and the RegEx 30 s.
So I have two different parts to improve the code.
Any clue how to make it faster?

Making the split async could be faster. Using .split(whereSeparator: \.isNewline) should also be faster if you don't need exactly the same behavior.

BTW I don't think Swift as a Unicode-compliant implementation could be literally as performant as non-Unicode ones. One major goal of string processing in Swift is to make it easier and safer, while being fast enough.

1 Like

Wow. That is a big improvement!

let matchedLines = fullText.split(separator: "\n")
Split time: 50.495766043663025 s

let matchedLines = fullText.split(whereSeparator: .isNewline)
Split time: 21.207733035087585 s

It is really unexpected by me that this simple change makes such a big difference.
Still it is much slower than shell. I wonder if there is a way to get rid of the whole Unicode-String overhead…

Could you share somehow the real data you're trying? I'm a little bit curious.

What machine do you use? So I have a point of reference.

IIRC byte- and scalar-level Regex matching is planned as the next stage of string processing feature set? They're considered generally low-level and less safe though.

I just did an interesting experiment on speeding up your case, and the result is as interesting as I'd expected:

I used AsyncLineSequence<FileHandle.AsyncBytes> to read in a 2GB file asynchronously by line, instead of reading the string entirely and split it later. It performs slightly better than String.init(contentsOfFile:), but...

Time No Filter Filtered
Sync 1m9.199s 3m6.004s
Async 0m53.216s 0m38.347s

With a regex filter, the async version runs even faster!

Experiment Codes
import Foundation

let path = "/path/to/file.log"
let regex = #/some_word/#

do {
    let lines = FileHandle(forReadingAtPath: path)!.bytes.lines
        .filter { $0.contains(regex) }

    for try await line in lines {
        _ = line
    }
}

do {
    let lines = try String(contentsOfFile: path)
        .split(whereSeparator: \.isNewline)
        .filter { $0.contains(regex) }

    for line in lines {
        _ = line
    }
}
1 Like

One more optimization idea (haven't tried it though):

I assume that by splitting by newline first, and then filtering by date, there are probably quite a few lines being split apart that never make it through (depends on the log contents i guess), so maybe it would be faster to find occurrences of the date first, and then find the surrounding newlines in some way, to avoid searching for all those newlines that don't really matter in the end.

Standard question, just to double check: are you building your test programs with optimization enabled?

I analyze Apache logs. Poor man statistics of a Web server. :wink:
My test machine is a MacBook Pro with M1 Pro.

Nope. Just Xcode run.

YES! I modified my code and now I'm down to 7 s.
Here is my latest code

let paths = ["/Users/tom/Desktop/regTest/logs/access.log", "/Users/tom/Desktop/regTest/logs/access.log.1"]
var startDate = Date()
var matchedLines : [String] = []

for aPath in paths {
    let lines = FileHandle(forReadingAtPath: aPath)!.bytes.lines
        .filter { $0.contains("26/Apr") }
        .filter{ $0.firstMatch(of: botListRegEx) == nil }
    for try await line in lines {
        matchedLines.append(String(describing: line))
    }
}

print("Match time: \(abs(startDate.timeIntervalSinceNow)) s")
print("Found \(matchedLines.count) lines.")

Match time: 7.472939968109131 s
Found 1813 lines.

If you have multiple files to read, you can make it even faster with a TaskGroup.

1 Like

Now 4.8 s.

let paths = ["/regTest/logs/access.log", "/regTest/logs/access.log.1"]
var startDate = Date()
var matchedLines : [String] = []

try await withThrowingTaskGroup(of: Void.self, body: { group in
    for aPath in paths {
        let lines = FileHandle(forReadingAtPath: aPath)!.bytes.lines
            .filter { $0.contains("26/Apr") }
            .filter{ $0.firstMatch(of: botListRegEx) == nil }
        for try await line in lines {
            matchedLines.append(String(describing: line))
        }
    }
})

print("Match time: \(abs(startDate.timeIntervalSinceNow)) s")
print("Found \(matchedLines.count) lines.")

Match time: 4.881006956100464 s
Found 1813 lines.

This is certainly not how TaskGroup is meant to be used🤔️ The correct way is to use group.addTask on each input file to process them in parallel.

let paths = ["/regTest/logs/access.log", "/regTest/logs/access.log.1"]
var startDate = Date()

let matchedLines = try await withThrowingTaskGroup(of: [String].self, body: { group in
    for aPath in paths {
        group.addTask {
            let lines = FileHandle(forReadingAtPath: aPath)!.bytes.lines
                        .filter { $0.contains("26/Apr") }
                        .filter { $0.contains(botListRegEx) }
            var matched : [String] = []
            for try await line in lines {
                matched.append(line)
            }
            return matched
        }
    }
    var matchedLines : [String] = []
    for try await partialMatchedLines in group {
        matchedLines.append(contentsOf: partialMatchedLines)
    }
    return matchedLines
})

print("Match time: \(abs(startDate.timeIntervalSinceNow)) s")
print("Found \(matchedLines.count) lines.")
1 Like

Swift relies very heavily on the compiler's optimizer for good performance. Would definitely recommend switching to release mode.

1 Like

… and that’s absolutely compatible with Xcode’s Product > Run:

  1. Choose Product > Scheme > Edit Scheme.

  2. Select the Run action on the left.

  3. Switch to the Info tab.

  4. Select Release in the Bulid Configuration popup.

Just remember to switch it back before you try to use the debugger (-:

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

3 Likes