Use RegexBuilder als grep replacement?

GreatOm · June 23, 2023, 1:32pm

Hi!
I'd like to improve an app which does currently use grep for log file parsing. A simple example in shell is

% LOG="aaa bbb
aaa ccc
aaa ddd
aaa fff
aaa ggg
bbb hhh
aaa iii
"
% echo $LOG | egrep '(bbb|ccc)'

shows the output

aaa bbb
aaa ccc
bbb hhh

Now I tried RegexBuilder with following code

import RegexBuilder
let log = """
aaa bbb
aaa ccc
aaa ddd
aaa fff
aaa ggg
bbb hhh
aaa iii
"""

let logRegEx = Regex {
    Capture {
        ChoiceOf {
            "bbb"
            "ccc"
        }
    }
}

let m = log.matches(of: logRegEx)
for aMatch in m {
    print(aMatch.output)
}

which gives

("bbb", "bbb")
("ccc", "ccc")
("bbb", "bbb")

I looked at the documentation and searched the web but I have no clue how to get the lines from the original text which match.
Any clue how to proceed?

As bonus I need a reverse match so that I get all lines which does not match.
My goal is to filter large files by dropping unneeded lines.

SimplyDanny · June 23, 2023, 4:16pm

The code you have written finds all matches of your regex in the string (not line-based). aMatch.output is a tuple of two strings in your example. The first element is the whole matched string while the second element is the part matched by your capture group. Given your regex, they are both equal. That's why you see the according output.

The code

let unmatchedLines = log.split(separator: "\n")
    .filter { $0.firstMatch(of: logRegEx) == nil }
print(unmatchedLines)

would print all lines which do not contain a match.

GreatOm · June 26, 2023, 6:34am

Many thanks for your code sample. Works great.

GreatOm · June 26, 2023, 7:35am

Now I changed my code to test the performance of Swift RegEx vs. SHELL calls. I'm disappointed from the speed: 80 s vs. 1 s!
Any clue how so speed this up?

Here is an example with about 100 MB real log data.
In SHELL I get following result:

% time (cd /regTest/logs;cat access.log access.log.1 | grep '26\/Apr' | egrep -v '(www.apple.com/go/applebot | www.bing.com/bingbot.htm | www.googlebot.com/bot.html | Xing Bot)' | awk '/GET/ {print $1}' | sort -n | uniq 1>/dev/null)
1,09s user 0,05s system 105% cpu 1,081 total

My Swift test takes much more time:

% time ./regTest
Found 1813 lines.
./regTest  82,54s user 0,26s system 99% cpu 1:22,83 total

Here is the sample code I used:

import Foundation
import RegexBuilder

guard let fullText = try? String(contentsOf: URL(filePath: "/regTest/logs/access.log")) + String(contentsOf: URL(filePath: "/regTest/logs/access.log.1")) else {
    print("Cannot read files!")
    exit(1)
}

let yesterdayRegEx = Regex {
    Capture {
        "26/Apr/2023"
    }
}
let botListRegEx = Regex {
    Capture {
        ChoiceOf {
            "www.apple.com/go/applebot"
            "www.bing.com/bingbot.htm"
            "www.googlebot.com/bot.html"
            "Xing Bot"
        }
    }
}

let dateMatch = fullText.split(separator: "\n")
    .filter{ $0.firstMatch(of: yesterdayRegEx) != nil }
    .filter{ $0.firstMatch(of: botListRegEx) == nil }

print("Found \(dateMatch.count) lines.")

GreatOm · June 27, 2023, 12:57pm

I made another test. The split takes about 50 s and the RegEx 30 s.
So I have two different parts to improve the code.
Any clue how to make it faster?

stevapple · June 27, 2023, 1:09pm

Making the split async could be faster. Using .split(whereSeparator: \.isNewline) should also be faster if you don't need exactly the same behavior.

BTW I don't think Swift as a Unicode-compliant implementation could be literally as performant as non-Unicode ones. One major goal of string processing in Swift is to make it easier and safer, while being fast enough.

GreatOm · June 27, 2023, 1:16pm

Wow. That is a big improvement!

let matchedLines = fullText.split(separator: "\n")
Split time: 50.495766043663025 s

let matchedLines = fullText.split(whereSeparator: .isNewline)
Split time: 21.207733035087585 s

It is really unexpected by me that this simple change makes such a big difference.
Still it is much slower than shell. I wonder if there is a way to get rid of the whole Unicode-String overhead…

stuchlej · June 27, 2023, 1:57pm

Could you share somehow the real data you're trying? I'm a little bit curious.

What machine do you use? So I have a point of reference.

stevapple · June 27, 2023, 2:00pm

IIRC byte- and scalar-level Regex matching is planned as the next stage of string processing feature set? They're considered generally low-level and less safe though.

stevapple · June 27, 2023, 2:27pm

I just did an interesting experiment on speeding up your case, and the result is as interesting as I'd expected:

I used AsyncLineSequence<FileHandle.AsyncBytes> to read in a 2GB file asynchronously by line, instead of reading the string entirely and split it later. It performs slightly better than String.init(contentsOfFile:), but...

Time	No Filter	Filtered
Sync	1m9.199s	3m6.004s
Async	0m53.216s	0m38.347s

With a regex filter, the async version runs even faster!

Experiment Codes

import Foundation

let path = "/path/to/file.log"
let regex = #/some_word/#

do {
    let lines = FileHandle(forReadingAtPath: path)!.bytes.lines
        .filter { $0.contains(regex) }

    for try await line in lines {
        _ = line
    }
}

do {
    let lines = try String(contentsOfFile: path)
        .split(whereSeparator: \.isNewline)
        .filter { $0.contains(regex) }

    for line in lines {
        _ = line
    }
}

ahti · June 27, 2023, 2:42pm

One more optimization idea (haven't tried it though):

I assume that by splitting by newline first, and then filtering by date, there are probably quite a few lines being split apart that never make it through (depends on the log contents i guess), so maybe it would be faster to find occurrences of the date first, and then find the surrounding newlines in some way, to avoid searching for all those newlines that don't really matter in the end.

David_Smith · June 27, 2023, 2:49pm

Standard question, just to double check: are you building your test programs with optimization enabled?

GreatOm · June 28, 2023, 5:33am

I analyze Apache logs. Poor man statistics of a Web server.
My test machine is a MacBook Pro with M1 Pro.

GreatOm · June 28, 2023, 5:36am

Nope. Just Xcode run.

GreatOm · June 28, 2023, 6:42am

YES! I modified my code and now I'm down to 7 s.
Here is my latest code

let paths = ["/Users/tom/Desktop/regTest/logs/access.log", "/Users/tom/Desktop/regTest/logs/access.log.1"]
var startDate = Date()
var matchedLines : [String] = []

for aPath in paths {
    let lines = FileHandle(forReadingAtPath: aPath)!.bytes.lines
        .filter { $0.contains("26/Apr") }
        .filter{ $0.firstMatch(of: botListRegEx) == nil }
    for try await line in lines {
        matchedLines.append(String(describing: line))
    }
}

print("Match time: \(abs(startDate.timeIntervalSinceNow)) s")
print("Found \(matchedLines.count) lines.")

Match time: 7.472939968109131 s
Found 1813 lines.

stevapple · June 28, 2023, 7:13am

If you have multiple files to read, you can make it even faster with a TaskGroup.

GreatOm · June 28, 2023, 7:22am

Now 4.8 s.

let paths = ["/regTest/logs/access.log", "/regTest/logs/access.log.1"]
var startDate = Date()
var matchedLines : [String] = []

try await withThrowingTaskGroup(of: Void.self, body: { group in
    for aPath in paths {
        let lines = FileHandle(forReadingAtPath: aPath)!.bytes.lines
            .filter { $0.contains("26/Apr") }
            .filter{ $0.firstMatch(of: botListRegEx) == nil }
        for try await line in lines {
            matchedLines.append(String(describing: line))
        }
    }
})

print("Match time: \(abs(startDate.timeIntervalSinceNow)) s")
print("Found \(matchedLines.count) lines.")

Match time: 4.881006956100464 s
Found 1813 lines.

stevapple · June 28, 2023, 8:01am

This is certainly not how TaskGroup is meant to be used🤔️ The correct way is to use group.addTask on each input file to process them in parallel.

let paths = ["/regTest/logs/access.log", "/regTest/logs/access.log.1"]
var startDate = Date()

let matchedLines = try await withThrowingTaskGroup(of: [String].self, body: { group in
    for aPath in paths {
        group.addTask {
            let lines = FileHandle(forReadingAtPath: aPath)!.bytes.lines
                        .filter { $0.contains("26/Apr") }
                        .filter { $0.contains(botListRegEx) }
            var matched : [String] = []
            for try await line in lines {
                matched.append(line)
            }
            return matched
        }
    }
    var matchedLines : [String] = []
    for try await partialMatchedLines in group {
        matchedLines.append(contentsOf: partialMatchedLines)
    }
    return matchedLines
})

print("Match time: \(abs(startDate.timeIntervalSinceNow)) s")
print("Found \(matchedLines.count) lines.")

David_Smith · June 28, 2023, 8:06am

Swift relies very heavily on the compiler's optimizer for good performance. Would definitely recommend switching to release mode.

eskimo · June 28, 2023, 8:29am

… and that’s absolutely compatible with Xcode’s Product > Run:

Choose Product > Scheme > Edit Scheme.
Select the Run action on the left.
Switch to the Info tab.
Select Release in the Bulid Configuration popup.

Just remember to switch it back before you try to use the debugger (-:

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple