import RegexBuilder
let log = """
aaa bbb
aaa ccc
aaa ddd
aaa fff
aaa ggg
bbb hhh
aaa iii
"""
let logRegEx = Regex {
Capture {
ChoiceOf {
"bbb"
"ccc"
}
}
}
let m = log.matches(of: logRegEx)
for aMatch in m {
print(aMatch.output)
}
which gives
("bbb", "bbb")
("ccc", "ccc")
("bbb", "bbb")
I looked at the documentation and searched the web but I have no clue how to get the lines from the original text which match.
Any clue how to proceed?
As bonus I need a reverse match so that I get all lines which does not match.
My goal is to filter large files by dropping unneeded lines.
The code you have written finds all matches of your regex in the string (not line-based). aMatch.output is a tuple of two strings in your example. The first element is the whole matched string while the second element is the part matched by your capture group. Given your regex, they are both equal. That's why you see the according output.
Now I changed my code to test the performance of Swift RegEx vs. SHELL calls. I'm disappointed from the speed: 80 s vs. 1 s!
Any clue how so speed this up?
Here is an example with about 100 MB real log data.
In SHELL I get following result:
% time (cd /regTest/logs;cat access.log access.log.1 | grep '26\/Apr' | egrep -v '(www.apple.com/go/applebot | www.bing.com/bingbot.htm | www.googlebot.com/bot.html | Xing Bot)' | awk '/GET/ {print $1}' | sort -n | uniq 1>/dev/null)
1,09s user 0,05s system 105% cpu 1,081 total
My Swift test takes much more time:
% time ./regTest
Found 1813 lines.
./regTest 82,54s user 0,26s system 99% cpu 1:22,83 total
Making the split async could be faster. Using .split(whereSeparator: \.isNewline) should also be faster if you don't need exactly the same behavior.
BTW I don't think Swift as a Unicode-compliant implementation could be literally as performant as non-Unicode ones. One major goal of string processing in Swift is to make it easier and safer, while being fast enough.
let matchedLines = fullText.split(separator: "\n")
Split time: 50.495766043663025 s
let matchedLines = fullText.split(whereSeparator: .isNewline)
Split time: 21.207733035087585 s
It is really unexpected by me that this simple change makes such a big difference.
Still it is much slower than shell. I wonder if there is a way to get rid of the whole Unicode-String overhead…
IIRC byte- and scalar-level Regex matching is planned as the next stage of string processing feature set? They're considered generally low-level and less safe though.
I just did an interesting experiment on speeding up your case, and the result is as interesting as I'd expected:
I used AsyncLineSequence<FileHandle.AsyncBytes> to read in a 2GB file asynchronously by line, instead of reading the string entirely and split it later. It performs slightly better than String.init(contentsOfFile:), but...
Time
No Filter
Filtered
Sync
1m9.199s
3m6.004s
Async
0m53.216s
0m38.347s
With a regex filter, the async version runs even faster!
Experiment Codes
import Foundation
let path = "/path/to/file.log"
let regex = #/some_word/#
do {
let lines = FileHandle(forReadingAtPath: path)!.bytes.lines
.filter { $0.contains(regex) }
for try await line in lines {
_ = line
}
}
do {
let lines = try String(contentsOfFile: path)
.split(whereSeparator: \.isNewline)
.filter { $0.contains(regex) }
for line in lines {
_ = line
}
}
One more optimization idea (haven't tried it though):
I assume that by splitting by newline first, and then filtering by date, there are probably quite a few lines being split apart that never make it through (depends on the log contents i guess), so maybe it would be faster to find occurrences of the date first, and then find the surrounding newlines in some way, to avoid searching for all those newlines that don't really matter in the end.