Why is Regex slower than using a bunch of String.contains?

wadetregaskis · July 19, 2023, 5:53am

As part of Use RegexBuilder als grep replacement? some folks before me figured out that Regex seems a little slow. I tested this a bit myself and confirmed that not only is Regex alone slower than "pre-screening" candidate lines with String.contains(), but it's massively slower. As in, an order of magnitude slower.

The Async iteration over lines of a file is surprising slow topic has the base example code for context, but the essential parts are:

let botListRegEx = Regex {
    ChoiceOf {
        "www.apple.com/go/applebot"
        "www.bing.com/bingbot.htm"
        "www.googlebot.com/bot.html"
        "Xing Bot"
    }
}

let ipAtStartRegEx = Regex {
    Anchor.startOfLine

    Capture {
        Repeat(1...3) { .digit }
        Repeat(3...3) {
            "."
            Repeat(1...3) { .digit }
        }
    }
}.asciiOnlyDigits()

…and their application:

let matchedLines = sequence(state: 0) { _ in readLine() }
    .filter { $0.contains("26/Apr") }
    .filter { $0.contains("\"GET ") }
    .filter { !$0.contains(botListRegEx) }
    .compactMap { $0.firstMatch(of: ipAtStartRegEx)?.output.0 }

In principle this can be done using a single regex, and one might think that ideally it should be for simplicity. Yet if you do that, like this:

let combinedRegex = Regex {
    ipAtStartRegEx

    ZeroOrMore(.whitespace)

    "26/Apr"

    ZeroOrMore(.anyNonNewline)

    "\"GET "

    NegativeLookahead {
        botListRegEx
    }
}

let matchedLines = sequence(state: 0) { _ in readLine() }
    .compactMap { $0.firstMatch(of: combinedRegex)?.output.0 }

…then the performance sucks. In a ~420 MB test case (test case synthesier included in the base code) the contains-using version takes about seven seconds on my M2 MacBook Air but the unified regex approach takes 81 seconds!

Plus, the contains version can be further optimised to reduce Objective-C bridging overheads, like so:

let targetDate = "26/Apr" as CFString
let GET = "\"GET " as CFString

let matchedLines = sequence(state: 0) { _ in readLine() }
    .filter {
        let str = $0 as CFString
        return (kCFNotFound != CFStringFind(str, targetDate, []).location
                && kCFNotFound != CFStringFind(str, GET, []).location)
    }
    .filter { !$0.contains(botListRegEx) }
    .compactMap { $0.firstMatch(of: ipAtStartRegEx)?.output.0 }

…and that shaves it down further to five seconds. No such optimisation is possible when using Regex, as far as I can tell (Regex under the covers uses the same CoreFoundation functions as contains, among others, and seems to suffer a lot from Objective-C bridging overheads).

So sixteen times faster than Regex alone. Is that… expected? Am I misusing Regex? Are there optimisation that can be made to improve its performance?

It's particularly surprising since in the 'decomposed' case each of those contains / CFStringFind / firstMatch calls is ignorant of the other, re-scanning prefixes of the line that cannot possibly match. The combined Regex has the advantage of only having to search suffixes of the line for a possible match for each component (after the first).

hassila · July 19, 2023, 8:10am

I think this ties into my previous comment on performance of these new super nice abstractions:

The swift-collections package and standard library are careful to define time complexity and also often provide benchmarks - as suggested in the above comment, maybe it'd be beneficial to surface some of the performance expectations and design envelope as part of the evolution process.

Maybe should be broken out to a separate discussion though.

Jon_Shier · July 19, 2023, 3:59pm

Every time I've asked about such performance goals has been met negatively. In general, we've been told implementation concerns beyond initial viability aren't relevant to the evolution process. So I don't anticipate much progress here.

Karl · July 19, 2023, 4:40pm

A lot of this stuff just hasn't been optimised yet.

For example, String.contains(String) is implemented using String.firstMatch(of: some RegexComponent). This turns the searched-for String in to a Regex, containing a Program, which contains a DSLTree, etc. The engine which executes this regex also has a bunch of heap allocations to represent its matching state, etc -- it doesn't know the regex you gave it is a simple string; it could be any regex.

For a general Unicode-Unicode match, that kind of complexity might be necessary.

But when both Strings are ASCII (as in your example), there is an obvious fast path we could use. We could do it; it just hasn't been done yet.

I'm sure PRs are welcome.

scanon · July 19, 2023, 4:45pm

For the purposes of Evolution, we care that an API can be optimized. We don't want to introduce API that cannot ever be made fast. However, it's perfectly OK to introduce new API and then chip away at optimizing the implementation over following language releases.

Jumhyn · July 19, 2023, 4:47pm

Moreover, this means that specific implementation questions are broadly off-topic for the Evolution category but are perfectly germane here in Using Swift or Compiler.

scanon · July 19, 2023, 4:52pm

As a side note: memmem, in general, is not actually fast. On most platforms, it binds a naive O(n²) worst-case search (suitable for small searches). With a needle that is known-ASCII (and with more work, even when it isn't), there are a variety of faster search options available that I would hope we would use eventually, at least when the haystack is large enough.

wadetregaskis · July 19, 2023, 9:05pm

And I'm considering diving into that. I appreciate the "put your time where your complaints are" contract that comes with open-source and community-driven software, and I love that Swift is largely open, from language to libraries. I just don't know if or when I'll actually have the time (or how well I'll do learning all the relevant libraries' code).

That said, String.contains(String) using Regex under the hood seems…

I don't know if that means it should be super easy to optimise String or if it's an omen that the String implementation is terrifying.

hassila · July 19, 2023, 9:06pm

I think that is fine, but what I’m missing is the discussion on expected “final” performance characteristics after such work has been done - I’d definitely find that useful. In my experience such discourse often can expose thoughts on the expected implementation and expected usage - which may help provide better feedback.

wadetregaskis · July 19, 2023, 9:08pm

Is this true? That seems counter to the results I'm seeing here, where String.contains(String) is much faster than explicitly using Regex. In the time profiles I didn't notice any regex stuff under String.contains(String), only Objective-C bridging gunk and CoreFoundation string functions.

Forgive the noob question, but can you point me at the String.contains(String) implementation (in GitHub or wherever)? Oddly I can't seem to find it (I'm totally unfamiliar with the layout of Swift's open source projects).

hassila · July 19, 2023, 9:10pm

Ok, but doesn’t that conflict a bit with (the reasonable stance):

Doesn’t it make it much harder to judge the merits of an api in terms of being able to be optimized if we basically avoid discussing the “how”?

wadetregaskis · July 19, 2023, 9:12pm

Which seems sensible. However, how do you gauge your confidence in that (that a given implementation can be made fast) and what's the required confidence threshold?

All too often we think "oh, this architecture will be fine" and then get a rude awakening when we actually try to optimise it.

scanon · July 19, 2023, 9:16pm

We do have a bunch of people who have quite a lot of experience with these things, so our collective intuition is pretty solid. We're not perfect, but it hasn't led us badly astray yet.

Also, in the case of Regex, we have very deliberately kept the initial ABI surface of the regex engine small by not exposing most of its internal workings. This costs some performance for simple cases in the short term, but also allows us much more freedom in what choices we can make about optimizing long-term, because we're mostly not constrained by ABI compatibility, and is a very deliberate implementation choice.

David_Smith · July 19, 2023, 9:29pm

I'm putting together a project to try this out myself, but one thing that I immediately notice: top level code/global variables can have some overhead in Swift, moving everything into a little struct might actually improve perf.

Jumhyn · July 19, 2023, 9:29pm

I didn’t mean to suggest that any whiff of ‘implementation’ is off-topic in evolution discussions (as Steve notes, understanding whether a proposed API admits a hypothetical efficient implementation is totally in-bounds), I only meant that questions about the current implementation are (likely) off topic for evolution. They’re probably either a PR or a question/suggestion for another category.

David_Smith · July 19, 2023, 9:50pm

Ok, definitely found some things that look suspicious and weren't there last time I wrote a benchmark like this. I'm gonna file a few bug reports based on this, thanks for the test case!

Karl · July 19, 2023, 9:54pm

When I check in Godbolt, String.contains(String) compiles down to a call to sSy17_StringProcessingE8containsySbSSF.

Running that through swift-demangle:

$sSy17_StringProcessingE8containsySbSSF ---> (extension in _StringProcessing):Swift.StringProtocol.contains(Swift.String) -> Swift.Bool

That matches the signature I find here. The implementation simply returns weather firstRange(of: ...) != nil.

Since String conforms to RegexComponent, I expect that the most specific overload of firstRange will be this one. It constructs a Regex using the .regex property, which might be more efficient than whatever Regex initialiser you were comparing with (it uses .init(verbatim:); is that what you were using?), but since it is ultimately a Regex<Substring> like any other, it contains a DSL tree that will be executed by the matching engine.

None of these algorithms are inlinable, which might also be responsible for performance differences against in-module code, even if you call the same functions with the same inputs.

I'm not sure why you're seeing any bridging stuff, though. That's certainly strange.

Ah, of course - that's what I get for hoping things could be easy for once! Oh, well - this seems like a useful operation for the standard library to expose on the UnsafeBuffer (and eventually, safe BufferView) types.

wadetregaskis · July 19, 2023, 10:04pm

I didn't try a struct specifically, but putting everything inside a function made no difference.

wadetregaskis · July 19, 2023, 10:15pm

That could be misleading. Godbolt compiles on Linux, not macOS, and only for x86-64, not AArch64. I woudn't be at all surprised if the implementations are quite different.

To be clear, I'm only looking at this on macOS 13.4.1, with Swift 5.9 (beta).

Instruments says that String.contains(String) actually uses the generic implementation from StringProtocol which calls -[NSString rangeOfString:options:range:locale:] and that calls CFStringFindWithOptionsAndLocale. There's a whole bunch of other gunk in there too, mainly around bridging and ARC, but those are a relatively small amount of the execution time.

I wondered if somehow I was dealing with NSStrings instead of native Swift strings, but I can see from the time profile that readLine (my source of these strings) is using String._fromUTF8Repairing(_:), and I can follow that all the way down to swift_slowAlloc. So I'm pretty sure these are native Swift strings. Which leads to the question of why they're immediately bridging to NSString just to do a 'contains' check…

All this is a bit off topic since the question is why is Regex so slow, not String.contains(String). Although both questions are valid.

scanon · July 19, 2023, 10:20pm

Without spending any time on this at all, I believe you're calling String.contains(_:String) from Foundation, rather than one of the _StringProcessing methods, which would explain all the bridging and NSWhatnot that you're seeing.