Why is Regex slower than using a bunch of String.contains?

It looks like the new rewritten Foundation should* avoid bridging in that method :slight_smile:

*one possible caveat on this I'm checking on

(edit: ok, checked on it, and indeed I can do "Hello World".contains("Hello") with zero objc_msgSend calls)

2 Likes

It seems like it's the default implementation for StringProtocol, which is pure Swift, isn't it?

In any case, how do I get it to use _StringProcessing methods of which you speak?

Nope, most of those are Foundation

Oh, I see, if you merely import Foundation then it changes the implementation of strings.

Unfortunately, the "pure" Swift version is an order of magnitude slower than the Foundation version, for even a trivial example like "foobar!".contains("26/Apr"). The 'pure' version calls a bunch of stuff, e.g.:

StringProtocol.contains(_:)
  BidirectionalCollection<>.firstRange<A>(of:)
    TwoWaySearcher.init(pattern:)
      BidirectionalCollection<>._ends<A>(with:)
        BidirectionalCollectionConsumer.consumingBack(_:)
          FixedPatternConsumer<>.consumingBack(_:in:)
            etc

Seems a bit overkill for such a trivial operation as scanning a string for an exact subset match.

Plus, how realistic is to not have Foundation imported anyway? I imagine there's few non-trivial Swift programs which don't use Foundation.

2 Likes

Yep :sweat_smile:

I don't see this, not in your test at least... the async/await version takes 23 seconds regardless of whether I'm using String.contains or String.contains2:~

Wrong code hidden, see below
// in a file without `import Foundation`

extension String {
    func contains2(_ v: String) -> Bool {
        contains(v)
    }
    func contains2(_ v: Regex<Substring>) -> Bool {
        contains(v)
    }
}

It is possible even for macOS/iOS apps – by bringing small key fragments to swift files that don't have import Foundation – but not many people would bother doing that.

Oops. That's factually wrong. I assumed that a separate file with no import Foundation in it would choose non-foundation implementation for, say, String.contains. That's not the case if import Foundation being used in other files of the same project.

Correct answer then – no, it's not realistic for iOS apps and non server macOS apps to not use Foundation, the situation for server apps is different.

1 Like

While implementing my own string contains method I stumbled on a surprising inconsistency in the hasPrefix and hasSuffix methods:

// no imports

"\r\n".hasPrefix("\r")      // true
"\r\n".hasPrefix("\r"[...]) // false

"\r\n".hasSuffix("\n")      // true
"\r\n".hasSuffix("\n"[...]) // false

2 Likes

This is a bug—I had thought it’d been filed and addressed, but I’m not seeing it on GitHub. All of these should return false because \r\n is a single extended grapheme cluster.

1 Like

Same result with import Foundation.

Good find, this is indeed a bug! I've opened a GitHub issue documenting the problem here: [StdLib] Incorrect results from `String.hasPrefix` and `hasSuffix` · Issue #67427 · apple/swift · GitHub

2 Likes

Regex performance is very much a work in progress and there's on-going work to improve it. There's a lot of constant costs that need to be trimmed, architectures to refactor (e.g. captures), as well as optimizations.

I think your example demonstrates the needs for some optimizations that could signficantly improve performance. If anyone's interested, the repo contains benchmarks and this would make a great addition. That being said, optimizations can still be drowned out by constant costs or architectural costs mentioned above, which are being actively worked on.

However, there are limits to what can be accomplished through optimizations. Regex seeks to remain faithful to the programmer's intent when they wrote the algorithm, and you are comparing two different search algorithms.

In the regex-only algorithm, you are parsing and capturing IP addresses prior to checking whether the line contains "26/Apr". In the algorithm with filter, you are first excluding lines that do not contain "26/Apr" in them before parsing IP addresses. These will behave differently depending on the nature of the input. Inputs that rarely start with IP addresses and very often contain "26/Apr" will respond differently to the two different algorithms than inputs that often start with IP addresses and rarely contain "26/Apr".

You can represent the filtering algorithm as a regex using lookahead assertions, but I haven't tried this out for your example. Caveats about regex performance being a work-in-progress still apply.

As for contains which uses the TwoWaySearcher, this is calling into very generic code that knows nothing about strings. As such, it is loading and comparing Characters (i.e. extended grapheme clusters), which is very inefficient. I opened a bug for adding specializations for this generic algorithm.

9 Likes

Yep. So it's not strictly 'fair' in that sense, but I'm also not certain which algorithm it actually biases against. Immediately rejecting literally half the lines (in the synthetic test logs) because the first byte isn't an ASCII digit seems to me like it should be super fast (in principle) and actually give the pure-Regex approach a big leg up.

Perhaps it'd be wise to bias towards real-world code - not necessarily optimal code - for prioritising optimisations. I think the pure-Regex approach as written is exactly how most people would do it, no?

In the spirit of that, note though that the synthetic log files aren't actually all that faithful to true Apache logs, that @GreatOm was actually working with. The Apache logs do start with the IP address, but the rest of the string is a bit different (and generally much longer, in real log files). It might make sense to improve the synthetic log file generator to better mimic real logs.

Also, real-world logs presumably have IPv6 addresses in them too. I omitted that from my testing purely for my own convenience.

1 Like