Slow regex performance

sspringer · November 2, 2024, 1:35pm

The application of regex expressions in Swift seems to be quite slow e.g. in comparison to Java, what is the issue here?

With a small test for Swift and the according small test for Java the numbers are:

Java:

... (1200000 matches).
0.508753042 seconds

Swift:

With characters:
... (1200000 matches).
17.111058916 seconds

With codepoints:
... (1200000 matches).
14.411686875000001 seconds

UPDATE: The last test was with Swift 6.0.0 on macOS 14.6.1 and OpenJDK Zulu 21, a test with Swift 6.0.2 on Windows 11 (on a slower machine) and OpenJDK Temurin 21 gave:

Java:

... (1200000 matches).
0.676467799 seconds

Swift:

With characters:
... (1200000 matches).
8.6005614 seconds

With codepoints:
... (1200000 matches).
9.218244 seconds

So it seems a little bit better with Swift 6.0.2 (?) but still far away from the Java performance.

Michael_Ilseman · November 2, 2024, 5:11pm

Thank you so much for writing this benchmark! I've opened a PR to add it to our benchmark suite and we'll be using it to motivate more of our performance work. Let me know how you'd like attribution.

In this benchmark's case, the slow perf comes down to some areas we're currently looking at improving:

More time is spent on retain/release than actually matching content
- Upcoming adoption of ~Copyable and ~Escapable
More time is spent in copying our COW arrays than in actual matching
- Upcoming rework using custom data structures
- Situational: dropping captures that are never read from
Lots of time drilling through String's higher-level representation
- Upcoming adoption of UTF8Span for direct UTF-8 processing

Also, we'd like to add more optimizations, such as more quickly finding runs of literal content. Unlike the improvements listed above, these might only benefit specific regexes (or the way they're written) over specific kinds of input. Thus, there is a risk that a benchmark might see dramatic improvements that don't translate to the workload the benchmark was derived from.

For example, most of your input data are lines that are successfully matched in their entirety. Other input data could include long lines with content that starts off looking like a URL, but ultimately isn't matched. If you're able to share more regexes or more kinds of input, we can make sure that optimizations are applicable to real workloads.

Finally, we've made some significant performance improvements that are (unfortunately) not yet in a toolchain but will be soon. On your benchmark, I'm seeing a 1.56x improvement from those changes for grapheme-cluster semantics and a 1.53x improvement for scalar semantics.

sspringer · November 2, 2024, 6:50pm

Thank you for your answer and your efforts!

The regex in my little benchmark was only one regex example I picked up somewhere quickly (and changed it a little bit) to more drastically show the issue, as in our applications we mostly use many simpler regex expressions. There might be better examples, and maybe more would be needed. (No attribution necessary.)

I think the comparison with e.g. Java versions is important for knowing where one might want to aim (although the Java regex engine has some flaws, some expressions are not interpreted as expected). So I added cross-links to the other language example in my repos.

P.S.: The simple replacement of Strings / Characters seems too have become much better in Swift 6.0.2 than it was in Swift 6.0.0 see these tests (although I compared the older Swift on macOS with the newer Swift on Windows, but I think both use the new Foundation implementation so this comparison should be valid).