Slow regex performance

Michael_Ilseman · November 2, 2024, 5:11pm

Thank you so much for writing this benchmark! I've opened a PR to add it to our benchmark suite and we'll be using it to motivate more of our performance work. Let me know how you'd like attribution.

In this benchmark's case, the slow perf comes down to some areas we're currently looking at improving:

More time is spent on retain/release than actually matching content
- Upcoming adoption of ~Copyable and ~Escapable
More time is spent in copying our COW arrays than in actual matching
- Upcoming rework using custom data structures
- Situational: dropping captures that are never read from
Lots of time drilling through String's higher-level representation
- Upcoming adoption of UTF8Span for direct UTF-8 processing

Also, we'd like to add more optimizations, such as more quickly finding runs of literal content. Unlike the improvements listed above, these might only benefit specific regexes (or the way they're written) over specific kinds of input. Thus, there is a risk that a benchmark might see dramatic improvements that don't translate to the workload the benchmark was derived from.

For example, most of your input data are lines that are successfully matched in their entirety. Other input data could include long lines with content that starts off looking like a URL, but ultimately isn't matched. If you're able to share more regexes or more kinds of input, we can make sure that optimizations are applicable to real workloads.

Finally, we've made some significant performance improvements that are (unfortunately) not yet in a toolchain but will be soon. On your benchmark, I'm seeing a 1.56x improvement from those changes for grapheme-cluster semantics and a 1.53x improvement for scalar semantics.