With characters:
... (1200000 matches).
17.111058916 seconds
With codepoints:
... (1200000 matches).
14.411686875000001 seconds
UPDATE: The last test was with Swift 6.0.0 on macOS 14.6.1 and OpenJDK Zulu 21, a test with Swift 6.0.2 on Windows 11 (on a slower machine) and OpenJDK Temurin 21 gave:
Java:
... (1200000 matches).
0.676467799 seconds
Swift:
With characters:
... (1200000 matches).
8.6005614 seconds
With codepoints:
... (1200000 matches).
9.218244 seconds
So it seems a little bit better with Swift 6.0.2 (?) but still far away from the Java performance.
Thank you so much for writing this benchmark! I've opened a PR to add it to our benchmark suite and we'll be using it to motivate more of our performance work. Let me know how you'd like attribution.
In this benchmark's case, the slow perf comes down to some areas we're currently looking at improving:
More time is spent on retain/release than actually matching content
Upcoming adoption of ~Copyable and ~Escapable
More time is spent in copying our COW arrays than in actual matching
Upcoming rework using custom data structures
Situational: dropping captures that are never read from
Lots of time drilling through String's higher-level representation
Upcoming adoption of UTF8Span for direct UTF-8 processing
Also, we'd like to add more optimizations, such as more quickly finding runs of literal content. Unlike the improvements listed above, these might only benefit specific regexes (or the way they're written) over specific kinds of input. Thus, there is a risk that a benchmark might see dramatic improvements that don't translate to the workload the benchmark was derived from.
For example, most of your input data are lines that are successfully matched in their entirety. Other input data could include long lines with content that starts off looking like a URL, but ultimately isn't matched. If you're able to share more regexes or more kinds of input, we can make sure that optimizations are applicable to real workloads.
Finally, we've made some significant performance improvements that are (unfortunately) not yet in a toolchain but will be soon. On your benchmark, I'm seeing a 1.56x improvement from those changes for grapheme-cluster semantics and a 1.53x improvement for scalar semantics.
The regex in my little benchmark was only one regex example I picked up somewhere quickly (and changed it a little bit) to more drastically show the issue, as in our applications we mostly use many simpler regex expressions. There might be better examples, and maybe more would be needed. (No attribution necessary.)
I think the comparison with e.g. Java versions is important for knowing where one might want to aim (although the Java regex engine has some flaws, some expressions are not interpreted as expected). So I added cross-links to the other language example in my repos.
P.S.: The simple replacement of Strings / Characters seems too have become much better in Swift 6.0.2 than it was in Swift 6.0.0 see these tests (although I compared the older Swift on macOS with the newer Swift on Windows, but I think both use the new Foundation implementation so this comparison should be valid).