Benchmarking RegEx Builder

I have been working on my own String Pattern Matching Library (SPML) since the Swift 2 / 3 days (for a book project I've been working on for as long, as well). When Swift 5.7 introduced RegEx Builder, I was happy to abandon that effort to use the more idiomatic RegEx Builder DSL for the book.

However, there were a few things about the SPML that I really liked, such as the ability to handle context free languages (as well as regular languages) and the easy ability to insert semantic actions (code) within the grammars. So recently I dusted off the SPML and started playing around with it again.

One of my first questions concerning the performance of RegEx Builder versus SPML. So I created a simple little benchmark to compare the two. Here's the RegEx Builder expression I used:

         let logLine = Regex
         {
            Repeat(copies...copies)
            {
                 Anchor.startOfLine

                 "["
                    Repeat(.digit, count: 4)
                    "-"
                    Repeat(.digit, count: 2)
                    "-"
                    Repeat(.digit, count: 2)
                    One(.whitespace)
                    Repeat(.digit, count: 2)
                    ":"
                    Repeat(.digit, count: 2)
                    ":"
                    Repeat(.digit, count: 2)
                    "."
                    Repeat(.digit, count: 3)
                 "] ["
                  ChoiceOf { "TRACE"; "DEBUG"; "INFO"; "WARN"; "ERROR" }
                 "] ["
                    OneOrMore(.word)
                    ZeroOrMore { "."; OneOrMore(.word) }
                 "] ("
                     OneOrMore(.digit)
                 ")"
                 OneOrMore(.whitespace)

                 ZeroOrMore(.any, .reluctant)
                 "\n"
                 Anchor.endOfLine
            }
         }

Executing this 150,000 times took around 1,200 seconds. That is, to put it nicely, a lot of time. (note that this did not include the time to build the RegEx value; I excluded that from the benchmark because the RegEx initializer is notoriously slow).

I fed this benchmark eight lines of text for each iteration (that is, "copies" had the value 8) and supplied eight lines of text that this expression would successfully match. Here is a typical line from the corpus ("body of text") string:

[2025-09-01 12:34:56.789] [TRACE] [net.http] (1442) anna.smith+test@example.com | https://example.com | {"id": 42, "name": "Widget", "price": 12.50, "tags": ["swift","regex","benchmark"]} | let foo = 123; let bar = foo + 456 // sum | This is is a test with a repeated word word for backreference. | 0.0.0.0

There are eight of these lines in the corpus string, based on the parameters I used when running the benchmark code.

I'm wondering if I haven't done something stupid in this regular expression that is causing backtracking to kick in and create a massive slowdown. To put things in perspective, the following SPML code executes in 6.5 seconds:

            let result =
               p.match( corpus )
               {
                  p.nPat_fast(
                     n:copies,
                     block:{ tail in
                  
                     // Date/Time:
                     
                     p.c( "[" ){
                     p.nADigit( n:4 ){
                     p.c( "-" ){
                     p.nADigit( n:2 ){
                     p.c( "-" ){
                     p.nADigit( n:2 ){
                     p.ws(){
                     p.nADigit( n:2 ){
                     p.c( ":" ){
                     p.nADigit( n:2 ){
                     p.c( ":" ){
                     p.nADigit( n:2 ){
                     p.c( "." ){
                     p.nADigit( n:3 ){
                     p.s( "] [" ){
                     
                     // Level:
                     
                     p.choiceOf(
                       [
                          {tail in p.s("TRACE", tail)},
                          {tail in p.s("DEBUG", tail)},
                          {tail in p.s("INFO", tail)},
                          {tail in p.s("WARN", tail)},
                          {tail in p.s("ERROR", tail)}
                       ]
                     )
                     {
                        p.s( "] [" ){

                        // Module:
                        
                        p.oneOrMoreWordChar(){
                        p.zeroOrMorePat(
                           block:
                              { tail in
                                 p.s(".")
                                 {
                                    p.oneOrMoreWordChar( tail )
                                 }
                              }
                        ){
                        p.s("] (" ){
                        p.nADigit( n:4 ){
                        p.c( ")" ){
                        p.oneOrMoreWS(){
                        p.thruChar( "\n" ){
                        tail()
                        }}}}}}}}}}}}}}}}}}}}}}}}
                     } // block
                  ) // nPat
               } // Match

While my ego would love to believe that my astounding coding skills produced something that ran over two orders of magnitude faster than RegEx Builder, common sense suggests that there is something else wrong here.

FWIW, here's the full RegEx Builder benchmark program (the SPML variant is roughly the same except for the actual regular expression code):

// BenchRegex.swift

import Foundation
import RegexBuilder

// MARK: - Utilities

struct BenchResult
{
    let name      : String
    let iterations: Int
    let duration  : Duration
    let matches   : Int
}

@inline(__always)
func time<T>(label name:String, iterations: Int = 25, _ body: () -> T) -> (result: T, elapsed: Duration)
{
    let clock = ContinuousClock()
    let start = clock.now
    let result = body()
    let end   = clock.now
    return (result, end - start)
}

@inline(__always)
func runTimed(label name:String, iterations: Int = 25, _ body: () -> Int) -> BenchResult
{
    let (_, elapsed) = time(label: name, iterations: iterations)
    {
        var total = 0
        for _ in 0..<iterations
        {
            total &+= body()
        }
        return total
    }
    // One extra pass to report “per-run” match count:
    let totalMatches = body()
    return BenchResult(name: name, iterations: iterations, duration: elapsed, matches: totalMatches)
}

@inline(__always)
func fmt(_ d: Duration) -> String
{
    let c = d.components
    let seconds = Double(c.seconds) + Double(c.attoseconds) / 1e18
    return String(format: "%.2f ms", seconds * 1000.0)
}

// MARK: - Corpus Generator

func makeCorpus( _ copies:Int ) -> String
{
   let levels  = ["TRACE","DEBUG","INFO","WARN","ERROR"]
   let modules = ["net.http","auth.core","db.pg","ui.kit","regex.engine","scheduler"]

   let emails  = [
     "anna.smith+test@example.com",
     "john-doe@example.co.uk",
     "δοκιμή+αλφα@παράδειγμα.gr",
     "bad..dots@example.com",
     "simple@example.org",
     "x@y.z"
   ]

   let urls = [
     "https://example.com",
     "http://localhost:8080/path/to/resource?x=1&y=2#frag",
     "ftp://192.168.0.1/pub/file.tar.gz",
     "https://sub.domain.example.io:443/a/b/c?q=swift%20regex",
     "ws://[::1]/socket"
   ]

   let jsons = [
     #"{"id": 42, "name": "Widget", "price": 12.50, "tags": ["swift","regex","benchmark"]}"#,
     #"{"ok": true, "msg": "hello \"world\"", "pi": 3.14159}"#,
     #"{"bad": [1,2,,3], "x": }"#
   ]

   let code = [
     "let foo = 123; let bar = foo + 456 // sum",
     "func greet(_ name: String) -> String { return \"Hello, \\(name)!\" }",
     "struct Point { var x: Double; var y: Double }",
     "if a == b && c != d || e <= f { print(\"ok\") }"
   ]

   let text = [
     "This is is a test with a repeated word word for backreference.",
     "Sphinx of black quartz, judge my vow. 12345 abcde Áéîøü",
     "P@ssw0rdPolicyMustHaveMix and length >= 12 with symbols!"
   ]

   let dates = [
     "2025-09-01 12:34:56.789",
     "1970-01-01 00:00:00.000",
     "2000-02-29 23:59:59.123"
   ]

   let ipv4s = [
     "0.0.0.0","10.0.0.1","127.0.0.1","192.168.100.200","256.12.12.12","12.34.56.078"
   ]

   var parts: [String] = []

   var lvlNdx    = 0
   var modNdx    = 0
   var dateNdx   = 0
   var emailNdx  = 0
   var urlNdx    = 0
   var jsonNdx   = 0
   var codeNdx   = 0
   var textNdx   = 0
   var ipv4sNdx   = 0

   for _ in 0..<copies
   {
      let lvl = levels[lvlNdx]
      lvlNdx = (lvlNdx + 1) % levels.count

      let mod = modules[modNdx]
      modNdx  = (modNdx + 1) % modules.count

      let dt  = dates[dateNdx]
      dateNdx  = (dateNdx + 1) % dates.count

      let msg = [
         emails[emailNdx],
         urls[urlNdx],
         jsons[jsonNdx],
         code[codeNdx],
         text[textNdx],
         ipv4s[ipv4sNdx]
      ].joined(separator: " | ")
      emailNdx  = (emailNdx + 1) % emails.count
      urlNdx    = (urlNdx + 1) % urls.count
      jsonNdx   = (jsonNdx + 1) % jsons.count
      codeNdx   = (codeNdx + 1) % code.count
      textNdx   = (textNdx + 1) % text.count
      ipv4sNdx  = (ipv4sNdx + 1) % ipv4s.count

      parts.append("[\(dt)] [\(lvl)] [\(mod)] (\(Int.random(in: 1000...9999))) \(msg)")
   }

   return parts.joined(separator: "\n") + "\n"
}



// MARK: - Benchmark Runner

    func main()
    {
         let copies     = 8
         let iterations = 150000       // bump for longer timing windows

         let corpus  = makeCorpus( copies )

         print("Corpus size: \(corpus.utf8.count) bytes")
         print("Iterations per regex: \(iterations)\n")

         var results: [BenchResult] = []
        
         let logLine = Regex
         {
            Repeat(copies...copies)
            {


                 Anchor.startOfLine

                 "["
                    Repeat(.digit, count: 4)
                    "-"
                    Repeat(.digit, count: 2)
                    "-"
                    Repeat(.digit, count: 2)
                    One(.whitespace)
                    Repeat(.digit, count: 2)
                    ":"
                    Repeat(.digit, count: 2)
                    ":"
                    Repeat(.digit, count: 2)
                    "."
                    Repeat(.digit, count: 3)
                 "] ["
                  ChoiceOf { "TRACE"; "DEBUG"; "INFO"; "WARN"; "ERROR" }
                 "] ["
                    OneOrMore(.word)
                    ZeroOrMore { "."; OneOrMore(.word) }
                 "] ("
                     OneOrMore(.digit)
                 ")"
                 OneOrMore(.whitespace)

                 ZeroOrMore(.any, .reluctant)
                 "\n"
                 Anchor.endOfLine
            }
         }

         let r = runTimed(label: "log", iterations: iterations)
         {
             var count = 0
             for _ in corpus.matches(of: logLine)
             {
                 count &+= 1
             }
             return count
         }
         results.append®
         print("log",  "matches: \(r.matches)",   "time: \(r.duration)" )

         let totalSeconds = results.reduce(0.0)
         { acc, r in
            let c = r.duration.components
            return acc + Double(c.seconds) + Double(c.attoseconds) / 1e18
         }
         //let sTemp = String(format: "%.3f s", totalSeconds)
         print("\nTotal wall time (sum of individual regex runs): \(totalSeconds)")
    }
main()

2 Likes

Did you test a debug or release build? It’s typical that debug builds can be substantially slower due to lack of optimizations. If you’re using a release build already, can you try capturing a trace with Time Profiler in Instruments to understand where the time is being spent? (You don’t have to record the full 2 hours, just a representative several seconds)

Of course I did a release build.
FWIW, the debug build of SPML runs in 54 seconds, so even if the RegEx Builder library was not compiled in optimized form, there is still considerable difference between their run times.

Just hoping there was something wrong with the regular expression I passed to the RegEx Build DSL.

I'm finishing up testing the SPML and throwing together minimal documentation for it; I should have it github'd before too much longer.

FWIW when i ran the shared snippet it did not appear to produce a match against the benchmark code. the following updated regex implementation did match successfully and took ~6.3s for the 150,000 iteration benchmark:

         let logLine = Regex
         {
            Repeat(copies...copies)
            {
                 Anchor.startOfLine

                 "["
                    Repeat(.digit, count: 4)
                    "-"
                    Repeat(.digit, count: 2)
                    "-"
                    Repeat(.digit, count: 2)
                    One(.whitespace)
                    Repeat(.digit, count: 2)
                    ":"
                    Repeat(.digit, count: 2)
                    ":"
                    Repeat(.digit, count: 2)
                    "."
                    Repeat(.digit, count: 3)
                 "] ["
                  ChoiceOf { "TRACE"; "DEBUG"; "INFO"; "WARN"; "ERROR" }
                 "] ["
                    OneOrMore(.word)
                    ZeroOrMore { "."; OneOrMore(.word) }
                 "] ("
                     OneOrMore(.digit)
                 ")"
                 OneOrMore(.whitespace)

                //  ZeroOrMore(.any, .reluctant)
                ZeroOrMore(CharacterClass.anyOf("\n").inverted)
                 "\n"
                //  Anchor.endOfLine
            }
         }

i'm not familiar enough with the regex DSL implementation to have a sense of why the performance was so much worse under the other configuration, nor why the regex as originally written did not produce a match. i thought that you maybe needed to set anchorsMatchLineEndings() on the regex, but even with that enabled i couldn't figure out how to get it to match without using explicit newline characters & character classes rather than the Anchor.endOfLine values.

Thanks for figuring this part out. As I mentioned, it was hard to believe 1200 seconds was correct. There must be some crazy backtracking going on here (O(2^n) time for degenerate cases with backtracking, though I have a hard time believing this was a degenerate case).

FWIW, I got 9.847350108 seconds on my (latest model) x86 MBP. The SPML ran in 7.123332388 seconds on the same machine. This is more in line with what I would expect.

Thanks for this report, @hryde! I've opened an issue to track this as we work on performance. Failing faster is definitely one component of speeding up the overall performance of the engine.

1 Like

I don't think that regex can produce a positive match after backtracking. In which case, it's probably better to turn it off: possessive | Apple Developer Documentation

1 Like

When I ran the "benchmark" with one copy (instead of 8), it ran (at least) an order of magnitude faster (and probably closer to 256 times faster, 2^8, I don't remember the timings). This suggests to me that, for whatever reason, I've tapped into a degenerate pattern case for the regular expression parser. With backtracking, recognition is O(2^n), worst case. However, this worst case only happens in rare, degenerate cases. Like I said, I must have somehow tapped into a degenerate case without realizing it.

It would be nice to figure out what I did wrong with my pattern and why backtracking went O(2^n) on me, but if what I believe has happened, I'm not sure there is anything that can be done about it. Trying to recognize that a given case is O(2^n) probably maps to the halting problem, so there's not much that could be done about it.

Of course, it is possible that this is an actual bug, so someone should look into it; but I suspect that's just the way things are. Though I would really like to know what is wrong with

ZeroOrMore(.any, .reluctant)
"\n"
Anchor.endOfLine

And why that fails to match (and becomes a degenerate case).