'Standard' vapor website drops 1.5% of requests, even at concurrency of 100!

wadetregaskis · May 5, 2024, 12:32am

I made it about 30x faster.

Using current.description instead of string interpolation ("\(current)") surprisingly made no measurable difference.
The swap optimisation I suggested earlier did nothing to start with, because it turns out BigInt implements += as just + and then an assignment, which surprisingly the compiler takes on face value and doesn't optimise.
But, switching to BigUInt - while it didn't make a difference by itself - did couple with the swap optimisation to improvement performance by 30%.
Turning the logging down a notch (to error instead of info) improved performance by ~5% (about half of that can be gained by just redirecting it to /dev/null in the shell - probably those few percent are what Terminal was consuming to update the display).

Of course, turning INFO logging off might not be desirable in production, I was just curious. Not routing to a live tty is definitely valid, though - normally it'd go to a file, with comparatively no cost.
Switching to UIntXL from Numberick improved performance by a further 180%.

So that's nearly 5x faster in sum so far, just from these tweaks.
Using NBKFibonacciXL from Numberick improved performance by a further 525% again.

Of course, the reason it does this is because it relies on a clever algorithmic trick to metaphorically turn it from a linear search to a binary search, so this particular step is "unfair" against the other platforms if you don't also apply the same algorithm to them.

Though one could argue that unless they also have a convenient package pre-made which implements this, it is "fair" to let Swift have the advantage. Benchmark the tools you have, not the tools you want, kind of rationale.

Note: at this point logging does make a difference - a big difference, with up to a third of the performance lost if you emit logs to the Terminal rather than a file.

Next steps

Given these optimisations, it'd be worth re-running the full analysis and re-generating the charts, to see how all this has affected error rates. I did some spot checks and of course still see timeouts if I crank things up enough, and maybe it's showing more logical behaviour now with a seeming steady degradation of success rate as concurrency increases, and no errors at reasonable request rates (i.e. where the server doesn't run out of CPU for the Fibonacci calculations)… but it's hard to be sure.

I'm not even sure I saw the same behaviour as @axello with the original code - as far as I observed, there were no mysterious errors until the request rate genuinely exceeded the CPU capacity of the server.

Other observations

Vapor's probably not the bottleneck

/bench/1 - i.e. returning the literal string constant "0" - is about 70x faster than /bench/10000 (with the original code), making me pretty confident that Vapor is not remotely the bottleneck in the benchmark as originally written.

But, given the optimisations I made above, the performance of /bench/10000 is now nearly half that of /bench/1 - i.e. getting close enough that we can no longer ignore Vapor's component - so if one were to pursue further optimisations it might pay to look into Vapor or NIO themselves (obviously starting with a time profile to get a feel for the hotspots, which I stopped short of doing).

`wrk`'s scaling behaviour

I based my measurements & optimisations on:

wrk --threads 20 --duration 30s --connections 20 --latency --timeout 2s http://127.0.0.1:8080/bench/10000

Logically this seems like about the sweet spot for efficiency on my 20-logical-cores iMac Pro, and indeed some experimentation in thread & connection counts either side of that seemed to support that hypothesis.

Changing the number of threads didn't really have much effect, though - the default is just two, which seemed to achieve similar throughput numbers. 120 threads was noticeably worse-performing (although not by a huge amount - a few percent).

I assume that these requests & responses are so small that even two threads can easily do thousands of them per second. Presumably if e.g. the response body were much larger, it would take wrk more time to handle them and therefore require more threads for wrk itself to not become the bottleneck.

Upping the "connections" to e.g. 1,000 saw it reporting errors. I played with the "connections" number a bit and saw that it vaguely correlated with the number of errors (and degrading successful req/s), which makes sense - presumably as concurrency goes up increasing portions of server time is effectively wasted because it doesn't respond within two seconds and doesn't count.

HTTP is not representative of real-world situations (and therefore these benchmarks might not be either)

I hadn't realised until I played with it myself that this benchmark is not using TLS. It's therefore not really benchmarking a real-world scenario, irrespective of what you think about using Fibonacci calculations as the load.

Including TLS would of course increase the complexity of the benchmark by adding another non-trivial layer of software into the stack, but it's only fair since every platform has to do it (and I'd be very surprised if any of the other languages tested don't have TLS very heavily optimised already).

For low request rates I wouldn't expect TLS to change much, but once you start getting to O(100k/s) that could change significantly.