'Standard' vapor website drops 1.5% of requests, even at concurrency of 100!

wadetregaskis · May 3, 2024, 4:58pm

Benchmarks have many challenges, and this one that you're touching on is particularly difficult to square: what's "normal" code? Often people benchmark the "simple" or even naive code, having put no real effort into tailoring it to each language & framework, but is that representative of real-world use? If you were actually running a web server to calculate Fibonacci numbers, and you had enough load that you cared about performance at all, would you really not try to optimise it a bit?

Benchmarking is best done as an exploration of a continuum between "most naive implementation anyone can come up with" and "most optimised implementation anyone can come up with", where somewhere in-between those two extremes is probably what most people actually use in the real world, because that represents roughly the sweet spot on the curve of programming effort (and skill) vs performance.

This is why I like benchmarks which encourage audience participation, and/or provide many example solutions for a given problem. e.g. the Computer Language Benchmarks Game. You can peruse the submissions and evaluate trade-offs like readability & maintainability vs runtime performance (or trade-offs in runtime performance, like CPU vs memory usage, that arise from different implementation details).

The increased data lets you answer multiple questions, like "which language is most forgiving of naive code" vs "which language is fastest given a little effort" vs "which language is most sensitive to implementation details" etc. And review all the nuances of that, like: just how much uglier is heavily-optimised C code vs Rust vs Swift vs Java?

Indeed. That might approximate a very I/O-heavy workload (like a micro-service architecture where most "work" is just bouncing bytes between machines), so it's not necessarily a useless benchmark, but it doesn't seem all that promising. I think @axello's goal is wise, of including a workload that's at least vaguely realistic.

It's easy enough to busy-wait on a timeout, though. Here's a bunch of examples (and an overview of their relative performance, which might matter here as inefficient clock APIs add more skid to your timing).

Still, that's a dubious normalisation. If Swift really is substantially worse at the actual server workload, then you probably want to see that in this sort of benchmark. It's irrelevant how fast Vapor is if Vapor's not the actual bottleneck in the real world.

And while this is important to note because it suggests the benchmarks are biased, it's also worth considering that comparatively few people are experienced in Vapor (or Swift on the server in general). So it's not necessarily unfair - from a certain perspective.

Again, though, this is why it's very important to explain your benchmark setup and assumptions. "What you can expect to get from the current market" is very different from "what is possible".

I like this goal. Abysmally few people (and fewer companies) care about this. Even when it also costs them money (electricity is expensive, but even that pales in comparison to the capital & operational costs of over-provisioning).

Along those lines, you might benefit from emphasising the peak memory discrepency more. From my experience at LinkedIn (a very typical large, micro-services, Java server stack) peak memory usage is by far the dominanting factor in both capital costs and energy usage (DRAM is a lot more power-hungry than people think). Languages like Java require huge amounts of RAM in order to be performant [in non-trivial workloads], which is out of whack with the actual shape of hardware, so you end up maxing out the physical RAM capacity of every server and still using only a tiny fraction of its CPU capacity.

Contrast that with e.g. Google where core infrastructure is mostly in C++, so you need an order of magnitude fewer machines to serve the same workload. Given that most servers use a huge amount of power just to sit there idle¹, and thanks to Turbo Boost and its ilk even the slightest workload rockets CPU power usage up to TDP limits anyway, just think about how much is being wasted by having ten times as many servers as necessary.

It may well be one of the greatest failures of Computer Science that it's taught generations of people to obsess over CPU usage and basically ignore RAM usage.

¹ Not because they have to, but because of a combination of ignorance and dubious aspirational beliefs. Hardware designers like to think that their products are used super heavily, running some incredibly optimised, massively parallel nuclear physics calculations or whatever. They try very hard to ignore the sad reality that most of their products spend most of their lives basically idle. So they choose various trade-offs and designs that prioritise largely-hypothetical high loads, at the expense of real loads. Even when their biggest and most important customers go to great lengths to reverse those decisions and publicly show how bad they were.