'Standard' vapor website drops 1.5% of requests, even at concurrency of 100!

lukasa · May 3, 2024, 6:17am

I doubt that it does. I think Java is just handily beating Swift in calculating the Fibonacci numbers!

lukasa · May 3, 2024, 6:21am

It's worth pushing on this just a touch: whether Java runs the GC ends up being irrelevant. What matters is that allocating new objects in Java is approximately free, and it's delaying reclaiming them. Because BigInt objects in almost all languages are dynamically sized they require a lot of heap allocations, particularly if you create a lot and throw them away, as your benchmark does. Java (and Javascript!) both use a bump-allocator for these short-lived objects, and none of them are going to survive out of young-gen.

This is a really strong advantage in this benchmark. If you could get the BigInts allocated out of a bump allocator in the Swift code you'd likely see an immediate improvement in Swift's performance.

hassila · May 3, 2024, 8:05am

Cross-environment benchmarks are always difficult, but if you really want to highlight the differences between the networking request handling under load, it is clearly desirable to have the equivalent load for all environments.

I would suggest that you create a load function that will take it's running time into account instead to make this parameter constant across environment measurements - just take timestamps in a busy loop and wait for a configurable amount of time (if you want to check how things works under load, this effectively also blocks the thread/task/cpu, but does it the same amount of time across environments and lets you focus on what you seem to be trying to benchmark).

johannesweiss · May 3, 2024, 9:42am

Right, that's kinda what I assumed. Your benchmark is running three big integer libraries that have been developed and tuned over decades against a Swift library that saw its last contribution years ago and presumably has seen a lot less tuning.

Now I'm not looking for excuses and Swift code can and should absolutely be able to compete on a BitInt benchmark. But I don't think we can expect a stale community project to compete against v8's BigInt, GMP or Java's BigInteger. These have literally been tuned for decades and might not even be written in the languages themselves (they might be hand-tuned assembly).

The thing that pains me the most here is that I'm not aware of a Swift big integer library that's actively being worked on and that I would assume can compete with the above. And that's absolutely a problem.

Sure, that's not a bad thing but instead of adding many hundreds of thousands of lines of HTTP servers, HTTP client benchmarking, network and what not, if we want to compare big int performance then I'd suggest we instead use a program

let start = DispatchTime.now()
for _ in 0..<1_000 {
    fib(10_000)
}
let end = DispatchTime.now()
print("took \((end.uptimeNanoseconds - start.uptimeNanoseconds) / 1_000_000)ms")

as pretty much the same benchmark. But much shorter and much fewer lines of code to discuss. I'm sure if you compared such a program with the equivalent Java/JS/PHP/... programs you'd see an even bigger difference.

Your intuition is not wrong. A compiled language, especially one like Swift can achieve better performance. But it's still possible to write slow programs and especially when it comes to comparing built-in high-performance primitives like BigInts in v8 (for Javascript)/Java/... then the performance of the calling (scripting) language won't matter at all. If you spend 95% of your code running potentially hand-tuned C/assembly routines for the big int, then it won't matter that you're calling it from PHP.

A good example of this is the massive use of "slow" Python in ML despite ML being very very computationally intensive. Most of the code is C++/CUDA/... running on CPU/GPU but it's often driven from "slow" Python programs. Optimising the Python often doesn't do much because 99.99% of the time is already spent in C++/CUDA/...

Let's take a step back: I assume your goal is to get the Swift code to perform in line with the expectations. One path that I'm convinced can lead to success is if you start by focussing exclusively on the fibonacci code and optimise that. Your problem is a performance engineer's dream! You spent 95% of the time in probably less than 5% of the code. One strategy that usually helps is to increase iteration speed by making things simpler: Remove as much auxiliary code as necessary. In your case you literally need to keep only 5 lines of code: A loop around fib(10000).

If you had such a simple loop in all languages then you'd be able to optimise the Swift code to match or even surpass the others. And you could draw more of the Swift community to help you. Many Swift engineers won't look into your benchmark because they might not know anything about Vapor, HTTP or networking. So it can help to show that what's likely the core of the issue is outside of Vapor, HTTP or networking.
Maybe the Swift BigInt implementation has some low hanging performance fruit? Maybe there's a different bit integer implementation that's faster? Or maybe you want to call GMP from Swift like you do from PHP.

Once the fibonacci slowness has been overcome, you can reintegrate into your server and I'm sure you'd see much much better results. That may yield new things that can be optimised but things will look brighter.

axello · May 3, 2024, 11:19am

That explains the rapid memory growth of the java and javascript applications.

My --moral-- problem is: if I tweak the swift code to better handle memory allocations, the comparison seems less 'honest'. I could for example use the swap suggestion that @wadetregaskis proposes. But I should change the other code as well to even the odds, so to say.

OR, or: find a completely other benchmark
Or only calculate the floating point number and thus removing the need for the BigInt third party library.

Does anyone have an idea for a standalone benchmark technique, which does not involve databases etc. (Like TechEmpower does). Ideally where we can differentiate between the workload and the framework overhead?

Max_Desiatov · May 3, 2024, 11:23am

This seems the most appropriate to me, otherwise you're mostly benchmarking BigInt libraries here, which greatly obscures networking and concurrency aspects. You'd still be benchmarking floating point implementation in that case, but that's more equalized across different languages and usually has no dependency on a third-party numerical library.

vns · May 3, 2024, 12:18pm

I would look at this problem from the different angle.

As was pointed out BitInt for Swift has been stale and not as optimized as on other platforms. So the comparison isn’t fair from the start — other players are a few decades of improvements ahead. Which means fine-tuning Swift isn’t make it less honest, it would be odd that Swift performs better just for the sake of existence in such task.
All languages you are comparing against has GC, and Swift here being an “outsider”. Handling operations in languages that has completely different memory workflow in the same way and expect same result isn’t fair — you need to use more appropriate way according to the language and technology.

With that considerations, it makes a lot of sense to adjust Swift version. It might be interesting as well to add one more language without GC, e.g. Rust to the comparison.

You can try do some large list processing — like 1brc in Java a few months back did with an extremely simple task over large dataset. Not so large, but huge enough list with similar task to be performed might be a good substitute.

ratranqu · May 3, 2024, 1:03pm

One way to have very similar workload for every language is to just have a fixed time delay (maybe even with the time delay as a param to your request so you can skew one way or another) as the compute part of your request (as @hassila hinted to). Then you know that you comparing only the frameworks.

jimc · May 3, 2024, 1:16pm

Are you trying to benchmark the framework as a whole or just the network/request handling portion?

If just the network/request handling, then do as already mentioned and build a small dynamic response and wait a configurable amount of time. This will help get you closer to testing the thing you think you’re testing instead of the current test which tests something else.

If the whole framework, how big of a result do you want to build and in what formats (html, json)? Any sensible test should generate a reasonable size response similar to a normal web app. Also how many routes does the web app have configured? I don’t have any production rails apps with less than 100 routes. Testing a 3-route web app that returns 20 bytes of text not even built with the normal result builders isn’t really “testing the web app framework”.

Also, most frameworks have a default ORM and that is used in most web apps so by eliminating that from the test you’re eliminating half the framework.

You do have a good topic now for a follow-up talk about the difficulty of benchmarking and the risk of benchmarking something different than you intended.

axello · May 3, 2024, 1:34pm

Thanks for the suggestion. Like a large list sorting task?
I will have to look up what 1brc did, and it is hopefully not too complicated, as I need to rewrite it in all these different languages.

Thanks for your other valuable different angles

I'll rewrite the swift one to reuse the same variable instances, but before testing I need to figure out why vapor+swift do not seem to timeout in the same way as the other languages. Otherwise it's an unfair competition again.

axello · May 3, 2024, 1:38pm

Thank you for your comment.

However, I do not want to measure only framework overhead. I want to measure application differences. So how much resources would a 'typical' javascript/php/java/swift application need for a specific task? Can I run 1000 swift processes vs 100 javascript processes? The framework is an extra factor indeed. I could for example add Hummingbird in the mix, as well as a gazillion other frameworks, each with its pros and cons.

If you feed me a gazillion euros, I will test with a gazillion frameworks!

axello · May 3, 2024, 1:58pm

Good points Jim, thanks!

I am actually not trying to test the frameworks, but more a 'typical application'. Indeed there are hundreds of 'typical applications', and solving old mathematical constructs is probably not one of them.

But the more variables you add to the mix, the more difficult it becomes to pinpoint cause and effect. Remember Mark Twain's old adage: "Lies, Damned Lies and Statistics!"?

I do not want to test the network, hence I do not want my tests to shove down megabytes of javascript down the pipes. I thought below 10 kB of data would be nice, which fits nicely with the algorithm.
I also do not want to mess with databases, as then you quickly get into: which database? But I agree with your ORM sentiment.

It all boils down to “what am I benchmarking?” indeed.

Hence my request to my audience: how would you benchmark this without resorting to 100s of different code pieces and creating templates for all the different templating engines and database servers out there?
The people at TechEmpower.com did that, and it's a bewilderment of choices. And the Vapor versions are apparently obsolete.
Mind you: I am looking more for a general qualitative comparison than a millisecond my code is faster than your code competition.

Haha, indeed! I'm trying to get funded for some follow-up research.

jaleel · May 3, 2024, 2:10pm

Actually I like the initiative!
Maybe a good way to test more fair is to take some C/C++ implementation of some workload?

ksluder · May 3, 2024, 3:06pm

Eh, I’d be concerned that the specific implementation of “wait” could drastically impact the results. Waiting by sleeping would not load the CPU, which drastically impacts scheduler behavior. This is definitely relevant to how the concurrency runtime is going to behave when handing incoming requests or resuming outgoing responses.

jimc · May 3, 2024, 3:50pm

The short answer is that it just doesn’t matter, they are all fast enough. If you have a scenario where it does matter, your problem is likely not addressed by a simple benchmark.

The right answer 99.9% of the time to the question of “which web app framework should I use?” is “which web app framework do I know?” or “which web app framework do I want to learn?” or “which web app framework fits best with the rest of the infrastructure?”. And for that 0.1% where it does matter, a simple benchmark probably doesn’t test the specifics of your problem so it’s pointless to even consider.

In your own example: a realistic simple benchmark would likely show Vapor in the same range as other web app servers. But if your specific need was computing Fibonacci sequences, a benchmark that did not use BigInt wouldn’t show the performance problems you were about to encounter.

axello · May 3, 2024, 4:06pm

Hi Jim,

In the article I wrote that I do not care about speed as is. Although I am worried why Vapor seems to drop so many requests immediately. What I would like to measure is the environmental impact. How much energy does this RoR application use to handle 100000 requests, compared to a C++ implementation of the same code?
To name two examples that are NOT in this report.

Thus far, no-one is interested in environmental impact: "Just add more hardware" for our badly-written non-optimised perl implementation of our business logic.

However, with doubled prices for energy at most colocations, this environmental impact becomes real business impact.

Yes, our four 16-core servers are fast enough to handle our e-commerce website. But what if we would have chosen another frameworks and technology stack? Could we have done it on two 8 core servers as well?

I think several members of this forum know the answers to that, but I want to measure and publish this general consensus.

Regards,

Axel

axello · May 3, 2024, 4:09pm

Thanks Jaleel!
I already got an offer for a C++ implementation .
However, first we need to make clear that this simple Fibonacci workload is indeed a fair test for actual 'server load'.

When we take into account database interactions and templating engines, writing it in C++ which is not commonly used for web development, becomes a pain and a huge effort!

axello · May 3, 2024, 4:21pm

That is interesting and also what I gathered from all the posts above.

However, in my defense: vapor does not work optimally out of the box.
As a budding web developer I wrote simple applications in Node.js, php and java.
For Node.js and php I needed to put a load balancer in front of it, as they are single threaded. Vapor and java apparently do that 'out of the box'.
However, java's out of the box experience seems far superior than Vapor's. At least from my inexperienced view.

If I build a web-app using current practices (vapor website and all), I expect it to work 'decent'. Not drop requests on a light load. I don't know yet about server optimisations, as that is a whole different field to become experienced in.

Or perhaps we need more 'optimise your Vapor website' talks over 'beginning with Vapor' talks (sorry @0xTim , Looking forward to SSS in London!)

Disclaimer: the java benchmark was written by a java devops engineer. The Vapor app by a webdev noob who only started web development with Vapor 1, 2 and 4. (me)

wadetregaskis · May 3, 2024, 4:58pm

Benchmarks have many challenges, and this one that you're touching on is particularly difficult to square: what's "normal" code? Often people benchmark the "simple" or even naive code, having put no real effort into tailoring it to each language & framework, but is that representative of real-world use? If you were actually running a web server to calculate Fibonacci numbers, and you had enough load that you cared about performance at all, would you really not try to optimise it a bit?

Benchmarking is best done as an exploration of a continuum between "most naive implementation anyone can come up with" and "most optimised implementation anyone can come up with", where somewhere in-between those two extremes is probably what most people actually use in the real world, because that represents roughly the sweet spot on the curve of programming effort (and skill) vs performance.

This is why I like benchmarks which encourage audience participation, and/or provide many example solutions for a given problem. e.g. the Computer Language Benchmarks Game. You can peruse the submissions and evaluate trade-offs like readability & maintainability vs runtime performance (or trade-offs in runtime performance, like CPU vs memory usage, that arise from different implementation details).

The increased data lets you answer multiple questions, like "which language is most forgiving of naive code" vs "which language is fastest given a little effort" vs "which language is most sensitive to implementation details" etc. And review all the nuances of that, like: just how much uglier is heavily-optimised C code vs Rust vs Swift vs Java?

Indeed. That might approximate a very I/O-heavy workload (like a micro-service architecture where most "work" is just bouncing bytes between machines), so it's not necessarily a useless benchmark, but it doesn't seem all that promising. I think @axello's goal is wise, of including a workload that's at least vaguely realistic.

It's easy enough to busy-wait on a timeout, though. Here's a bunch of examples (and an overview of their relative performance, which might matter here as inefficient clock APIs add more skid to your timing).

Still, that's a dubious normalisation. If Swift really is substantially worse at the actual server workload, then you probably want to see that in this sort of benchmark. It's irrelevant how fast Vapor is if Vapor's not the actual bottleneck in the real world.

And while this is important to note because it suggests the benchmarks are biased, it's also worth considering that comparatively few people are experienced in Vapor (or Swift on the server in general). So it's not necessarily unfair - from a certain perspective.

Again, though, this is why it's very important to explain your benchmark setup and assumptions. "What you can expect to get from the current market" is very different from "what is possible".

I like this goal. Abysmally few people (and fewer companies) care about this. Even when it also costs them money (electricity is expensive, but even that pales in comparison to the capital & operational costs of over-provisioning).

Along those lines, you might benefit from emphasising the peak memory discrepency more. From my experience at LinkedIn (a very typical large, micro-services, Java server stack) peak memory usage is by far the dominanting factor in both capital costs and energy usage (DRAM is a lot more power-hungry than people think). Languages like Java require huge amounts of RAM in order to be performant [in non-trivial workloads], which is out of whack with the actual shape of hardware, so you end up maxing out the physical RAM capacity of every server and still using only a tiny fraction of its CPU capacity.

Contrast that with e.g. Google where core infrastructure is mostly in C++, so you need an order of magnitude fewer machines to serve the same workload. Given that most servers use a huge amount of power just to sit there idle¹, and thanks to Turbo Boost and its ilk even the slightest workload rockets CPU power usage up to TDP limits anyway, just think about how much is being wasted by having ten times as many servers as necessary.

It may well be one of the greatest failures of Computer Science that it's taught generations of people to obsess over CPU usage and basically ignore RAM usage.

¹ Not because they have to, but because of a combination of ignorance and dubious aspirational beliefs. Hardware designers like to think that their products are used super heavily, running some incredibly optimised, massively parallel nuclear physics calculations or whatever. They try very hard to ignore the sad reality that most of their products spend most of their lives basically idle. So they choose various trade-offs and designs that prioritise largely-hypothetical high loads, at the expense of real loads. Even when their biggest and most important customers go to great lengths to reverse those decisions and publicly show how bad they were.

wadetregaskis · May 3, 2024, 5:22pm

Seems a bit… harsh? What makes you think it's stale? It's been in development for years, with hundreds of patches by dozens of people, including fairly recently (September). When I surveyed Swift 'big int' packages a few months back it was the most mature and performant that I could find.

I haven't examined its performance super closely, but what I have profiled and decompiled has looked pretty good. At least by Swift standards. Sure, there's room for improvement in principle, but mostly it's at the mercy of the Swift language & compiler (e.g. copy-paste the code into your module, rather than importing it from another module, and you'll get a nice performance boost, but that's just how Swift works, for better and worse).