'Standard' vapor website drops 1.5% of requests, even at concurrency of 100!

johannesweiss · May 4, 2024, 5:34pm

It is using all the cores at max! As long as you have as many connections as cores, the benchmark will fully utilise all your CPU cores, forever.

The client-side of the benchmark is (in parallel for each connection)

var start = Time.now()
let connection = openConnection
while true {
    let response = connection.sendRequestAndWaitForResponse()
    let end = Time.now()
    if end - start > 2s { throw Timeout() }
    start = Time.now()
}

So let's assume you have 4 cores and 4 connections only. They will all constantly be busy calculating fiboncacci numbers forever. As soon as the server has calculated one fib(10k), the client will ask it immediately calculate the next.

See here: Client is configured to use 4 connections, I have 4 CPU cores and we're using 385% CPU in the instant that I have taken this screenshot. This fluctuates between 380% and 400% CPU because there's a tiny little period where the kernel is actually switching between client & server (and also the client needs a tiny bit of time preparing the next request).

tera · May 4, 2024, 6:03pm

Visualising your numbers:

adam-fowler · May 4, 2024, 7:44pm

So am I right in saying that a web server that uses the new NIOAsyncChannel, which in effect offloads a large amount of work to Swift concurrency, might be more susceptible to DOS attacks because the EventLoops are less busy and will accept connections at a faster rate.

wadetregaskis · May 4, 2024, 8:06pm

Ah yes, its implementation is short & relatively readable but not efficient¹. Lots of temporaries, which means copying, which seems to be the biggest performance-killer in this sort of thing.

@oscbyspro wrote a pretty good version for decimal conversion, that's part of the new new Foundation. If you're curious about some of the optimisation details check out the pull request (and if you're tangentially curious about the functional-correctness aspects, check out my initial version - it's harder than you might think to implement this sort of thing even before you start to worry about performance).

If you look at the benchmark results in Oscar's pull request, in short "v2x" is Oscar's final version (that's now in Foundation) and "v0" is roughly equivalent to attaswift/BigInt's version. So you can see that using this version should render the cost of converting to a string basically irrelevant, in @axello's benchmark.

While this new version in Foundation doesn't do anything but decimal, it is fully localisation-aware (when used via IntegerFormatStyle) and it works with all BinaryIntegers.

I'm not sure if it's actually shipped in any Apple OS release yet, though.

This reminds me, @axello - Numberick is Oscar's own package containing (among other things) big-int implementations, and is probably more performant (it definitely is for string conversion, at least). Perhaps worth trying instead - it should be pretty trivial to switch to.

¹ Hypothetically it could be, but it'd require the optimiser to do much more dramatic transformations than it does currently.

johannesweiss · May 4, 2024, 8:40pm

I believe NIOAsyncChannel upholds backpressure correctly (@FranzBusch will have the details). So it's not pushing new connections onto the Swift Concurrency pools, it's pulling channels from Concurrency land. This pulling will only happen if the Concurrency threads aren't already overloaded. Of course there might be scenarios where some details change but by and large it should be fine.
For peak performance you'll want to get rid of the extra latency for the thread switches between Concurrency and I/O and that can be done by installing NIO as your Concurrency executor which means async functions run on NIO's EventLoops.

Truthfully though, I'd be shocked if there are many higher-level server applications that don't have a hidden unbounded queue hidden somewhere. A lot of file system/database calling frameworks have unbounded queues where work can just pile up without a limit... But these can be fixed if necessary.

And this is why IMHO a low-level thing like SwiftNIO must be held to a somewhat higher standard here: It is important it doesn't impose any unbounded queues that a user can't fix. The 'no unbounded queues' guarantee is upheld even with NIOAsyncChannel, you are still in control if and when connections/requests are accepted. And of course NIO's core sits 'in front' of NIOAsyncChannel so if you already have some custom circuit breaking/backpressuring logic you can continue adding that to the serverChannelInitializer to slow down connection acceptance to the rate/count you want. And by adding handlers to the childChannelInitializer you can control the number of inflight requests. Both allow you to slow things down by slowing down (or outright stopping) outbound read events.

But in most cases I'd think just using NIOAsyncChannel should be fine because it pulls new connections/requests. So if you're overloaded it'll naturally pull more slowly . Internally it will uphold the backpressure also by slowing the read events just like anything else would, it doesn't have any privileged access.

axello · May 4, 2024, 9:35pm

Thanks, I will test what happens if I add async to my function. However, currently vapor already accepts almost all the requests (98%) I throw at it, unlike the other languages which timeout increasingly until breakdown. @weissi has written below why vapor drops packages so quickly.

I wonder if wrk handles vapor's responses differently, otherwise I do not understand, especially with synced code, how it can 'absorb' so many requests.

axello · May 4, 2024, 9:56pm

Very insightful post with lots of good things to think about. Thanks!

That's a good reference website!

I'm called wise !

Currently I log the very basic memory output of htop while the benchmarks were running. Manually.
I would like to have a better and more fine-grained solution. Preferably one which can be automated to log when the tests run. But some servers run multiple processes, some with different names —e.g. ngnix with Node.js and php. Do you know of any programs other than the venerable ps which can accurately capture memory usage of a process?

Well said!

axello · May 4, 2024, 10:03pm

Of course it is click-baity; apparently it works. Same as click-baity username taylorswift

But the drops are also in the referenced article and no-one until @weissi could explain why only swift exhibits this behaviour under heavy load.

axello · May 4, 2024, 10:07pm

Thanks,
I did not think I was using anything else but the default Vapor NIO complexity?
What would you suggest: simply returning "Hello World" instead of a calculation?

We then might not see the timeouts/drops, but it defeats the whole purpose of the test. (IMHO)

axello · May 4, 2024, 10:15pm

Thanks for creating the library so that I could write my benchmark!
And thanks for benchmarking it against the Node.js implementation. Amazing that javascript is faster, but it probably uses way more memory (which is what I like to benchmark as well)

It reminds me of the java vs C++ benchmarks of a bygone age; I just couldn't believe a JIT language was faster than compiled C++. Yet here we are.

axello · May 4, 2024, 10:22pm

Thanks @johannesweiss , I will have my work cut out for me this Monday!
I will create various different Dockers with your and @wadetregaskis change proposals.

axello · May 4, 2024, 10:25pm

Thanks, I'll create a version with this package and see how it compares.

wadetregaskis · May 5, 2024, 12:32am

I made it about 30x faster.

Using current.description instead of string interpolation ("\(current)") surprisingly made no measurable difference.
The swap optimisation I suggested earlier did nothing to start with, because it turns out BigInt implements += as just + and then an assignment, which surprisingly the compiler takes on face value and doesn't optimise.
But, switching to BigUInt - while it didn't make a difference by itself - did couple with the swap optimisation to improvement performance by 30%.
Turning the logging down a notch (to error instead of info) improved performance by ~5% (about half of that can be gained by just redirecting it to /dev/null in the shell - probably those few percent are what Terminal was consuming to update the display).

Of course, turning INFO logging off might not be desirable in production, I was just curious. Not routing to a live tty is definitely valid, though - normally it'd go to a file, with comparatively no cost.
Switching to UIntXL from Numberick improved performance by a further 180%.

So that's nearly 5x faster in sum so far, just from these tweaks.
Using NBKFibonacciXL from Numberick improved performance by a further 525% again.

Of course, the reason it does this is because it relies on a clever algorithmic trick to metaphorically turn it from a linear search to a binary search, so this particular step is "unfair" against the other platforms if you don't also apply the same algorithm to them.

Though one could argue that unless they also have a convenient package pre-made which implements this, it is "fair" to let Swift have the advantage. Benchmark the tools you have, not the tools you want, kind of rationale.

Note: at this point logging does make a difference - a big difference, with up to a third of the performance lost if you emit logs to the Terminal rather than a file.

Next steps

Given these optimisations, it'd be worth re-running the full analysis and re-generating the charts, to see how all this has affected error rates. I did some spot checks and of course still see timeouts if I crank things up enough, and maybe it's showing more logical behaviour now with a seeming steady degradation of success rate as concurrency increases, and no errors at reasonable request rates (i.e. where the server doesn't run out of CPU for the Fibonacci calculations)… but it's hard to be sure.

I'm not even sure I saw the same behaviour as @axello with the original code - as far as I observed, there were no mysterious errors until the request rate genuinely exceeded the CPU capacity of the server.

Other observations

Vapor's probably not the bottleneck

/bench/1 - i.e. returning the literal string constant "0" - is about 70x faster than /bench/10000 (with the original code), making me pretty confident that Vapor is not remotely the bottleneck in the benchmark as originally written.

But, given the optimisations I made above, the performance of /bench/10000 is now nearly half that of /bench/1 - i.e. getting close enough that we can no longer ignore Vapor's component - so if one were to pursue further optimisations it might pay to look into Vapor or NIO themselves (obviously starting with a time profile to get a feel for the hotspots, which I stopped short of doing).

`wrk`'s scaling behaviour

I based my measurements & optimisations on:

wrk --threads 20 --duration 30s --connections 20 --latency --timeout 2s http://127.0.0.1:8080/bench/10000

Logically this seems like about the sweet spot for efficiency on my 20-logical-cores iMac Pro, and indeed some experimentation in thread & connection counts either side of that seemed to support that hypothesis.

Changing the number of threads didn't really have much effect, though - the default is just two, which seemed to achieve similar throughput numbers. 120 threads was noticeably worse-performing (although not by a huge amount - a few percent).

I assume that these requests & responses are so small that even two threads can easily do thousands of them per second. Presumably if e.g. the response body were much larger, it would take wrk more time to handle them and therefore require more threads for wrk itself to not become the bottleneck.

Upping the "connections" to e.g. 1,000 saw it reporting errors. I played with the "connections" number a bit and saw that it vaguely correlated with the number of errors (and degrading successful req/s), which makes sense - presumably as concurrency goes up increasing portions of server time is effectively wasted because it doesn't respond within two seconds and doesn't count.

HTTP is not representative of real-world situations (and therefore these benchmarks might not be either)

I hadn't realised until I played with it myself that this benchmark is not using TLS. It's therefore not really benchmarking a real-world scenario, irrespective of what you think about using Fibonacci calculations as the load.

Including TLS would of course increase the complexity of the benchmark by adding another non-trivial layer of software into the stack, but it's only fair since every platform has to do it (and I'd be very surprised if any of the other languages tested don't have TLS very heavily optimised already).

For low request rates I wouldn't expect TLS to change much, but once you start getting to O(100k/s) that could change significantly.

wadetregaskis · May 5, 2024, 12:47am

I meant to note, I'm still looking at the scaling & failure rate behaviour. But I'm getting such confusing results that it's hard to even know where to start.

One thing that's increasingly clear is that the wrk results are wildly variable. e.g. if I run:

wrk --threads 20 --duration 30s --connections 4000 --latency --timeout 2s http://127.0.0.1:8080/bench/10000

…sometimes I get nearly 40k req/s and basically no timeouts, and other times I get mere tens of requests per sec, and basically everything times out.

I don't know if the problem is wrk or Vapor or what, but with such absurdly large variations in performance and general behaviour, it's hard to quantify anything.

But this only happens with high numbers of connections - if I lower it to about 3,000 or less, the variability goes away.

(thankfully I was using a mere 20 or 100 connections for my earlier optimisation work - and I did test pretty thoroughly that those results were consistent across many runs)

graskind · May 5, 2024, 12:52am

Are you using the updated version of Vapor I released this morning?

adam-fowler · May 5, 2024, 7:08am

Yes this makes total sense. We are pulling connections off a NIOAsyncChannel's inbound stream (which is back pressured). If Swift concurrency is overloaded we would stop pulling connections and the back pressure on the stream means no more connections can be pushed onto the stream so we stop accepting connections.

johannesweiss · May 5, 2024, 8:47am

Right, using something less CPU-heavy and micro-optimisation-dependent than fib(10k) would probably make more sense when investigating web server performance.

Regarding the 'NIO complexity' I think there's something important that got lost in translation. Regardless of the number of connections/concurrency you choose, wrk will send requests as quickly as possible. So even with just 4 connections you'll max out a 4 core machine, in any web framework, in any language. The faster you produce those responses, the faster wrk will be sending new requests.

So with these benchmarks that will always fully load the server machine, when a new connection comes in, the server needs to make a decision: It can

Either accept the new connection immediately, slowing the existing connections down a little (because now there are more connections to service with the same resources as before)
Or it can prioritise the existing connections and slow the connection acceptance (increasing the latency of the first request in the new connection which now has to wait).

This is true for any framework in any language. This choice can be either explicit or implicit or a mixture of both.

The only reason we have discussed how SwiftNIO's default setting works is because this particular benchmark immediately records a failure if even only a single request hits a >2s latency.

For example:

10,000 requests at 0.1s latency & one request at 2.1s latency -> 1 error [avg latency 0.1002s]
10,001 requests at 1.99s latency each -> 0 errors [avg latency 1.99s]

I'm not saying having a cut off at 2 seconds is bad or wrong but it's a peculiarity.

The main reason I recommended that the Vapor devs raise the default setting of maxMessagesPerRead is that benchmarking tools like wrk like to open a lot of connections to start with and immediately load any connection to the max. It's important to not look bad at wrk just to avoid having to have a 100+-message long discussions over it .

axello · May 5, 2024, 9:41am

Thanks, I will do that. I'd rather do the naive approach, as I'm pretty sure there will be an optimised Fibonacci algorithm for the other languages as well.
➔ Shall I retrace your steps or will you create a PR for it?

Keep in mind: I was running both the system under test and wrk on 2013 Core i5 linux machines, and you are running on a 20 Core Xeon

Nomenclature was a great problem, as always in programming. To maintain readability of the post I started using Node.js+javascript and Vapor+swift, but then simplified to Node.js and Vapor. Both are most only used with their companion languages.
I meant the whole package: Vapor + swift, not the framework alone.
In this case, as you have demonstrated, the problems seem to come from
a. Bad --default-- configuration for a production site (i.e. INFO)
b. wrong Docker setup, where it writes to the console. The other benchmarks do not do this by default.
c. not recently maintained BigInt library
d. Some assumptions in SwiftNIO which might not be catering to benchmarks.

I say all in all: good results!

What kind of server were you running the benchmarks on? Because if you're running it on a Docker on a 20 core iMac Pro I am a bit shocked that it can only handle 1000 concurrent requests.
My java benchmarks were run on much more limited hardware and went higher.
But at least I guess the memory issue is much better with swift!

If you have 4 equations, and you add a constant (TLS calculations). Don't you simply shift the equation up? And if I add a hardware TLS card, it shifts up less?
What I mean is: these benchmarks were ALWAYS meant to be relative to each other. Unless I am forgetting something, but I assume TLS is optimised in all frameworks and not naively calculated in e.g. php? There it is probably handled by nginx?

It could indeed be that TLS handling by swift, or java, or nginx is completely unrelated. So yeah, adding TLS would be good, and then the results could quickly become incomprehensible to clearly present.
(IMHO)

axello · May 5, 2024, 9:45am

Haha!

I'm sorry I messed up everybody's weekend.

I can conclude that these weird (98%) responses from vapor+swift as compared to other languages comes from a slow start then?
I will create some benchmarks with e.g. a 4 second time out and see if that improves our swifty results.

Thanks

johannesweiss · May 5, 2024, 10:00am

Two reasons:

Key reason: The fib(10k) is 3x the amount of work because you're comparing a community project with v8 (which is what Node.js uses)'s super-optimised, presumably C++-written big int library.
Artefact from wrk's reporting that doesn't play well with the default settings (which have since been changed to accommodate for that).

And yes, the high latencies only affected (fixed now) the first request in every new connection because we artificially deprioritised accepting new connections. And the connection establishment delay counts towards the first request.

Honestly, I don't think you need to do that with Vapor's new default settings.

But if you want to validate that we're not talking rubbish, you can run wrk -c 100 --timeout 100s -t 4 -d 300s http://127.1:8080/bench/10000 against the old, unchanged server and see that the average request latency should be about the same as with the new changed server. The max will still be high but the average and the percentiles will be coming down, we're just not erroring on the initial >2s requests anymore. And FWIW, the max you'll see will depend on how many connections you're attempting to spin up (because we only accept 4 at a time, so they're trickling in under such high load).