'Standard' vapor website drops 1.5% of requests, even at concurrency of 100!

wadetregaskis · May 5, 2024, 3:19pm

I assume so - I checked it out yesterday afternoon (PDT).

{
  "identity" : "vapor",
  "kind" : "remoteSourceControl",
  "location" : "https://github.com/vapor/vapor.git",
  "state" : {
    "revision" : "4942d74e8493fc918ed6144c835c8a0e6affd4f4",
    "version" : "4.92.1"
  }
}

tera · May 5, 2024, 3:23pm

lorentey:

Count node.js Swift ratio

10,000 1.558ms 4.345ms 2.78x

30,000 8.794ms 33.127ms 3.76x

100,000 69.763ms 350.025ms 5.02x

300,000 552.623ms 3056.003ms 5.52x

1,000,000 5.944s 34.368s 5.78x

I quickly cooked a minimal BigInt implementation from scratch (addition only), here's what I got:

Count	Time
10,000	1.834ms
30,000	10.102ms
100,000	94.939ms
300,000	806.729ms
1,000,000	8.699s

It's not so bad but not quite as good as node.js implementation timing indicated above. In my implementation I've manually unrolled the loop, effectively working with "8x64" bit ints in the outermost loop. Possible reasons for this implementation being slower compared to node.js implementation:

they might be using asm.
they could be using 128-bit ints, while my implementation is using 64-bit ints as the base type.
they could be using SIMD tricks.
there's some ARC traffic in there in my implementation.
the obvious difference could be just due to the differences in the computer specs (I'm on a 2021 MacBook Pro, which is M1 Pro @ 3GHz if I am not mistaken).

BTW, are there GPU based BigInt implementations in the wild and how do they compare to CPU based implementations?

finestructure · May 5, 2024, 3:38pm

You'll want 4.96.0 instead:

RussBaz · May 5, 2024, 4:27pm

Is this option off by default in Vapor? Can you give some pros and cons of this option?

wadetregaskis · May 5, 2024, 5:24pm

I think there is a bug in Vapor (or NIO) here. Or possibly wrk. Or conceivably even the macOS kernel. Whichever it is, something is very broken.

Scaling with Vapor 4.92.1

wrk --threads N' --duration 30s --connections N --latency --timeout 2s http://127.0.0.1:8080/bench/10000

…where N' = min(20, N). No point running more threads than my iMac Pro has logical cores as that just slows wrk down (from experiments and measurements done previously).

Before hitting timeouts

The table below shows some exemplar runs (roughly speaking) for various values of N.

I did see some variance, of course, between runs - perhaps 30 seconds is still too little to produce stable numbers, at least on my iMac Pro with Spotlight and other such vampires sporadically sucking the life out of the CPU.

Connections	Requests/s	P99 latency (seconds)	Max latency (seconds)	Errors	System CPU usage
1	3k	0.0004	0.0061	None	7%
10	24k	0.0005	0.0068	None	60%
18	37k	0.0007	0.0358	None	97%
19	37k	0.0007	0.0315	None	100%
20	37k	0.0008	0.0119	None	100%
21	38k	0.0018	0.0960	None	100%
40	41k	0.0027	0.0174	None	100%
80	41k	0.0048	0.1264	None	100%
160	41k	0.0085	0.1277	read 18	100%
320	41k	0.0161	0.1984	read 247, write 3	100%
640	41k	0.0311	0.4495	read 1502	100%
1280	40k	0.0485	0.7863	read 1740, write 1265	100%
2000	40k	0.0930	1.67	read 9327, write 27	100%

Scales like you'd expect:

Until you hit 19 concurrent request streams, the CPU isn't fully utilised.
You don't actually hit peak throughput until some number beyond merely the core count, presumably because you need a few extra concurrent threads to hide some of the latencies.
There's a few wild outliers in max latency in pretty much every run - just look at that divide between P99 and max - but looking at the more stable P99 latency you can see a clean curve.

Note that while errors do start happening, there's a lack of connection & timeout errors. And how the max latency is still below the 2s threshold. I'll come back to that aspect later.

Upon hitting timeouts

…now everything goes to hell. Basically every run can be classified into one of three behaviour modes:

Mode	Connections	Requests/s	P99 latency (seconds)	“Max” latency (seconds)	Errors	System CPU usage
Fast ramp	2560	32k	0.1714	2.00	connect 23, read 75364, write 8, timeout 91	100%
Slow ramp	2560	22k	0.1938	2.00	connect 23, read 156761, write 2, timeout 78	17% → 100%
No ramp	2560	0.1k	0.9352	1.60	connect 23, read 232170, write 43	13%

You can see these clear as day in the CPU usage charts (sorry for the low quality - snipped from iStat Menus for convenience):

"Fast ramp" mode

Throughput averaged over 30 seconds is almost unaffected, and there are far fewer socket errors (especially noticeable at a bit lower connections count, not shown here, where I noticed in "fast ramp" mode I can get no errors but in the other modes huge numbers of errors). The parenthetical seems to confirm what the chart suggests visually: that it's not immediate ramp, there's still a non-zero delay, and so it makes things behave better but not perfect (w.r.t. wasting CPU capacity).

"Slow ramp" mode

I can watch it over the course of many seconds, as it very slowly increases CPU utilisation until it finally hits the hardware's capacity. The benchmark results are unsurprisingly severely hampered (although valid - there is a real problem here).

"No ramp" mode

This one is weird, and clearly a bug. It never [within 30 seconds] starts ramping at all. It just sticks at only three or four CPU cores of usage indefinitely.

Unsurprisingly the performance goes completely to hell in this mode. Worth noting, though, that the throughput isn't merely 10-20% of normal but a fraction of a percent. So even though it's using a few CPU cores, it's accomplishing practically nothing in terms of successful requests. And yet, there are no timeouts reported by wrk when this happens, which I find baffling. Perhaps the chart gives a telling clue - it's basically all kernel execution, not userspace. I assume it's highly unlikely, but could this maybe point to a kernel bug?!

Tipping point?

I think this behaviour of modal operation still occurs even before timeouts start happening, i.e. with lower concurrent connections, because I see the slow ramp behaviour sometimes in the CPU charts, even for e.g. a mere 300 connections. But I am yet to see the "no ramp" behaviour with anything fewer than ~2,250 connections.

And with Vapor 4.96.0…

…basically slow ramps become fast ramps but no ramps still happen. And by still happen, I mean like 90% of the time, for 2,560 connections. In fact it took so long to get the first "fast ramp" run that I was worried that 4.96 had seriously broken Vapor.

Alas, I don't have enough data (enough sample runs) to determine if Vapor 4.96 actually changed the probability of hitting "no ramp" mode. All I can say with confidence is that it was & remains very common - perhaps the most common mode, for high concurrent connections.

About those read & write errors

They are because the read and write syscalls fail (I added some printf debugging to wrk):

read failed: 54 (Connection reset by peer)
write failed: 32 (Broken pipe)

The Internet tells me that reset by peer means the server sent a RST; i.e. closed the connection. After having accepted it. I wonder if this is actually / technically an error on Vapor's part or just it proactively closing connections when it feels there's too many open already?

…presumably between requests, not in the middle of one, to technically satisfy the requirement that the connection be "idle" - although maybe there's a race here between wrk sending the next request and the server seeing it, which explains why wrk thinks the connection is valid, having sent the request, but dies when trying to retrieve the response…? I vaguely recall this exact behaviour being discussed previously in these forums.

But, what's really interesting is that they only occur when CPU usage is below 100% (whether because it's stuck in "no ramp" mode or still ramping in the other modes). As soon as CPU usage hits 100%, they stop. Completely. No matter how many concurrent connections I use.

For that, I have no good hypothesis. That smells like something Vapor / NIO is doing "deliberately", that's somehow contingent on there being unused CPU available. Maybe a broken heuristic on what counts as an "idle" connection?

vns · May 5, 2024, 5:45pm

Maybe a bit off main topic and with incorrect understanding how tools like wrk is implemented, but I have always been curious how much such tests from the single machine does align with the real world? I was recently working on a case where we need to test upload throughput on Django server and I’ve been using a simple test script that did that uploads in parallel for configurable amount of users, and the question we’ve been discussing within the team is that all that how accurate our measurements since all uploads has came from the single machine? Couldn’t be that in general this benchmarks are too off from the behavior of the app in the “wildness”?

odmir · May 5, 2024, 5:47pm

I noticed this ”three mode” behavior before as well during some experiments. I was using hummingbird for the server and the nio http client for making as many requests as I could. The server wasn’t doing much work, probably just returning a string if I recall correctly. This might indicate the issue is not with vapor. I thought I was crazy and was just doing something wrong or pushing it too far and thought that having the client and the server in the same machine fighting for the same resources might have had something to do with it, so I dismissed it

odmir · May 5, 2024, 5:54pm

I also tried wrk again the other day and there were configurations that managed to outright crash the server with the backtrace only visible in the console logs

Found the backtrace if someone is interested:

Crashing Thread

Thread 1 Crashed:: NIO-SGLTN-0-#0
0   hb2App                        	       0x1011d5210 NIOAsyncWriter.InternalClass.deinit + 64
1   hb2App                        	       0x1011d4cd9 NIOAsyncWriter.InternalClass.__deallocating_deinit + 9
2   libswiftCore.dylib            	    0x7ff814dce230 _swift_release_dealloc + 16
3   libswiftCore.dylib            	    0x7ff814dcebdb bool swift::RefCounts<swift::RefCountBitsT<(swift::RefCountInlinedness)1>>::doDecrementSlow<(swift::PerformDeinit)1>(swift::RefCountBitsT<(swift::RefCountInlinedness)1>, unsigned int) + 123
4   hb2App                        	       0x1011c69de 0x100ea6000 + 3279326
5   libswiftCore.dylib            	    0x7ff814dce230 _swift_release_dealloc + 16
6   libswiftCore.dylib            	    0x7ff814dcebdb bool swift::RefCounts<swift::RefCountBitsT<(swift::RefCountInlinedness)1>>::doDecrementSlow<(swift::PerformDeinit)1>(swift::RefCountBitsT<(swift::RefCountInlinedness)1>, unsigned int) + 123
7   libswiftCore.dylib            	    0x7ff814dbbb5a swift_arrayDestroy + 138
8   hb2App                        	       0x1012e2b9c specialized _ArrayBufferProtocol.replaceSubrange<A>(_:with:elementsOf:) + 13 [inlined]
9   hb2App                        	       0x1012e2b9c specialized ContiguousArray.replaceSubrange<A>(_:with:) + 13 [inlined]
10  hb2App                        	       0x1012e2b9c specialized ContiguousArray.removeAll(keepingCapacity:) + 13 [inlined]
11  hb2App                        	       0x1012e2b9c SelectableEventLoop.runLoop() + 140 (SelectableEventLoop.swift:590)
12  hb2App                        	       0x1012e0c77 SelectableEventLoop.run() + 487 (SelectableEventLoop.swift:676)
13  hb2App                        	       0x1012be60a specialized static MultiThreadedEventLoopGroup.runTheLoop(thread:parentGroup:canEventLoopBeShutdownIndividually:selectorFactory:initializer:_:) + 266 (MultiThreadedEventLoopGroup.swift:93)
14  hb2App                        	       0x1012bbc11 specialized static MultiThreadedEventLoopGroup.runTheLoop(thread:parentGroup:canEventLoopBeShutdownIndividually:selectorFactory:initializer:_:) + 35 [inlined]
15  hb2App                        	       0x1012bbc11 closure #1 in static MultiThreadedEventLoopGroup.setupThreadAndEventLoop(name:parentGroup:selectorFactory:initializer:) + 97 (MultiThreadedEventLoopGroup.swift:111)
16  hb2App                        	       0x1012bfb7a partial apply for closure #1 in static MultiThreadedEventLoopGroup.setupThreadAndEventLoop(name:parentGroup:selectorFactory:initializer:) + 42
17  hb2App                        	       0x1012c24ef thunk for @escaping @callee_guaranteed (@guaranteed NIOThread) -> () + 15
18  hb2App                        	       0x1012ffee6 closure #1 in static ThreadOpsPosix.run(handle:args:detachThread:) + 358 (ThreadPosix.swift:116)
19  libsystem_pthread.dylib       	    0x7ff80441b18b _pthread_start + 99
20  libsystem_pthread.dylib       	    0x7ff804416ae3 thread_start + 15

wadetregaskis · May 5, 2024, 6:09pm

That's maybe not always true.

It depends on the request rate (and probably request & response sizes, and whether you're running any Lua scripts within wrk, etc), but basically wrk can hit the limits of a single CPU core, for enough activity.

But even before that, a lone wrk thread can't turn around a connection infinitely fast, so it can end up delaying requests that are otherwise ready to go even if it's not fully utilising its CPU core.

Just using --threads $(sysctl hw.ncpu) seems like the best and a robust default. I've not seen any negative side-effects of that, in my testing [so far].

Sent via PM (not that the patch is secret, just not sure folks want it here on the main thread - and it's quite trivial and exactly as described previously).

I also am just running everything on the one host (wrk doesn't use that much CPU for this benchmark, and not having to deal with actual network hardware eliminates some variability).

Keep in mind, also, that the iMac Pros contain slow Xeons, so those 20 cores aren't what you'd think. Those Xeons are about eight years old now and are thermally limited in the iMac Pro below their design limits - I think they're actually under-clocked by a turbo level or two, as well - so they're pretty crappy "server" CPUs, even by Xeon standards.

But nonetheless, I wouldn't expect modally different behaviour from your system - just some variation in throughput numbers.

It would of course be best to do the official benchmark runs on a modern server (and IMO that means something like a Graviton 4 or an Ampere Altra). 2013 and 2016 Xeons are ancient in every important respect - architecture, design, process node, etc.

Oh, and also: I didn't use Docker. I'm just doing swift run -c release, because it's way simpler and I don't really want to install Docker and its entourage.

I don't think having INFO level enabled is wrong - every web server I've ever encountered in production has had essentially that enabled; i.e. at least one log line for every HTTP request received. In fact Vapor logs relatively little vs e.g. Apache - just a UUID!

But directing that logging anywhere but a file is definitely unrealistic.

Yes - it's important to remember that irrespective of what optimisations or configuration changes might be proposed, the original benchmark was still useful. "Fairness" in benchmarks is overrated - what matters is whether they reveal interesting or useful things.

Could be that the iMac Pro's CPU isn't as powerful as you're thinking (I think that's at least part of it, for reasons mentioned above). Could be Swift on Linux compiles differently. Could very well be differences in Linux kernel vs macOS kernel performance. Etc.

Oh, and to be clear: 20 logical cores. Given HyperThreading. Only 10 physical cores.

I didn't record it for every benchmark run, but I did occasionally check memory usage and I never saw it being more than 9 MiB. By server standards that's literally a rounding error (I've worked with job management systems that literally don't support resolutions finer than gigabytes).

And I wouldn't expect any more - it's basically small-N threads (number of CPU cores) doing an iterative calculation on a pair of relatively small memory allocations. There's probably a lot of memory traffic, not just actual reads & writes but meta-traffic like alloc/free & retain/releases etc. But the high water mark should be very low and very close to the average water mark.

In Swift. Obviously in GC'd languages all bets are off.

But I would like to see you run the numbers limiting Java & JavaScript to the same amount of memory Swift needs. I'm pretty sure they won't even launch with that "little".

(more seriously - varying Java heap size is an important and insightful test, as there tends to be a quadratic relationship between heap size and performance in Java, and GC'd languages generally - and servers don't in fact have infinite RAM, so sometimes you don't get to pick your ideal spot on that curve)

Will Swift use hardware offloading for TLS? None of Apple's platforms have ever had or supported that (to my knowledge) - not counting some ISA specialisations to accelerate certain operations on the CPU cores - so I wouldn't be surprised if Swift doesn't. (unless Vapor w/ TLS is ultimately using BoringSSL or somesuch, which might still)

In any case, I wouldn't assume that the cost of TLS is equivalent across all these languages and server stacks. Maybe they all just e.g. use OpenSSL underneath, but maybe not.

hassila · May 5, 2024, 6:14pm

No hard facts to back it up (yet), but I wonder if it can be an issue with swift concurrency ramp up of cpus.

We have seen a very similar behavior where we usually will scale up to ncpus (using task groups), but just sometimes, we just get a single core that runs. I haven’t had a reproducible case for it that is easily shareable, but it’s on the list of things to investigate root cause for…

Just throwing it out there in case anyone else have made similar observations.

This is on a 20-core M1 ultra.

wadetregaskis · May 5, 2024, 6:16pm

It's certainly the subject of some debate. Personally I don't think it's typically an issue, as long as your benchmark driver (wrk in this case) doesn't use a lot of hardware resources or otherwise interfere too much with the workload under test.

But it's also a question of what exactly do you want to test? If you want to see your actual, production capabilities then of course you should test on your real production hardware and networks. Because your network probably is the bottleneck, with non-pathological test cases and production server code that's not grossly inefficient.

(and then you get into the question of where your test driver should run - is it "okay" if it still runs within your production network or do you need to stick it outside, on the public internet? And how far away? Is it okay if it runs from the same metro area, or does it need to be from a different continent? Maybe run it over cellular since that's how most of your real users access it? etc - very quickly you end up accidentally testing e.g. Qualcomm 5G modem upload latencies, rather than anything you actually control)

I think for @axello's purposes running it all on one machine is fine, although it might be marginally better to use separate machines connected by a sufficiently fast link - you don't want to be network-bottlenecked when investigating server behaviours, but you do want to test the actual network stack… although if it's all on the same machine & kernel anyway, probably there won't be any difference since from userspace a localhost TCP connection looks the same as one from another host… unless your benchmark is that latency-sensitive, of course.

taylorswift · May 5, 2024, 7:43pm

to expound on this a little more, real human users are not going to be sending you 8,000 requests per second, if they are, that’s a good-to-have sort of problem.

the more realistic scenario is you are dealing with a lot of unauthorized scraping, or even an outright DoS attack. in these scenarios, you really are dealing with a lot of requests from a small number of machines.

0xTim · May 5, 2024, 7:55pm

It's currently off in Vapor because of the preconditions required to make it work. There's an upcoming PR that solves this Mark all functions that use `wait` as `noasync` by 0xTim · Pull Request #3168 · vapor/vapor · GitHub

johannesweiss · May 6, 2024, 8:09am

What is the problem you suggest is a bug in either of those?

Also what is your ulimit -a? The default macOS limit for open file descriptors is 256, so without changing settings no client/server can have more than 256 file descriptors.

Also what is your setup? You're benchmarking on macOS, server & client on the same machine and both on macOS, right?

What's the work you're running in each request? 0.0004 seconds is 400 µs so this can't possibly be fib(10k) unless your CPU goes 10x faster than an M1 Max which is unlikely.

But looking at your table it seems that you handle a peak 41k requests per second. Given that you have 20 cores that'd be 2,050 requests per second per core. That suggets that whatever you're doing is taking about 488 µs per request which is consistent with the 1 connection case where you have 0.0004s.

So your server seems to be scaling linearly which is a good thing.

That's expected because otherwise some cores have nothing to do.

Also expected because the client wrk and the kernel need some time to generate the requests and shuttle over the bytes. So going above core count will make sure there's always a request to answer. You should definitely be able to reach peak load with 2*core count connections.

johannesweiss · May 6, 2024, 8:13am

It's always true with wrk. It doesn't have a "request rate" it is merely a while true { sendRequestAndWaitForResponse() }. So once you have the same number of connections as you have cores, it should nearly max you out. Once you have about 2*core count connections it should always max you out.

wrk does suffer from the Coordinate Omission problem so its latency measurements are to be taken with a grain of salt, so it should have a request rate parameter but it doesn't (Gil Tene's wrk2 fixes that but I haven't tried it yet).

johannesweiss · May 6, 2024, 8:16am

Absolutely, yes. wrk uses asynchronous I/O so there's no real point of going above the number of cores you have. You could argue that it doesn't need as many because in many cases producing a requests is much less work than responding to one. But: That depends on what the request is doing, so going --threads $(sysctl hw.ncpu) is the right default -- for wrk, for NIO, for envoy, for pretty much everything asynchronous.

[sorry, Discourse forces me to reply to multiple messages in one message as I'm only allowed 3 consecutive replies]

Curious why you're not installing NIO as your concurrency executor? We have seen dramatically better performance for anything I/O latency sensitive (particularly proxies) and on Linux even had wins for non-I/O workloads which came as a surprise.

hassila · May 6, 2024, 8:28am

The specific issue we see with not ramping up cpu cores is not directly ties to NIO in this case, it's something we've observed in 'normal' application code which simply calls the standard task group API:s - usually we ramp up to the expected number of cores (we add 12 tasks specifically in our case for reasons), but just sometimes we only get a single core running (seemingly processing the tasks we added to the group one-by-one). I just mentioned it in this context as it looked similar to what Wade had seen in case anyone else had been running into it too (in case it's a concurrency ramp up issue that is known).

That said, unrelated, we do have an internal case to try out setting NIO as our concurrency executor to see what the impact would be, just haven't gotten around to it yet - thanks for the pointer (again... I caught it in some other comment you made on NIO performance , appreciated.).

hassila · May 6, 2024, 9:56am

Ok, I played a bit with it now, but it seems using NIO as the concurrency executor will limit us to a single thread? At least when I installed NIO (succesfully) out task groups would not always just run on a single thread instead of scaling up.

I tried setting the thread count with e.g.:

        NIOSingletons.groupLoopCountSuggestion = System.coreCount
        NIOSingletons.blockingPoolThreadCountSuggestion = System.coreCount
        if NIOSingletons.unsafeTryInstallSingletonPosixEventLoopGroupAsConcurrencyGlobalExecutor() {
            print("Installed NIO as Global Executor")
        } else {
            print("Failed to install NIO as Global Executor")
        }

But that made no difference.

Is this expected or am I just holding it wrong?

adam-fowler · May 6, 2024, 11:08am

Having used wrk quite a bit to profile Hummingbird, here is a list of things I have found over time when using a Mac to profile.

Don't allow your server to use all your cores. Limit the number of threads the EventLoopGroup triggers. If your server is using every resource on your Mac, then it is more susceptible to other processes affecting its performance. Xcode might start indexing, your mail app gets more mail, one of the many electron apps you are running does something. I've generally found limiting to 4 cores on an M1 gives you fairly stable performance.
Don't test with more than 128 connections. You will get read errors. This is due to the file descriptor limit applied to each process on macOS. As @johannesweiss mentioned earlier the default for this is 256. You can change this but it involves disabling the System Integrity Protection.
There is some upper limit of requests it doesn't seem possible to pass (about 145k per second on my laptop). I have not worked out what this is yet. I don't know if this is some limitation from Hummingbird, NIO or something else completely.
Do your profiling on a Linux box. It is the target platform. It doesn't have the connection limit macOS has. Other process won't affect performance so much. These days it so easy to get one up and running on a cloud provider, install swiftly and then swift and connect to it over s SSH connection in VSCode.

tera · May 6, 2024, 11:53am

I didn't know that. Gave it a test: I can see this file descriptor limit in a UI macOS app (regardless of whether it is "app sandboxed" or not) although, strangely, not in a console app. Is that expected?

Count	node.js	Swift	ratio
10,000	1.558ms	4.345ms	2.78x
30,000	8.794ms	33.127ms	3.76x
100,000	69.763ms	350.025ms	5.02x
300,000	552.623ms	3056.003ms	5.52x
1,000,000	5.944s	34.368s	5.78x