'Standard' vapor website drops 1.5% of requests, even at concurrency of 100!

wadetregaskis · May 5, 2024, 5:24pm

I think there is a bug in Vapor (or NIO) here. Or possibly wrk. Or conceivably even the macOS kernel. Whichever it is, something is very broken.

Scaling with Vapor 4.92.1

wrk --threads N' --duration 30s --connections N --latency --timeout 2s http://127.0.0.1:8080/bench/10000

…where N' = min(20, N). No point running more threads than my iMac Pro has logical cores as that just slows wrk down (from experiments and measurements done previously).

Before hitting timeouts

The table below shows some exemplar runs (roughly speaking) for various values of N.

I did see some variance, of course, between runs - perhaps 30 seconds is still too little to produce stable numbers, at least on my iMac Pro with Spotlight and other such vampires sporadically sucking the life out of the CPU.

Connections	Requests/s	P99 latency (seconds)	Max latency (seconds)	Errors	System CPU usage
1	3k	0.0004	0.0061	None	7%
10	24k	0.0005	0.0068	None	60%
18	37k	0.0007	0.0358	None	97%
19	37k	0.0007	0.0315	None	100%
20	37k	0.0008	0.0119	None	100%
21	38k	0.0018	0.0960	None	100%
40	41k	0.0027	0.0174	None	100%
80	41k	0.0048	0.1264	None	100%
160	41k	0.0085	0.1277	read 18	100%
320	41k	0.0161	0.1984	read 247, write 3	100%
640	41k	0.0311	0.4495	read 1502	100%
1280	40k	0.0485	0.7863	read 1740, write 1265	100%
2000	40k	0.0930	1.67	read 9327, write 27	100%

Scales like you'd expect:

Until you hit 19 concurrent request streams, the CPU isn't fully utilised.
You don't actually hit peak throughput until some number beyond merely the core count, presumably because you need a few extra concurrent threads to hide some of the latencies.
There's a few wild outliers in max latency in pretty much every run - just look at that divide between P99 and max - but looking at the more stable P99 latency you can see a clean curve.

Note that while errors do start happening, there's a lack of connection & timeout errors. And how the max latency is still below the 2s threshold. I'll come back to that aspect later.

Upon hitting timeouts

…now everything goes to hell. Basically every run can be classified into one of three behaviour modes:

Mode	Connections	Requests/s	P99 latency (seconds)	“Max” latency (seconds)	Errors	System CPU usage
Fast ramp	2560	32k	0.1714	2.00	connect 23, read 75364, write 8, timeout 91	100%
Slow ramp	2560	22k	0.1938	2.00	connect 23, read 156761, write 2, timeout 78	17% → 100%
No ramp	2560	0.1k	0.9352	1.60	connect 23, read 232170, write 43	13%

You can see these clear as day in the CPU usage charts (sorry for the low quality - snipped from iStat Menus for convenience):

"Fast ramp" mode

Throughput averaged over 30 seconds is almost unaffected, and there are far fewer socket errors (especially noticeable at a bit lower connections count, not shown here, where I noticed in "fast ramp" mode I can get no errors but in the other modes huge numbers of errors). The parenthetical seems to confirm what the chart suggests visually: that it's not immediate ramp, there's still a non-zero delay, and so it makes things behave better but not perfect (w.r.t. wasting CPU capacity).

"Slow ramp" mode

I can watch it over the course of many seconds, as it very slowly increases CPU utilisation until it finally hits the hardware's capacity. The benchmark results are unsurprisingly severely hampered (although valid - there is a real problem here).

"No ramp" mode

This one is weird, and clearly a bug. It never [within 30 seconds] starts ramping at all. It just sticks at only three or four CPU cores of usage indefinitely.

Unsurprisingly the performance goes completely to hell in this mode. Worth noting, though, that the throughput isn't merely 10-20% of normal but a fraction of a percent. So even though it's using a few CPU cores, it's accomplishing practically nothing in terms of successful requests. And yet, there are no timeouts reported by wrk when this happens, which I find baffling. Perhaps the chart gives a telling clue - it's basically all kernel execution, not userspace. I assume it's highly unlikely, but could this maybe point to a kernel bug?!

Tipping point?

I think this behaviour of modal operation still occurs even before timeouts start happening, i.e. with lower concurrent connections, because I see the slow ramp behaviour sometimes in the CPU charts, even for e.g. a mere 300 connections. But I am yet to see the "no ramp" behaviour with anything fewer than ~2,250 connections.

And with Vapor 4.96.0…

…basically slow ramps become fast ramps but no ramps still happen. And by still happen, I mean like 90% of the time, for 2,560 connections. In fact it took so long to get the first "fast ramp" run that I was worried that 4.96 had seriously broken Vapor.

Alas, I don't have enough data (enough sample runs) to determine if Vapor 4.96 actually changed the probability of hitting "no ramp" mode. All I can say with confidence is that it was & remains very common - perhaps the most common mode, for high concurrent connections.

About those read & write errors

They are because the read and write syscalls fail (I added some printf debugging to wrk):

read failed: 54 (Connection reset by peer)
write failed: 32 (Broken pipe)

The Internet tells me that reset by peer means the server sent a RST; i.e. closed the connection. After having accepted it. I wonder if this is actually / technically an error on Vapor's part or just it proactively closing connections when it feels there's too many open already?

…presumably between requests, not in the middle of one, to technically satisfy the requirement that the connection be "idle" - although maybe there's a race here between wrk sending the next request and the server seeing it, which explains why wrk thinks the connection is valid, having sent the request, but dies when trying to retrieve the response…? I vaguely recall this exact behaviour being discussed previously in these forums.

But, what's really interesting is that they only occur when CPU usage is below 100% (whether because it's stuck in "no ramp" mode or still ramping in the other modes). As soon as CPU usage hits 100%, they stop. Completely. No matter how many concurrent connections I use.

For that, I have no good hypothesis. That smells like something Vapor / NIO is doing "deliberately", that's somehow contingent on there being unused CPU available. Maybe a broken heuristic on what counts as an "idle" connection?