That requests are coming in (or at least should be, unless wrk is the problem) and aren't being addressed promptly even though there's plenty of hardware resources available to do so.
It's 2560 by default. It's just coincidence that I was using 2,560 connections in some of my tests - just the result of power-of-two scaling. It's curious that it didn't cause any apparent failures in either wrk nor the server, even though surely at least a few of those file descriptors were preoccupied (e.g. stdin / stdout / stderr).
Increasing the limit to 10k makes no difference in the results (latency & throughput, as well as the bad, tri-modal behaviour), for 2,560 connections. Presumably it would for higher connection counts (I didn't test).
Correct. macOS 14.4.1 (23E224) on 10-core iMac Pro (Xeon W-2150B, an Apple-specific SKU that's basically an under-clocked W-2155 with more cache, neither of which help this benchmark at all).
It can't be. It takes non-zero time for wrk to issue every request, so for a sufficient number of requests it will exhaust any given finite compute capacity.
And it can only issue one request at a time per thread, which can artificially increase latency and therefore reduce throughput because more connections than threads might be ready for their next request, and therefore unintentionally delayed by the serialisation.
Stabler, but at the expense of throughput.
It can lead to counter-intuitive effects when disruption to the server's execution (from pre-emption by unrelated threads) causes request(s) to exceed the time limit, and therefore be counted as failures despite having used server resources and therefore impairing other pending requests. But I'd be surprised if that ultimately had a significant effect.
And on macOS, there simply aren't the controls available to reliably prevent that anyway (e.g. you can't dedicate cores to specific threads).
I don't have access to a suitable Linux machine.
(and admittedly, I also don't want to deal with the Swift toolchain on Linux - I know enough from dabbling that it's not as smooth an experience as on macOS, and the perf tools on Linux aren't as good either)
Note that the trend in server hardware is [finally] away from so-called "high performance cores", because power is ultimately the bottleneck and inherently favours parallelism. So be careful about benchmarking on a consumer CPU's "performance cores" if your actual production environment uses slower but more efficient server cores (similarly, beware developing on Apple Silicon if you're running on x86, because the per-core performance at full load i.e. without turbo boost benefits is comparatively terrible on x86).
Fortunately my iMac Pro's Xeon is crappy even by Xeon standards, so it's very representative.
That's good, and smart, but is there a way to enable some sort of error logging when this happens? Even if the server's not falling down, it's still a potential problem that it's connection-limited.
I didn't see any error messages, at any time, from the serverbench app.
That might be expected. At least some of the timeouts still cost the server the Fibonacci computation time, which means wasted CPU and therefore less CPU time for successful requests. The more CPU time each request takes, the worse the impact of that timeout.
But it's not clear to me if you can just do timeout count (as reported by wrk) times server response CPU time, to compute that loss delta. It might be that wrk times out some of those connections before the server's even started handling them, and maybe the server detects that the socket closed while it was in the accept queue and therefore never even does the Fibonacci calculation.
Which reminds me - is wrk opening N persistent connections and using HTTP/1.1 pipelining, or is it actually opening a new TCP connection for every request? Because that will dramatically change the behaviour in some aspects.
I do not understand that metric either. That's why I always report 'requests' vs 'timeouts'. The Socket Errors only appear on high load with the timeouts.
Yes, the reason I used wrk, is that it uses pipelining. That's why ab (apachebench) had such terrible performance: it opened a new socket for each request. And then it overloaded the system by throwing
socket: Too many open files
apr_socket_recv: Connection reset by peer
errors.
I raised the ulimit -n to 10240, but still apr_socket_recv: Connection reset by peer (104) occurred occasionally.
Update: I'm not sure now if wrk is actually the problem.
The problem is wrk; it is very buggy.
It naively assigns connections / threads connections to each thread. Integer division. So you're not necessarily getting the actual number of connections requested (e.g. 100 / 26 = 78).
It doesn't even check for socket errors, so it doesn't notice when it runs out of file descriptors, and just silently fails (running with fewer connections than instructed).
Once I fixed that and added some logging, I finally saw the EMFILE errors that are expected when the file descriptor limit is hit (not that this was a big issue in any of my posted results, as it at worst means my 2,560-connection runs were actually more like 2,540-connection runs).
It doesn't actually wait for the sockets to open before trying to use them, resulting in spurious errors and basically, surprisingly varied undefined behaviour. This, specifically, is what was causing the slow- and no-ramp behaviour.
…else blindly assume the socket is ready and try to use it. Note also the complete lack of error checking, for the fcntl calls (a common theme throughout the wrk code).
Clearly the author encountered EINPROGRESS, but for reasons unfathomable decided to just hard-code an ignore of it.
In retrospect this was hinted at already in what I'd observed & written about - that the read & write errors only happen during the ramp period. Turns out that's not completely correct - there are still some after the ramp, but several orders of magnitude fewer (and I don't yet know what's causing those lingering read/write errors).
The proper fix is probably to remember that the socket isn't connected yet and react appropriately when it does while moving on to other sockets in the meantime. But even as a quick hack, by moving the O_NONBLOCK enablement to afterconnect, that essentially fixes it for the purposes of this benchmark. Technically it does now require that all connections within a given wrk thread have to open before it'll send any requests, but that doesn't seem to cause a significant start-up delay even with many thousands of connections (and Vapor's listen queue left at the default 256).
With a fixedmodified version of wrk, I now see reliable performance and no timeouts even with 10,000 connections (I didn't test beyond that, but it's clearly now following the expected curve of increasing latency and slightly decreasing throughput).
Though those lingering few read/write errors still bother me. I might look into them later.
Orthogonally, a few other miscellaneous observations:
Removing all async code from the server itself (which was really just it's main & configure methods) makes no difference. Presumably this is expected; I assume how Vapor actually runs internally is unaffected by whether you use run or execute.
Many (all?) of Vapor's objects have no meaningful descriptions, which is annoying for debugging as I had to basically create them manually, in order to figure out even the most rudimentary things like the default settings for various configuration options.
At the time of this writing, that's true - but Tim and I have been working on closely related code for most of today, so it won't continue to be the case. The async entrypoint is, if nothing else, cleaner (since it doesn't rely on use of .wait()).
The lack of documentation for a lot of the bits and pieces is definitely a failure mode, no question there. I didn't go to the trouble of getting to 100% documentation coverage on SQLKit recently because it was a ton of fun, believe me . FWIW, we're planning to do better across the board for docs in Vapor 5 and Fluent 5.
To clarify, I meant the output of the description property. But it's true that I was looking at that in large part because there was no (or insufficient) documentation on the types, too.
It wasn't a big deal once I got a basic handle on Vapor's App layout (recursing through its subproperties), but I had to do it through trial and error and Xcode auto-complete.
TBH, I've often found description (and debugDescription) to be more trouble than they're worth - either it's painfully involved to include all the relevant information, or you end up sending potentially sensitive data in responses by accident, or the result ends up being too verbose when things aren't going wrong... Or, most often of all, any combination of all three - not to mention the confusion with "Wait, is description supposed to be human-readable? But what about LosslessStringConvertible types?" (nevermind that LosslessStringConvertible is not the same as RawRepresentable where RawValue == String). And then you start getting Hyrum's Law locking you into treating the exact contents of description as if they were public API even for non-lossless types...
So yeah, while you're right that it can make life more difficult at times (especially where docs are lacking), you can probably see why I tend to fail at being diligent about CustomStringConvertible...
Digging into this deeper, I'm now not sure wrk is the root problem.
I investigated wrk's implementation further, and it does in fact properly wait¹ (asynchronously via kevent) for the connecting socket to establish (specifically, to become readable or writable). It usually successfully writes to it and then receives a kevent telling it the socket has closed (EV_EOF is set). It tries to read the socket anyway - which is correct; it could have data still in it from before the close - but always comes up empty in this case.
Looking at the syscalls using System Trace in Instruments, it genuinely looks like wrk is doing everything correctly.
And yet on the server side, it seems like basically the same story - it looks like there's a kevent signalling the socket is readable, but when NIO reads from it the response is zero, which NIO takes as meaning EOF.
It appears that the connection is being closed by a third party, which I guess means the macOS kernel here? Though I have no idea what its motivation would be.
It's not completely clear if the socket is actually closed, before either the client or the server explicitly close it anyway as part of their clean-up. Perhaps the return value of zero from read on the server side does not imply the socket is closed; perhaps NIO is being presumptuous?
The slow- & no-ramp starts are a direct consequence of this - wrk spends all its time trying to open its connections and keeps getting ECONNRESET on them (mostly on reads, sometimes on writes / connects).
I don't know why opening all the connections first, before sending anything on them, avoids this problem. For a long while I suspected a logic error in the kevent code in either wrk or the server, but if it's there I can't see it. I also cast about for race conditions or erroneous sharing of kevent queues between threads, but couldn't find anything suspect on the wrk side, and couldn't really penetrate the very complicated NIO code.
I'd also very much like to know if this same behaviour occurs on other platforms, e.g. Linux, as that might narrow it down to at least userspace vs kernelspace.
¹ In my defence, it does it a bit obtusely and the code is generally hard to follow, utterly undocumented, and missing most error checking. So I'm a little embarassed that I perhaps mistakenly or prematurely accused wrk, but not that embarassed.
Rent one from AWS. You can get a 4 core ARM64 for less than $0.15 an hour. I'm sure there are other cloud providers that are cheaper as well. Use VSCode with the Remote-SSH extension to connect to it. And then use swiftly to install swift. Unfortunately the Swift extension doesn't work over SSH, so you'd have to use the CLI to build and run.
EDIT: You can use the swift extension, if you have it installed locally. You can then install it in your SSH environment.
Of course, as I pointed out previously, wrk will need some (small) amount of time to produce the request and kernel/network add some latency too. That's why I was recommending to add 2*server cores connections. That way (unless your network is very high latency), you'll max out the server as there is always another request to process.
This isn't relevant here. wrk uses asynchronous I/O so it will absolutely be able to send out 100 requests at more or less the same time even with just 1 thread (as long as it does 100 connections, it doesn't pipeline unless you add a Lua script which does pipelining). The latency wrk needs between issuing 2 requests over 2 separate connections on 1 thread should be on the order of 10 to 50µs. The time it takes the server to reply to one fib(10k) will be at least 4ms, so 2 to 3 orders of magnitude more. So as long as your network is under 1ms of latency you should be absolutely fine with wrk -t <cpu count> -c 100.
Can you precisely describe the smallest & easiest benchmark setup where you see this? I have never seen this situation, so would be very interested.
This is via localhost and no Docker et al involved, right?
That would be down to the library using NIO i.e. Vapor. SwiftNIO will send the error through the normal error event (fireErrorCaught -> errorCaught) through the server pipeline. Then you/the library can use the AcceptBackoffHandler or a custom handler to handler to slow the accepts when we're out of file descriptors.
This isn't "on by default", the library would add this handler to the server pipeline if desired.
From what I can tell, wrk's default is:
persistent connections that never close (unless error)
no pipelining
request N on a connection gets scheduled immediately after request (N-1) has finished
it cannot do request rates (like '10 requests per second on each connection'), it's just a dumb loop (which is why it suffers from Coordinated Omission)
In my testing, wrk's timeout reporting was incredibly flaky. A timeout may be reported as a read error/write error/timeout. I believe this happens because even if wrk closes a connection (which may well lead to an error) because of a timeout, it only reports a timeout if the close(2) syscall returned 0. In my opinion that's not useful.
My recommendation when using wrk is: Set the timeout high enough that you will never hit it. Like --timeout 1000. You can use the max latency statistic, this gives you the worst case visible in this test.
Right, in my testing, most wrk reported socket errors happen becausewrk closes the socket. For example if wrkconnects to a server, then especially on localhost it's likely that the connect succeeds immediately, without the server having accepted it (yet), kernel just auto SYN+ACKs to a certain degree on localhost. But if wrk now closes this socket before the server accepts it, we'll see a socket error (likely 'connection reset by peer'). IMHO wrkshould report that as a timeout because it decided to time out, but it doesn't.
My favourite "proof" of the silliness of wrk's reporting is this:
Run a single nc -l 12345. This will open port 12345 and accept one connection, then exit immeidately
Run wrk -c 2 http://localhost:12345 in another window
eehm, yeah. 0 connect errors and 0 timeouts. Of course . The server never accepted more than 1 connection and then it exited. wrk presents this as 226k read errors and 113k write errors.
Hence my recommendation: Set the timeout incredibly high. Ignore any result that has any number of errors/timeouts. If you need to dig into what happened with the errors/timeouts, use another tool.
Without a custom Lua script, wrk does not pipeline as far as I understand and have seen.
I think what @axello and I have meant is HTTP connection reuse. My apologies for the confusion from my use of the wrong terminology.
Any situation which tries to connect more than 128 connections at closely the same time…
Correct.
Vapor doesn't appear to have any such controls…? Even at TRACE logging levels there's nothing useful and no errors reported, even when I can see errors happening clear as day in syscall traces.
Spelunking in the NIO code, I found plenty of places where it silently discards error information - e.g. read & write errors are often translated to just an "eof" error constant, discarding the actual errno.
I had some partial luck using breakpoints to catch the offending code paths, but "live" debugging like that is problematic for cases like this, where thousands of things are happening essentially concurrently and at very high speed, as the consequence of pausing the program for a breakpoint is to completely invalidate any further execution. And it gets old real fast having to iteratively refine the breakpoints, restarting the entire benchmark between every one.
Unfortunately that's a substantially less useful benchmark, because you lose almost all insight into success rates (given a latency requirement). And real-world SLOs are often set in terms of a small but non-zero acceptable failure rate (in my experience - I'm sure it varies).
I don't think @axello's test methodology is in any way wrong. Unlimited latency testing is not usually representative of real-world behaviour, and constant request rate benchmarks test subtly different things (e.g. they'll behave differently if the response latency is variable vs consistent).
The problem is the tools (wrk) and the kernel…
Yeah, that's a great little demonstration of this.
It's not hard to fix this bug in wrk - I patched my local copy - but it's only part of the issues we've been seeing…
wrk bugs, kernel bugs, and bad kernel configuration defaults
I'll talk simply here for the sake of clarity, but of course there's room for error here. I'm fairly confident in these assertions, but not certain.
It's worth noting that - aside from some missing documentation, perhaps - I'm not seeing any major issues with Vapor or NIO. I'm still not entirely sure what's going on server-side with the false connection closures, but I assume it's basically the same kernel bug as what wrk is suffering from on the client side.
wrk bugs
I've already detailed some of the ways wrk is very buggy, in prior posts. @johannesweiss highlighted some others, like its misattribution of connect failures to read/write errors (only partly its fault - more on that in the next section).
Another one is that it doesn't count connection establishment in its latency numbers - it starts its counter from when it starts writing the HTTP request. With HTTP connection reuse that tends to get lost in the noise, in the happy cases, because the steady-state behaviour is that no TCP connections are being opened or closed, but without reuse, or in unhappy cases like a heavily-loaded server (kinda the point of a benchmark, usually), it can cause wildly inaccurate latency numbers.
Kernel bugs
I don't know if it's specific to the use of kevents, or a broader problem with sockets in the macOS kernel, but there's several clear bugs:
If the listen backlog is exhausted the client side is supposed to receive a connection failure event (either connect fails directly, in blocking mode, or EV_EOF is set on any kevents for the socket, in non-blocking mode).
What actually happens is that the connection is falsely claimed to have been made (kevent posted saying the socket is writable withoutEV_EOF set). Even worse, you can actually write to it! The write call succeeds! Then, another kevent is posted saying the socket is readable but closed (EV_EOF is finally set).
The reason I believe this is a bug (or a design flaw, at least) is that it means there's no way to actually tell if the connection was actually opened or refused, in this situation. I realise the kernel might think it's smart by letting you write to the socket before it's actually even open - I assume that's its intent - but it makes for very confusing and misleading behaviour sometimes.
I also think it's particularly annoying for localhost connections since there's no real reason the kernel can't immediately know the connection attempt will fail, and very clearly fail it. It could do that right at the connect call, even in non-blocking mode.
When a connection is refused because the listen backlog is full, it's specifically supposed to fail with ECONNREFUSED on the client side, according to the listen man page.
Whether you pull errno from that failed read, or from connect itself if in blocking mode, the error code you actually get is ECONNRESET or EBADF, depending on whether you hit somaxconn first (more on that shortly) or the listen backlog limit, respectively.
That's not supposed to happen - ECONNRESET is supposed to mean the previously open connection was closed by the other side (specifically, an RST packet was received from the other side). (technically there's nuances regarding half-closed states, but not relevant here)
And EBADF is obviously not supposed to happen either, and even worse really, because it indicates the kernel has already torn down the entire file descriptor even though the userspace code has not actually called close on it yet (I'm not certain that's wrong per whatever design spec governs these things, but it's certainly unexpected and problematic).
This all explains why making wrk open all the connections synchronously works - when using an otherwise sensible number of wrk threads, e.g. 20 on my machine, you can only ever have that many concurrent connection attempts at once, which is well below the relevant kernel & server limits.
Note also that if the server is outright not listening (e.g. you kill the server while wrk is running, so that the listening socket closes completely) then you do start getting correctly-reported connection failures. So this bug is specific to the case where there is a server at the destination but the kernel refuses to connect to.
Bad kernel configuration defaults
We already covered the file descriptor limit. 2560 doesn't seem an unreasonable default to me, for macOS, although the discrepency @tera noted between CLI & GUI apps (2560 vs 256) is concerning, and 256 is unreasonably few especially for GUI apps which tend to have a lot more spurious file activity than simple CLI apps.
Arguably it's reasonable to expect a server to specifically tune this higher (e.g. into the hundreds of thousands, potentially) on a per-process basis. As a system-wide default, 2560 makes sense.
Beyond that, it turns out the magic knob is the kern.ipc.somaxconn sysctl. That basically determines how many in-flight, incoming connecting attempts you can have at any one time.
It appears to be a per-process limit (or per-socket, maybe), but I couldn't find any documentation of that (in fact, it's barely documented at all - notwithstanding the various cargo-culting online about making its value bigger). I just noted that changes to it don't affect already-running processes.
On macOS it defaults to 128. That's way too low for a web server. Maybe it's fine for "client" or "consumer" applications, which I guess is macOS's target audience, but even so it's not clear to me why it has to be quite that low.
I believe its intent is to guard against SYN attacks and the like. Still, it can do that just fine at higher values, while not crippling normal use.
When that limit is hit, irrespective of your listen backlog size, connection attempts fail. This is what was causing all of the read/write errors reported by wrk (which were actually all connection failures but are being misreported by the macOS kernel, as detailed earlier).
The listen man page actually maybe hints at this, where it lists a very important known bug:
BUGS
The backlog is currently limited (silently) to 128.
I couldn't quite confirm it - I can't find the relevant kernel source code - but I think this is referring to the somaxconn default, not a separate bug in listen itself. i.e. after raising somaxconn to something sane, and setting the listen backlog to 3,000, I basically couldn't get any connection failures irrespective of how many concurrent connections I used (it seems like wrk is simply incapable of launching 3,000 effectively simultaneous connection attempts, at least while the server is running on the same machine and competing for CPU time).
FWIW the default listen backlog with Vapor is 256, which is unfortunate as it misleads you into thinking that the limit is, well, 256. Maybe it's not technically Vapor's concern what other limits might apply to the listen backlog, but it sure would be nice if the Vapor documentation mentioned the somaxconn sysctl.
In conclusion
Plenty of bugs to go round!
With all these things fixed, I now see zero connect/read/write errors irrespective of load. Timing information is reported correctly.
@johannesweiss previously asserted that wrk erroneously miscounts its own timeouts as read/write errors, which I can totally believe given the state of the wrk code, but I have not yet seen that problem myself. I couldn't even find the code which actually executes the timeout and closes the socket, within wrk, with a brief search.
Could you file bugs for those please? NIO shouldn't discard the actual errno.
Sorry, I wasn't meaning to imply that tail latencies aren't important. And of course @axello's methodology is absolutely not wrong! But we don't need to use wrk's timeout functionality to see them. You can pass --timeout 1000 --latency and you get more information. Granted, wrk doesn't give nearly enough of the higher percentiles (p999, p9999, ...).
That also is fixed by wrk2.
My advice purely applies to wrk as it's error reporting on timeout is just unusable.
Just to be extra clear: I don't mean unlimited latency is okay. I mean: Use wrk's max latency report instead of its timeout functionality. If max > what you're willing to tolerate: count it as failed. But yeah if you're a little more lenient and want p9999 < 2seconds for example, the wrk just won't cut it.
Interesting, I don't think I've seen this but I'll try again.
kern.ipc.somaxconn is the maximum number of unaccepted connections allowable. So well done on benchmarking the default limit to 128 .
Feel free to set it to something higher of course if you need even more connections.
(The code is @axello's unmodified code, except for Vapor updated to 4.97.0. My computer is a M1 MacBook Pro, the OG AppleSilicon, server started with LOG_LEVEL=error .build/release/App)
Yeah, the best way of reproducing this I found is the following:
Run this silly NIO program
let g = MultiThreadedEventLoopGroup.singleton
let chan = try ServerBootstrap(group: g)
.childChannelInitializer { channel in
print("accepted", channel)
return channel.eventLoop.makeSucceededFuture(())
}
.serverChannelOption(ChannelOptions.maxMessagesPerRead, value: 1)
.serverChannelOption(ChannelOptions.backlog, value: 256)
.serverChannelOption(ChannelOptions.socketOption(.so_reuseaddr), value: 1)
.serverChannelOption(ChannelOptions.backlog, value: 2560)
.serverChannelOption(ChannelOptions.autoRead, value: false)
.bind(host: "127.0.0.1", port: 8888)
.wait()
print(chan)
while true {
sleep(1)
chan.read()
}
try chan.closeFuture.wait()
which accepts one connection every second and then just does nothing with it, no HTTP, no nothing. So everything should time out, either the request (if you connection got lucky and got accepted) or the connect (if not).
I see that this pull request was merged. Is it now possible to use NIO's EventLoop for the async functions? Also, are there any drawbacks from doing so?
Should be possible now indeed! Possible drawbacks: It's a pretty different scheduler which tries to keep the same thread. That's great for I/O latency and usually works really well if you're a server replying to request (as they come in balanced from many connections, each pinned to one of the available threads).
But if you're using Swift Concurrency as a way to distribute very CPU-heavy, synchronous computations across many cores then you won't see them as distributed across cores as you do with the standard one. You'd need to be explicit with the offloading of CPU-heavy computations.
Equally, if you use blocking I/O functions, which is a very bad idea under any Swift Concurrency executor, you may see the effects earlier with NIO's scheduler. I usually frame this as "already buggy programs get to hide less" .
Tl;dr: if you're a server I'd give it a try and it should help quite a bit! Or if you're a web crawler client or something else that predominantly does network I/O.
If you're using Concurrency as a mechanism to parallelise CPU-heavy computations then I'd stick to the default. Kinda unlikely given that this thread is about Vapor web servers but you never know.
It's not currently possible, the default commands calls through to a sync function which uses .wait() so there's a new PR (Async Entry and Custom Executor by 0xTim · Pull Request #131 · vapor/template · GitHub) to solve that. I have this working in a sample project so you can use that branch if you want to try it out. (There's a breaking test which I need to fix before me can merge).
The reason why this extra PR is required is because the compiler swallows transitive noasync messages - I've put a pitch up to resolve that here