'Standard' vapor website drops 1.5% of requests, even at concurrency of 100!

adam-fowler · May 6, 2024, 12:00pm

I you type ulimit -a it should give you details on what limits are applied to your current shell and any process it triggers. @johannesweiss pointed this out to me a few years back.

lukasa · May 6, 2024, 1:04pm

swift-nio-ssl (which is used by Vapor) has no support for hardware offload for TLS, and frankly I'm not aware of any hardware offloading of TLS that's done for performance. TLS is very fast. As an example, here's crypto involved in TLS 1.3's handshake:

Digest calculation over the handshake messages, usually 4kB or less.
One key exchange, usually ECDHE which involves generating a random scalar and performing two point multiplications.
One signature (either signing or verifying depending on role), again usually an EC signature. Implies another digest calculation and point multiplication.
HKDF generation of a number of derived secrets. All digest calculations.

That's it. The in-band cryptography is then all bulk symmetric through an AEAD, usually AES-GCM but sometimes ChaCha20-Poly1305.

The important heuristic there is that these are all super fast on even slightly modern application processors. Offloading these to a hardware card would almost never produce improved performance, because the cost of shuffling the data between the cards would utterly dominate the cost of the computation being done. Back in 2010 Adam Langley offered a rough calculus of 1500 handshakes/sec/core. That number not only assumes 2010-era server hardware, but also involves RSA 1024 (far slower than EC) and RC4 without hardware acceleration (compare AES with hardware acceleration in all modern CPUs). Nowadays we are well past the point where the crypto for TLS handshakes is lost in the noise of the rest of the protocol stack.

The reason you might do crypto offload isn't for performance, but you might choose to do it for security. Keeping a private key inaccessible from main memory is a valuable thing to do, so you may choose to do that. But this will never make your handshake faster, it'll only make it slower. (Sidebar: also, this only has an effect on the signing operation, as it'll only be the server's private key in the hardware security component. So only one of these operations gets slower, but it does get a lot slower.)

(Sidebar to my sidebar: how much slower? Using YubiHSM2 as a good example of a publicly documented HSM, Yubico quotes ~73ms for ECDSA-P256-SHA256. Whereas, openssl speed ecdsap256 on my M1 Max with OpenSSL 3.2.1 says 59353.7 signing operations per second, or ~16µs. That makes YubiHSM2 about 4,500x slower than doing the signing operation on the CPU. Worse still, typically HSMs are single-threaded, so you have locked your handshake rate down hard.)

It is, swift-nio-ssl is a BoringSSL wrapper. However, we have not exposed much in the way of API to use hardware secure elements except for hopping through Swift code first using NIOSSLCustomPrivateKey.

Many stacks use BoringSSL under the covers. However, for these 4 that's likely not true. I'd expect PHP to use OpenSSL via httpd. Node.js uses OpenSSL as well I believe. As for Java, it depends: Java has a builtin implementation, but Netty uses BoringSSL.

tera · May 6, 2024, 2:38pm

ulimit -n gives 256 on my computer, and I indeed see hitting that limit in the UI macOS app (correctly fails with "Fatal error: failed to open 252th file", see below)... Although this doesn't happen in the console app! Why?!

The test code:

func test() {
    let files = (0 ..< 1000).map { i in
        let path = URL.temporaryDirectory.appendingPathComponent(UUID().uuidString)
        let file = open(path.path, O_CREAT)
        if file < 0 {
            fatalError("failed to open \(i)th file")
        }
        return file
    }
    print("opened \(files.count) files")
}
test() // or call it from, say, ContentView's init of the UI app.

johannesweiss · May 6, 2024, 4:14pm

At the moment it's always sticky to the NIO thread. The idea is that if you're triggering work from the I/O that's servicing the connection/requests you'll end up on the 'correct' thread always so you can do I/O without ever having to switch threads. Given that the connections are round-robin'd across your EventLoops you should get a good spread.

If you're however not triggering your work from connections/requests that are already spread across threads, you'd indeed need to do that manually by calling MultiThreadedEventLoop.singleton.next().execute { ... }.

But to be honest, the idea today is that this is for I/O-triggered workloads.

There should be no limit apart from what your hardware can handle.

Hmm, I'd sugget to use all high-performance cores. So on most machines just the number of cores, for Apple Silicon Macs, I'd suggest to select the number of high-perf cores only.

This shouldn't require disabling SIP. I think you need to run

sudo launchctl limit maxfiles 100000 100000

And then (in the shells you're running wrk & the server from): ulimit -n 100000. But these settings won't persist across reboots / new shells.

If I remember correctly you can also echo limit maxfiles 100000 100000 | sudo tee -a /etc/launchd.conf to make it permanent.

Yeah, always benchmark on the target system indeed.

It depends on whether your software can handle the 'out of file descriptors' condition. NIO should handle it fine but of course it cannot accept more connections than the limit, so you will see read/write/connect errors in wrk.

hassila · May 6, 2024, 4:25pm

Thanks for the clarification!

If there only was a way to convince the desktop system with unlimited power available and high performance needs to use those first…. I’ve filed FB13223271 on that though

I don’t mind if background system tasks go on the efficiency cores, but I want our software to only use those as a last resort…

johannesweiss · May 6, 2024, 4:35pm

Hmm, assuming the right QoS settings etc the threads should migrate to the high-perf cores when necessary. The main reason I'm suggesting to spawn the server with number-of-high-perf-cores threads is because dumb round-robin scheduling onto threads isn't ideal if the cores are asymmetric. By only creating as many threads as you have high-perf cores you should get (under high load) each high-perf core running one of your threads.

wadetregaskis · May 6, 2024, 5:12pm

Hmmm… that seems at odds with my experience at Google. For decades now it's been standard practice (if you can afford it) to have dedicated machines just for TLS termination - admittedly mostly because of geographical reasons with the goal of minimising round-trip latency to end users thanks to TLS's multi-cycle handshakes (and that's one of the problems QUIC was invented to address), but also because TLS is expensive, and for that reason originally wasn't done on traffic past the POPs (until the NSA made everyone's lives worse).

Last I (vaguely) recall, TLS was done in network hardware now because it can do it with lower latency and more energy-efficiently than CPUs. I'm pretty sure there's a published paper on this, from Google, but I can't immediately find it.

Granted "expensive" here is relative to load, and not many servers operate at Google's load scales. I sure as heck wouldn't ever worry about TLS costs on my production servers. But I'd still want to cover it in my benchmarks, if I'm comparing different server stacks. You never know when you might hit an implementation-specific bottleneck.

And as you also pointed out, there is a divide between OpenSSL and BoringSSL within the industry. I don't know if they're currently expected or known to have performance differences… I know that originally BoringSSL was slower than OpenSSL in some ways, particularly on platforms that at the time the Google BoringSSL team felt were "irrelevant" (like Power / PowerPC) since they specifically removed optimisations for those platforms, and it was originally forked to improve security specifically, not performance.

Not necessarily. A lot of commercial network offload hardware is just badly architected, and produces the artificial bottlenecks that I think you're alluding to. There are ways to do it that have practically no latency overhead (beyond what fundamental physics dictates, and you have to pay those costs irrespectively because the data has to physically get to the CPU sooner or later).

tera · May 6, 2024, 5:26pm

I'm simply not getting "out of file descriptor" condition around 256th opened file... Tried to run the above test (with an increased N) in the console app just now, got this:

failed to open 7164th file

which is way above the 256 that ulimit -n returns. On the same machine the UI app (either app sandboxed or not) correctly hits this limit (failing after 252th file or so). Weird.

wadetregaskis · May 6, 2024, 9:20pm

Yeah, I'm pretty sure file descriptor limits aren't the issue here, although I haven't had a chance to get back on my iMac today to double-check that.

I didn't see any errors indicating file descriptor exhaustion (I'd expect connect / accept to fail with EMFILE).

lukasa · May 7, 2024, 1:43pm

As you allude to here, these machines are almost never now used only for TLS offload. And in modern data centers it is usually the case that instead of these frontend machines being used for TLS offload, they have the effect of adding more TLS, as there's now TLS between the machines in the DC too.

Saving the cost of handshakes is desirable, for sure, but the cost there isn't so much server CPU as it is round-trip latency.

Yes, you can offload to NICs, which is the only kind of hardware offload that makes sense here. But the reason this offload is useful is nothing to do with whether TLS is expensive on the CPU, and everything to do with the cost of passing data between the CPU and the NIC.

The more processing you can do on the NIC, the faster things go, because getting a packet from a NIC to kernelspace is expensive, and from kernelspace to userspace is expensive. The actual TLS math is fairly cheap: the packet shuffling is expensive.

But I didn't read the original comment this way. I read it as proposing offload to a "TLS card", and that doesn't sound like a NIC to me, that sounds like a totally separate processor. This will never be a win. Offload to the NIC is a win because the NIC is already involved in this flow: it already has to touch the packets, so it may as well touch the more. Offloading to an otherwise off-path device is always a higher latency cost.

And regardless, this is all moot. Outside of hyperscaler networks, you aren't going to do any of this, and by the time NIC offload is tempting you've done a number of other things first that Vapor also doesn't support, such as kernel-space TLS, or kernel-bypass networking. For use-cases like that we absolutely can start having a conversation about TLS becoming expensive, but for a "calculates fibonacci numbers" server TLS is approximately free.

axello · May 7, 2024, 2:42pm

I retested the original benchmarks with the vapor updates from Gwynne and Wade, and put the results in a new blog post:

Juice Sucking Servers, part deux

Now I see a nice 100% request-response, up until about 600 concurrent requests. Then it drops to 98%.
When I only apply the vapor upgrade to 3.96, I see the response go down to 25%. But when I add the benchmark optimisations from Wade, Vapor's requests accept rate again stays at 98%. I still don't know how that can happen. Perhaps I'm measuring all wrong?

For example, this is wrks output:

Running 30s test @ http://192.168.8.185/bench/10000
  4 threads and 10000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   966.51ms  152.91ms   2.00s    90.78%
    Req/Sec   254.96     78.11   545.00     70.09%
  29506 requests in 30.09s, 62.83MB read
  Socket errors: connect 8983, read 0, write 0, timeout 527
Requests/sec:    980.45
Transfer/sec:      2.09MB

I am plotting this as 29506 requests, of which 527 have a timeout.
However I can also read it as this this: “of the 29506 requests, only 8983 ‘connect’ –whatever that may be–, and of those 8983, 527 time out.
I simply do not know enough about wrk or tcp sockets.

¯\(ツ)/¯

Indeed, there are no differences between a 2 and 4 second timeout.

To prevent this discussion, I am using two separate linux machines, connected over ethernet.
As these are old machines, I am using simple gigabit ethernet, so no 10-gig fiber optics. For more rigorous tests, the network latency needs to be taken into account as well. But here the network latency is probably the same for all the tests.

Thanks y'all for making server side swift a better place!

johannesweiss · May 7, 2024, 3:29pm

Good. Could you tell us:

How long is the average latency with just one connection? That should give us roughly the time one fib(10k) takes?
How many real CPU cores do you have (so ignore hyper-threaded 'logical' cores)?
What's the output of ulimit -a?
What operating systems / setup are you using for client / server?
Is this over a real network?

With that information we should be able to give some more insight into what's going on and if that's expected.

dimi · May 7, 2024, 3:35pm

Can't you also just call setrlimit(RLIMIT_NOFILE, ...) in process to lift that limit, or does that no longer work on modern macOS?

wadetregaskis · May 7, 2024, 3:48pm

That's 8,983 connect errors, not connections made.

It's a pity that wrk doesn't report the actual error codes - as I mentioned previously, I had to manually add some printf-debugging for them and knowing the actual codes was insightful. It'd actually be a fairly easy but useful addition to wrk for anyone wanting to help out an open source project.

There's no issues with bandwidth here - a million requests only adds up to about 80 MiB of data transferred, according to wrk. Even if that's not counting the request & response headers, given that many requests take many tens of seconds, it's a tiny fraction of even just a gigabit link.

However, latency will be worse on a gigabit link than a 10 Gb or 40 Gb link, the latter two being much more common in server environments.

Now, whether that materially impacts the benchmark, I dunno… I suspect not. But, just want to note it.

P.S. You can easily do 40 Gb over copper, and some datacentres still do (from server to top-of-rack switch only, though, usually). Fibre's actually cheaper at any serious scale, but lots of colo and edge facilities are comparatively tiny by modern standards, and also don't necessarily benefit from deploying their contents all at once where scale aids purchasing power.

tera · May 7, 2024, 3:49pm

Interestingly this sheds some light onto the unexpected test results I was getting earlier here and here as the limits correlate with the test results:

1. macOS UI App (sandboxed or non sandboxed):
    getrlimit result: 256
    maximum number of files opened empirically: 252

2. console app:
    getrlimit result: 7168
    maximum number of files opened empirically: 7164

The unaccounted file descriptors are for things like stdin/stdout, etc.

By changing the limit via setrlimit I was indeed able to create more than 256 files in the UI app.

Why macOS UI app and console app have disagreeing limits by default – no idea.

wadetregaskis · May 7, 2024, 4:08pm

@axello, in your new post you say:

I was logging to the console with info level instead of debug.

I assume you mean error, not debug? It's not normal to use debug log levels in production (and depending on your server, that might produce a lot more logging which will slow things down).

wadetregaskis · May 7, 2024, 4:14pm

Also, it appears with the new Vapor and using Numberick that the Swift server now out-performs all the others in throughput, at ~30k/s on your hardware vs 26k/s for Java (the previous winner). Is that correct, or did the setup change between your two posts?

Might be worth noting [in the second post] that the memory behaviour between these languages, specifically its impact on performance, is speculative. It's probably on the right track, but nobody's actually examined the Java or JavaScript servers to see if they really are getting a notable benefit from bump allocation.

Note also that Swift can get these same benefits if you take on some of the memory management explicitly. You might also hear it referred to as "zone" or "region" allocation. It's most-often used to avoid retain-release traffic. Though I presume it's unrealistic to do that in a typical Swift web server (I'm not sure if it'd even be practical, with a general-purpose framework like Vapor involved).

axello · May 7, 2024, 4:20pm

➔ Average latency with concurrency 1

  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.57ms  661.76us  21.11ms   98.61%
    Req/Sec   282.27     17.58   300.00     97.00%
  8443 requests in 30.02s, 17.98MB read
Requests/sec:    281.27
Transfer/sec:    613.35KB

with concurrency 2:

  2 threads and 2 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.62ms    1.12ms  42.81ms   98.30%
    Req/Sec   281.17     28.60   313.00     85.67%
  16813 requests in 30.02s, 35.80MB read
Requests/sec:    560.01
Transfer/sec:      1.19MB

with concurrency 4:

  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.34ms    2.28ms  64.88ms   98.25%
    Req/Sec   477.56     41.27   510.00     93.00%
  28546 requests in 30.02s, 60.79MB read
Requests/sec:    950.92
Transfer/sec:      2.03MB

ulimit -a (server)

real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 63467
max locked memory           (kbytes, -l) 2045452
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1024
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 63467
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

➔ The client has about the same values.

server

My Core i3 reports 4 processors, so that is probably 2 real cores:

output of /proc/cpuinfo:

Intel(R) Core(TM) i3 CPU         550  @ 3.20GHz

PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"

client:

Intel(R) Core(TM) i5-3470S CPU @ 2.90GHz

PRETTY_NAME="Ubuntu 22.04.4 LTS"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"

network

This is over a real 1 Gbps network

axello · May 7, 2024, 4:29pm

Apologies, yes: the default value for vapor when you do not set
app.logger.logLevel yourself. As it runs with 'production', that is 'notice'.

johannesweiss · May 7, 2024, 5:26pm

Okay, so 1 request takes 3.57ms on average. This does of course include some network latency as well as the time it takes to create the request. So the real fib(10k) CPU latency is slightly lower. I hadn't realised you're bencharking over a real network which adds latency. Thanks for adding -c 2 and -c 4 numbers.

As expected, your numbers also show that adding a second connection makes it scale linearly (because you have another core available). The latency is pretty much unaffected (3.57 ms vs. 3.62 ms), great. That also leads to a doubling of the requests per second: From 280 to 560. Perfect, exactly what we want to see!

And fortunately you also added the -c 4 numbers which neatly show us that your CPU is now above its maximum: The latencies get higher (because now we create more work than the CPUs can handle) and that's also why we don't see another doubling of the reqs per second (we go from 560 (2 conns) to 950 (4 conns)). This is also expected because whilst the logical hyperthreaded cores aren't as good as having 4 real cores, they merely use the available CPU resources a little better.

So here's what I would expect and you can check with wrk if that's actually the case:

The requests per second should stay around 950 (because that's an average), even if you double your connections again (to 16, 32, 64, 128, ...)
The max latency will go up as you increase connections
Once you go above about 1000 connections, you'll start to see errors (because your settings will only allow up to 1024 file descriptors and a bunch of them are in use for things that aren't network connections)
You should not see any errors below 1000 connections (you may want to raise your --timeout just so we don't struggle to accept these 1000 connections, because your machine will be super overloaded at this point)

All in all, this all looks totally expected. One point that's important to make is that it seems that your CPU cores together can do about 950 requests per second. So regardless of how many connections, you won't be able to fulfill more than 950 requests per second. That's the maximum.

So if you have 950 connections, we would expect about an average latency of 1s (because each of them will constantly enqueue a request, so it'll take us a whole second to compute all of the fibs we need for just one request from each connection). If you had 10,000 connections (*) we'd expect a 10.5 average request latency, at best!

(*): To actually make 10,000 connections and wrk work you'd need to

raise the ulimits for server & client
be careful that you're not running out of 5 tuples because the ephemeral port range is limited
might want to configure Vapor to accept connections even more eagerly (maxMessagesPerRead higher & backlog higher too)