'Standard' vapor website drops 1.5% of requests, even at concurrency of 100!

hassila · May 3, 2024, 5:52pm

As mentioned, you busy loop taking timestamps. Not sleep.

lorentey · May 3, 2024, 7:52pm

BigInt.+ isn't doing anything particularly expensive. It has cost that's linear in the size of its input, and for values beyond 2^128 it includes an allocation to hold its result. I'd expect it to be generally in the ballpark of other big integer implementations.

(It doesn't do anything special, though -- it includes no clever SIMD tricks: it boils down to a boring loop of addingReportingOverflows.)

Given that this benchmark is quite deliberately using big integer additions as a way to generate load, I'm not surprised that it would dominate the performance analysis.

How confident are we that BigInt.+ is actually meaningfully slower than, say, Java's BigInteger.+? There may be an unexpected stdlib regression lurking there.

swift-numerics is our official home for numerics solutions outside the stdlib. It has a feature branch for BigInt development; it has been in the works for over four years now.

scanon · May 3, 2024, 8:15pm

BigInt.+ isn't particularly bad, but it isn't particularly clever either. GMP and similar implementations have hand-unrolled assembly loops for these core operations, which can easily be a two-to-three times faster than compiler-generated code (in any language--carry propagation introduces a second loop-carried dependency, and compilers don't generally do well with converting CF into a physical register to test the loop criterion and then back into a flag). So I would expect a modest load difference from that, and also from allocation pressure (but it would also be pretty easy to fix this if it's important for some real workload).

taylorswift · May 3, 2024, 8:18pm

i think that the title of this post is a little click-baity because we’ve been using SwiftNIO (which is what Vapor is built on top of) for about two years now and have not observed anything resembling a failure rate of 1.5%. that is a lot of requests, and if we were losing that many we wouldn’t be showing up in Google search results period.

in particular, we use Swift to serve Swiftinit, which is a large site that handles a lot of requests (but nowhere near 8,000 per second) and have not had major issues with dropped requests.

in the server industry you basically have two types of costs: humans and “electricity”.

i think that Swift is really good if you want to save on electricity. it’s truly remarkable how little memory Swift applications use (if written by experienced engineers) and you can run big services on some really tiny nodes. we are actually overprovisioned right now (due to investing in AWS savings plans) and could be paying a fraction of what we’re paying right now (which is already not much) if things like cross-compilation had been supported when we made the original investment.

on the other hand, you might have a hard time if you’re trying to save on human costs, and at many companies the cost of paying developers dwarfs the cost of paying for electricity. you really do have to reinvent everything yourself, which includes things such as modifying, compiling, and distributing your own custom Swift toolchains because Swift does not officially support the cloud platforms you are deploying to.

i think the focus on BigInt in this thread is illustrative because in the past at a different company i recall experiencing similar difficulties with underdeveloped libraries and being blocked by things like IEEE decimal support or ION support. back then, interest rates were lower, funding was easier, and people really did do things like take monthslong detours to actually write these libraries from scratch. but that didn’t pan out in the end, and i can’t imagine making the same decision myself in our current economic environment.

tera · May 3, 2024, 9:08pm

If we were serious about environmental impact we wouldn't have bitcoins!

There were a few issues raised here: request drops and bigInt performance. I'd check those separately. When checking request drops I'd use something simpler than big ints to rule its variables out, and vice versa when checking bigInts I'd use a simpler benchmark without involving NIO complexity.

lorentey · May 3, 2024, 10:05pm

Node.js's BigInteger.+ beats attaswift.BigInt.+ by about 3x for the specific case that's being exercised here (n = 10,000), which is precisely within expectations for a naive implementation.

Swift code

import BigInt

func fibonacci(_ n: Int) -> BigInt {
  precondition(n > 0)
  if n == 1 {
    return 0
  }
  var previous: BigInt = 0
  var current: BigInt = 1
  for _ in 2 ... n {
    let next = previous + current
    previous = current
    current = next
  }
  return current
}

let counts = [10000, 30000, 100000, 300000, 1000000]
for c in counts {
  print(c)
  var f: BigInt = 0
  let clock = ContinuousClock()
  var duration = clock.measure {
    f = fibonacci(c)
  }
  print("fibonacci: \(duration)")

  duration = clock.measure {
    print(f)
  }
  print("print: \(duration)")
}

Node.js code

function fib_loop(n) {
  if (n <= 1) {
    return BigInt(n);
  } else {
    let a = BigInt(0);
    let b = BigInt(1);
    let temp;
    for (let i = 2; i <= n; i++) {
      temp = a + b;
      a = b;
      b = temp;
    }
    return b.toString();
  }
}

let counts = [10000, 30000, 100000, 300000, 1000000];
for (let i = 0; i < counts.length; i++) {
    let c = counts[i];
    console.log(c);
    console.time("fib_loop");
    let f = fib_loop(c);
    console.timeEnd("fib_loop");

    console.time("print");
    console.log(f);
    console.timeEnd("print");
}

Interesting, but ultimately irrelevant benchmark data

Interestingly, the gap widens to ~6x on larger values:

Count	node.js	Swift	ratio
10,000	1.558ms	4.345ms	2.78x
30,000	8.794ms	33.127ms	3.76x
100,000	69.763ms	350.025ms	5.02x
300,000	552.623ms	3056.003ms	5.52x
1,000,000	5.944s	34.368s	5.78x

(Switching to using BigUInt and += is expected to eliminate the allocation overhead. Doing that shaves off a few percentages, but it doesn't meaningfully close the gap.)

Radix conversions are particularly slow in my old library. For n = 10,000, attaswift does its decimal conversion about 7x slower than Node.js. Other values demonstrate that the latter may have an algorithmic advantage:

Count	node.js	Swift	ratio
10,000	0.024ms	0.164ms	6.8x
30,000	0.047ms	1.199ms	25.5x
100,000	0.076ms	12.006ms	158x
300,000	0.18ms	109.942ms	611x
1,000,000	0.536ms	1273.589ms	2376x

It would be useful to follow up on these to rule out Standard Library issues. However, these are most likely due to a multitude of package-level issues. (I wrote this stuff almost a decade ago, in the Swift 2 era. Swift 4 was in beta the last time I was able to update this code.)

John_McCall · May 3, 2024, 10:08pm

Web servers dropping requests when the server is too overloaded to process them is the intended and correct behavior. You don't want a server to just keep accepting requests it can't handle until it falls over from resource exhaustion.

The benchmark here is demonstrating that a server that naively computes Fibonacci using that specific BigInt Swift package gets overloaded faster than a server doing the same thing with BitInt packages in other languages. That seems like constructive feedback for that package — it's slower than similar packages at this specific operation. It's not a general statement about either Swift or Vapor, though. And if server frameworks written in other languages are accepting work they can't handle and then only clearing it because the benchmark eventually stops crushing it with requests, they are poorly written frameworks.

RussBaz · May 4, 2024, 7:00am

To be honest, I have experienced Vapor dropping request for no apparent reason to me on my test machine with no meaningful load at all (although very rarely). Without showing any actual error, too!

However, I could never reproduce or reason why it happened. After all, the next request sent (without a server restart) would be properly processed without any issues. It feels like it was more common before 5.10. I don't think I have experienced it since upgrading to 5.10, but I started experiencing possible performance issues.

johannesweiss · May 4, 2024, 9:17am

Thanks @lorentey for benchmarking vs. JS. So @axello as others have pointed out, Swift is doing about 3x the work that the JavaScript/Java implementations are doing. That means the server gets more busy and it cannot be competitive with languages that use a 3x faster BigInt implementation because in this benchmark 92% of the time is spent in BitInt.+.

But you are right to point out the "dropped" requests. They're technically not dropped but just responded to with high latency and the load testing software then times out and counts them as dropped.

I had a look into that, the first curious observation is that it's always the first request in each connection that is much slower than the others. That can amount to over 2 seconds and then your client gives up and counts it as dropped -- fair enough I guess.

So why is the first request slow? SwiftNIO's default setting is to only accept 4 connections in a burst (even if there are 100 new connections, it'll just accept 4 each EventLoop tick). So in a way it prioritises existing connections over new connections under high load. That doesn't play well with benchmarks that are just burning through CPU and open a load of connections at the same time. On my machine, each fib(10k) is around 5ms each. So we're accepting 4 connections, calculating their first fibs (20ms at least), then accepting another 4 connections, calculating 8 fibs, accepting another 4 connections, calculating 12 fibs (60ms), accepting another 4 connections, calculating 16 fibs, .... So the 100th connection will takes ages (over 2 seconds) to accept. To make it worse, if we see over 2 seconds, we'll get more new connections .

Now is this a good default in SwiftNIO? Debatable. Clearly the other frameworks you have tested accept more connections in one go, maybe SwiftNIO should raise that number or maybe Vapor should. The good news is that the fix is easy in Vapor. If you do

swift package edit vapor

and then apply this patch (which will accept up to 256 connections in a lot)

diff --git a/Sources/Vapor/HTTP/Server/HTTPServer.swift b/Sources/Vapor/HTTP/Server/HTTPServer.swift
index 135fa752f..fac66c413 100644
--- a/Sources/Vapor/HTTP/Server/HTTPServer.swift
+++ b/Sources/Vapor/HTTP/Server/HTTPServer.swift
@@ -348,6 +348,7 @@ private final class HTTPServerConnection: Sendable {
         let quiesce = ServerQuiescingHelper(group: eventLoopGroup)
         let bootstrap = ServerBootstrap(group: eventLoopGroup)
             // Specify backlog and enable SO_REUSEADDR for the server itself
+            .serverChannelOption(ChannelOptions.maxMessagesPerRead, value: 256)
             .serverChannelOption(ChannelOptions.backlog, value: Int32(configuration.backlog))
             .serverChannelOption(ChannelOptions.socket(SocketOptionLevel(SOL_SOCKET), SO_REUSEADDR), value: configuration.reuseAddress ? SocketOptionValue(1) : SocketOptionValue(0))

you shouldn't see the dropped requests nearly as early.

Note:

This will of course just delay rejecting requests. It just doesn't prioritise servicing existing connections over new ones
This is mostly working around a benchmarking artefact where 100 connections are created at the same time

@0xTim / @graskind it might be worth to setting .serverChannelOption(ChannelOptions.maxMessagesPerRead, value: 256) or even a higher value as the default setting. Also this should be user configurable.

graskind · May 4, 2024, 11:26am

@johannesweiss Hmm... I always wondered what that particular option really meant . It wouldn't be at all hard to make this configurable, and bump the default value in the process. I'll probably work up a PR for it over the weekend, if @0xTim doesn't beat me to it.

@scanon BTW, speaking of that many years in development BigInt branch in swift-numerics, any hope of a hint of forward motion there?

wadetregaskis · May 4, 2024, 3:15pm

Thank you for rooting-causing this, @johannesweiss. That's really informative.

So if the handler were async this wouldn't be an issue?

It seems odd to me that the limit is hard-coded at four… I would have expected at least something a little more hardware-aware, like the number of CPU cores.

graskind · May 4, 2024, 4:19pm

I ended up writing said PR and merging it about 10 minutes after I posted this . I would be very interested to know whether retrying the benchmark with the updated default in place (released as Vapor 4.96.0) yields different results.

johannesweiss · May 4, 2024, 4:41pm

First of all: Apologies, this is a lot of text because the answer to your question is 'No. And also Yes', basically it's a difficult question. I will also admit that I should have probably spent more time on editing the below wall of text, the truth is I want to enjoy some of my weekend so a braindump it is .

Everything in SwiftNIO is asynchronous (written either in synchronous, manually-evented functions, or with futures or with async functions) but not necessarily async functions, although it's compatible with async functions of course. This isn't really related to async vs. non-async, this is related to where the CPU-burning code runs and when it gets triggered, what is prioritised and if it's preemtible.

Both SwiftNIO and Swift Concurrency's default global executor are cooperatively scheduled systems. Like all schedulers, you may agree or disagree with what exactly they happen to schedule at a given point in time. And because they're cooperatively scheduled that means that if you block (I/O or CPU-burning) the threads for a while then they can't interrupt that and do something else (like accepting new connections) instead.

If you're prepared to add some latency you can of course offload computations (or blocking I/O) into other thread pools. For example, onto a custom Swift Concurrency executor, your own threads, onto some DispatchQueue, onto a NIOThreadPool, another EventLoop or onto some Swift Concurrency executor. But you need to be very careful here: If you just blindly offload computations onto other threads without implementing some limits your server will be vulnerable to attacks: It will continue to accept work that it then just queues up, burning extra CPU & extra memory for that.

Exactly like John says:

Now, as it happens to be, if you just made the Vapor route async that would have (in standard configuration) the effect of offloading the fib(10k) calculations onto different threads (Swift Concurrency's global executor). Note that this adds latency (unless you make SwiftNIO the Swift Concurrency executor) and it may lead to queuing up requests in memory if you can't handle them as fast as they arrive! Just to be clear: That's can be a very bad thing.

Over time, Vapor will need to grow controls over how much work it will allow to be offloaded from the I/O threads. Today, marking a route handler async has the side-effect of also offloading the non-I/O work it performs (again, unless you make SwiftNIO the Concurrency executor). In the pre-concurrency world you could of course also offload in Vapor but you needed to more explicitly write ioPool.runIfActive { workToOffload() } or otherEventLoop.submit { workToOffload() }, same same but a little more explicit than just async. Just to be clear: I'm not implying that offloading being more explicit lead to more developers actually implementing limits . A benefit of having Vapor do the async offload for you instead of you explicitly asking for it is that Vapor could implement the offload limiting for you and just offer configuration of how many things you're prepared to offload at a given point in time.

Long story short: Careless offloading may make you more vulnerable to OOMs or DOS attacks and even without actually OOM'ing you may start to spend most CPU cycles on maintaining huge queues instead of doing productive work.

Said all that, in this synthetic benchmark it would have appeared to help! Why is that? Well, the benchmark happens to trigger at most 100 things at a time, that's too little to OOM or DOS the server. Plus, the benchmark imposed a strict 2 seconds latency target. Anything below 2s: okay, anything above: error logged. The latencies were actually totally fine but the first request of the later connections got latencies over 2s... It was always just one, but regardless the benchmark recorded that as a failure and dropped the connection.

But by separating the connection acceptance and the expensive part of the request handling (the fib(10k)) into different kernel threads, the kernel (which has preemptive scheduling) would take care of somewhat fairly balancing these threads. So the latency of connection acceptance would come down, the latency for each request would go slightly up (but not over 2s). This means that for concurrency 100 this would appear to help, for concurrency 100,000 or more this would be bad or disastrous. Why? Because if fib(10k) takes 5ms of CPU time, this means you can only make 200 of these calculations per second per core. So if you ever get more than 200*<number of cores> per second you cannot handle that, regardless of what you do.

You have two options: Either your server will either crumble and not respond to anything or it'll prioritise some of these requests and the remaining ones will be rejected (load shedding) or get high latencies (because they're queued). SwiftNIO's default is a combination of both: It accepts a few connections at a time, the ones that are pending are in the kernel's listen backlog (configured with this) and when the listen backlog is full, the kernel will reject further connections. IMHO exactly the right implementation and both limits (connection acceptance and listen backlog size) are configurable.

Lastly, if connection acceptance is really something you always want to prioritise, SwiftNIO offers the ability to donate one or more separate threads to exclusively accept connections. And as discussed before you can also configure how many connections it accepts per 'tick'.

As shown by my change to Vapor, it is not hard-coded by SwiftNIO, it's just a default value. This default is just fine for the real world but in this particular benchmark it doesn't look great.

If you want SwiftNIO to always prioritise accepting new connections first, then go ahead and set it to Int.max or make SwiftNIO use a separate EventLoop for connection acceptance (that's what this constructor is for). But then you might open yourself up for denial of service attacks because somebody could prevent you from processing existing connections by spawning new ones very fast (essentially sending SYN packets).

One rule of resilient systems is to never accept unbounded amounts of work, everything needs a limit. The idea here is that spending one resource (usually memory) on something that you can't process because you're out of another resource (usually CPU, network or disk I/O) doesn't make sense.

What the default limit is, is less important. The important bit is that there is a limit (and it should be configurable). SwiftNIO sticks to this rule, there are no (known to me) unbounded buffers fillable from the network. So even if connections are arriving faster than they can be processed, SwiftNIO will not just stupidly keep accepting connections over doing other work. If it did that then an attacker could keep sending SYN packets at a rate slightly faster than the SYN+ACKs come back. Is the default value of 4 any good? Again, debatable. I think it is okay but maybe a little low. The same applies to data. Even if the kernel buffers for a connection are constantly full, SwiftNIO will not stupidly keep pumping bytes, it will do maxMessagesPerRead read() calls and then switch to the next task (servicing the other connection or running other tasks).

So everything works as designed to be honest but if all you care about is

keeping the maximum latency < 2s
support 100 connections arriving at the same point in time, then no more connections
continuously with all your CPU power calculating for ~5ms on all threads

then yes, 4 is a bad configuration. I understand that these kinda benchmarks are common and that's why I recommended that Vapor tune this setting to 256 or so. Again, it's not important what the limit is, it's important that there is a limit .

wadetregaskis · May 4, 2024, 4:58pm

Right, but that's also why defaulting to the core count seems like a good idea because any CPU-bound workloads (which includes a lot of benchmarks, like this one in question) will basically perform optimally. Accepting more than the CPU core count will just add overhead through contention.

I'm curious why NIO / Vapor use this architecture, as opposed to an idle-acceptor? (dunno if that's the canonical term, but in short: run one thread per core and whenever that thread idles, accept another connection) I realise that's not easy to actually implement (for non-trivial handlers, i.e. anything which blocks or yields), but it's what I've seen work well in production.

Does Vapor support deadlines, to any degree? e.g. immediately terminate requests which are unlikely to succeed in time anyway, and/or cancel requests in flight if they exceed a deadline? I haven't explored it first-hand, but I've heard that this can be a substantial boon for overall success rates in overload situations.

(or, the simpler approximation via LIFO request handling?)

johannesweiss · May 4, 2024, 5:12pm

The number of threads NIO uses is defaulted to core count. The number of connections to accept in one burst is not. And it wouldn't make a massively better default either. Systems that aren't overloaded can't even tell the difference between 1, 10 and 100 in this setting. This is a priority control for when you're overloaded.

That is exactly the architecture. More precisely 'whenever a thread is idle or has serviced all connections once, accept up to 4 connections' . The problem was that it took >2s for the threads to become idle/us having serviced the other connections. Therefore accepting a connection took just over 2s which means the benchmark classified it as a failure.

With the setting set to 256 it's now 'whenever a thread is idle or has serviced all connections once, accept up to 256 connections'.

You'd need to implement that yourself, my understanding is that Vapor doesn't offer anything in that space at the moment.

FWIW, if you are building systems that are frequently overloaded and you want them to behave okay, then usually you'd do:

One explicit queue of a finite size right at the edge of your system (where requests are coming in)
Circuit breaker which rejects requests immediately if queue is exhausted (load shedding)
Often: Pop from the queue in LIFO (last in, first out) order

Why LIFO, not FIFO? This has some explanations & simulations but bottom line is: The longer something has been in a queue, the more likely it is that the client will hit its timeout before the server manages to process. Worse: The client hitting a timeout will usually mean that the client re-requests the work. The pathological situation is a FIFO queue where you constantly hit the client timeout juuuust before you finish processing, forever. 0 progress, lots of CO2 emitted .

These things can be implemented in your server itself or for example by putting an envoy or some other load balancing proxy in front. IIRC, Envoy's default config is 0 queue size (so LIFO/FIFO doesn't matter) and circuit breaker on 1,000 ongoing transactions. It's a bit of a weird limit because it's large enough that you won't see it in normal testing but too low for most production scenarios. Unless you're calculating fib(10k) for every request, then 1,000 concurrent requests might be good ;). But sticking to my point above: The important bit is that there is a limit by default and that it's configurable which is exactly what Envoy does.

The Swift on Server ecosystem currently doesn't have any circuit breaking libraries that I'm aware of but SwiftNIO has all the flexibility you'd need to implement one.

tera · May 4, 2024, 5:19pm

Is this obvious? What bad would happen if instead of an artificial limit (which we must carefully select and maintain) we'd just accept more work and fail to accept more when resources are exhausted? And once resources are freed — we'd accept more work again. Why is that bad? Whether you drop requests yourself or are they dropped by means of resource exhaustion – the end result is the same "service is temporarily unavailable, try later".

wadetregaskis · May 4, 2024, 5:24pm

Wouldn't it make a positive difference for transient busy periods, though? If you happen to get 100 connections basically simultaneously - but on a longer time scale you're not actually overloaded - then you should accept them all because it'll work out fine. If you don't, you end up with this somewhat pathological behaviour that @axello's benchmark triggers, of failing to serve requests that could have been served successfully.

I confess I still don't get it, but that's okay - I haven't played seriously with Vapor so I probably just need to sit down and really explore it a bit.

johannesweiss · May 4, 2024, 5:26pm

That's exactly what is implemented. If resources are plentiful the event loop ticks will be short so we'll basically constantly accept up to 4 connections. But when resources are getting tight (ticks are getting longer) we're accepting up to 4 connections with longer and longer delays.

In pseudo-code what's going on is:

while true {
    let newConnections = server.acceptNewConnectionsIfAnyAreWaiting(upTo: 4)
    serviceExistingConnections()
}

So if serviceExistingConnections() is very quick, it keeps accepting new connections at the maximum rate. It's just if serviceExistingConnections() uses all your CPU for 2 seconds that there might be 2 seconds where no new connections are being accepted.

wadetregaskis · May 4, 2024, 5:27pm

johannesweiss:

In pseudo-code what's going on is:
while true {
    let newConnections = server.acceptNewConnectionsIfAnyAreWaiting(upTo: 4)
    serviceExistingConnections()
}
So if serviceExistingConnections() is very quick, it keeps accepting new connections at the maximum rate. It's just if serviceExistingConnections() uses all your CPU for 2 seconds that there might be 2 seconds where no new connections are being accepted.

It's not clear to me why this would ever fail to saturate all available cores, basically immediately, then? Either you have a copy of this loop running concurrently on every core, or you have just one instance of this loop but serviceExistingConnections should be recognising that it's not using all the cores…?

johannesweiss · May 4, 2024, 5:31pm

Any setting will accept all the connections. It's literally just how many connections are accepted in one acceptance burst.

It's like

for numbers in allTheNumbers.chunks(ofCount: 4) {
    for number in numbers {
        print(number)
    }
    doOtherWorkIfNecessary()
}

This will print all the numbers in allTheNumbers but if doOtherWorkIfNecessary() takes 5ms, then you'll quickly print 1, 2, 3, 4 then wait for 5ms, then 5, 6, 7, 8 then wait for 5ms, then 9, 10, 11, 12, etc.