'Standard' vapor website drops 1.5% of requests, even at concurrency of 100!

I have been researching the server impacts of 4 different backend technologies for their environmental impact. This was for a talk I gave in Amsterdam. I've been benchmarking Node.js, php, java and vapor.

During my research I stumbled upon a weird habit of this particular vapor/docker/swift: it always drops 1 - 2% of all requests. With ‘drop’ I mean a timeout of > 2000 ms.
Also, Node.js and java handle responses a lot faster at first than swift. Which is also… incredible.
But Node.js, php and java only start dropping requests when they cannot handle the load anymore, and then quickly become useless. Swift happily chucks along, but with a persistent drop of about 1.5%, right from the start, at 100 concurrent requests. It keeps on going at this 'drop level' until I stopped testing at a concurrency level of 100000.

I asked in the Vapor Discord, and they pointed me to here, as it might be SwiftNIO related?

Anyhoo, here is my report about it:

Juice Sucking Servers

9 Likes

Your results seem illogical, unless I'm misunderstanding your test setup. If you're using 100,000 concurrent workers, with a two second timeout, then you must have a throughput of at least 50,000 requests per second in order to avoid timeouts. Yet you're reporting only ~8,000 requests/sec with ≤3% timeouts.

2 Likes

Could you post the Vapor server code here as well as how you built the application?


Also: CC @0xTim regarding the PR marking the .wait()-using methods noasync which is a precursor for being able to use NIOSingletons.unsafeTryInstallSingletonPosixEventLoopGroupAsConcurrencyGlobalExecutor() which is the only way to get full performance. Otherwise Swift Concurrency forces us into at least two thread hops per request which means a request will likely have a floor latency of 20 µs (10 µs per thread hop), reality will be worse. At least if the OP uses the async APIs which I assume is the case.

3 Likes

Server code is here: vapor/swiftbench/Sources/App/Controllers/bench.swift · main · Axel Roest / serverbench · GitLab.

1 Like

Johannes' suggestion here seems right: Vapor is building up an unbounded queue of work. The reason things are so stable is that NIO is humming along very nicely, serving the HTTP traffic. Then we kick the request onto the async/await pool and wait for CPU resources to become available to do the fibonacci calculation, before returning the response.

You can see this if you capture a wireshark trace of the execution: early HTTP requests terminate fairly quickly (~70ms on my M1 Max running a Linux VM), but later ones take longer (110ms). The durations here are very stable and only go up, so it's a clear trend that it's taking us longer and longer to get to do any work.

I haven't analysed the business logic at all, so it's fully possible that the app itself can be sped up, but the cost of jumping between concurrency domains is definitely hurting today.

As for the low-lying error rate, I think this is also caused by the high latency profile of hopping into async code. If I drastically widen the number of threads I use in wrk (from 4 to 120) my error rate drops precipitously. This is likely because wrk doesn't hit its timeouts quite as aggressively as it does when it uses fewer threads, and so the more marginal requests complete.

TL;DR: I don't think this is directly NIO related, it's more about the awkward transition from Futures to Concurrency. Task executors will help markedly. I also recommend running the business logic under Instruments or something similar to see if the perf can be improved, as the calculation time for Fibonacci is pretty heavy.

1 Like

Or call NIOSingletons.unsafeTryInstallSingletonPosixEventLoopGroupAsConcurrencyGlobalExecutor() as the very very first thing in your program today (and log some warning if it returns false) :slight_smile:.

More info on the PR: allow setting MTELG.singleton as Swift Concurrency executor by weissi · Pull Request #2564 · apple/swift-nio · GitHub

1 Like

For the Fibonacci calculation, a couple of likely optimisations:

  • Using BigUInt might be faster, as it removes any sign-bit and two's-complement handling. It's impossible to get a negative number out of a [canonical] Fibonacci sequence.

  • For large numbers (once you start getting into hundreds of bits for their storage) you're probably going to pay some significant costs just for copying them (e.g. in the core loop generating the next value in the sequence, in generateFibonacci). You might (untested) get better performance by doing more in-place:

    for _ in 2 ..< count {
        previous += current
        swap(&previous, &current)
    }
    

    …assuming the optimiser does a good job on the above, and wasn't already doing essentially this with the original code. In principle the 'swap' should optimise away to merely a hidden pointer that switches between the two numbers (and likewise for the original version, though that asks more of the optimiser to recognise that fact).

    At the very least it should ensure at most merely memory movement; it will avoid any retain-release traffic and the like (BigInt & BigUInt use an Array<UInt64> internally for numbers that require more than 128 bits, and Array's internal storage is a reference-counted class instance).

    Of course, whether this is "fair" to the benchmark is a fair question (no pun intended). I think it is, because while it's a bit of a "trick" to know to use swap instead of the temporary variable approach, it's just as readable and something you do use in real-world code once you know the "trick".

  • Just return current.description. Don't use string interpolation unnecessarily - it's at best merely as fast, and usually slower.

  • generateAllFibonacci is very inefficient as currently written. It might get saved by the optimiser, but I wouldn't bet on it. You're storing all the results simultaneously even though you don't need more than two in memory at a time (the most recent two values, in order to compute the next). It'll very likely be much faster if rewritten like e.g.:

    func generateAllFibonacci(req: Request) throws -> String {
        var result = "0, 1"
    
        var previous = BigUInt(0)
        var current = BigUInt(1)
    
        for _ in 2..<10000 {
            previous += current
            swap(&previous, &current)
            result += ", "
            result += current.description
        }
    
        return result
    }
    

    String basically grows its size [when necessary] by doubling its capacity, so while this will be pretty fast it still isn't technically optimal in terms of peak memory usage and malloc traffic.

    It'd be better if you streamed the results piecemeal back to Vapor, where it can put them out on the wire as they're generated; much more efficient than collecting a big temporary String of the whole response. I'd be surprised if Vapor doesn't support this, though I looked through the docs and couldn't find any indication of how.

1 Like

I don’t think serving Fibonacci numbers via HTTP is actually the goal of this exercise? It seems like a synthetic workload to exercise the system behavior.

4 Likes

It might not be the goal, but it's the principal cost, and it's quite likely to be a major contributor to cross-language differences. In particular, BigInt biases hard towards GC'd languages because the cost of repeatedly allocating storage is very cheap, while Swift eats the perf cost of calling malloc.

5 Likes

It's also a very notable cost: it's a heavy on-CPU load. I don't see a request take less than 70ms, so we're looking at probably 60ms of CPU time per request. Optimistically, that gives us a best-case throughput of less than 17RPS per core. It sort of doesn't matter what the network stack gets up to if that's the cost. (And it's 60ms on my M1 Max: the OP tested on a 2013-era Intel server chip. I expect the cost there to be far worse).

Hey @axello, I had an actual look at your binary. 92% of the time is spent in BigInt.+. So really we're benchmarking the BigInt implementation here and not the Vapor server. We're basically burning all the CPU on doing + operations on the BigInt and nothing else. So sync/async, NIO, Vapor, none of those matter for these numbers.

What BigInt implementations are you using for the other languages?

CC @lorentey, any suggestions regarding BigInt performance? Seems to be using GitHub - attaswift/BigInt: Arbitrary-precision arithmetic in pure Swift

11 Likes

Indeed, it could very well be other causes leading to this problem, see below. However, on the same shitty server (2013) java and javascript perform much, much better at lower concurrencies. ¯_(ツ)_/¯

Also these results are from wrk, so maybe I do not understand how it does its accounting?

Hi Cory,

good observations. Of course the Fibonacci code can be optimised, but I wanted the same type of algorithms across all four platforms. For example, at first I was afraid that the java implementation was so performant because it did aggressive caching, but the developer assured me it wasn't so. Still, the default java implementation, on this machine is twice as fast as the default swift implementation!
I was using low-performance machines (2013) to do the benchmarking and the wrk instance over gigabit. So it could very well be a network bottleneck as well. But then I think: why not for the other frameworks?

Hi Wadetregaskis,
Thank you for your elaborate analysis. While the implementation is slow and there are many other and faster algorithms, that was not the point of this exercise. A matrix type algorithm in php might outrun this naive algorithm in swift. I'm pretty sure similar enhancements can be made to Node.js and javascript. In fact, me and a java developer believe java is so fast because it can optimise on the fly for the most efficient code which might fit in a level 1 cache and thus be executed optimally on the processor.

Benchmarks are never "fair", but I can try to make it as fair as possible, by using 'default' implementations. Joannis mentioned to use Hummingbird as that is faster than Vapor. But then I could perhaps also use the 'compiled php' from facebook, so there are always trade-offs.

I do something similar in javascript! Why is javascript performing faster at first?

    let a = BigInt(0);
    let b = BigInt(1);
    let temp;
    for (let i = 2; i <= n; i++) {
      temp = a + b;
      a = b;
      b = temp;
    }
    return b.toString();

Good point about the string interpolation, but again: why is vapor giving timeouts due to this when even php can keep up?

I do not use generateAllFibonacci at all. It was more of a leftover code to generate all the codes. There are similar leftovers. e.g. the recursive algorithm in the other languages.

1 Like

Indeed Kyle.

In the blogpost I mention I was looking for an algorithm which could run locally, is well known, does not use a database nor excessive bandwidth, and could be scaled if the processor proved to fast. Fibonacci can be easily scaled by doing e.g. Fibonacci(1000000)

Aha!
That might be a very good reason why swift behaves differently from the other interpreted/jit languages. They can optimise on the fly, while swift has to be optimised up front. About the Garbage Collecting: you can see that very well in the java charts with the kinks in the memory graph. The audience suspected that if there was more time between the consecutive concurrency workloads, java might have some breathing room to run the GC.

And also: swift memory usage stays very constant. I did not use very low-grained malloc loggers, as I have no idea how to start doing that under linux (no Instruments!). But I presume memory is immediately de-alloced when not use, so indeed thousands of malloc-free occurrences. Which might not happen in javascript/java/php.

2 Likes

Aah!
I only used the '--concurrency' flag, as I knew that from ApacheBench. I just now see the 'threads' flag of wrk. (I was in a crunch while performing these tests).
One would assume that wrk increases the number of threads according to the concurrency level requested, but we all know how bad assumptions are!.

For all the tests I used the default level of threads for wrk. As this was a single Core i5 machine, I could have increased that to two/four, but I wonder if swift's responses would be handled differently. I will do that in a follow-up test.

Thanks for the profiling Johannes!
I was looking for 'arbitrary-length integer' libraries for all the four languages:

  • javascript: BigInt
  • php: I used bcmath first, but that was so ridiculously slow that I went for gmp, which is a frontend for the Gnu MP library.
  • java: java.math.BigInteger

So, basically all the standard 'big integer' extensions.
I think it is actually a good thing that the algorithm spends so much in BigInt, because that means the benchmark is processor bound, and not network or memory! It would be nice to see though where the other big contender, java, gets its superior speed from.

Would you suggest to drop the Fibonacci level to e.g. 100, so that more requests can be made and the influence of the chosen framework/technology is bigger?
But my biased point was also that a compiled language would be superior to a script language.

As far as I see it now, script languages trade-off memory for cpu speed.

Everyone:
(I had a dinner party, hence the tardy responses from me).

Thank you SO much for all your wonderful input. I really wanted to have swift come out on top and was very surprised that java and javascript using bog default implementations were faster at first. Swift does come out on top memory wise.

I am a benchmarking noob. Last time I benchmarked a server it was indeed ApacheBench for Apache, somewhere 20 years ago?

So a couple of things are still unclear to me.

  1. When I do a request, and the server answers: got it! (ack), that counts as a succesful request for wrk?
  2. And then, when the server does not answer within the timeout period, I presume I get a timeout for those requests made?
    ➔ I see that in all the other languages, js, php, java
  3. For vapor+swift I see immediately some packets are dropped, but all the other requests are somehow handled differently and/or correct? Is that the reason I can easily go to 100000 requests, even though the network might completely be saturated? Would wrk then not give apr_socket_recv: Connection reset by peer (104) errors? I got that a lot with ab, which does not support HTTP/1.1

Questions, questions, so many questions!

But still the nagging doubt remains that java can easily handle 1000s of requests within the 2 second timeout, and vapor drops several requests immediately.

Perhaps the java implementation does indeed do url caching, it was created by a java devops engineer. I was looking to have 'wrk' use random parameters per request, to prevent this, but it cannot do that.

Still, all in all, it makes for interesting statistics. But now I seriously begin to doubt what wrk considers an answer?
Perhaps vapor simply returns empty 503 errors instead of calculating and returning a proper result?

Yeah, this was bothering me as well. My initial analysis was that the hops to async code were the issue but @weissi pointed out to me that the RouteBuilder actually defaults to sync code. Handily, this is an even better explanation of our results than the prior one.

Vapor has always run synchronous code directly on NIO's event loops. This means your fibonnaci calculations are blocking our I/O. This unfortunately means that our latencies are directly tied to how much CPU work you're doing, which as we discussed is quite a lot.

Adding the word async to your function doesn't improve throughput, but it does substantially improve tail latencies. That's just because we can keep serving new requests without the event loops getting blocked.

That takes us to problem two, which is the failed requests at low load. This seems like a weird Vapor issue to me, because if you look at a packet capture these failed requests principally happen at the start of the load. This suggests an error in channel setup, but I'm not seeing any logs from there. @0xTim, any idea what Vapor is up to?

4 Likes