'Standard' vapor website drops 1.5% of requests, even at concurrency of 100!

I think what @axello and I have meant is HTTP connection reuse. My apologies for the confusion from my use of the wrong terminology.

Any situation which tries to connect more than 128 connections at closely the same time…


Vapor doesn't appear to have any such controls…? Even at TRACE logging levels there's nothing useful and no errors reported, even when I can see errors happening clear as day in syscall traces.

Spelunking in the NIO code, I found plenty of places where it silently discards error information - e.g. read & write errors are often translated to just an "eof" error constant, discarding the actual errno.

I had some partial luck using breakpoints to catch the offending code paths, but "live" debugging like that is problematic for cases like this, where thousands of things are happening essentially concurrently and at very high speed, as the consequence of pausing the program for a breakpoint is to completely invalidate any further execution. And it gets old real fast having to iteratively refine the breakpoints, restarting the entire benchmark between every one. :confused:

Unfortunately that's a substantially less useful benchmark, because you lose almost all insight into success rates (given a latency requirement). And real-world SLOs are often set in terms of a small but non-zero acceptable failure rate (in my experience - I'm sure it varies).

I don't think @axello's test methodology is in any way wrong. Unlimited latency testing is not usually representative of real-world behaviour, and constant request rate benchmarks test subtly different things (e.g. they'll behave differently if the response latency is variable vs consistent).

The problem is the tools (wrk) and the kernel…

Yeah, that's a great little demonstration of this.

It's not hard to fix this bug in wrk - I patched my local copy - but it's only part of the issues we've been seeing…

wrk bugs, kernel bugs, and bad kernel configuration defaults

I'll talk simply here for the sake of clarity, but of course there's room for error here. I'm fairly confident in these assertions, but not certain.

It's worth noting that - aside from some missing documentation, perhaps - I'm not seeing any major issues with Vapor or NIO. I'm still not entirely sure what's going on server-side with the false connection closures, but I assume it's basically the same kernel bug as what wrk is suffering from on the client side.

wrk bugs

I've already detailed some of the ways wrk is very buggy, in prior posts. @johannesweiss highlighted some others, like its misattribution of connect failures to read/write errors (only partly its fault - more on that in the next section).

Another one is that it doesn't count connection establishment in its latency numbers - it starts its counter from when it starts writing the HTTP request. With HTTP connection reuse that tends to get lost in the noise, in the happy cases, because the steady-state behaviour is that no TCP connections are being opened or closed, but without reuse, or in unhappy cases like a heavily-loaded server (kinda the point of a benchmark, usually), it can cause wildly inaccurate latency numbers.

Kernel bugs

I don't know if it's specific to the use of kevents, or a broader problem with sockets in the macOS kernel, but there's several clear bugs:

  • If the listen backlog is exhausted the client side is supposed to receive a connection failure event (either connect fails directly, in blocking mode, or EV_EOF is set on any kevents for the socket, in non-blocking mode).

    What actually happens is that the connection is falsely claimed to have been made (kevent posted saying the socket is writable without EV_EOF set). Even worse, you can actually write to it! The write call succeeds! Then, another kevent is posted saying the socket is readable but closed (EV_EOF is finally set).

    The reason I believe this is a bug (or a design flaw, at least) is that it means there's no way to actually tell if the connection was actually opened or refused, in this situation. I realise the kernel might think it's smart by letting you write to the socket before it's actually even open - I assume that's its intent - but it makes for very confusing and misleading behaviour sometimes.

    I also think it's particularly annoying for localhost connections since there's no real reason the kernel can't immediately know the connection attempt will fail, and very clearly fail it. It could do that right at the connect call, even in non-blocking mode.

  • When a connection is refused because the listen backlog is full, it's specifically supposed to fail with ECONNREFUSED on the client side, according to the listen man page.

    Whether you pull errno from that failed read, or from connect itself if in blocking mode, the error code you actually get is ECONNRESET or EBADF, depending on whether you hit somaxconn first (more on that shortly) or the listen backlog limit, respectively.

    That's not supposed to happen - ECONNRESET is supposed to mean the previously open connection was closed by the other side (specifically, an RST packet was received from the other side). (technically there's nuances regarding half-closed states, but not relevant here)

    And EBADF is obviously not supposed to happen either, and even worse really, because it indicates the kernel has already torn down the entire file descriptor even though the userspace code has not actually called close on it yet (I'm not certain that's wrong per whatever design spec governs these things, but it's certainly unexpected and problematic).

This all explains why making wrk open all the connections synchronously works - when using an otherwise sensible number of wrk threads, e.g. 20 on my machine, you can only ever have that many concurrent connection attempts at once, which is well below the relevant kernel & server limits.

Note also that if the server is outright not listening (e.g. you kill the server while wrk is running, so that the listening socket closes completely) then you do start getting correctly-reported connection failures. So this bug is specific to the case where there is a server at the destination but the kernel refuses to connect to.

Bad kernel configuration defaults

We already covered the file descriptor limit. 2560 doesn't seem an unreasonable default to me, for macOS, although the discrepency @tera noted between CLI & GUI apps (2560 vs 256) is concerning, and 256 is unreasonably few especially for GUI apps which tend to have a lot more spurious file activity than simple CLI apps.

Arguably it's reasonable to expect a server to specifically tune this higher (e.g. into the hundreds of thousands, potentially) on a per-process basis. As a system-wide default, 2560 makes sense.

Beyond that, it turns out the magic knob is the kern.ipc.somaxconn sysctl. That basically determines how many in-flight, incoming connecting attempts you can have at any one time.

It appears to be a per-process limit (or per-socket, maybe), but I couldn't find any documentation of that (in fact, it's barely documented at all - notwithstanding the various cargo-culting online about making its value bigger). I just noted that changes to it don't affect already-running processes.

On macOS it defaults to 128. That's way too low for a web server. Maybe it's fine for "client" or "consumer" applications, which I guess is macOS's target audience, but even so it's not clear to me why it has to be quite that low.

I believe its intent is to guard against SYN attacks and the like. Still, it can do that just fine at higher values, while not crippling normal use.

When that limit is hit, irrespective of your listen backlog size, connection attempts fail. This is what was causing all of the read/write errors reported by wrk (which were actually all connection failures but are being misreported by the macOS kernel, as detailed earlier).

The listen man page actually maybe hints at this, where it lists a very important known bug:

The backlog is currently limited (silently) to 128.

I couldn't quite confirm it - I can't find the relevant kernel source code - but I think this is referring to the somaxconn default, not a separate bug in listen itself. i.e. after raising somaxconn to something sane, and setting the listen backlog to 3,000, I basically couldn't get any connection failures irrespective of how many concurrent connections I used (it seems like wrk is simply incapable of launching 3,000 effectively simultaneous connection attempts, at least while the server is running on the same machine and competing for CPU time).

FWIW the default listen backlog with Vapor is 256, which is unfortunate as it misleads you into thinking that the limit is, well, 256. Maybe it's not technically Vapor's concern what other limits might apply to the listen backlog, but it sure would be nice if the Vapor documentation mentioned the somaxconn sysctl.

In conclusion

Plenty of bugs to go round! :crazy_face:

With all these things fixed, I now see zero connect/read/write errors irrespective of load. Timing information is reported correctly.

@johannesweiss previously asserted that wrk erroneously miscounts its own timeouts as read/write errors, which I can totally believe given the state of the wrk code, but I have not yet seen that problem myself. I couldn't even find the code which actually executes the timeout and closes the socket, within wrk, with a brief search.

wrk patch
diff --git a/src/ae.c b/src/ae.c
index e66808a..99fa732 100644
--- a/src/ae.c
+++ b/src/ae.c
@@ -398,6 +398,7 @@ int aeProcessEvents(aeEventLoop *eventLoop, int flags)
         numevents = aeApiPoll(eventLoop, tvp);
         for (j = 0; j < numevents; j++) {
             aeFileEvent *fe = &eventLoop->events[eventLoop->fired[j].fd];
             int mask = eventLoop->fired[j].mask;
diff --git a/src/ae.h b/src/ae.h
index 827c4c9..6d0d63a 100644
--- a/src/ae.h
+++ b/src/ae.h
@@ -38,9 +38,10 @@
 #define AE_OK 0
 #define AE_ERR -1
-#define AE_NONE 0
-#define AE_READABLE 1
-#define AE_WRITABLE 2
+#define AE_NONE     0
+#define AE_READABLE (1 << 0)
+#define AE_WRITABLE (1 << 1)
+#define AE_EOF      (1 << 2)
 #define AE_FILE_EVENTS 1
 #define AE_TIME_EVENTS 2
diff --git a/src/ae_kqueue.c b/src/ae_kqueue.c
index 6796f4c..d4faa79 100644
--- a/src/ae_kqueue.c
+++ b/src/ae_kqueue.c
@@ -124,6 +124,13 @@ static int aeApiPoll(aeEventLoop *eventLoop, struct timeval *tvp) {
             int mask = 0;
             struct kevent *e = state->events+j;
+            if (e->flags & EV_EOF) {
+                mask |= AE_EOF;
+                printf("EOF in kevent { ident: %lu, filter: %hx, flags: %hx, fflags: %hx, data: %ld, udata: %p }\n", e->ident, e->filter, e->flags, e->fflags, e->data, e->udata);
+            } else {
+//                printf("No EOF.\n");
+            }
             if (e->filter == EVFILT_READ) mask |= AE_READABLE;
             if (e->filter == EVFILT_WRITE) mask |= AE_WRITABLE;
             eventLoop->fired[j].fd = e->ident;
diff --git a/src/net.c b/src/net.c
index 75916f7..10764fe 100644
--- a/src/net.c
+++ b/src/net.c
@@ -2,6 +2,7 @@
 #include <errno.h>
 #include <unistd.h>
+#include <string.h>
 #include <sys/ioctl.h>
 #include "net.h"
@@ -17,6 +18,7 @@ status sock_close(connection *c) {
 status sock_read(connection *c, size_t *n) {
     ssize_t r = read(c->fd, c->buf, sizeof(c->buf));
     *n = (size_t) r;
+    if (0 > r) { printf("read failed: %d (%s)\n", errno, strerror(errno)); }
     return r >= 0 ? OK : ERROR;
@@ -25,7 +27,7 @@ status sock_write(connection *c, char *buf, size_t len, size_t *n) {
     if ((r = write(c->fd, buf, len)) == -1) {
         switch (errno) {
             case EAGAIN: return RETRY;
-            default:     return ERROR;
+            default:     printf("write failed: %d (%s)\n", errno, strerror(errno)); return ERROR;
     *n = (size_t) r;
@@ -35,5 +37,6 @@ status sock_write(connection *c, char *buf, size_t len, size_t *n) {
 size_t sock_readable(connection *c) {
     int n, rc;
     rc = ioctl(c->fd, FIONREAD, &n);
+    if (0 > rc) { printf("ioctl(…, FIONREAD, …) failed: %d (%s)\n", errno, strerror(errno)); }
     return rc == -1 ? 0 : n;
diff --git a/src/wrk.c b/src/wrk.c
index 51f46f7..6fc5249 100644
--- a/src/wrk.c
+++ b/src/wrk.c
@@ -101,11 +101,18 @@ int main(int argc, char **argv) {
     cfg.host = host;
+    uint64_t remainingUnevenConnections = cfg.connections % cfg.threads;
     for (uint64_t i = 0; i < cfg.threads; i++) {
         thread *t      = &threads[i];
         t->loop        = aeCreateEventLoop(10 + cfg.connections * 3);
         t->connections = cfg.connections / cfg.threads;
+        if (0 < remainingUnevenConnections) {
+            t->connections += 1;
+            --remainingUnevenConnections;
+        }
         t->L = script_create(cfg.script, url, headers);
         script_init(L, t, argc - optind, &argv[optind]);
@@ -240,21 +247,51 @@ static int connect_socket(thread *thread, connection *c) {
     fd = socket(addr->ai_family, addr->ai_socktype, addr->ai_protocol);
+    if (0 > fd) {
+        printf("socket failed: %d (%s)\n", errno, strerror(errno));
+        goto error;
+    }
     flags = fcntl(fd, F_GETFL, 0);
-    fcntl(fd, F_SETFL, flags | O_NONBLOCK);
+    //printf("Socket flags: %x\n", flags);
+    if (0 > fcntl(fd, F_SETFL, flags | O_NONBLOCK)) {
+        printf("fcntl(…, F_SETFL, flags | O_NONBLOCK) failed: %d (%s)\n", errno, strerror(errno));
+        goto error;
+    }
+    if (0 != c->start) {
+        printf("Connecting socket which already has a start time, %llu.\n", c->start);
+    }
+    c->start = time_us();
     if (connect(fd, addr->ai_addr, addr->ai_addrlen) == -1) {
-        if (errno != EINPROGRESS) goto error;
+        if (errno != EINPROGRESS) {
+            printf("connect failed: %d (%s)\n", errno, strerror(errno));
+            goto error;
+        }
     flags = 1;
-    setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &flags, sizeof(flags));
+    if (0 != setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &flags, sizeof(flags))) {
+        printf("setsockopt(%d, IPPROTO_TCP, TCP_NODELAY, %p (%#x), %lu) failed: %d (%s)\n", fd, &flags, flags, sizeof(flags), errno, strerror(errno));
+        goto error;
+    }
+//    if (0 != setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &flags, sizeof(flags))) {
+//        printf("setsockopt(%d, SOL_SOCKET, SO_REUSEADDR, %p (%#x), %lu) failed: %d (%s)\n", fd, &flags, flags, sizeof(flags), errno, strerror(errno));
+//        goto error;
+//    }
+    flags = /*AE_READABLE |*/ AE_WRITABLE;
     if (aeCreateFileEvent(loop, fd, flags, socket_connected, c) == AE_OK) {
         c->parser.data = c;
         c->fd = fd;
         return fd;
+    } else {
+        printf("aeCreateFileEvent(…) failed: %d (%s) [maybe]\n", errno, strerror(errno));
@@ -267,6 +304,8 @@ static int reconnect_socket(thread *thread, connection *c) {
     aeDeleteFileEvent(thread->loop, c->fd, AE_WRITABLE | AE_READABLE);
+    c->start = 0;
+    printf("Reconnecting…\n");
     return connect_socket(thread, c);
@@ -345,10 +384,12 @@ static int response_complete(http_parser *parser) {
         c->delayed = cfg.delay;
+        c->start = time_us();
         aeCreateFileEvent(thread->loop, c->fd, AE_WRITABLE, socket_writeable, c);
     if (!http_should_keep_alive(parser)) {
+        printf("Should not keep alive.\n");
         reconnect_socket(thread, c);
         goto done;
@@ -362,17 +403,26 @@ static int response_complete(http_parser *parser) {
 static void socket_connected(aeEventLoop *loop, int fd, void *data, int mask) {
     connection *c = data;
+    if (mask & AE_EOF) {
+        printf("Socket %d failed to connect.\n", fd);
+        goto error;
+    }
     switch (sock.connect(c, cfg.host)) {
         case OK:    break;
-        case ERROR: goto error;
+        case ERROR: printf("sock.connect failed: %d (%s)\n", errno, strerror(errno)); goto error;
         case RETRY: return;
+    printf("Socket %d connected.\n", fd);
     http_parser_init(&c->parser, HTTP_RESPONSE);
     c->written = 0;
-    aeCreateFileEvent(c->thread->loop, fd, AE_READABLE, socket_readable, c);
     aeCreateFileEvent(c->thread->loop, fd, AE_WRITABLE, socket_writeable, c);
+//    aeCreateFileEvent(c->thread->loop, fd, AE_READABLE, socket_readable, c); // Arguably shouldn't do this until the request has been fully written, because a well-behaving HTTP server cannot respond until then anyway.
+    socket_writeable(loop, fd, data, mask);
@@ -396,7 +446,7 @@ static void socket_writeable(aeEventLoop *loop, int fd, void *data, int mask) {
         if (cfg.dynamic) {
             script_request(thread->L, &c->request, &c->length);
-        c->start   = time_us();
+        //c->start   = time_us();
         c->pending = cfg.pipeline;
@@ -406,7 +456,7 @@ static void socket_writeable(aeEventLoop *loop, int fd, void *data, int mask) {
     switch (sock.write(c, buf, len, &n)) {
         case OK:    break;
-        case ERROR: goto error;
+        case ERROR: /*printf("sock.write failed\n");*/ goto error;
         case RETRY: return;
@@ -414,6 +464,7 @@ static void socket_writeable(aeEventLoop *loop, int fd, void *data, int mask) {
     if (c->written == c->length) {
         c->written = 0;
         aeDeleteFileEvent(loop, fd, AE_WRITABLE);
+        aeCreateFileEvent(c->thread->loop, fd, AE_READABLE, socket_readable, c);
@@ -430,12 +481,19 @@ static void socket_readable(aeEventLoop *loop, int fd, void *data, int mask) {
     do {
         switch (sock.read(c, &n)) {
             case OK:    break;
-            case ERROR: goto error;
+            case ERROR: /*printf("sock.read failed\n");*/ goto error;
             case RETRY: return;
-        if (http_parser_execute(&c->parser, &parser_settings, c->buf, n) != n) goto error;
-        if (n == 0 && !http_body_is_final(&c->parser)) goto error;
+        if (http_parser_execute(&c->parser, &parser_settings, c->buf, n) != n) {
+            printf("http_parser_execute failed\n");
+            goto error;
+        }
+        if (n == 0 && !http_body_is_final(&c->parser)) {
+            printf("http_body_is_final failed\n");
+            goto error;
+        }
         c->thread->bytes += n;
     } while (n == RECVBUF && sock.readable(c) > 0);