SwiftNIO WebSocket client outbound write performance

johnnzhou · August 4, 2024, 11:41pm

I'm developing a performance measurement tool using WebSockets with Vapor's WebSocket Kit. I've observed that when the load reaches around 4,000,000 bytes, the underlying socket returns EWOULDBLOCK. This leads to significant delays before the socket resumes writing outbound data. To investigate further, I built a custom version of Vapor's WebSocket Kit (GitHub - Local-Connectivity-Lab/websocket-kit: WebSocket client library built on SwiftNIO) for testing purposes.

Setup:

A WebSocket Server based on Vapor's WebSocket Kit. maxFrameSize is configured to be 67,108,864 bytes. Single threaded event loop group.
A WebSocket Client based on Vapor's WebSocket Kit with the event loop group consisting of 4 threads.
MacBook Pro 16'' with M3 Pro
Swift Version: 5.10
Test repo: GitHub - johnnzhou/ws-perf-demo

Hypothesis #1:
The issue might be related to SwiftNIO and system preferences for smaller writes using writev. To test this, I added a custom ChannelHandler named BufferWritableMonitorHandler in my custom WebSocket Kit. This handler writes min(buffer.readableBytes, 65536) repeatedly until no bytes are left in the buffer. The results showed that both setups (with and without small writes) produced similar performance with acceptable tolerance. This indicates that small writes may not be the primary issue.

Without Small Writes (ms)	With Small Writes (ms)
39898	41710
39978	37150
40214	40629

Hypothesis #2:
Another possibility is that the so_SNDBUF socket option needs adjustment to improve write performance. I tested different buffer sizes (1MB and 4MB) but still observed noticeable delays between when the socket first returns EWOULDBLOCK and when it resumes writing.

Hypothesis #3:
Channel writable water mark is a bit small. I added the code in custom WebSocket build to configure the writeBufferWaterMark. But there is no significant improvement to the issue.

I reproduced the same test using OKHTTP in Java and some runs show that it doesn't have such issues. Loads are sent to outbound without any noticeable delays.

Upon further inspection, I found that epoll/kqueue is used to monitor and notify SwiftNIO when the socket becomes writable. Could these mechanisms be contributing to the issue?

Any insights or suggestions on how to address this performance problem would be greatly appreciated! If there are any issues with my setup or if you have any recommendations, please let me know!

johannesweiss · August 5, 2024, 5:59pm

If you see EWOULDBLOCK then indeed, the SNDBUF is exceeded and the kernel won't accept more bytes. Now, as you did you can increase that buffer but there'll always be a limit.
After that, these bytes need to be sent over the network (or loopback) and actually be read by the other side. If the other side reads the bytes too slowly, then you'll always have times when you won't be able to write more. TCP's flow control and congestion windows make it impossible for your kernel to send significantly more than the other side is willing to accept. That's a feature if course, or else you could be taking up arbitrary resources of a remote machine.

What is the other side? As in what server are you testing against, and what's the latency to that server?

johnnzhou · August 6, 2024, 7:21am

Hi Johannes,

Thanks for the explanation. I tried to isolate the problem with the mock server and mock client (GitHub - johnnzhou/ws-perf-demo). I increased the RCVBUF to 4 MB on the mock server with 4 threads in the EventLoopGroup, bound to loopback. I also increased the SNDBUF to 4MB. The result is slight improved compared to previous runs. Client took about 38842 ms to complete and server took about 38829 ms to receive all loads. When I increased the SNDBUF and RCVBUF to 6MB, then client and server took about 27962 ms and 27941 ms to complete respectively.

But I expected the performance on loopback to be a lot better than this. Did I totally estimate the performance wrong?

In the actual testbed, we are trying to create the test client following M-Lab's NDT7 protocol. Their server implementation is located here. I'm facing a noticeable performance bottleneck when running the test client implemented using SwiftNIO (4threads, 4MB SNDBUF) compared to other test client (the official JS and Go implementations) under the same network environment using the same machine. Although I have to admit it is very hard to measurement the network performance, I observed consistent slowdown in terms of outbound speed, which are sampled every 250ms.

progress: 39.455589053574705 mbps
progress: 31.653835293670998 mbps
progress: 26.428077478795817 mbps
progress: 22.683289463839973 mbps
progress: 19.86803122634133 mbps
progress: 17.674438077572788 mbps
progress: 15.917064023147367 mbps
progress: 14.477556727229976 mbps
progress: 13.276827150357045 mbps
progress: 12.260014551247195 mbps
progress: 11.387868794149659 mbps
progress: 10.63156096694044 mbps
progress: 9.96946485399164 mbps
progress: 9.384991347989965 mbps
progress: 8.865261583957734 mbps
progress: 8.40006825727957 mbps
progress: 7.981264883602526 mbps
progress: 7.602241882862475 mbps
progress: 7.257583113073676 mbps
progress: 6.942820191436118 mbps
progress: 6.654222824931202 mbps
progress: 6.3886626608093735 mbps
progress: 6.14348527484113 mbps
progress: 5.916432460001552 mbps
progress: 5.705562830350746 mbps
progress: 5.509207256007956 mbps
progress: 5.325917098342166 mbps
progress: 5.154428967249834 mbps
progress: 4.993641013150573 mbps
progress: 4.842582062055885 mbps
progress: 4.700391704476466 mbps
progress: 4.566314333091406 mbps
progress: 4.439673373323003 mbps
progress: 4.319869175574044 mbps
progress: 4.206359568335116 mbps
progress: 4.098662426742249 mbps

As a reference, the performance showed by the official JS client is around 20 Mbps

I was wondering if I missed something while implementing the test client using SwiftNIO.

Thanks for your help in advance!

johannesweiss · August 6, 2024, 8:54pm

The consistent slowdown is odd. Unfortunately, I don't really know much about Vapor's WebsocketKit.

Did you check what's actually happening during your tests? Are the CPUs at 100%, how's the memory behaving?

And just to be sure: you're always running everything in release mode, right?

Also what's the write pattern you're doing? Big messages or small messages? Many at once or one after the other?

johnnzhou · August 7, 2024, 7:16am

Hi Johannes,

Yes, in production, all the binaries are in release mode.

The write pattern I'm currently using starts with small message (1 << 13 bytes). Then it will try to increase the message size to accommodate fast client until reaching the max message size limit. Message will be sent one after another. A pseudo algorithm looks as followed

currentLoad = 1 << 13
MAX_MESSAGE_SIZE = 1 << 24
while we are still within MEASUREMENT_INTERVAL {
    nextSize = (currentLoad > MAX_MESSAGE_SIZE) ? Int.MAX : 16 * currentLoad
    if (totalBytes - bufferedSize) >= nextSize {
       currentLoad *= 2
    }

    if (bufferedSize < 7 * currentLoad) {
      ws.send(currentLoad)
      totalBytes += currentLoad
    }
}

I followed the trace in SwiftNIO. I noticed that PendingWritesManager buffers and handles all writes. I don't think it is an issue, because for most time it uses writev to save the number of syscall and waits for signal from kqueue/epoll when resource is constrained.

I did some basic profiling and I found out that memory usage spiked to 2G and CPU also spiked to 100%. I guess it means some kind of bottleneck, either in my code or not enough resources to handle load effectively? Not sure what caused the excessive memory usage (unnecessary copy of bytebuffer? but bytebuffer uses CoW and most operations are reading data. Did I miss something here?).

johannesweiss · August 9, 2024, 10:14am

This suggests that you're producing the writes faster than your network / server on the other side can consume them.

What in your code monitors how fast the bytes are actually being sent?

Unnecessary copies would show up in CPU, not memory. If you have an unnecessary CoW, that's super annoying allocate+copy+deallocate old one but it doesn't make the memory spike.

johnnzhou · August 12, 2024, 5:55am

Hi Johannes,

Thanks for the explanation.

The client samples every 250ms. The test client tracks the total number of bytes being sent and will ask about the number of writable bytes currently being buffered.

let numBytesSent = total - bufferedAmount
let elapsedTime = now - start

let speed  = numBytesSent / elapsedTime

By doing this calculation every 250ms, test client tells us the throughput of our network.

I'm investigating if I miss anything in the test client implementation (synchronization issue, missing locks, data races ...).