I'm developing a performance measurement tool using WebSockets with Vapor's WebSocket Kit. I've observed that when the load reaches around 4,000,000 bytes, the underlying socket returns EWOULDBLOCK. This leads to significant delays before the socket resumes writing outbound data. To investigate further, I built a custom version of Vapor's WebSocket Kit (GitHub - Local-Connectivity-Lab/websocket-kit: WebSocket client library built on SwiftNIO) for testing purposes.
Setup:
A WebSocket Server based on Vapor's WebSocket Kit. maxFrameSize is configured to be 67,108,864 bytes. Single threaded event loop group.
A WebSocket Client based on Vapor's WebSocket Kit with the event loop group consisting of 4 threads.
Hypothesis #1:
The issue might be related to SwiftNIO and system preferences for smaller writes using writev. To test this, I added a custom ChannelHandler named BufferWritableMonitorHandler in my custom WebSocket Kit. This handler writes min(buffer.readableBytes, 65536) repeatedly until no bytes are left in the buffer. The results showed that both setups (with and without small writes) produced similar performance with acceptable tolerance. This indicates that small writes may not be the primary issue.
Without Small Writes (ms)
With Small Writes (ms)
39898
41710
39978
37150
40214
40629
Hypothesis #2:
Another possibility is that the so_SNDBUF socket option needs adjustment to improve write performance. I tested different buffer sizes (1MB and 4MB) but still observed noticeable delays between when the socket first returns EWOULDBLOCK and when it resumes writing.
Hypothesis #3:
Channel writable water mark is a bit small. I added the code in custom WebSocket build to configure the writeBufferWaterMark. But there is no significant improvement to the issue.
I reproduced the same test using OKHTTP in Java and some runs show that it doesn't have such issues. Loads are sent to outbound without any noticeable delays.
Upon further inspection, I found that epoll/kqueue is used to monitor and notify SwiftNIO when the socket becomes writable. Could these mechanisms be contributing to the issue?
Any insights or suggestions on how to address this performance problem would be greatly appreciated! If there are any issues with my setup or if you have any recommendations, please let me know!
If you see EWOULDBLOCK then indeed, the SNDBUF is exceeded and the kernel won't accept more bytes. Now, as you did you can increase that buffer but there'll always be a limit.
After that, these bytes need to be sent over the network (or loopback) and actually be read by the other side. If the other side reads the bytes too slowly, then you'll always have times when you won't be able to write more. TCP's flow control and congestion windows make it impossible for your kernel to send significantly more than the other side is willing to accept. That's a feature if course, or else you could be taking up arbitrary resources of a remote machine.
What is the other side? As in what server are you testing against, and what's the latency to that server?
Thanks for the explanation. I tried to isolate the problem with the mock server and mock client (GitHub - johnnzhou/ws-perf-demo). I increased the RCVBUF to 4 MB on the mock server with 4 threads in the EventLoopGroup, bound to loopback. I also increased the SNDBUF to 4MB. The result is slight improved compared to previous runs. Client took about 38842 ms to complete and server took about 38829 ms to receive all loads. When I increased the SNDBUF and RCVBUF to 6MB, then client and server took about 27962 ms and 27941 ms to complete respectively.
But I expected the performance on loopback to be a lot better than this. Did I totally estimate the performance wrong?
In the actual testbed, we are trying to create the test client following M-Lab's NDT7 protocol. Their server implementation is located here. I'm facing a noticeable performance bottleneck when running the test client implemented using SwiftNIO (4threads, 4MB SNDBUF) compared to other test client (the official JS and Go implementations) under the same network environment using the same machine. Although I have to admit it is very hard to measurement the network performance, I observed consistent slowdown in terms of outbound speed, which are sampled every 250ms.
Yes, in production, all the binaries are in release mode.
The write pattern I'm currently using starts with small message (1 << 13 bytes). Then it will try to increase the message size to accommodate fast client until reaching the max message size limit. Message will be sent one after another. A pseudo algorithm looks as followed
currentLoad = 1 << 13
MAX_MESSAGE_SIZE = 1 << 24
while we are still within MEASUREMENT_INTERVAL {
nextSize = (currentLoad > MAX_MESSAGE_SIZE) ? Int.MAX : 16 * currentLoad
if (totalBytes - bufferedSize) >= nextSize {
currentLoad *= 2
}
if (bufferedSize < 7 * currentLoad) {
ws.send(currentLoad)
totalBytes += currentLoad
}
}
I followed the trace in SwiftNIO. I noticed that PendingWritesManager buffers and handles all writes. I don't think it is an issue, because for most time it uses writev to save the number of syscall and waits for signal from kqueue/epoll when resource is constrained.
I did some basic profiling and I found out that memory usage spiked to 2G and CPU also spiked to 100%. I guess it means some kind of bottleneck, either in my code or not enough resources to handle load effectively? Not sure what caused the excessive memory usage (unnecessary copy of bytebuffer? but bytebuffer uses CoW and most operations are reading data. Did I miss something here?).
This suggests that you're producing the writes faster than your network / server on the other side can consume them.
What in your code monitors how fast the bytes are actually being sent?
Unnecessary copies would show up in CPU, not memory. If you have an unnecessary CoW, that's super annoying allocate+copy+deallocate old one but it doesn't make the memory spike.
The client samples every 250ms. The test client tracks the total number of bytes being sent and will ask about the number of writable bytes currently being buffered.
let numBytesSent = total - bufferedAmount
let elapsedTime = now - start
let speed = numBytesSent / elapsedTime
By doing this calculation every 250ms, test client tells us the throughput of our network.
I'm investigating if I miss anything in the test client implementation (synchronization issue, missing locks, data races ...).