I surely can! I'll cover each option in turn.
ChannelOptions.RecvAllocator
This allocator is used to allocate the memory that a Channel
will use to write received network data into. This option is only meaningful on Channel
s that receive data directly from the network. The name is derived from the most basic system call used to receive network data in POSIX systems: recv
.
The signature of recv
is:
ssize_t recv(int socket, void *buffer, size_t length, int flags);
Notice that this system call requires that the userspace program reading data (in this case, SwiftNIO), must pass a buffer to the kernel for the kernel to write data into. NIO needs to allocate that buffer, and it allows users to influence how it does so by way of the recvAllocator
channel option.
A recvAllocator
is any type that conforms to the RecvByteBufferAllocator
protocol:
/// Allocates `ByteBuffer`s to be used to read bytes from a `Channel` and records the number of the actual bytes that were used.
public protocol RecvByteBufferAllocator {
/// Allocates a new `ByteBuffer` that will be used to read bytes from a `Channel`.
func buffer(allocator: ByteBufferAllocator) -> ByteBuffer
/// Records the actual number of bytes that were read by the last socket call.
///
/// - parameters:
/// - actualReadBytes: The number of bytes that were used by the previous allocated `ByteBuffer`
/// - returns: `true` if the next call to `buffer` may return a bigger buffer then the last call to `buffer`.
mutating func record(actualReadBytes: Int) -> Bool
}
This protocol is fairly straightforward, containing two functions. The first function, buffer(allocator:)
instructs the RecvByteBufferAllocator
to use the provided allocator
to allocate a ByteBuffer
, and returns it to the caller. The second function is used by the Channel
to tell the RecvByteBufferAllocator
how many bytes were actually read from the network.
What's notable here is that the first call, buffer(allocator:)
, does not take a size argument. This is the biggest clue for what types that conform to RecvByteBufferAllocator
are supposed to do: they are supposed to decide how big a buffer to allocate.
This turns out to be a very important job, because the size of the buffer passed to recv
has a fairly profound effect on system behaviour. Consider a system where the remote peer sends small messages over TCP. If NIO was constantly allocating 8kB buffers, even though the actual receives were always 5 or 6 bytes, NIO programs would scale very badly, as each Channel
would consume drastically more memory than it needs to.
On the other hand, consider a system using TCP where the remote peer is sending data very quickly. In this system, if we allocate too many small buffers (say, 256 bytes) then we will make far too many system calls to copy data from the kernel. This will slow the program down, making it consume more CPU cycles, and will even slow down data transfer. This is also bad.
The above two examples were with TCP, but the buffer size matters even more for UDP. In UDP, if the buffer that NIO allocates is not large enough to write the entire packet into, the packet will be truncated: the kernel will just throw the rest of the packet data on the floor and return an error (EMSGSIZE
).
We can divide the world into strategies. In TCP, we can use the result of the recv
system call itself to determine how big the buffer needs to be. If, every time we call recv
, we completely fill the buffer, we know it's too small: the remote peer is sending data at least as fast as we're consuming it. We can use that as a signal to make the buffer larger, which will improve throughput. Slower connections will never use larger buffers, and so we can conserve memory. This strategy is implemented by AdaptiveRecvByteBufferAllocator
.
In UDP, this is a terrible idea! If the buffer is too small we'll see data loss, and data loss is not great, Bob! More broadly, UDP doesn't meaningfully benefit from tuning the size: each datagram has a fixed maximum size selected by the MTU of the link. For this reason it's faster and safer to just allocate the same size every time, which will be the MTU. This strategy is (sort of) implemented by FixedSizeRecvByteBufferAllocator
.
TCP channels in NIO default to using AdaptiveRecvByteBufferAllocator
with a minimum size of 64 bytes, an initial size of 2048 bytes, and a maximum size of 65kB. UDP channels in NIO default to using FixedSizeRecvByteBufferAllocator
allocating 2kB chunks (safely larger than the usual 1500 byte MTU of ethernet.
ChannelOptions.datagramVectorReadMessageCount
This channel option is used to control how UDP channels perform vector reads. To explain this, we need to step back a moment.
In high performance network programming one of the biggest overheads is the cost of system calls to read and write data. System calls are fairly expensive, entailing a fairly hefty context switch that needs to flush TLBs and generally take steps to protect kernel memory from userspace. To this end, NIO goes to great lengths to minimise the number of system calls it needs to make.
On the outbound side, NIO takes advantage of the writev
system call. This is a "vector write" system call. Essentially, it is similar to write
, but instead of passing only one buffer of data to send we can pass an array of buffers to send in a single system call. The kernel will copy all of them, in order, into the send buffer, and then return from the system call.
This greatly cheapens the cost of making smaller writes. In NIO we frequently encourage users to make their writes "semantic", to make them correspond to protocol atoms. This means that we will often see many quite small writes. Consider HTTP/1.1 chunked encoding, which looks like this:
1\r\n
a\r\n
6\r\n
series\r\n
2\r\n
of\r\n
5\r\n
small\r\n
writes
\r\n
Assuming the user has sent each of these small words in their own byte buffer, NIO needs to add a buffers for the lengths and the CRLFs. If we didn't have the writev
system call, we'd either have to make 3 system calls for each chunk (one for the size + first CRLF, one for the body content, and one for the trailing CRLF) or we'd have to do a lot of memory copying to flatten the messages down. writev
reduces the amount of copying and the number of system calls we have to make. It's great! NIO uses this by default, and does not require any input from the user.
So, if vector writes are so great, what about vector reads?
Well, in TCP vector reads aren't very useful. This is because TCP is "stream oriented": there are no inherent message boundaries in TCP. You just read the body like it's a stream of bytes. For this reason, it's just as easy to pass a single giant buffer into the kernel on recv
and get the kernel to fill it as it is to pass several smaller ones. For this reason, when we use TCP in NIO, we do not perform vector reads: we perform simple, scalar read calls.
But with UDP, this doesn't work so well. In UDP, each call to recv
will give you one (and only one!) datagram. For protocols with lots of small messages, this leads to death by system call overhead: we will end up spending more of our time doing the context switching associated with making system calls than we will actually processing network traffic. Worse, because we'll be so inefficient, we'll likely suffer packet loss as we can't process packets fast enough.
Fortunately, some platforms provide a vector datagram read system call, called recvmmsg
. This system call lets us allocate multiple buffers, and to ask the kernel to give us multiple received datagrams at once. On systems that support it, this lets us trade increased memory use for better performance: while we have to allocate more memory at once (to support the maximum possible number of datagrams we could have received), we can vastly reduce the number of system calls we have to make to process the data.
However, unlike with TCP, we don't feel like we can just opt users into this behaviour. As you've seen, if we support reading 30 datagrams at once (not unreasonable), we may need to allocate 30 * 2048 == 61448
bytes of memory per channel. This is quite a lot, especially if you're opening many of them. And unlike TCP we don't do this adaptively (though arguably we could if it proved to be useful).
This is where an unfortunate design comes into play. You see, these two channel options interact with each other.
datagramVectorReadMessageCount
does not change the amount of memory a recvAllocator
will give us. Indeed, it cannot! The recvAllocator
is in charge of how much memory it will allocate. This presents a problem for vector reads. We could call buffer(allocator:)
more than once, but there is an implied contract for the RecvByteBufferAllocator
implementations that we will not do that. Additionally, that would lead to lots of calls to malloc
, an unnecessary slowdown.
So instead, we document that if the user sets the datagramVectorReadMessageCount
to a value that isn't 1, they should increase the size of the memory allocated by the FixedSizeRecvByteBufferAllocator
correspondingly. This is because, under the hood, we ask the allocator to allocate the memory and then slice it up into datagramVectorReadMessageCount
equal sized slices.
In your above example, you set datagramVectorReadMessageCount
to 30
, and the capacity of FixedSizeRecvByteBufferAllocator
to 30 * 2048
. This has the effect of allocating a single slab of 61448
bytes. That slab is then partitioned into 30 2048-byte slices, which are passed to the system for a vector read. This allows you to read up to 30 datagrams in a single system call and single call to malloc
, vastly increasing throughput at the cost of higher baseline memory consumption.
So, let's talk about this:
The problem is that my channelRead
seems to be only triggered one time and is only a partial message.
through the lens of what we've just learned.
The changes you made should not have avoided truncation: we should still have the same maximum size for each datagram, 2kB. We can just read more datagrams faster. So I'm curious as to how this fixed your truncation issue. Can you reproduce the original behaviour? I'd like to see what's going on there.