It's fire and forget.
So after a quick glance at the code I'm going to make a few suggestions.
Firstly, avoid using Data at your send entry point. This ends up forcing you to perform an extra data copy whenever you perform a write, e.g. of a PING message, because you're transforming from String to Data to ByteBuffer, and each transformation involves a byte copy. Instead, it's better to handle ByteBuffer directly here if you can.
Relatedly, consider avoiding this generic entry point and instead again take a ByteBuffer. The further up the stack you can start using ByteBuffer the better your performance will be, as you can avoid incurring extra copies. In this case this won't be a huge performance impact as you're ultimately going to hit a generic specialization, but it's still nicer to avoid the risk.
Next, consider whether having your write method be async is actually serving you. In many cases this will return immediately having written into the buffer, and will only occasionally actually suspend for I/O. This makes it a bit unclear what it means for the write to suspend. In general I'd say you should either always suspend, or never suspend: it's rarely wise to suspend depending on an internal state like this one.
With that we can try to work through the issue with the buffer itself. Right now I think the real concern I have is that this buffer is quite complex, and so quite aside from everything else it's challenging to understand exactly how it works. My initial recommended refactor would be to move the buffering logic entirely into a NIO ChannelHandler. This avoids the need for the extra lock, instead relying on the Channel's own mutual exclusion and greatly simplifying the logic.
My next suggestion would be to avoid waiting for the write promises to complete, and instead to start using NIO's ChannelWritabilityChanged notifications. These rely on NIO's internal backpressure management system, and will give you a strong clue that you can back off. This covers most of what this buffer does, which only meaningfully reduces flush calls in a limited number of cases. This will also minimise the number of promises in flight, which is very helpful.
Third, and perhaps more importantly, we need to consider whether you can get a hint from the user as to whether you can coalesce. This will lead to the most effective operation. In this case, something like a scoped writer will be useful:
client.withBatch { batch in
batch.write()
batch.write()
batch.write()
}
Clients that are emitting a large number of messages can use this interface to give you a "hint" that they are going to send more. This avoids a need for an explicit flush message, but allows performance-sensitive clients to tune their I/O.
A final question: do you happen to be at KubeCon 24? If you are, I'm there this week and would be happy to work through this with you in some more detail.