Should you ever call `print` inside a channel handler?

taylorswift · April 15, 2024, 11:00pm

we are not supposed to block the NIO event loop, ever. which makes me wonder if you should ever be calling print, especially fflush, from inside a channel handler, except when doing debugging, etc.

should you ever be calling print inside a channel handler?

Helge_Hess1 · April 15, 2024, 11:33pm

No! It could block an arbitrary amount of time depending on where stdout is pointing to (and with it the whole event loop, which might be processing thousands of other connections that will stall).

tera · April 16, 2024, 12:23am

Strictly speaking so is malloc or a mutex lock or ... a memory access...

I do wonder though why calling print other than for debugging?

taylorswift · April 16, 2024, 12:38am

for server logging purposes.

as i understand it, blocking on print (for a long time) is a little more realistic than blocking on malloc, because it depends on where stdout is being directed to.

tera · April 16, 2024, 12:46am

A typical answer would be to use the same approach as with real-time audio — use a memory ring buffer of a big enough size and flush the ring buffer from a secondary thread. Don't block on ring buffer overflow, record that in the log "overflow occurred" and drop the log entry — with big enough ring buffer size and frequent enough secondary thread readouts overflows won't happen.

lukasa · April 16, 2024, 10:22am

This is entirely correct. stdout is typically connected to a pipe, which is bad enough, but it can be connected to many more horrifying things. Redirected to a file mounted on NFS, anyone?

taylorswift · April 16, 2024, 6:30pm

i take it this would require some new concurrency primitives, to allow mutating a buffer from multiple threads?

alternatively, i could imagine buffering log output byte by byte in an AsyncStream<UInt8> although i doubt that would be very performant…

tera · April 16, 2024, 6:38pm

There're just two variables that would require synchronisation: readerPos and writerPos. I'd allow the variables to grow "indefinitely" starting from 0 and use modulo by ring buffer size arithmetic (or mask them with a 1 << n if ring buffer size is chosen to be a power of 2) to convert them to buffer indices. I'd either make those two variables atomic or protect their accesses with a lock. The buffer contents itself doesn't need to be synchronised explicitly (it will be synchronised implicitly via the explicit synchronisation of those two control variables).

tera · April 17, 2024, 12:15am

This BTW is always puzzling me: conceptually a buffer contents is not any different than the two variables for reading and writing position. Let's say the buffer is tiny – only two bytes – and lets represent those two bytes as two extra variables, A and B. We'd have four variables in total:

R // read position, Int
W // write position, Int
A // first byte of buffer, Byte
B // second byte of buffer, Byte

The typical approach is to access R & W atomically with acquire/release memory order or the combination of these flags. However, when reading/writing the buffer contents no atomic operations or locks are used, just plain memcpy (which could be implemented by a manual loop) and in this particular extreme example accessing the A & B variables.

Oversimplifying, assuming a single reader and a single writer [1], and assuming reads / writes are of size 1:

// writer:
1. grab W non-atomically (there's just one writer – us)
2. grab R atomically
3. based on the two obtained values, if there's space available to be written †:
    a) write to A or B (depending upon the obtained W position)
    b) increment W atomically

† - note that there is no time of check vs time of use issue here as there are no other writers, just the code we are in itself, so the space available could not reduce.

// reader:
4. grab R non-atomically (there's just one reader – us)
5. grab W atomically
6. based on the two obtained values, if there's anything to read ††:
    a) read A or B (depending upon the obtained R position)
    b) increment R atomically

†† - note that there is no time of check vs time of use issue here as there are no other readers, just the code we are in itself, so the contents available to read could not reduce.

The puzzling moment is that in the reader / writer pseudo code above #3a and #6a are done non-atomically. This could only work reliably if #3b is syncing not just its operand – the W variable itself – but everything currently out of sync. Is this guaranteed to happen or are there weak architectures on which this isn't done?

[1] - @taylorswift use case would require support for more than one writer and a single reader so the pseudocode outlined is not be sufficient. Note that it is crucial to not get yourself into the situation:

// WRONG, DON'T DO THAT:
let oldPos = atomicAdd(&writePos, size)
write to the buffer based on oldPos

as the reader might get control after writePos has been updated but before the contents is written fully or at all. In case of multiple writers I'd probably protect writerPosition by a lock:

// not ideal (as there's an O(n) operation under lock) but correct:
lock.withLock {
    write to the buffer based on writerPos with e.g. `memcpy`
    (note that you'd need one or two memcpy's to handle buffer wrapping.)
    writerPos += size
}

reader code (in case of single reader) could still use atomic, but I'd probably not bother and make the code more consistent by using the lock for the reader as well. Keep the code under lock as small as possible (e.g. in case of reader memcpy it to a temporary buffer †††, obviously it would be a very bad idea to perform the actual writing being under lock).

††† - bonus point of using a temporary (scratch) buffer on the reader side – you'd be able doing a single "write to file" operation even when the data you are writing is located at the end of the ring buffer and wraps to its beginning. Without a temporary buffer you'd need performing two I/O operations.

PS. the actual code that implements RingBuffer takes about 50 lines... much shorter than this post