Performance engineering fun times with NIO NFS3, Instruments, and ChatGPT 4o

johannesweiss · May 18, 2024, 6:31pm

Today, in a completely personal endeavour, I completely nerd-sniped myself when I was wanting to see if a state of the art LLM (I used ChatGPT 4o because it's new) can be used as a performance engineer's assistant. And I thought that some of the below might be useful to others too. Even if you're totally not into LLMs, most of this post is just some good old performance engineering exploration with Instruments.

The experiment and actual data from my computer

Anyway, I wanted to see if an LLM can help to understand the latencies that a client program requesting a few bytes from a localhost NFS server would encounter. So the experiment is:

Reader program: A terrible program (don't ask me why I used C, makes no sense, no error checking either) which is essentially a for loop over open a file, read up to 1024 bytes, print bytes.
Server program: NIOExtrasNFS3Demo: A dummy NFS server which exposes a fixed directory /tmp/mount/dir with 6 files file, file1 ... file5 each containing H then a megabyte of Es and an O. It serves them over NFS and also automatically mounts that to /tmp/mount so you can start reading. All the dummy files are just a buffer in memory, so there's no actual on disk file. It's just a demo that NIO's NFS support works.

I chose this experiemnt because these kinds of of two programs communicating with each other over (localhost) network, XPC, gRPC or similar are quite common. But because the kernel('s NFS client) is the client I didn't need a client and a server program, just the server and a program that simply reads a file from the NFS mount, easy.

Given that the performance of these kinds of arrangements is often dictated from adding the individual latencies of each step, I wanted to see if ChatGPT can be helpful predicting these latencies (through many layers of the stack).

IMHO, a few things are crucial for a(n aspiring) performance engineer

Ability (or willingness to learn) to write a real-world benchmark to get real-world numbers and to have something to profile/trace with the tools
Rough understanding (or willingness to learn) what operations will need to happen
Rough mental model (or willingness to learn) of the latency of each of these operations
Knowledge or willingness to learn tools (perf, strace, Instruments, dtrace, ...) that let you measure each of these steps

For this particular problem I happened to know enough which was useful to help assessing what the LLM can do. But it's very important to highlight that nobody knows all the steps in everything, nobody knows all the latencies and nobody knows all the tools (and depending on what you're doing maybe there is no tool).

So please don't be dissuaded from getting into performance if that's interesting to you. Everybody starts from a place where they feel like they don't know enough, it's a journey (and I think a fun one) that never ends!
Also, there are very useful resources on the internet such as approximate latency numbers and lists of tools to measure things, some more approachable than others.

Anyway, for today's experiment I exclusively used Instrument's 'System Trace' (somewhat old WWDC video) because it looks cool and is actually pretty powerful for today's job.

A very rough list of things that needs to happen:

Reader program calls the read system call to tell the kernel that we want to read
Switch in to kernel mode
Kernel's VFS (virtual file system layer) sees that this file is not in the buffer cache and also that it's on NFS
Kernel invokes its NFS client to send bytes to the NFS server (this will wake up the NFS server process)
Context switch to NIOExtrasNFSDemo NFS server process
NIOExtrasNFSDemo's kevent system call returns tells NIO that there are bytes to read
NIO reads the bytes and feeds them through its ChannelPipeline invoking the NFS protocol decoders etc.
Some processing of the request in the NFS server and the dummy file system that the demo server implements (find the right bytes in the in-memory store that holds the file data)
Send it through NIO's ChannelPipeline which invokes the NFS protocol encoders
NIO's channel calls the write or writev system call to send the NFS response
Kernel's NFS client invoked again, decoding the NFS response
Kernel now notices that it can finally fulfill the original read system call
Context switch back to the reader program
Reader program's read system call returns, having written the bytes into the buffer provided originally

Or in less detail:

reader program (read system call starts)
-> kernel (NFS client, serialises NFS request and sends it)
-> NIOExtrasNFSDemo (reads NFS network request, parses it, prepares the response, serialises NFS response, sends it) 
-> kernel (NFS client)
-> reader program (read system call ends)

Recorded with Instrument's System Trace and zoomed in, we get the following timeline (I added some annotations).

So as we can see, in this particular instance the total time the read call in the reader program was 126 µs. Each system call is a red blob (with I (input) for reads, O (output) for writes etc or S (signalling) for signalling operations). The length of the system call is in shaded red. So as expected, in the reader program (the row that says a.out -- Main Thread), we see one long blocking read system call (annotation: [1]).

And then we see a bunch of different phases that I annotated [2] to [10] which all added up together make up [1]. We roughly get this time series:

[2]: 35 µs: we are in the read system call but still Running (i.e. using CPU, not just waiting) in the reader program. The majority of this is already kernel processing because the kernel's doing work on behalf the user program. This will include entering kernel mode, going through the kernel's VFS, serialising the NFS network request in the kernel, putting the network packets into the loopback (localhost) "network" etc. What we don't see in this view is that at the end of [2] another kernel thread briefly spins up, likely doing something with the network. But we're basically done here
[3]: 11 µs: This period is the switching period. Both the reader program and the NFS server are Blocked which means they're both waiting for something. The reader program is waiting for the overall read to complete and the NFS server hasn't been woken up to read the bytes of the incoming NFS request. This period has some (not shown) kernel thread running, presumably doing some networking work which then eventually triggers our NFS server to wake up. This period also contains the context switching overhead. At the of of [3] we see a long system call (kevent) ending, that's what SwiftNIO (and really most asynchronous software) uses for waiting when it has no other work. In this case kevent returns to signal that there are bytes to read.
[4]: 8µs: kevent has returned, now it's time for SwiftNIO to work. It needs to understand kevent's return value (which tells NIO that there are bytes readable on a certain fd), NIO will allocate a buffer for the bytes to be read soon etc. And as soon as it can it will issue a read system call to get the bytes (forming the NFS request) out of the kernel.
[5]: 6µs: a quick read to get these NFS bytes
[6]: 20µs: This is the actual work. NIO's got the bytes and now needs to
- decode & understand the NFS request
- call into the dummy file system which processes this "read". It is a dummy file system and has the bytes that are in this file readily available in memory
- encode the NFS response encompassing the bytes
- send the bytes
[7]: 12µs: writev system call: This is sending the bytes of the NFS response back to the kernel and should immediately unblock another kernel thread to process this NFS response
[8]: 12µs: Some more processing at the back of the write we just did, not really relevant because we already unblocked the kernel's request
[9]: 14µs: Another period where both the reader program as well as the NFS server are blocked with nothing to do. Some kernel thread is doing the work of shuttling over the bytes, preparing to decode the NFS response. This period again contains some context switching overhead
[10]: 10µs: We're back in the reader program but still in the read system call. So this is still kernel processing, likely the VFS understanding that this NFS response actually has the bytes for the file, dealing with the UBC (unified buffer cache, ...)

And that's it, the reader program (in user space) finally has all the bytes.

So in a way as a summary of this particular instance (these times are a little variable, lots of layers involved):

We need 4 system calls (overhead seems to be between 6 and 12µs each): Let's take the average (9µs) and say that's 36µs in total
20µs of "processing" (parse, handle, serialse) in the NFS server
11µs+14µs of context switching & localhost "networking" overhead
35µs (beginning of read) and 10µs (end of read) overhead in the kernel's buffer cache, VFS, NFS client etc that is fully accounted to the reader program

This makes up the 126 µs, perfect.

ChatGPT 4o's prediction (with some help and guidance)

Okay, now finally over to ChatGPT, what does it make of this? You can read my full transcript here (unfortunately I failed to create a new chat for this, so you'll see me asking other silly stuff before settling on this question, so it's a bit messy. You'll need to scroll almost half the way down) My initial prompt was

[Prompt (as in: that's what I wrote)] What is the latency of small read of a program accessing an NFS mount which is served by an NFS server also on localhost, running on macOS. The data isn't in the kernel's buffer cache but it is in the NFS server's in-memory cache. Please split the expected latency into the different steps that are necessary, all the way from the requesting program running to the requesting program receiving back the data. Please point out all the necessary system calls both on the requesting program as well as on the NFS server program.

To which ChatGPT responds with honestly a really pretty good response including the step by step that I asked for:

[ChatGPT] [...] Thus, the total expected latency for a small read operation from an NFS mount, with the NFS server on localhost and data in the server's in-memory cache, on macOS is approximately 62-130 microseconds . [...]

Which is bang on!!! We saw 126 µs and it predicts 62µs to 130µs, nicely done.

I wasn't quite happy with everything so I asked it to refine its answers a bit and correct some mistakes in a bunch of follow up prompts:

[Prompt refinement] Can you also include the thread switching & scheduling overhead in the kernel?

[Prompt refinement] What if the server's asynchronous and is using kevent to wait until it can recvmsg. Can you include that as well please?

[Prompt refinement] the in-kernel NFS client doesn't use system calls though, it is the system.

[Prompt refinement] Why is there thread scheduling and context switching overhead in the NFS server before the sendmsg system call? That seems weird.

Okay, so I'll reproduce here in fully, without any editing what ChatGPT 4o had to say about this:

Steps and System Calls Involved with `kevent` and Thread Switching & Scheduling Overhead

User Space to Kernel Space Transition (syscall to read):

Requesting Program: Issues a read system call.

System Call: read

Latency: ~1-5 µs

Kernel to NFS Client Processing:

The kernel’s NFS client processes the read request and prepares an NFS request.

Latency: ~10-25 µs (no system call, purely internal processing)

Network Stack Processing (Client to Server):

The NFS request is sent through the network stack over the loopback interface.

Latency: ~15-30 µs

Thread Scheduling and Context Switch (Server):

Scheduling overhead for the server thread handling the request.

Latency: ~3-10 µs

NFS Server Receives Request via kevent:

The server uses kevent to wait for the readiness to receive the message.

System Call: kevent (wait for events)

Latency: ~5-10 µs

NFS Server Processes recvmsg:

The NFS server’s network stack receives the request and passes it to the NFS server process.

System Call: recvmsg

Latency: ~5-10 µs

NFS Server Processing:

The NFS server processes the read request, finds the data in its in-memory cache, and prepares the response.

Latency: ~10-15 µs (in-memory fetch is internal)

NFS Server Sends Response via sendmsg:

The NFS server sends the data back through the network stack over the loopback interface.

System Call: sendmsg (sending data through sockets)

Latency: ~15-30 µs

Thread Scheduling and Context Switch (Client):

Scheduling overhead for the client thread receiving the response.

Latency: ~3-10 µs

Kernel Receives Data and Processes It:

The client's kernel network stack receives the data and processes it.

Latency: ~5-10 µs

Kernel to User Space Transition:

The kernel delivers the data to the requesting program, completing the read syscall.

System Call: read (completion)

Latency: ~1-5 µs

Total Expected Latency

User Space to Kernel Space Transition: ~1-5 µs

Kernel to NFS Client Processing: ~10-25 µs

Network Stack Processing (Client to Server): ~15-30 µs

Thread Scheduling and Context Switch (Server): ~3-10 µs

NFS Server Receives Request via kevent: ~5-10 µs

NFS Server Processes recvmsg: ~5-10 µs

NFS Server Processing: ~10-15 µs

NFS Server Sends Response via sendmsg: ~15-30 µs

Thread Scheduling and Context Switch (Client): ~3-10 µs

Kernel Receives Data and Processes It: ~5-10 µs

Kernel to User Space Transition: ~1-5 µs

Summing Up the Latencies

Total Expected Latency:

Minimum: 1+10+15+3+5+5+10+15+3+5+1=73 µs1+10+15+3+5+5+10+15+3+5+1=73µs

Maximum: 5+25+30+10+10+10+15+30+10+10+5=180 µs5+25+30+10+10+10+15+30+10+10+5=180µs

Thus, the total expected latency for a small read operation from an NFS mount, with the NFS server on localhost and data in the server's in-memory cache, on macOS including the use of kevent for asynchronous I/O and thread switching and scheduling overhead is approximately 73-180 microseconds. This detailed breakdown correctly includes the necessary system calls and internal processing steps.

Coming to an even more refined 73-180 microseconds and the actual 126 µs is (I swear I didn't make this up!!) precisely the midpoint of this prediction ((180+73)/2 = 126.6). Wow, colour me impressed.

FWIW, ChatGPT says the server would use recvmsg and sendmsg for receiving/sending bytes. And that's totally legit, SwiftNIO just happens to use read/write(v) for TCP sockets but that's really a detail and doesn't matter at all. SwiftNIO could be changed with a tiny change to use recvmsg and send(m)msg for TCP sockets too and nothing good or bad would happen, it's the ~same.

Verdict & Conclusion

Was ChatGPT's answer entirely correct? No, it wasn't. I did tweak its answer by refining the prompt more and more until I was really quite happy with it. Not everything is perfectly correct but I think it's definitely ballpark correct.
Of course, I could only do these refinements easily because I have quite a bit of pre-existing knowledge in this area (wrote large chunks of SwiftNIO and all of SwiftNIO's NFS protocol support as well as the demo server).

So I don't want to claim here that you can just trust ChatGPT to become your performance engineering teacher and give you perfect explanations of everything. But I do think it does add value. At this point in time, my preferred LLM use cases are problems where I can easily validate the answer.
In this particular case this is sort of true: It's somewhat straightforward to create a benchmark (see in the details at the very bottom) to measure this in the real world. With some (arguably kinda advanced) UNIX systems knowledge it's also possible to mostly predict what work is happening and I think ChatGPT does look like it can add valuable information here.

What I'm unsure is how -- without a lot of experience -- one would validate details like where exactly the context switching overhead comes into play. If somebody already knows the tools, they can just measure it, they don't really need to ask ChatGPT. A different way of saying that ChatGPT didn't really save me any time here (which of course wasn't the goal either) but nevertheless I was very impressed by how well it did here.

I'm pretty sure that if going step by step, ChatGPT (or similar LLMs) can be used as a first approximation providing at least some vocabulary to research, measure and profile everything properly.

So yes, I don't really know what my conclusion is haha. Maybe: it does provide value already but purely relying on it is definitely not a good idea. It was fun though!

Further random bits of information

What was the actual benchmark?

I ran while true; do sudo purge; ./a.out 1048575 /tmp/mount/dir/file*; done in a loop whilst recording with Instrument's System Trace.

My computer is an M1 MacBook Pro, the OG Apple Silicon MacBook Pro
I compiled this terrible program using clang readstuff.c which creates an executable ./a.out (yes, I was even too lazy to pass -o readstuff to give the program a sensible name
Why the sudo purge? That is to flush the disk cache, this makes the kernel actually issue NFS requests. If you don't flush then it'll mostly just get it from its buffer cache in no time.
Why the large offset (1048575)? If you read bytes from a file on NFS, the kernel will usually issue 16 concurrent NFS reads to prefetch the next bits of data, just in case. That just makes looking at things in Instruments harder so I selected the last byte. The kernel already knows the size of the file so it knows that there's no point issuing the other 15 reads. Should've read like 1000 bytes just to show that it doesn't make a difference. Next time...
Why reading all the files (/tmp/mount/dir/file*) and not just one? This one's a bit funny: By clearing the disk cache, we also need to re-read the reader program and the NFS server program from the disk (using the SSD and the APFS file system). But we don't want to measure this quite significant overhead as it would normally always be in the buffer cache. We just want to see the 'read from NFS part'. So the first file will kinda suffer all the reading programs from disk overhead and the remaining files with then just get the NFS reads because by now the programs are already back in memory. So on each iteration I ignored the first one which was dominated by reading in tons of stuff from the SSD.
I checked with nfsstat -c3 -w 1 that the number of NFS reads is correct (and we can also see it in the timeline)

How to actually do the recording

Recording a System Trace can be done like so:

Then select 'All Processes' and hit record. I set the following filter to only see the reader program and the NFS server.

What about other LLMs?

Just for the sake of it I asked Clause Opus the same intial prompt

[Prompt] What is the latency of small read of a program accessing an NFS mount which is served by an NFS server also on localhost, running on macOS. The data isn't in the kernel's buffer cache but it is in the NFS server's in-memory cache. Please split the expected latency into the different steps that are necessary, all the way from the requesting program running to the requesting program receiving back the data. Please point out all the necessary system calls both on the requesting program as well as on the NFS server program.

And it responded:

[Claude Opus] Here's a breakdown of the expected latency for a small read operation from a program accessing an NFS mount served by an NFS server on the same macOS machine, assuming the data is in the NFS server's in-memory cache but not in the kernel's buffer cache:

Requesting program:

The program issues a read system call (e.g., read()) to read data from the NFS-mounted file.

The read system call triggers a context switch from user mode to kernel mode.

Latency: ~0.5-1 microseconds for the system call overhead.

Kernel NFS client:

The kernel's NFS client module receives the read request.

It checks the kernel's buffer cache but doesn't find the requested data.

The NFS client module prepares an NFS READ request.

Latency: ~1-2 microseconds for NFS client processing.

Kernel network stack:

The NFS READ request is passed to the kernel's network stack.

The network stack encapsulates the request into TCP/IP packets.

Latency: ~1-2 microseconds for network stack processing.

Loopback interface:

Since the NFS server is on the same machine, the network packets are routed through the loopback interface.

The packets are copied from the kernel's memory to the loopback interface's memory.

Latency: ~1-2 microseconds for loopback interface processing.

NFS server program:

The NFS server program receives the incoming network packets.

It processes the NFS READ request and extracts the necessary information.

The NFS server program checks its in-memory cache and finds the requested data.

Latency: ~1-2 microseconds for NFS server processing.

NFS server response:

The NFS server program prepares an NFS READ response containing the requested data.

The response is passed back to the kernel's network stack.

Latency: ~1-2 microseconds for NFS server response processing.

Kernel network stack (server side):

The network stack encapsulates the NFS READ response into TCP/IP packets.

Latency: ~1-2 microseconds for network stack processing.

Loopback interface (server side):

The response packets are routed through the loopback interface.

The packets are copied from the loopback interface's memory to the kernel's memory.

Latency: ~1-2 microseconds for loopback interface processing.

Kernel NFS client (receiving response):

The kernel's NFS client module receives the NFS READ response.

It extracts the requested data from the response.

The data is copied from the kernel's memory to the requesting program's memory.

Latency: ~1-2 microseconds for NFS client response processing.

Requesting program (receiving data):

The read system call returns with the requested data.

The program resumes execution in user mode.

Latency: ~0.5-1 microseconds for the system call return.

Total estimated latency: ~10-20 microseconds.

Note that this is a rough estimate, and the actual latency may vary depending on factors such as system load, hardware specifications, and network conditions. Additionally, the latency breakdown assumes that the requested data is small enough to fit within a single NFS READ request and response.

I think I like Claude's way of presenting the information even better, it's actually quite nice! The latency figures were much less close to the real world though. Nevertheless, not a bad answer either!