Interpreting a Perf Flamegraph Against NIO Threads and DispatchQueues

xanderdunn · November 9, 2020, 5:17am

At the start of my long-running server on Linux I create these threading mechanisms:

self.eventLoopGroup = MultiThreadedEventLoopGroup(numberOfThreads: System.coreCount / 4)
self.blockingThreadPool = NIOThreadPool(numberOfThreads: System.coreCount / 2)
for i in 0..<(System.coreCount / 2) {
    let queue = DispatchQueue(label: "serial-queue-\(i)")
    self.serialQueues.append(queue)
}

System.coreCount happens to be 16, so that's 4 EventLoopGroup threads, 8 NIOThreadPool threads, and 8 serial DispatchQueues.

The EventLoopGroup connects and manages ~500 web socket connections. Blocking tasks that can be performed concurrently such as JSON encoding and storing in a database are sent to the NIOThreadPool. Each of the ~500 connections are randomly assigned one of the 8 serial DispatchQueues and blocking tasks that must be performed in serial, such as JSON decoding and checksum validation, are performed on that connection's assigned queue: self.serialDispatchQueue.async{}. This is because the checksum validation of data received at timestep t+1 relies on the data received at timestep t. They must be done in the order received.

Here is a perf flame graph captured on 8 hours of thle above server running with swift build -c release -Xswiftc -cross-module-optimization:

In the bottom left-hand corner you can see the 4 EventLoopGroup threads: NIO-ELT-0-#<0, 1, 2, 3>. On the bottom right are the 8 NIOThreadPool threads: TP-#<0, 1, 2, 3, 4, 5, 6, 7>. A couple of questions arise:

The 8 NIOThreadPool threads are all nearly identical, while the 4 EventLoopGroup threads are very uneven. You can see NIO-ELT-0-#1 is barely visible as it was in only 2% of samples, while NIO-ELT-0-#2was in 43% of samples. #0 is in 22% of samples and #3 is in 15%. Why might this be? #0, 2, and 3 all appear to be calling the same functions, while #1 doesn't appear to contain any of the calls to my code that are in the other three threads. Is it perhaps expected that one of the threads is reserved for certain activity?
Everything above the _dispatch_call_block_and_release is what's occurring via the serialDispatchQueue.async{} calls. You can see the PureSwiftJSON decoding and CryptoSwift checksum calculations occurring here. These calls are initiated by the EventLoopGroup, but I believe it's safe to say the work is not actually consuming time on the EventLoopGroup's threads? It would be a serious issue if it were. The work on the DispatchQueues is confusing because it doesn't show up under specific thread names like the EventLoopGroup and NIOThreadPool tasks do.

The instructions for creating this graph on Linux came from here.

lukasa · November 10, 2020, 7:27am

In general, nope. NIO does not schedule any of its own work on your event loop threads, and has no background processing. In a standard NIO TCP application there is only one event loop which will be special, and that's the one that owns the listening TCP socket (i.e. the socket on which we call accept to produce the others). This is not sufficient to explain the pattern observed in your flame graph. However, this might be:

This appears to be an optimised build. I am very suspicious of the call stacks you have there, jumping from the thread directly to _dispatch_call_block_and_release. As you're using DispatchQueue.async, I think those call stacks are not actually execution on NIO's threads, but have been incorrectly assigned to them by perf. If they were executing on NIO's threads we'd expect to see several stack frames of NIO's event loop itself before this call.

I've seen this behaviour before, usually when there is debugging information missing from a library. In this case, it seems like dispatch is missing some debug symbols, which prevents perf getting an accurate handle on how we get into _dispatch_call_block_and_release.

Can I suggest you file a bug report at bugs.swift.org?

xanderdunn · November 10, 2020, 6:03pm

Thanks @lukasa. I filed SR-13841