WebSocket Server Restart Blocking all EventLoops?

I have a long-running server that is a combination of a WebSocket client and BigQuery requests. It sometimes gets itself into a bad state, and I wonder if my current hypothesis is plausible:

Ubuntu 18.04. 16 cores. The app has a single MultiThreadedEventLoopGroup(numberOfThreads: System.coreCount * 2).

At start, it connects to 400 different WebSocket endpoints on the same server. As it receives WebSocket messages, it processes the data and then inserts it into BigQuery. There are times when the WebSocket server restarts or goes down for maintenance, causing all 400 of my connections to die. Furthermore, during this maintenance time, my app's attempts to reconnect will all fail with WebSocket connection error connectTimeout(NIO.TimeAmount(nanoseconds: 10000000000)). There is a 1 second wait between each reconnect attempt. If I wait long enough, my app will successfully reconnect to all 400 endpoints when the server is back up. The thing I am surprised to witness during this time is that most or all of my BigQuery requests start failing too:

RESTRequest.swift:handleHTTPRequest(request:requestAttribution:httpClient:logger:logSuccess:retryCount:):79 : connectTimeout(NIO.TimeAmount(nanoseconds: 10000000000)) with request: Request(method: NIOHTTP1.HTTPMethod.POST, url: https://www.googleapis.com/bigquery/v2/projects/myproject/datasets/myDataset/tables/myTable/insertAll, scheme: "https", host: "www.googleapis.com", headers: [("Authorization", "Bearer <token>), ("Content-Encoding", "gzip")], body: Optional(AsyncHTTPClient.HTTPClient.Body(length: Optional(28197), stream: (Function))), redirectState: nil, kind: AsyncHTTPClient.HTTPClient.Request.Kind.host)"

Is it possible this is because all of the EventLoops are blocking on the WebSocket connection timeouts? Perhaps I should break up the BigQuery requests and the WebSockets connections onto two separate EventLoopGroups?

Are you actually blocking anywhere on the event loops? In general "just timeouts" etc are not blocking per se, so that would not be the case.

Sidenote: for the connection retry you'd usually want to apply some exponential backoff, not just "exactly 1 second" because that way you'll hammer the server with a lot of reconnection attempts at the same time. But that's likely a separate issue to your question at hand.

Thanks very much, this is the key I was looking for. That dispels my hypothesis.

The closest I come to blocking is the encoding of the BigQuery inserts. A BigQuery insertion will have a maximum of 10,000 objects and this will be encoded into JSON and gzipped to send the request. This can max out a single CPU for a second or two, but that's about it. Further, the encoding and compression will not be repeated when the request fails. The retry loop catches and tries again after the encoding and compression has already occurred.

Yeah, this is on my to-do list.

This encoding should definitely move off the event loop. If it’s “maxing out the CPU for a second or two” and being run on the event loop, all other work on that loop cannot progress. That includes connection attempts, but everything else as well, and it may well be the source of some of your issues here.

I recommend kicking this to the background in a thread pool.

You may also want to consider reducing the number of threads you allocate to event loops in this case, as you probably don’t need 32 event loops!

Thanks @lukasa!

Do you have in mind dispatch queues? Because I prefer using futures and promises, would there by any trouble with creating an EventLoopGroup that is specifically for these blocking computations? If there are no network operations on that EventLoopGroup, it should be well-suited for it? Combine is not an option because it's macOS only.

Thanks, I don't know how to determine what number is reasonable. I decided to go with the maximum number of threads the machine can be expected to execute simultaneously because this is the only thing the machine is doing right now. Based on discussions here and here.

There is this: https://apple.github.io/swift-nio/docs/current/NIO/Classes/NIOThreadPool.html

I suppose a second EventLoopGroup would also work (via bgLoop.next().async { .. work }), though it isn't really intended for this.

I do. :slight_smile: Or the thread pool linked by @Helge_Hess1, which is backed by Dispatch queues.

In general, sure, but EventLoops are a bit heavyweight if they don't do I/O. Each EventLoop consumes a number of file descriptors, which is unnecessary if you don't want to do I/O on them. Generally this is not the best usage model.

Sure, that's not an unreasonable decision. I was mostly noting that if we're kicking the serialization off the event loops doing I/O, you probably want to reduce the number of threads dedicated to those loops and give some to the CPU-bound work.

Thanks. The frustrating thing about this is that it's an entirely different concurrency abstraction and I lose all of the advantages of the futures/promises API that is NIO's foundation. NIOThreadPool's WorkItems don't appear to be composable like EventLoopFuture's flatMap, flatMapThrowing, scheduleTask, flatScheduleTask, etc. The unit tests for NIOThreadPool all use Lock and DispatchSemaphore to synchronize the results of the asynchronous tasks. Now I'm back at the low level abstractions that compelled me toward NIO initially. But I understand NIO is focused on networking tasks and I'm now trying to do non-networking tasks, so perhaps I should look into other concurrency libraries for this portion of the work.

I wonder how hard it would be to write an EventLoopGroup (e.g. backed by Dispatch) which doesn't do any I/O but just async (though you want a timer as well). Almost looks like NIOTSEventLoop might be exactly that?

This is not accurate. NIOThreadPool.runIfActive is the appropriate API to use in most cases, which vends a future containing its result. The WorkItem based APIs are for lower-level use-cases.

I think he wants to structure his long running tasks w/ promises as well. But you are right, that might not actually be necessary. Just use the regular I/O event-loop for the flow and just put the expensive subtask (zip, json encode) into the thread pool.

I see, I missed this call. I haven't found example usage, so thanks for your help. It takes as argument an EventLoop. Do I understand the following correctly: The closure provided to runIfActive doesn't run on the given EventLoop, it simply uses that EventLoop to create an EventLoopFuture with the result of the closure. This means, to chain blocking tasks, it would look like this:

self.threadPool.runIfActive(eventLoop: self.eventLoop) {
    // Do some blocking work
}.flatMap { result in 
    // Code run here shouldn't block
    self.threadPool.runIfActive(eventLoop: self.eventLoop) {
    // Do some more blocking work
}.flatMap { nextResult in
    self.threadPool.runIfActive(eventLoop: self.eventLoop) {
    // Do some final blocking work

The NIO NonBlockingFileIO code is a useful example, though I don't see any chaining of blocking computation.

Yes, that’s correct: the NIOThreadPool isn’t an event loop. In general it does not expect you to bother chaining work on the thread pool. Once you’ve established a blocking-safe context, you may as well just write synchronous code and use the regular, well-established synchronous primitives for composition.

1 Like

Ok, excellent, I've moved all of my potentially blocking computation (JSON encoding/decoding, compression/decompression, hash calculation, etc) off of the EventLoop. Initially I moved everything onto NIOThreadPool, but then discovered it was breaking FIFO expectations in some portions of my server. Processing some kinds of data out-of-order leads to inconsistencies, so in that situation I had to switch from NIOThreadPool to serial DispatchQueues.

As of right now this appears to be working very well, thanks for the direction @lukasa. I have written this out in an attempt to codify the 2x2 matrix of possible concurrency/threading situations and which tool to use to solve it for my own understanding:

  • Concurrent non-blocking computation: Use SwiftNIO's EventLoopGroup, which chooses the first available EventLoop on which to perform the task. It does not guarantee FIFO.
  • Serial non-blocking computation: Use a single SwiftNIO EventLoop, typically designated as safeEventLoop. Within a single EventLoop there is a FIFO guarantee.
  • Concurrent blocking computation: Use SwiftNIO's NIOThreadPool. This does blocking computation off of the EventLoop so as not to bottleneck network tasks. Does not guarantee FIFO.
  • Serial blocking computation. This is when the order in which things happen is important. Use serial DispatchQueues, which guarantees FIFO execution.

Any additional thoughts are welcome.