Crash recovery for server-side applications

The way proper servers like Apache handle that is by shutting things down gracefully. I.e. if a child process gets a termination request (for whatever reason, e.g. using too much memory is an important one, or a SIGTERM sent by the admin), it'll finish processing all in-flight requests, then shut down. In the meantime the frontend process is going to redirect new connections/requests to a different/new child.

Of course this kind of thing requires protocol knowledge, which is why in my opinion a general purpose watchdog like supervise is an awful solution. It is only the very last resort if you have nothing else available.

To be honest I have no good idea on how to even do this w/ Swift. Which is bad, because those kind of bugs are so common in the Swift reality. Let's do Objective-C on the server again! ;-)

1 Like

The combination of Swift's "unsafe" behaviors and the "one high-capacity process" design of NIO seems to all but demand a "recovery" model. One potential solution in the short term (absent deeper language/runtime support for dealing with these concerns) might be to adopt the "terminate a crashing thread" model, let the failed thread's resources leak (individual threads should be fairly well isolated in NIO's universe), and transition all remaining threads to an "orderly shutdown" state while also triggering spin-up of a new process. This allows for existing connections to finish servicing while minimizing downtime by leveraging the annoying truth that the majority of "crashing" crashes, while they do conceptually compromise the entire process' state, don't in practice corrupt all of the state, or at least not all at once.

The issue then becomes the security implications of the approach - not immediately terminating the entire process when a fault occurs potentially opens exploitation opportunities, even though the individual faulted thread is stopped.

3 Likes

Instead of adding explicit support for this (and this doesn't really address the issue in significant ways, an eventloop thread is still going to fail thousands of connections), I'd rather run NIO with just a single thread and use multiple processes (aka "the Node way"). The OS has all this builtin already ;-)

This solution might come at the cost of increased memory footprint when scaling up. Just heard from a JS guy one of the major drawback of Node is the single-threaded nature coupled with browser sized memory need (~1 GB). Scaling up process wise is not a cheap option.

The memory footprint doesn't usually increase with processes and compiled languages - Linux processes are CoW (Redis uses that in interesting ways to avoid locking while persisting data).
You do loose the ability to "easily" share resources between multiple threads (e.g. db connection pools and caches). But such need to be properly locked which also slows them down (Redi/S perf notes if you are interested).
And in the given context - which is "crash recovery" - such locks make the thing even more complicated (consider a "crash" while a procedure holds a lock, this can easily deadlock all threads).

Things look a little different for your "JS guy" (similar in Java), he has other issues, like the big base runtime overhead of JS/JVM plus the JITs modifying executable pages plus the GC.
(BTW: that "single-threaded nature" he is talking about equally applies to Swift NIO. It is essentially the same cooperative multitasking which makes stuff scale but harder to use.)

Did those best practices ever get documented or published anywhere (either by the SSWG or Vapor on its own)?

Great question! Creating a deployment guide is indeed part of the SSWG's focus areas for 2020.