Crash recovery for server-side applications

yxckjhasdkjh · September 5, 2018, 12:19pm

Since the server side group is now entering a new phase, I was wondering if the following was a problem that has been thought about before or that other server side developers see a need for.

Namely, it's about the fact that Swift apps can crash in many ways. Some of them are preventable, e.g. by avoiding force unwraps, but others aren't always: for example, it might be a bug in a library, it might be an integer overflow (where it might be difficult to strictly guarantee it won't ever happen), an invalid array access, some concurrency bug, ... Basically, programming errors will happen and invalid assumptions will be made, and so crashes will occur.

Swift doesn't have a way of recovering from those crashes, by default. This is probably ok for app development. After all, if an app really does crash, the user can just restart it (and there's probably some Apple tools that can report crash logs to the developers). However, for server side applications that won't work, as a bad request could bring down the whole server.

The solution we're using now is to just have multiple instances running and have the load balancer take control in the case of a crash (e.g. by restarting); but that's additional work and also somewhat inflexible.

I was wondering if some tooling could be built than can handle these kinds of situations more seamlessly, so not everyone has to reinvent the wheel. I'm not sure exactly what the best solution would be (maybe supervisor trees like Erlang has?), but maybe some other people have ideas.

Helge_Hess1 · September 5, 2018, 12:29pm

That's a reason why I actually like hosting Swift endpoints in Apache. It has all kinds of reliability features builtin. Another thing which is particularly problematic in Swift and doesn't even produce hard crashes (quickly) are memory leaks.

zoul · September 5, 2018, 12:45pm

There’s an interesting part on actors in the Concurrency Manifesto by Chris Lattner:

gist.github.com

https://gist.github.com/lattner/31ed37682ef1576b16bca1432ea9f782#part-3-reliability-through-fault-isolation

TaskConcurrencyManifesto.md

# Swift Concurrency Manifesto

Author: [Chris Lattner](https://github.com/lattner)

[Chinese Translation](https://gist.github.com/yxztj/7744e97eaf8031d673338027d89eea76) by [Jason Yu](https://github.com/yxztj)

## Contents

- [Introduction](#introduction)
- [Overall Vision](#overall-vision)

This file has been truncated. show original

yxckjhasdkjh · September 5, 2018, 1:22pm

That's interesting, can you share some more details about those kinds of features and how they can be used for Swift?

@zoul: Yes, I've seen the manifesto before (even though I've never managed to fully read it, I probably should...). It paints a really bright future, and I hope that Swift will have this one day (the earlier the better), but of course I can't hope that this is something that is going to happen quickly. My question is whether there could be any community solutions for the meantime. It could be as simple as a tutorial that is linked from swift.org, for that matter, if there's solutions that can already be used (e.g. Apache, or maybe some supervisor tools, or some docker solution...). Even just a mention of the issue might be enough, as I suspect that quite a lot of people wouldn't think of this problem initially.

jrose · September 5, 2018, 4:33pm

I'm going to write out my doubts here just so that they're written up; please don't take it as complete dismissal of the idea or anything.

If a thread goes down in-process—as in, insta-exit, not an exception—you've irretrievably lost the memory and resources that thread was holding unless you've registered some way to recover them from outside. For memory that might mean a leak; for file handles that might mean a broken connection or a truncated file on disk; for synchronization mechanisms across threads/actors, you've got a pending deadlock. So you'd need some exception-handling-like way to register cleanups, only with an outside thread running them.

(Swift is very unlikely to get true exception-handling; it mucks with the entire rest of the language model when every call can fail.)
If there's any memory-sharing, the memory might have been corrupted, so if we want to keep safety in the Swift sense, we need something like Go or Rust has to prevent sharing across actors.
Swift talks to C a fair amount. We'd have to figure out what uses of C in the runtime and in corelibs are not compatible with this kind of recovery, and either avoid them or document the Swift-side API as "recovery-unsafe" or whatever. (Libraries that people use on their own, including Libc, are the responsibility of the developer using them…)

These are the big things coming to mind right now.

yxckjhasdkjh · September 5, 2018, 4:48pm

Thank you for the detail. I'm not involved enough with the internals of the language to comment on them, so I'm not sure what the best solution would be.

Regardless, this is a problem that all server-side applications will run into, so this will need to be handled at some point. Currently, everyone is probably doing it their own way (Apache, load balancer, docker, systemd, ...), so it would be nice to have some kind of standard solution. This might be (long-term) an extension of the language of some sorts (as outlined in the manifesto above), but it could also just be additional, officially maintained tooling. Or even just a recommendation to use something that already exists.

tanner0101 · September 5, 2018, 7:34pm

As long as it is not possible to "catch exceptions" in Swift, it would seem our only option is to standardize some best practices for minimizing the impact of crashes.

Vapor currently recommends using a combination of nginx and supervisor when deploying your app to production. Nginx should help to deliver a proper 5xx if the reverse proxied server goes down unexpectedly (much better than just getting a TCP / socket error on the client side). Supervisor should instantly restart the app minimizing downtime after a crash.

I'm very interested to hear what others think is the best practice here.

I'm interested to learn more about what Apache offers. Are there things it can do that Nginx can't?

jrose · September 5, 2018, 8:07pm

Sorry, I'll clarify the execption-handling comment a bit. What I mean by "Swift is unlikely to get true exception-handling" is

No way to stop a fault from taking out the current thread. (How this interacts with higher-level parallelism abstractions like Dispatch is a different and complicated problem.)
No way to perform cleanups that run on the same thread when a fault occurs. In general this probably scales up to "within the same actor", but I won't make that claim right now.
No way for an arbitrary function call to end without
- Returning
- Throwing (in the Swift.Error sense)
- Exiting the thread or process
- Trapping / faulting

"Catching exceptions" usually encompasses all three of these, but note that even if you didn't want to catch the exception and had absolutely no custom cleanup to perform (finally blocks), you'd still have pieces of the problems I mentioned above.

Helge_Hess1 · September 5, 2018, 8:08pm

That's a reason why I actually like hosting Swift endpoints in Apache. It has all kinds of reliability features builtin.

I'm interested to learn more about what Apache offers. Are there things it can do that Nginx can't?

Well, you need to read more exactly what I'm saying ;-) :

hosting Swift endpoints in Apache

e.g. using mod_swift.

You are using Nginx as a frontend server, but that has little to do with watchdog'ing the actual server.
What Apache modules essentially allow you to do is to say "stop this instance gracefully if some limit has been hit" (MaxRequestsPerChild is the most basic, but actually very effective, variant). I.e. the "master" process wouldn't direct new requests to an instance which is in shutdown mode, but continue to handle ongoing threads until a clean shutdown is possible.

This kind of thing is really good for Swift services which are guaranteed to hit a leak issue sooner or later due to the memory management mechanisms. (I'm not saying that those are wrong, just that you have to deal with the consequences. It is rather trivial for a framework user to introduce retain cycles, happens all the time)

FWIW those kind of things are also protocol specific. It's really not that easy to do in a cool way, but it would be really great if Server Side Swift would come with a good solution for it.

P.S.: I'm also aware that in large installations the swarm manager would deal with that. But even then, is is good to have a way to gracefully shutdown instances considered "dirty".

P.S.2: That you have to recommend Nginx and supervisor for production is OK and working, but if you think about it, it is a little ridiculous as well. We can't built an edge server in Swift? ;-)

Joe_Groff · September 5, 2018, 8:30pm

Even in these situations, there's usually an opportunity for improvement over sudden death that doesn't require perfect recovery or isolation. A server wants a trap handling one request to not immediately take down other requests; although it's formally possible for badly behaved code to leave any shared resources in inconsistent states, concurrent requests hopefully already have relatively isolated state by necessity of their execution environment, so in practice it's probably OK for the supervising event loop to stop taking new requests and try to let the surviving requests complete before killing itself. There are other places where it'd be nice to have "soft landing" crash handling without perfect recovery; for example, test runners would like to test behavior that may trap, and test should be simple and short-lived enough that leaked resources aren't an issue. StdlibUnittest forks to test crashes, but that's pretty heavyweight, and not all of our target platforms even support forking.

To that end, it wouldn't be unreasonable to have a setjmp-ish mechanism for introducing a last resort crash handler that at least was guaranteed to run in response to the safety check traps Swift itself emits, if not arbitrary signals raised by any C code. This wouldn't have to be a beautiful or composable API, since we don't want user code using it for traps as mundane control flow as happens too often with runtime exceptions in Java. Maybe we could require that the handler still end the process instead of being allowed to return. Maybe "actors will solve this" in the fullness of time, but that's just a beautiful idea, and it wouldn't hurt to have a less beautiful practical alternative in the short term.

gps · September 5, 2018, 11:43pm

Nginx returning a 500 is good, as is being able to bring back a process through supervisor. However, that still results in all other in flight requests dying (unless you also retry them somehow), which IMO is still a huge problem.

yxckjhasdkjh · September 6, 2018, 10:37am

Thanks everyone for all the helpful replies. Clearly you are all much more knowledgeable than I am. If I understand correctly, I can see that the following solutions have already been tried out and found to work:

Swift server code as an Apache module with mod_swift
nginx + supervisord
for our own service, we use a docker container that runs supervisord internally to restart the service if it died; additionally a load balancer sits in front of the docker cluster and will detect the health of the containers periodically (unfortunately, this means that if supervisord can't restart the service for some reason, we have a time window in which we might lose requests); I'm not sure if that's the best solution.

In addition, the following possible future solutions have been brought up:

A more resilient, maybe actor-based way of writing Swift code (if I understand @jrose correctly, that will mean that any recovery mechanism would have to run at least in a different thread)—long term vision, of course
a more short-term solution where there is a (purposefully limited) last-resort crash handler that could be used in Swift code, as proposed by @Joe_Groff
maybe a standard community solution based on external tooling

Is there interest in setting up a page somewhere that outlines some of these ideas in more detail as a reference for all server-side developers?

Some additional questions:

Do you have a resource for that? I looked at the vapor docs, but haven't found anything about this. It would be very useful.

Do you know how Nimble is able to test for fatalError / preconditionFailure / etc.? Do these functions behave differently than "regular" crashes (e.g. illegal array access)?

Joe_Groff · September 6, 2018, 3:51pm

It looks like it installs a Mach exception handler to respond to the Mach exceptions that happen to get raised by the code current Swift compilers and LLVM backends generate for traps. These aren't stable; there is currently no guarantee as to how functions like precondition or fatalError really end the process. An official crash handler mechanism would have to be coordinated with Swift's code generation patterns for traps.

tanner0101 · September 6, 2018, 8:58pm

@Joe_Groff @jrose thanks! That’s very enlightening.

If I understand correctly, we have two somewhat separate issues then (might be worth splitting one of these off into a different thread):

1: Long-term.

We should continue to discuss the concept of “soft landing” crashes. If / when the mechanism for doing this becomes available in Swift, the @server-workgroup can adopt them and recommend best practices for using them.

2: Short-term.

The @server-workgroup should consider recommending best practices for minimizing the impact of exceptions. We could agree upon what we think those best practices are (be it Nginx, Apache, whatever) and make that information available. Even if the recommendation is simply to consider this potential pitfall, that is better than nothing.

Joe_Groff · September 6, 2018, 9:58pm

Establishing and documenting current best practices while discussing both near- and far-term improvements sounds like a good approach. Even with soft landing crash handlers or in-process actors, I suspect many people would still want to rely on process isolation with existing mitigation practices for the added security and reliability.

Helge_Hess1 · September 6, 2018, 10:15pm

NPEs / fatalErrors are really a language-level issue. That they bring down a whole app is a pretty big flaw / issue. Though ! is heavily discouraged, I guess it would be fair to blame it on the users.
Nevertheless it would be really good if we could solve that part in some way. Doesn't sound easy, IMO.

As mentioned I think the other big issue (actually much bigger issue in the real world) in Swift is memory leaks. I think that would have to be managed at a "container" level and needs to be protocol specific (because how and when protocols can shutdown is different, this could start with getting something for HTTP).
But this is like a layer above SwiftNIO. More like a framework which provides the watchdog'ing capabilities on top of it.

P.S.: I don't think there is a need to document the weak/lame best practices deployed now. It is not language specific at all and well documented elsewhere (supervise, LBs, k8s etc).

yxckjhasdkjh · September 7, 2018, 9:56am

Probably 99% of web services out there run in interpreted languages of some sort (including JVM languages and of course also PHP, Ruby, Python, Perl, Javascript, ...). With these kinds of languages hard crashes are very rare; you'll still have to protect yourself against it in the long run, but it's definitely not something you need to deal with right away. So yes, I think in that sense this is "language specific" in the sense that people used to doing web services in these languages wouldn't think of this problem in the same way.

Plus, Swift is supposed to be easy to learn, so I'm not sold on the idea of saying "they should know this already and it's documented elsewhere".

Maxim_Veksler · September 16, 2018, 1:11pm

I'd like to suggest that fast fail is intact a feature rather then a bug.

In todays container based world where bootstrapping a new "docker instance" is almost immediate you might be wanting the service to crash when reaching a state of fault. You want the cluster manager to bring a new instance into existence connect it to the load balancer and continue running without requiring manual intervention.

You might also want to old instance to remain dead so that you can investigate what happened and debug.

Helge_Hess1 · September 16, 2018, 1:38pm

A typical NIO server is going to serve tens of thousands of connections within a single process, that is the sole point of the framework. Bringing all of them down because a single connection triggers an issue is suboptimal nicely put.

Maxim_Veksler · September 17, 2018, 10:07am

I understand your point. Yes, that is true.

How would these connections be handled in case of a rollout? How would this work in environments where CI is pushing new server version into productions multiple times a day?

I'm imagine some retry logic, either at load balancer or client side would be implemented to make this work.

I think you raise a good architectural point, would be interested to learn about solutions.