`fatalError` without a way to intercept it is harmful on the server

RussBaz · September 5, 2024, 5:39am

Hi. I have recently come to conclusion that there must be a way to gracefully recover from the fatalError (at least on the server, perhaps, as a compiler flag).

Because otherwise, it becomes a land mine when it is used in third party packages in runtime. For example, my recent issue was with the Vapor Fluent. If I am to forget to include a nested model, and then try reading it, this will trigger a crash of the whole server instead of just 500 error.

Am I doing something wrong or is this works as intended?

Also, just for information, I have got an 'inspiration' for this post after accidentally reading today a certain response by Linus Torvalds to Rust support in the Linux kernel regarding runtime 'panics'. (LKML: Linus Torvalds: Re: [PATCH 00/13] [RFC] Rust support)

Anyway, I am writing this only to share my recent experiences and to (hopefully) start a discussion. (And to rant a bit, so, please excuse me somewhat)

Please share your own opinions on this topic. Thank you!

sspringer · September 5, 2024, 6:30am

Cf. Swift Concurrency Manifesto / Part 3: Reliability through fault isolation. Someone else has written that there are some ideas for recovering from fatal errors, but I cannot find the according topic entry.

finestructure · September 5, 2024, 6:47am

I've run into Fluent fatalErrors a few times and to be honest I think that's a library problem. In all the cases I've seen it should throw instead of trapping. I'm hoping the next major version of Fluent addresses that.

mman · September 5, 2024, 7:02am

Thanks @RussBaz for sending the email from Linus, very insightful.

I am running swift in production on the server since swift 3, and also run multiple Node JS based containers for ages.

What works for me is to isolate the functionality to docker containers to make sure that when anything fails it is restarted automatically, and integrate with some crash reporting (sentry.io) so that you know about it fast. + observability like cpu and mem usage at least.

As idealistic as clean code may be, I have seen containers crashing no matter the language and no matter the code quality for multitude of reasons. Invalid code from early swift compiler versions to fatal errors for missing library calls to memory leaks caused by me, to memory leaks caused by dependencies.

The problem is worse but the same in nature in JavaScript land where deploying to production is like smoking weed while riding bike blinded (never tried that - just imagine it like this).

So just my .2 cents that if I was able to catch fatal errors and recover, it will not simplify the bigger picture…

Martin

P.S. sentry does not support swift on Linux and that is last big missing piece from dev ops perspective

MPLewis · September 5, 2024, 7:06am

Having never used Fluent/FluentKit but just searching for fatalError in its source code, I'm very surprised at just how many instances of it there are - this is a project built by the Vapor folks and it's way more crash-happy than I expected. I would have thought they of all people would understand that's not a great approach when a single fatalError can take down many hundreds of unique connections at once.

FranzBusch · September 5, 2024, 7:50am

After having written a lot of code in the server ecosystem that uses precondition/fatalError and being bitten by it too many times my latest stance is that we should only use precondition/fatalError when it would leave the program in a bad state e.g. if you write a state machine where you are sure a certain state can never happen otherwise it is a bug in how you have written the state machine.

Most of the times throwing an error and gracefully handling it is the right approach. Especially in server applications where any precondition/fatalError that is triggered by user input can lead to denial of service attacks.

Finagolfin · September 5, 2024, 9:19am

There are two rival schools of thought on how to deal with such serious errors, and realistically at least four different domains where they might be applied. One is the Erlang approach of "fail fast" and spawn a new lightweight process to try the failing code again, which only works because it assumes that most code runs in isolated lightweight processes, not the core runtime that oversees all these executing processes. While not so common, this Erlang approach ran telecom systems that were highly reliable and is spreading outward from there.

Another approach is much more well known, where you have a monolithic process with little isolated concurrency and you assume most exceptions and errors do not corrupt memory or "leave the program in a bad state", so you catch or log the problem and keep the process moving along.

Both can work well when their two core assumptions are correct, but not otherwise. They are:

Most errors are within the expected program model and won't corrupt state.
Most code runs in isolated processes that are overseen by a central executor, which can keep track of crashes and simply respawn when necessary.

The way fatalError() is designed assumes the first isn't true for its errors, and that the second is. How realistic that is for Swift on the server, particularly when using the new structured concurrency, I can't say, as I don't use Swift on the server.

Even Linus admits in your link that there are different domains, ie "kernel code is different from random user-space system tools," so it all really depends how Swift tries to address the needs of these different domains.

finestructure · September 5, 2024, 9:27am

Yes, the problem is that fatalError takes your whole service down including all other requests that were in flight. But in particular some of the Fluent fatalErrors are recoverable.

For instance, it fatal errors when you try to access a relationship that hasn't been loaded yet. That doesn't feel like something that should crash your entire process but throw instead.

0xTim · September 5, 2024, 12:02pm

When Fluent was written we were pretty constrained by the features of the language at the time and decisions were made (rightly or wrongly) to adopt things like property wrappers and properties for a better ergonomic approach for various APIs.

Thankfully the language has evolved a lot and we now have things like throwing properties (which we can't retroactively apply because that would be a breaking change). Fluent 5 (and Vapor 5 etc) will provide far fewer places that will crash out the app, short of misconfigurations at start up we can't recover from etc.

RussBaz · September 5, 2024, 1:11pm

It would still be nice to have some kind of compiler guaranteed isolation mechanism for fatal errors when the service availability is crucial.

ktoso · September 5, 2024, 1:17pm

We've talked about this a lot during early actor days, and currently the answer really might be to use distributed actor + process isolation.

There's recent thread about this: Runtime extensibility via distributed actors, as well as a demo here: GitHub - martiall/swift-subprocess-distributedactors, using just pipes and how one could even run the "other side" in wasm etc.

Using this, you can crash the whole "other side" because it's properly process isolated. It comes at a cost though, the process.

Either way, it's a worthwhile idea to build out and I'd love to form a workgroup of some kind to pursue these ideas.

RussBaz · September 6, 2024, 4:24am

Out of curiosity, how would you personally architecture a web service using distributed actors?

ktoso · September 6, 2024, 4:28am

That's too generic of a question to be honest.

Distributed actors enable various things, from load balancing across nodes, to isolating risky work into processes in the same nodes, or just using them for websocket or whatever else you might want. This all implies different transports and semantics... It depends what you're looking for.

Classic talks I recommend people watch on the topic are Tesla's virtual power plant (https://www.youtube.com/watch?v=EZ9NJyfH9Gg) (using Akka cluster and streams), or .NET Orleans used in HALO's lobby system (https://www.youtube.com/watch?v=I91ZU8tEJkU). There's other IoT systems using actor clusters, such as Eero, those usually fall into the "virtual twin" pattern.

And ofc any system ever done with Akka or Erlang clusters -- plenty of talks about them around.

In interest of keeping this thread on topic though: yeah another way to use them is for process isolating "risky code".

tera · September 14, 2024, 12:27am

Not so long ago I played with the idea of dynamic throw/catch – a concept similar to unchecked exceptions in other languages.

dl;dr version: compared to normal throwing functions the functions are not "coloured" by being throwing v non-throwing and there is no viral effect on the callers (that with normal swift throwing functions must "try" and potentially be marked throwing themselves infecting their callers in turn. Unchecked exceptions are adoptable on an opt-in basis: if you don't catch them they behaves as they are today (terminating the process) and if you opt-in to catch those you could catch them and proceed accordingly, be it a fatalError† or a precondition† failure or an index out of bounds on array subscripts†, or etc†.

† - the dynamic throwing equivalents of those.