Hi. I have recently come to conclusion that there must be a way to gracefully recover from the fatalError (at least on the server, perhaps, as a compiler flag).
Because otherwise, it becomes a land mine when it is used in third party packages in runtime. For example, my recent issue was with the Vapor Fluent. If I am to forget to include a nested model, and then try reading it, this will trigger a crash of the whole server instead of just 500 error.
Am I doing something wrong or is this works as intended?
Also, just for information, I have got an 'inspiration' for this post after accidentally reading today a certain response by Linus Torvalds to Rust support in the Linux kernel regarding runtime 'panics'. (LKML: Linus Torvalds: Re: [PATCH 00/13] [RFC] Rust support)
Anyway, I am writing this only to share my recent experiences and to (hopefully) start a discussion. (And to rant a bit, so, please excuse me somewhat)
Please share your own opinions on this topic. Thank you!
I've run into Fluent fatalErrors a few times and to be honest I think that's a library problem. In all the cases I've seen it should throw instead of trapping. I'm hoping the next major version of Fluent addresses that.
Thanks @RussBaz for sending the email from Linus, very insightful.
I am running swift in production on the server since swift 3, and also run multiple Node JS based containers for ages.
What works for me is to isolate the functionality to docker containers to make sure that when anything fails it is restarted automatically, and integrate with some crash reporting (sentry.io) so that you know about it fast. + observability like cpu and mem usage at least.
As idealistic as clean code may be, I have seen containers crashing no matter the language and no matter the code quality for multitude of reasons. Invalid code from early swift compiler versions to fatal errors for missing library calls to memory leaks caused by me, to memory leaks caused by dependencies.
The problem is worse but the same in nature in JavaScript land where deploying to production is like smoking weed while riding bike blinded (never tried that - just imagine it like this).
So just my .2 cents that if I was able to catch fatal errors and recover, it will not simplify the bigger pictureβ¦
Martin
P.S. sentry does not support swift on Linux and that is last big missing piece from dev ops perspective
Having never used Fluent/FluentKit but just searching for fatalError in its source code, I'm very surprised at just how many instances of it there are - this is a project built by the Vapor folks and it's way more crash-happy than I expected. I would have thought they of all people would understand that's not a great approach when a single fatalError can take down many hundreds of unique connections at once.
After having written a lot of code in the server ecosystem that uses precondition/fatalError and being bitten by it too many times my latest stance is that we should only use precondition/fatalError when it would leave the program in a bad state e.g. if you write a state machine where you are sure a certain state can never happen otherwise it is a bug in how you have written the state machine.
Most of the times throwing an error and gracefully handling it is the right approach. Especially in server applications where any precondition/fatalError that is triggered by user input can lead to denial of service attacks.
There are two rival schools of thought on how to deal with such serious errors, and realistically at least four different domains where they might be applied. One is the Erlang approach of "fail fast" and spawn a new lightweight process to try the failing code again, which only works because it assumes that most code runs in isolated lightweight processes, not the core runtime that oversees all these executing processes. While not so common, this Erlang approach ran telecom systems that were highly reliable and is spreading outward from there.
Another approach is much more well known, where you have a monolithic process with little isolated concurrency and you assume most exceptions and errors do not corrupt memory or "leave the program in a bad state", so you catch or log the problem and keep the process moving along.
Both can work well when their two core assumptions are correct, but not otherwise. They are:
Most errors are within the expected program model and won't corrupt state.
Most code runs in isolated processes that are overseen by a central executor, which can keep track of crashes and simply respawn when necessary.
The way fatalError() is designed assumes the first isn't true for its errors, and that the second is. How realistic that is for Swift on the server, particularly when using the new structured concurrency, I can't say, as I don't use Swift on the server.
Even Linus admits in your link that there are different domains, ie "kernel code is different from random user-space system tools," so it all really depends how Swift tries to address the needs of these different domains.
Yes, the problem is that fatalError takes your whole service down including all other requests that were in flight. But in particular some of the Fluent fatalErrors are recoverable.
For instance, it fatal errors when you try to access a relationship that hasn't been loaded yet. That doesn't feel like something that should crash your entire process but throw instead.
When Fluent was written we were pretty constrained by the features of the language at the time and decisions were made (rightly or wrongly) to adopt things like property wrappers and properties for a better ergonomic approach for various APIs.
Thankfully the language has evolved a lot and we now have things like throwing properties (which we can't retroactively apply because that would be a breaking change). Fluent 5 (and Vapor 5 etc) will provide far fewer places that will crash out the app, short of misconfigurations at start up we can't recover from etc.
Distributed actors enable various things, from load balancing across nodes, to isolating risky work into processes in the same nodes, or just using them for websocket or whatever else you might want. This all implies different transports and semantics... It depends what you're looking for.
Classic talks I recommend people watch on the topic are Tesla's virtual power plant (https://www.youtube.com/watch?v=EZ9NJyfH9Gg) (using Akka cluster and streams), or .NET Orleans used in HALO's lobby system (https://www.youtube.com/watch?v=I91ZU8tEJkU). There's other IoT systems using actor clusters, such as Eero, those usually fall into the "virtual twin" pattern.
And ofc any system ever done with Akka or Erlang clusters -- plenty of talks about them around.
In interest of keeping this thread on topic though: yeah another way to use them is for process isolating "risky code".