What are the pain points of deploying Swift on Server applications?

doctorj · December 6, 2024, 6:09pm

What have been some of the pain points in using Swift on Server? What does the community think it needs to be improved to be more widely adopted in the industry? Any specific issues with deploying on Mac or Linux would be nice to know as well.

dima_kozhinov · December 6, 2024, 8:17pm

Hi Justin,

You need to deploy all runtime dependencies. Ideally you should use the same Linux distribution and version for development that you will have on a production server.
Several years ago the company I worked for at the time, started to deploy microservices on their servers and part of them were written in Swift. We used Perfect at the time (this framework looks abandoned now). One particular pain point was that when the server app crashed, there was no way to know that and automatically restart it. Maybe Vapor has something for this scenario, I'm not very familiar with Vapor.

Both pain points were mitigated by using Docker at the time, but I'm not a Docker fan and this is not an advice to use Docker.

Why our app crashed: Its business logic contained some horrible legacy code translated from another programming language, and that code was prone to violating array ranges. Swift apps crash on range violations, and this can be mitigated just by using adequate programming style, it's not a rocket science to never violate ranges, but we had to use really bad legacy code back then.

There may be other reasons for unexpected termination of a server app, and this scenario surely should be foreseen and processed somehow.

taylorswift · December 6, 2024, 9:48pm

yes, in general, i have found Swift’s “crash early and crash violently” philosophy to be problematic for server-side use cases, where availability is paramount and downtime is fatal.

to this day, Swift doesn’t have a good way to recover from precondition failures, so a single logic error in a many hundreds of thousands of lines codebase, triggered by a single unlucky visitor (or robot) has a nuclear-sized blast radius that takes down the service for everyone.

i think the language currently also ships with some unfortunate default settings, which are straightforward for an experienced engineer to correct for, but can bite novice developers badly. i’m not sure what the status of changing these defaults is, but i hope it comes soon because it is the type of incident that can lead business-oriented folks within an organization to conclude that Swift is an unwise choice on the server.

of course you can respond “just hire people who know what they are doing”, but at the business level, having higher skill requirements to accomplish the same high level task is seen as a hindrance, not a virtue.

what we need to emphasize in order to make Swift a truly great server-side language is robustness, the ability for individual parts of an application to fail without causing an entire service to collapse for many end users at once. and it needs to become something “junior” engineers can be trusted with, instead of something you need to recruit an expert server-side sherpa to navigate.

George · December 6, 2024, 9:52pm

This isn't only an engineering issue. If you add numbers that are user-controllable, you open yourself up to malicious users DoS-ing you by just constructing the right request to crash the server. It would be great if we could have some concept of an isolation domain that fatal errors can kill without killing the full process (but that doesn't incur the overhead of a separate process).

doctorj · December 9, 2024, 8:08pm

What do you mean by this "If you add numbers that are user-controllable, you can open yourself up to malicious users DoS-ing you by just constructing the right request to crash the server."? I'm assuming this means that users with bad intent can DoS you but maybe I'm missing something here.

George · December 9, 2024, 11:36pm

Let’s say you allow users to specify a number you represent as an Int, and somewhere (anywhere) in your request handling code you add one to that number. A malicious user could simply provide Int.max and that will crash your entire server (because of trap on overflow), whereas in many other frameworks that would just fail the single request.

Helge_Hess1 · December 10, 2024, 12:18am

I agree that this is a big issue, in particular in a SwiftNIO context where a fatalError will bring down tens or hundreds of thousands of connections.
It may be correct but it is not very resilent.

johannesweiss · December 10, 2024, 1:57am

Yes, it will crash and yes, having such a problem in a production system is an issue. And also yes: These bugs are not hypothetical, occasionally I come across one, here's one example I filed. The way to protect against that is to unit test places where we're accepting un-sanitised user input.

The reasoning here is that when designing a programming language, you have a bunch of imperfect options (non-exhaustive):

Define overflow/underflow as undefined behaviour or wrap around (C/Java/C++/Rust in release)
Treat overflow/underflow as programmer error and trap (Swift, Rust in debug mode)
In theory, make every arithmetic operation throws (unworkable)
Make every arithmetic operation throw an error that bubbles that can be caught

(1) Means that integer under/overflows frequently become undetected security vulnerabilities because certain properties don't hold anymore. They often go undetected because the software doesn't fail.

(2) Makes the programmer responsible for making sure that integer operations don't under/overflow and when the programmer gets it wrong, you crash. That way they don't go undetected because the programmer gets told about the issue and fixes it.

(3) Is unworkable, if every arithmetic operation were throws, pretty much every function would be throws which makes it pointless as a tool (and emits a lot of code for error recovery)

(4) Is a concept some languages (like Java) have: Unchecked exceptions/errors where by default it would bubble up and crash the process but if you want, you could catch it. For example a HTTP server route handler could just catch all of these "unchecked" exceptions/errors. This still requires emitting a lot of code because potentially everything is now throwing, even though the programmer doesn't see the trys. Another problem here is that it's really hard for a programmer to reason. Like lock.lock(); let a = x + y; lock.unlock() looks like you always lock and then unlock. But with unchecked exceptions/errors that wouldn't be true.

So clearly there are different tradeoffs here. Swift errs on the side of caution and makes sure that nothing overflowed accidentally which eliminates one class of security vulnerabilities. But of course this means that if you're not carefully validating your inputs you may crash because of undetected under/overflows.
Personally, I think that's the right default because I really dislike if programs become so-called "weird machines" and they reach states that shouldn't be possible like it often is with under/overflows. Of course not everybody will agree here.

C/C++ made a IMHO really bizarre choice and defined under/overflow of signed integers as undefined behaviour. That means that the compiler can assume none of your (signed) integers over/underflow but doesn't haven't to check it either.

So a program like

int a = user_input;
unsigned short b = 123;

int c = a + b;
if (c < a) { // don't use this as an overflow check
   printf("this is bad\n");
   exit(1);
}

...

appears to (badly, don't use this) check for a potential overflow. But the compiler is totally free to just remove the if because it "cannot be reached" -- by legal means. This can lead to pretty bad security vulnerabilities and if you naively try to catch some cases, the compiler might just decide to remove them.

And in fact, it does:

See how it removes the if (c < a) branch, despite the fact that c == -2147483648 which is clearly less than a == 2147483647...

johannesweiss · December 10, 2024, 2:04am

Right, SwiftNIO is designed and very good at handling a lot of connections in a single process with really low overhead. If you choose to run such a program, then yes, a crash will bring down more connections.
For a lot of software, especially distributed, cloud-native, scalable software this is probably a good tradeoff as the overall system is designed in a resilient way anyway. So one node crashing is seen as normal because the scale means that you'll hit hardware/network/... issues anyway which you can't protect against in software on a single box.

But of course, it depends. If you are dealing with very important information that is lost on crash, you may need to run a system which runs fewer (or even just 1) concurrent requests per process. Or design the system differently.

I'd say everything in this space is a tradeoff and Swift made an IMHO very reasonable choice but it can't be the right choice for absolutely everybody in every situation...

Ben_Cohen · December 10, 2024, 2:35am

Not just security vulnerabilities. One of the most "exciting" times in my career was observing a relatively benign calculation bug on a trading system (a fill on an order racing with a correction to that order resulted in an accidental attempt to order -100 shares) get escalated into one of the nastiest bugs you can imagine (underflow because number of shares was held in a uint meant the system attempted to buy 4 billion shares).

One of the many safety checks later along the way caught it before it did any damage but still...

taylorswift · December 10, 2024, 2:38am

in many service architectures, the HTTP/HTML layer contributes very little to the overall load, and the lion’s share of the actual “work” is running database queries. so in these setups, you would have a handful (or even one) apex node that talks to the outside world, and distributes load among many database nodes which return data to the apex node where it gets rendered and returned to the client.

jaleel · December 10, 2024, 11:13am

Can you elaborate a bit more and give some examples? Just curious, want to see more real world examples to learn.

A bit off topic but while I do get what issue could "crash early" bring, I do believe going distributed is the correct way to solve those issue. Maybe counterintuitive, but think fault tolerance should be the main force to start thinking how to distribute nodes, so that systems continue working. And Swift has instruments for that, maybe not in best shape yet. Though of course classic setup with microservices and k8s will also work.

sspringer · December 10, 2024, 1:52pm

The crashes you would want to "isolate" (so that the whole program doesn't crash) are the ones where incorrect or inconsistent data would be produced, and you certainly don't want to continue any operation with incorrect data. But it certainly would be a good feature to be able to “isolate” such a crash to the according server request, and doing so within a single server application.

I wonder if the safety guarantees to prevent data conflicts in parallel code that came with Swift 6 could help isolate at least some (or all?) of these fatal errors in concurrent contexts. Just an idea, perhaps others can comment on this...

jaleel · December 10, 2024, 5:02pm

Btw, what are real world examples of such systems, except for Erlang? Cause the problem I see is in every other language memory is shared, so they all vulnerable to some kind of fatalError kind of problem.

Potentially one can write a distributed actors runtime using system processes, where memory is isolated, so killing it won't do any harm to other parts of an app.

taylorswift · December 10, 2024, 5:39pm

i’ve written about this countless times but off the top of my head i remember implementing a pageable HTML calendar that you could click “>” to jump to the next year. the server started crashing a few weeks later and i realized somebody had been entering Int.max for the year in the URL and clicking “>”. it had never even occurred to me that someone would go looking at years beyond 2025.

there’s a really simple counterargument to this, which is cost.

jaleel · December 10, 2024, 8:27pm

Not sure it’s counterargument, getting more nines in uptime (99.99…%) or basically considering stronger availability and reliability is always about cost.

sliemeobn · December 10, 2024, 8:39pm

While I understand that "the pain is real" when having to deal with crashes that take down the entire process, I personally think is a bit too prominently placed here on the pain points for swift on server list.

I'll take a crashing service any day of the week, if the alternative is a runtime with "let's soldier on, it'll be fine YOLO" vibes. I can image the post-mortem forum post would then be "I accidentally updated half a million user data records with unrecoverable garbage" instead of "things were patchy for a while".

And in my experience (as mentioned upthread as well)
a) you need stuff like retrying/queueing/transactions/compensation anyway for robust systems (as things will go wrong anyway), and
b) it is a lot easier to recover from a crash than a "yolo" service doing whatever.

I am not saying things can't or shouldn't be improved (something like in-process isolates would be cool), but it's not exactly a showstopper either way imho.

To me, way bigger hurdles to having Swift shine on the server are still tooling and ecosystem. I am thinking about things like

swiftly being still in its infancy
SwiftPM just barely holding it together at times
unresolved SwiftSyntax situation
Foundation/FoundationEssentials and their platform story (despite truly heroic efforts)
URLSession vs. AsyncHTTPClient, Data vs. ByteBuffer, "how to files?"
rather underwhelming guidance on "how to deploy" or even "which build settings do I need"
confusion around XCode... (with its own toolchain, and its own build system, and its own everything...)
ecosystem struggling to keep up with rapid evolution (concurrency, ownership, macros, ...)

I don't want to come across as too negative, you CAN get it all working - and "Swift - The Language" is well worth it for me. I just wish the tooling (and core package ecosystem) could catch up and keep up with the fast-moving train that is Swift.

George · December 10, 2024, 8:41pm

I very much understand and appreciate these tradeoffs.

That said, I think “this is a pain point when deploying server-side Swift applications” is also true. I’m also not advocating we change the fundamental nature of Swift.

The reason I even brought it up (or more accurately, +1ed @taylorswift bringing a related point up) is that I don’t think we’ve fully explored the design space yet. Particularly with things like region based isolation and distributed actors, there may still be ways in which we can “kill” a part of the process without actually killing the whole process. I certainly don’t have a concrete proposal here, but maybe there is something we can do where a single isolation region can fail in a way that doesn’t pose consistency issues for the rest of the program. Maybe something with distributed actors which aren't actually distributed (but are constrained in how they can interact with the rest of the program).

Helge_Hess1 · December 10, 2024, 8:50pm

That's one reason why I think mod_swift is a way better choice for most server side Swift ;-)
Or Node.js and Java environments. Because those catch errors and then can cleanly shutdown the stack/process, w/o having to throw away everything. Especially in memory safe languages the likeliness that one transaction breaks others is very low.

I understand that this is very different at extremely high scale, but that's not something people usually do.

Helge_Hess1 · December 10, 2024, 8:53pm

This is not how it works. In production servers you usually detect a fault, tear down that transaction, mark the process for disposal but let the remaining transactions finish.
It is very rare (especially in non-C w/o memory corruption being likely) that one transaction affects others.