Memory leaking in Vapor app

tanner0101 · April 2, 2019, 2:19pm

Huge thanks to everyone's help here tracking this down. MySQL 3.2.4 has been tagged with a fix that reduces peak memory usage by 10-15x for the sample project. See vapor/mysql#232 if you're interested to see what was fixed.

@iljaiwas please run swift package update on your real app and let me know if you see similar results.

Post mortem

Here's a small recap of what went wrong, and also why it took so long to find it. The main blocker was that we initially thought this was a Linux-only memory leak--and we had some good reasons to think that:

System Monitor showed memory usage going up into the GiB and never being released
Valgrind showed bytes "definitely" being lost.

If this indeed was a memory leak on Linux only, then that meant it was highly unlikely to be in Vapor's source code since most everything is the same between macOS and Linux. It was only slightly more likely to be in Swift NIO where there are more #if os, but still unlikely. Really the main and only suspect would be Foundation which has an entirely different implementation on Linux.

Luckily, @johannesweiss made a couple of key discoveries:

With these insights, the issue was solved in a matter of minutes. The problem was not leaking memory, but overly high peak memory usage which lead to fragmentation. Which, importantly, we were also seeing on macOS. This meant it was highly likely the problem was in Vapor. Furthermore, @Joannis_Orlandos had already made note of a bug in the MySQL package the previous day that could be the culprit for high peak memory usage. That is the bug fixed in MySQL 3.2.4 and what I now believe is the solution to this problem.

Key take aways

For anyone else that might experience this issue with Swift on Linux in the future, here are some insights that might help.

Valgrind may think Swift.String is leaking

Due to optimizations in Swift's String, Valgrind may think bytes have been "definitely lost" even though they haven't. This seems to only happen when you stop Valgrind while it is still busy doing work.

Valgrind may be very slow

It's doing a lot of black magic, so give it time. It will make your application much slower while it's running, but it will also take a significant amount of time to crunch through all the data it collects. Even if the application being debugged has finished doing its work, make sure that Valgrind goes to 0% CPU before quitting to analyze results.

Going forward

To help prevent issues like this from happening in the future, we need to improve Vapor's DB driver performance benchmarking. Currently, benchmarks are for small, unrealistic models and test run time only. Benchmarks that could have caught this issue early would test realistically sized models (with diverse properties and types) and also memory usage alongside run time. Implementing these benchmarks is unfortunately much easier said than done, but it's something that has been on our list long before this issue. I will try to prioritize for Vapor 4's release cycle.