Memory leaking in Vapor app


(Helge Heß) #34

I don't think Vapor has an ORM (Object Relational Mapper ...), doesn't it just use Codable to map incoming data to arbitrary value types? So in the worst case the data should be gone after mapping? (it likely doesn't have an ORM like uniquing context, aka NSManagedObjectContext/EOEditingContext, which tends to gobble up memory :-) ).

But it sounds a little like the user may be keeping a handle to some Data column (which may then be referring to the full ByteBuffer). That might explain it. (though I guess not really, because all those Data columns in a single result set would share the same BB, which is IMO much better than copying the buffer into 10k standalone Data storage objects, maybe the root case is really the use of Data here with its dual-alloc, instead of using a BB slice).

P.S.: "But I'm off for the weekend (without computer)" - I hope you actually leave it behind ;-)


(Simon Pilkington) #35

Incidentally, my week was spent looking at memory utilisation graphs which were gradually increasing in a number of our SmokeFramework + SmokeAWS + SwiftNIO services. My investigations tally closely with the analysis by @johannesweiss that it is related to peak memory usage. I gained more insight into what I had been seeing from Johannes' comments so that has been very useful as well.

I don't want to highjack a thread about Vapor but I did wanted to mention that we are seeing similar issues and there are likely optimisations the frameworks can do to at least minimise the impact of this.


(Amir Abbas Mousavian) #36

I checked the code of reading ByteBuffer as Data here:

We are not going to copy here when data length is inlinable. I also checked Foundation source and it does not copy either (for obvious reasons),

That was only a suggestion to copy data when they are inlinable. You may oppose for other reasons as you know NIO much better than me. But I think it worth considering as inlinable data are held in the stack and there is no malloc overhead.

The code would become something like this:

    public func getData(at index0: Int, length: Int) -> Data? {
        let index = index0 - self.readerIndex
        guard index >= 0 && length >= 0 && index <= self.readableBytes - length else {
            return nil
        }
        
        let inlinableCount: Int
#if arch(x86_64) || arch(arm64) || arch(s390x) || arch(powerpc64) || arch(powerpc64le)
        inlinableCount = 14
#elseif arch(i386) || arch(arm)
        inlinableCount = 6
#endif
        
        if length <= inlinableCount {
            return  Data(bytes: UnsafeMutableRawPointer(mutating: ptr.baseAddress!.advanced(by: index)),
                        count: Int(length))
        }
        
        return self.withUnsafeReadableBytesWithStorageManagement { ptr, storageRef in
            _ = storageRef.retain()
            return Data(bytesNoCopy: UnsafeMutableRawPointer(mutating: ptr.baseAddress!.advanced(by: index)),
                        count: Int(length),
                        deallocator: .custom { _, _ in storageRef.release() })
        }
    }

(Cory Benfield) #37

I think @johannesweiss and I are entirely in favour of copying into Data in some cases: don’t mistake his comments for disagreement. I think what he’s saying is that NIO should have more aggressive heuristics than you’re proposing. For example, copying up to 1024 bytes may still be cheaper overall than this behaviour. And if you have a ByteBuffer holding 100MB, copying out 1MB may be worth it if that gives the opportunity to free the underlying buffer.

In practice, the case you suggested and a general heuristic about when copying is cheap can probably be implemented by NIO. The other case (accidentally keeping the larger buffer alive) is something that users will probably have to work around, but we can definitely work to add diagnostics for when this is happening.


(Johannes Weiss) #38

100% what @lukasa says. One option we also have is to implement heuristics with an explicit override for the user if they know what they're doing. Something like

extension ByteBuffer {
    public enum ByteTransferStrategy {
        case heuristic
        case copy // maybe `case forceCopy`
        case noCopy // maybe `case forceNoCopy`
    }

    public func getData(at index: Int, length: Int, transferStrategy: ByteTransferStrategy = .heuristic) {
        [...]
    }
}

does that sound like a plan?


#39

Done with SR-10252


(Johannes Weiss) #40

Thanks @Tof. Just checked out that issue. Would you mind retrying with the Swift 5 compiler?

Why would Swift 5 help? I believe the reason that you're seeing this issue is that the Swift compiler and the clang on your system have a different address sanitizer version an they're incompatible. That was a longstanding bug in Swift that it relied on the system's clang. So stuff (like ASan) relying on a particular clang version would only work if your system happened to have the right one. Swift 5 now finally fixed that by shipping a matching clang compiler :slight_smile:.

EDIT: I just tried the example project from the bug report and it does compile fine with the Swift 5 compiler & ASan.


(Ilja Iwas) #41

Re-reading you original analysis, I'm wondering what bytes could remain allocated after the request has been handled that prevent all those memory pages from being freed?

Could these be cache entries created by Vapor or the mysql client library, which unfortunately get placed on the same memory page as data allocated to fulfil the HTTP request? Maybe @tanner0101has an idea?


(Johannes Weiss) #42

There are quite a few ways that this can happen:

  • caches (eg. Swift's MetadataCache, maybe Vapor has some caches, ...)
  • things that are moved/copied around, for example writing to an array/dictionary/string/... might trigger a copy (due to copy-on-write) and the new backing storage might end up at a random place on the heap. So even though no more bytes might be allocated at the end, the backing storage might have moved around and is now occupying space in a totally different region which could prevent some memory from being released back to the kernel.

You'll see that before serving any requests, we had 914544 bytes allocated and after some requests had been handled that went up to 1027664, so in total, so it increased by roughly 113kB. The good thing is that it totally stabilised after going up by 113kB which isn't very much at all.

Also, the allocator itself is a non-trivial data-structure itself, so for its own bookkeeping it will also require some space. What exact memory regions can be returned to the kernel highly depends on the implementation of the allocator itself. Also, the allocator doesn't guarantee that it will try to immediately return as much as possible to the kernel.

But all of the issues will fortunately be solved if you can solve the peak memory usage issue.


(Tanner) #43

Huge thanks to everyone's help here tracking this down. MySQL 3.2.4 has been tagged with a fix that reduces peak memory usage by 10-15x for the sample project. See vapor/mysql#232 if you're interested to see what was fixed.

@iljaiwas please run swift package update on your real app and let me know if you see similar results.

Post mortem

Here's a small recap of what went wrong, and also why it took so long to find it. The main blocker was that we initially thought this was a Linux-only memory leak--and we had some good reasons to think that:

  • System Monitor showed memory usage going up into the GiB and never being released
  • Valgrind showed bytes "definitely" being lost.

If this indeed was a memory leak on Linux only, then that meant it was highly unlikely to be in Vapor's source code since most everything is the same between macOS and Linux. It was only slightly more likely to be in Swift NIO where there are more #if os, but still unlikely. Really the main and only suspect would be Foundation which has an entirely different implementation on Linux.

Luckily, @johannesweiss made a couple of key discoveries:

With these insights, the issue was solved in a matter of minutes. The problem was not leaking memory, but overly high peak memory usage which lead to fragmentation. Which, importantly, we were also seeing on macOS. This meant it was highly likely the problem was in Vapor. Furthermore, @Joannis_Orlandos had already made note of a bug in the MySQL package the previous day that could be the culprit for high peak memory usage. That is the bug fixed in MySQL 3.2.4 and what I now believe is the solution to this problem.

Key take aways

For anyone else that might experience this issue with Swift on Linux in the future, here are some insights that might help.

  • Valgrind may think Swift.String is leaking

Due to optimizations in Swift's String, Valgrind may think bytes have been "definitely lost" even though they haven't. This seems to only happen when you stop Valgrind while it is still busy doing work.

  • Valgrind may be very slow

It's doing a lot of black magic, so give it time. It will make your application much slower while it's running, but it will also take a significant amount of time to crunch through all the data it collects. Even if the application being debugged has finished doing its work, make sure that Valgrind goes to 0% CPU before quitting to analyze results.

Going forward

To help prevent issues like this from happening in the future, we need to improve Vapor's DB driver performance benchmarking. Currently, benchmarks are for small, unrealistic models and test run time only. Benchmarks that could have caught this issue early would test realistically sized models (with diverse properties and types) and also memory usage alongside run time. Implementing these benchmarks is unfortunately much easier said than done, but it's something that has been on our list long before this issue. I will try to prioritize for Vapor 4's release cycle.


(Johannes Weiss) #44

Awesome work @tanner0101! Just as note: Valgrind and most other tools might mark pretty much anything as 'definitely lost' despite the fact they might not be. It's very likely for String but any class instance that's packed into an enum, even if it's an enum containing a struct containing a class can hit this problem. Long story short: 'definitely lost' means 'maybe lost' in most cases :slight_smile:


(Jon Shier) #45

How did you track the issue down to that line of code? It didn't seem like the valgrind output was specific enough to narrow it down like that.


#46

@tanner0101 really nice for MySQL :+1: :grin:
Can you do the same for Postgres too? ( please :pray: )
Like @Jon_Shier I'm surprised you could identify these lines just with valgrind
Ah! If with Xcode we could do remote debug on Linux for our vapor apps, as we do for iOS apps, it would really help :heart_eyes: (I dream of that since Vapor 1 :stuck_out_tongue: )


(Ilja Iwas) #47

Thanks a lot @tanner0101!

We ported our demo and our real production app to Postgres in the meantime, and your MySQL fix now provides almost identical memory behaviour for both databases.

@Tof We have a Postgres branch of our demo app, and it didn't suffer from this memory fragmentation issue as the MySQL version did. Feel free to check it out here: https://github.com/lluuaapp/vapor-memory-leak/tree/postgres


(Ian Partridge) #48

Great work everyone. For me, the takeaway from this is how these types of problems are going to be much more visible in Swift workloads than they are in, say, Java.

In Java, you've probably already set a JVM heap size of multiple gigabytes. If some component is spiking its memory usage and generating large amounts of garbage, it's just going to drive the collector harder. Unless you're keeping an eye on your GC logs and notice the increased rate of GC you may not even notice the problem, as the app may well still fit inside the heap so there will be no OutOfMemoryError.

In Swift, simply watching the resident set size through your system metrics is enough to spot that there might be a problem.


(Tom Lane) #49

Hi, newbie question here. Are we saying it is normal for Swift programs to potentially hold onto whatever peak memory they spike to?

@IanPartridge, you used the word “problem” to characterize spiking memory, which it certainly was in this case. But it seems to me one ought to be able to write for example an image or in-memory data processing program in Swift that temporarily loads a couple GB and returns it to the kernel without issue. At least that’s what I naively expected until I read through this issue (which by the way was a fascinating journey!)

@johannesweiss helpfully pointed to some lower level Swift caches and backing stores that could take up residence in new pages, I think? It wasn’t clear to me if these were sticking around because the Vapor driver was holding on to them or if they’re just the sort of things that can get sprinkled about when memory grows, making a Swift program’s RSS like a balloon that never deflates.

Searching for clarity, I found some comments from the amazing Chris Lattner :pray: where he said “Given that GC’s use a lot more memory than ARC systems do, it isn’t clear what you mean by GC’s winning on heap fragmentation.” He essentially made the point that ARC optimizes for important things but not necessarily max RSS. Which makes sense, but it’s not clear to me just how far this goes. Is it literally a conscious design trade off that Swift potentially holds onto its peak memory?

Apologies in advance if I’m (probably definitely) misunderstanding something here.. just wanted to take advantage of the incredible talent on this thread. :slight_smile: Thanks!


(Helge Heß) #50

Are we saying it is normal for Swift programs to potentially hold onto whatever peak memory they spike to?

It is highly implementation and access pattern dependent, but this is generally the case for malloc backed memory allocation (that is for most languages but such coming with an own "VM"), particularily on Linux.
Malloc doesn't usually return memory to the system, but puts freed blocks into a free list for reuse. So the RSS may indeed just appear to grow and never really shrinks (except for large allocations). In the worst case, unused memory pages will get swapped out.

(BTW: The same is true for Java/JS: it is implementation specific and depends on the GC in use. If you use a copying/compacting GC, which JVM allows, it can actually resolve the fragmentation.).

But it seems to me one ought to be able to write for example an image or in-memory data processing program in Swift that temporarily loads a couple GB.

Again, highly implementation dependent, but if you allocate memory over a certain size (presumably over the page size [4KB..16KB]), malloc will place you in a separate code path and dealloc the memory on free (no point in returning a 30GB memory block into the freelist).
For image processing (or other large files) in particular, the file may not be even loaded into "regular" malloc memory, the file might be mapped into memory (though careful with that as well).
But yes, it is a common pattern on Unix to do expensive stuff out of a concurrent server and in an external process / in a job queue.

But you mention one thing which is quite likely another problem here: in-memory data processing.
The original program sounded like it is fetching a whole set of data from the DB into the memory just to deliver it a second later and then drop it from the memory. That is almost universally not a good approach. (Obviously I can't say about this particular implementation) But I've seen this often and generally you should be directly streaming the data from the database to the http connection and do your transformations on the fly. This way you get "O(1)" memory use, instead of "O(n)".
Even though NIO mysteriously has no piping function builtin :-), it is setup to support this really well (I eventually need to rebuild Noze.io on top of it).

The problem with "just load everything into RAM" is that in highly concurrent framework like NIO many concurrent connections may be doing that at the same time, turning the issue into a "O(n^2)" ...

So: Stream if you can! ;-)

As with every algorithm, you need to choose the right one for your task and be aware of how your computer works. Resources, neither CPU nor RAM, are free :-)

backing stores that could take up residence in new pages, I think? It wasn’t clear to me if these were sticking around because XYZ was holding on to them or if they’re just the sort of things that can get sprinkled about when memory grows, making a Swift program’s RSS like a balloon that never deflates

No, that won't easily happen. NIO and its ByteBuffer is already doing the right thing here and NIO itself doesn't hold onto buffer objects. My understanding about what happened is this:

  • NIO is receiving network packets, say 1KB ByteBuffer blocks
  • it is vending those blocks to the MySQL driver
  • even for just reading tiny amounts of data (integers, 4 bytes?), that driver would use the ByteBuffer.getData to create a Foundation.Data object for those 4/8 bytes
  • NIO tries to be smart and not copy the data, but vend a Data object which is backed by the ByteBuffer itself (instead of copying unnecessarily)
  • Unfortunately Swift's Data implementation for "backed-data" is really inefficient, it is not just holding a reference and a slice, it is actually creating a new heap block to hold that reference.
  • Plus there is SR-7378, which requires yet another heap allocated closure.

To recap: To read and transform 4 bytes of data, we are allocating two tiny memory objects on the heap, just to drop them a second later.
If that incoming 1KB block has 100 ints, that is ~200 extra RAM blocks just to parse a single incoming buffer.

In short: The driver was using an inappropriate algorithm aka a plain bug (please correct me if I'm wrong). A low level driver should probably not use Data at all, but just use NIO's builtin ByteBuffer reading facilities (or peek at its raw data).

P.S.: Small ad: Apache aka mod_swift is really good because of this. It allows you to automatically and gracefully recycle its request handling subprocesses if they peak either due to a RC bug, due to fragmentation, or due to excessive single time RAM use :->


(Johannes Weiss) #51

Hi @tlane, thanks for your questions! I'm sure lots of people have the same questions, so I'll try to answer some of them.

No. Swift programs behave here exactly like C/C++/Rust/most other compiled languages. The thing that still holds onto space is the allocator, ie. the malloc implementation that comes with your operating system. It might hold only some more bytes for a multitude of reasons but the two most common ones are:

  1. it chooses to hold onto some extra bytes to service future allocations more quickly
  2. it has to hold onto some extra bytes because of memory fragmentation

Yes, you can absolutely write a program that loads of couple of GB into memory. If your program requires a couple of gigabytes in memory, that's cool. In this case however the program did not require 400 MB of stuff to be in memory at the same time but it still did. Even then it's not really an issue (because the memory will just be re-used later) but it was very surprising because no one expected a 400MB RSS after the requests had been processed.

If I understand @IanPartridge correctly he actually says that these issues are easier to spot with Swift because they're not hidden behind a GC. In a GC'd language you wouldn't think anything if you saw 400MB RSS, that's just totally normal. Usually, the GC will only do work once the heap space hits the configured maximum. In Swift however, because we're closer to the metal, we could identify an issue pretty early on that might have been hidden for quite a long time in a GC'd language. As Ian points out, in a GC'd language the only signal would've been that the rate of garbage collection is too high. But to notice a too high GC rate, you need to have a good feeling of what the GC rate should be and also you would need to look at the GC logs. That's much harder and requires much more intricate knowledge than just running ps.

However, as we've all learned, RSS is also not always a good approximation of how much memory is allocated. Especially on Linux.

Swift absolutely releases memory and it did in this case. A grand total of 113 kB were stored in some caches, that's next to nothing. After the first set of requests it didn't grow a single byte anymore which is great. The allocator (that comes with the operating system's libc and which is used in C/C++/Rust/...) however might 1) choose to not return everything to the OS 2) might be unable to return everything to the OS because of fragmentation.

This is not an ARC issue. All the memory (but 113kB) was given back to the allocator.

Regarding GCs being better for heap fragmentation: GC'd systems that have a copying GC can reduce fragmentation by copying live objects closer together in a GC cycle. But don't be fooled, as Chris points out GC'd systems use a lot more memory than ARC systems do (because running the GC is expensive so they try to run it only if necessary) so whilst you might be (partially) solving one problem (memory fragmentation), you're buying into a multitude of other issues that GCs bring. Both, GCs & ARC are great solutions that optimise for slightly different things. But ARC will use less memory than a GC'd system.


Please feel free to ignore the rest of this post but I just wanted to post some pretty awful C code that demonstrates some of the things:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#ifdef __linux__
#include <malloc.h>
#endif

const int no_allocs = 100000;
void *allocs[no_allocs] = { NULL };
size_t sizes[no_allocs] = { 0 };
size_t still_there = 0;

int main() {
    char *string = NULL;

    for (int i=0; i<no_allocs; i++) {
        size_t size = ((i % 10000) == 0) ? 100 : 1000;
        allocs[i] = malloc(size);
        allocs[i] || (abort(), 0);
        sizes[i] = size;

        memset(allocs[i], 0x41, size);
    }

    for (int i=0; i<no_allocs; i++) {
        if (sizes[i] >= 1000) {
            free(allocs[i]);
            allocs[i] = NULL;
        } else {
            still_there += sizes[i];
        }
    }

    asprintf(&string, "ps uxw -p %d", getpid());
    system(string);
#ifdef __linux__
    {
    struct mallinfo mi = mallinfo();
    printf("Total allocated space (bytes): %d\n", mi.uordblks);
#endif
    printf("expected still alloc'd       : %lu\n", still_there);
#ifdef __linux__
    printf("DIFF                         : %ld\n", (long)mi.uordblks - (long)still_there);
    }
#endif

    return 0;
}

This program does 100,000 allocations. Every 10,000th allocation (so 10 of them) will be 100 bytes, all others will be 1,000 bytes. The program then frees all the 1,000 byte long allocations and then prints out some stats. Because it never frees the 100 bytes allocations, we'd expect a grand total of 1,000 bytes to be still allocated when it prints out the stats.

Let's run it on Linux:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  7.0  2.2  94648 91488 pts/0    Ss+  10:49   0:00 ./a.out
Total allocated space (bytes): 8912
expected still alloc'd       : 1000
DIFF                         : 7912

So our RSS is 91 MB, we expect 1000 bytes to be still allocated and in reality it's 8900 bytes. The missing 7900 bytes will (just like in Swift's case) be in some random C/OS runtime thing/cache, let's ignore them. The interesting thing is that we have 8900 bytes alloated but or RSS is 91 MB! So a very similar "issue" to what we've seen before: The allocator chose not hand the memory back to the kernel and that's totally fine :slight_smile:. It did choose to return some memory because the peak memory required for the above program is higher than 91MB but it mostly held onto those bytes. Also, the allocator itself contains data-structures that contain information about the heap, they use up space too.

But let's run it on Darwin:

$ bash -c ' clang memtest.c && ./a.out '
USER       PID  %CPU %MEM      VSZ    RSS   TT  STAT STARTED      TIME COMMAND
johannes 50563  15.4  0.0  4281820  15332 s005  S+   11:51am   0:00.10 ./a.out
expected still alloc'd       : 1000

Ha, on Darwin our RSS is only 15MB, we don't know exactly how many bytes were still allocated because there's no mallinfo on Darwin and I just didn't go look what the appropriate API is but let's assume it's also in the ballpark of the 8900 bytes that Linux told us.
The real point is: As pointed out before: RSS will always be at least as big as the number of allocated bytes and the allocator (that comes with the operating) system chooses what to hand back to the kernel and what not.

What can we learn from this:

  • watching RSS can make sense because it can be a signal that peak memory usage is too high.
  • reducing peak memory usage is always a good idea
  • let's not stress :slight_smile:, the only issue was that the database driver used too much memory and conveniently that's already fixed :raised_hands:

(Ian Partridge) #52

Absolutely. However it's not clear to me that malloc() is necessarily the best way for ARC to handle this. It may be possible to achieve better performance by using anonymous memory mapping and sub-allocating into the mapping. Then the mapped memory could be backed by large pages (MAP_HUGETLB etc.)


(Johannes Weiss) #53

Sure, you can do those tricks and malloc can do those tricks too. If you're not happy with glibc's default allocator you're not alone, you could for example use jemalloc today. I haven't tried it but jemalloc should just work with Swift. In NIO's 'allocation counter' test suite I also replace malloc/free/realloc/... with a custom allocator without any issues.

Swift could obviously offer first-class support for choosing between different allocators. Then you could allocate 'special' objects on their own allocators which might perform better for that particular workload. But:

  • I don't think that's our biggest issue at all today
  • using Swift's seamless interfacing with C, you can do much of that already today
  • I could totally see such support coming in future Swift versions. I imagine this might be quite useful for certain embedded workloads(?)