Swift RSS memory consumption is much greater than heap memory

taylorswift · July 11, 2022, 1:11am

i’m trying to profile a swift application with heaptrack, and i’m seeing a peak RSS memory consumption that is almost 4x larger than the peak heap memory consumption:

total runtime: 323.71s.
calls to allocation functions: 368371213 (1137980/s)
temporary memory allocations: 304018730 (939181/s)
peak heap memory consumption: 259.81M
peak RSS (including heaptrack overhead): 1.11G

my deployment target only has about 700 MB of memory available for swift, which means even if all the heap allocations were removed, the app still wouldn’t run.

is this normal? how do you debug RSS memory consumption?

lukasa · July 11, 2022, 10:43am

Swift does not own the allocator, so it is not capable of controlling how that allocator operates. It is not uncommon for the default Linux allocator (glibc) to overcommit memory to some degree. Much of this memory is likely to be marked reclaimable by the OS (MADV_FREE), which gives the OS the option to reclaim but not the obligation to, where it will typically only free that memory under memory pressure.

If you're unhappy with this behaviour you can change the allocator. tcmalloc and jemalloc are both better behaved at returning memory to the OS.

The real question is: is the memory usage you're seeing actually a problem?

taylorswift · July 11, 2022, 6:08pm

yes, it is a problem, on my deployment target, my web server daemon gets repeatedly killed by the OS with out-of-memory error. this is the reason why swiftinit.org has been somewhat unstable recently.

luckily i discovered last night that the memory leak is actually not in my application but in SwiftShell, which is one of its dependencies. i was using SwiftShell to run git to get a commit hash for various resource files for etagging purposes, and when i turned off the git querying, peak memory consumption fell from 1.2 GB to around 250 MB.

lukasa · July 12, 2022, 10:49am

Are you sure it was leaking, as opposed to just causing RSS to inflate?

taylorswift · July 12, 2022, 5:27pm

is there a difference?

John_McCall · July 12, 2022, 6:31pm

RSS generally reflects peak memory usage and not current memory usage. The OS has little ability to understand the latter unless the allocator actually remaps/unmaps pages.

taylorswift · July 12, 2022, 7:32pm

well, it definitely behaved like a leak, because each time i loaded a file, the RSS would grow, but it would never decrease. this was true even if the contents of the file were discarded and repeatedly reloaded.

John_McCall · July 12, 2022, 7:59pm

If repeating an operation and throwing the result away doesn't rapidly converge towards some constant memory use, I agree that sounds like a leak. But that's the question that Cory was asking.

lukasa · July 13, 2022, 12:24am

Concretely, the difference is that inefficient allocator behaviour approaches a limit over time, while a leak does not. As a practical difference, changing your allocator (say, to tcmalloc) cannot solve a memory leak, but it can absolutely solve an allocator inefficiency.

Linking tcmalloc should take about 30s to test, and will conclusively tell you the difference. (It’s available on Ubuntu as google-perftools, then you can use LD_PRELOAD to override the allocator and point at libtcmalloc.so.)

johannesweiss · July 13, 2022, 8:43am

tl;dr: There are tools to figure out what's going on with memory & RSS. On Linux, /proc/pid/smaps is of particular interest. Specifically the difference between Private_Dirty/Private_Clean and Lazy_Free.

To expand here a little bit: Application memory can be in a few different states and whilst everything said here is true I think we should add that you can check what the problem is.

But let's first start with a few states of the memory. If you allocate memory in your application, this will likely go through malloc(3) or friends (posix_memalign, ...) and hopefully (unless it's in use forever or you have a leak) eventually be freed using free(3). Many people I speak to assume that memory from malloc comes straight from the kernel and gets released back to the kernel after free. This is not true. malloc, free etc are implemented by an allocator, the default one usually comes with your libc. These are complicated beasts, can be tuned and replaced which is what @lukasa is alluding to.

Just to be clear, in Swift you usually don't actually call malloc/free yourself, you typically create instances of classes or data structures that are backed by them (Array, Dictionary, String, ...). Once their reference count drops back to zero, Swift's runtime will free them for you. But the Swift runtime won't keep malloc'd memory around so for the sake of this discussion we can ignore Swift's runtime.

Conceptually, if you (or the Swift runtime) call malloc, the allocator will likely look if it already has spare memory that it previously requested from the kernel (usually using mmap). If it does, it will assign that memory to you without having to ask to the kernel for anything. It will just mark this memory as in use in its data structures and your malloc will return it.

Similarly, if you call free the allocator can and will absolutely hold on to that memory, i.e. it will usually not immediately return this to the kernel (or even tell the kernel about that fact). In fact it often can't return the memory because you can only ever return at the very least full pages (usually 4 kB or 16 kB) to the kernel. But even if you free fully page-aligned memory that is a multiple of your page size the allocator might (and will) still hold onto it. If the allocator's heuristics decide that something is worth returning to the kernel, then they will and only then is there a chance for RSS going down.

But, as Cory points out, there are different ways of returning memory to the kernel:

munmap: which invalidates that page mapping, returns the pages to the kernel immediately. Memory returned using munmap will immediately decrease your RSS.
madvise(..., MADV_FREE) does not invalidate any mappings/pages, it merely tells the kernel that you do not need the contents of that memory anymore (so the physical memory pages backing this memory can be reused by the kernel when it wants). But it's important to understand that these mappings are still owned by the application. The application can even still use (read/write) that memory (which will re-dirty it). Memory "returned" using madvise(..., MADV_FREE) does not immediately decrease your RSS size. It will eventually decrease your RSS if the kernel runs into memory pressure which means it'll actually reuse the physical pages for something else (and you'll likely get mapped a copy-on-write zero page).
madvise(..., MADV_DONTNEED) is like madvise(..., MADV_FREE) except that it does all the returning memory immediately. I.e. your RSS will decrease immediately.

Okay, we covered a few basics. Let's discuss again a few (non exhaustive states of memory):

allocated (using malloc) and in use: obviously counts towards RSS, you're using it [you, the allocator and the kernel agree that it's in use]
allocated using malloc and leaked: counts towards RSS [you might think it's not in use but the allocator and the kernel don't know that]
not allocated using malloc but either in the allocator's pools or sharing a page with something that's allocated. Counts towards RSS [you think it's freed, the allocator thinks it's freed, kernel thinks it's allocated]
not allocated & returned using madvise(..., MADV_FREE). This counts towards RSS because until the kernel has actually reused that memory [you think it's freed, the allocator knows it's freed, the kernel knows that you don't need the data (so it can just drop the physical memory pages) but your application still has the mappings]. Essentially, you no longer control your exact RSS size after MADV_FREE because the kernel can reduce it when it wants which is usually when it suffers memory pressure.
not allocated & returned using madvise(..., MADV_DONTNEED). This does not count towards your RSS [you think it's freed, the allocator knows it's freed, the kernel knows that you don't need the data and has unmapped the physical pages)]. You still get to keep the virtual memory mappings though.
not allocated & returned using munmap. This does not count towards RSS [you, the allocator and the kernel know this isn't allocated]. If your application were to touch the memory after munmaping it, it would crash (SIGSEGV).

And now, let's look into how we can figure out how much memory in your application is in what state.

If you want the kernel's understanding of (1), (2) and (3) you can grep ^Private_Dirty: /proc/YOUR_PID/smaps | grep -v ' 0 kB' and see all mappings that are "private" (not shared with other processes) & "dirty" (you actually used this memory, kernel can't just forget the contents).

Unfortunately, we can't easily tell (1) and (2) apart without introspecting the actual data of your address space. valgrind/heaptrack/... can do this partially for Swift and heap/leaks on macOS can do this pretty well.

(3) is actually quite similar because the kernel is fully oblivious to this. The only way to get information of memory that's allocated & in use vs. memory that's allocated & in the allocators pools/fragmentation is from the allocator itself. On Linux, mallinfo/mallinfo2 can help here, on macOS it's heap.

(4) MADV_FREE'd pages that haven't been reclaimed are fortunately quite easy to spot on Linux because the kernel has understanding of it. You can look into /proc/YOUR_PID/smaps and check all all entries that are LazyFree: grep ^LazyFree: /proc/YOUR_PID/smaps. Each line that appears is something that your allocator allocated using mmap earlier and then told the kernel that it's no longer needed using madvise(..., MADV_FREE). Over time (when the kernel actually reclaims those pages, they become like (5))

(5) MADV_DONTNEED'd pages (and MADV_FREE'd ones that have been reclaimed by the kernel) are a little bit harder to spot but they don't increase your RSS so maybe you just ignore them. If you're curious, those would appear as mappings that have a size (e.g. Size: 132 kB) but show up as Private_Dirty, Private_Clean, Shared_Dirty, Shared_Clean all 0 kB. That means that whilst you still own the virtual memory mappings, no physical pages are mapped to those. You can read/write them but you'll suffer page faults.

All the memory in (6) (munmap'd memory) doesn't appear in /proc/YOUR_PID/smaps at all and is just like memory that never got mapped.

Lastly, it's worth noting that not all allocators even use MADV_FREE and sometimes you can configure if you want it to use MADV_FREE or not. In general, MADV_FREE is a good idea because it will make reclaiming previously "returned" memory much cheaper in certain cases. For example if the kernel hasn't gotten around to really unmapping the physical pages from your MADV_FREE'd memory then you can just reuse it without doing a syscall or suffering a page fault.
But if you use a resource limiting system that limits based on your RSS, then MADV_FREE might be an issue for you.

Especially with containers MADV_FREE could be an issue if your hosting kernel has access to a load of memory but your container is limited by RSS. That would mean that the kernel itself never really runs into memory pressure so it might never actually reclaim the MADV_FREE'd pages but you still "exhaust" your resources because your RSS stays high.

@taylorswift if you share a copy of your /proc/YOUR_PID/smaps in the bad state, I'm happy to have a look too.

lukasa · July 13, 2022, 11:14am

As a practical matter, many of the high-profile adopters of MADV_FREE have subsequently unadopted it due to this reason, as well as due to the fact that they get a lot of "my program uses way too much memory" bug reports.

johannesweiss · July 13, 2022, 12:08pm

@lukasa I just checked glibc's history and from what I can tell it has actually never used MADV_FREE and always used MADV_DONTNEED. I grepped through the whole history of glibc (from git://sourceware.org/git/glibc.git) and the only additions/deletions with MADV_FREE I could find are the ones where they added the constants.

Did you observe some specific Linux distro applying a patch to glibc that makes it use MADV_FREE?

$ git log -p > /tmp/libchist
$ grep '^[+-]' /tmp/libchist | grep MADV_FREE
+	[__USE_MISC] (MADV_FREE): Likewise.
-# define MADV_FREE	  8	/* Free pages only if memory pressure.  */
-	* bits/mman-linux.h [__USE_MISC] (MADV_FREE): New macro.
-	(MADV_FREE): Likewise.
+	* bits/mman-linux.h [__USE_MISC] (MADV_FREE): New macro.
+	(MADV_FREE): Likewise.
-	* bits/mman-linux.h [__USE_MISC] (MADV_FREE): New macro.
-	(MADV_FREE): Likewise.
+	* bits/mman-linux.h [__USE_MISC] (MADV_FREE): New macro.
+	(MADV_FREE): Likewise.
+	* bits/mman-linux.h [__USE_MISC] (MADV_FREE): New macro.
+	(MADV_FREE): Likewise.
+# define MADV_FREE	  8	/* Free pages only if memory pressure.  */
+# define MADV_FREE	  8	/* Free pages only if memory pressure.  */
-# define MADV_FREE	  5	/* Content can be freed (Solaris).  */
-# define MADV_FREE	 5	/* Content can be freed (Solaris).  */
+# define MADV_FREE	  5	/* Content can be freed (Solaris).  */
+# define MADV_FREE	 5	/* Content can be freed (Solaris).  */
-# define MADV_FREE	 5	/* Content can be freed (Solaris).  */
+# define MADV_FREE	 5	/* Content can be freed (Solaris).  */

FWIW, doing the same for MADV_DONTNEED you will find a lot of hits, specifically the ones that matter:

$ grep '^[+-]' /tmp/libchist | grep MADV_DONTNEED | grep madvise
+    __madvise (mem, freesize - PTHREAD_STACK_MIN, MADV_DONTNEED);
+      __madvise ((void*) freeblock, freesize, MADV_DONTNEED);
-    __madvise (pd->stackblock, freesize - PTHREAD_STACK_MIN, MADV_DONTNEED);
-      __madvise (freeblock, freesize, MADV_DONTNEED);
+      __madvise (freeblock, freesize, MADV_DONTNEED);
-    __madvise ((char *)h + new_size, diff, MADV_DONTNEED);
+    __madvise ((char *) h + new_size, diff, MADV_DONTNEED);
-		    __madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
+                    __madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
-    madvise ((char *)h + new_size, diff, MADV_DONTNEED);
+    __madvise ((char *)h + new_size, diff, MADV_DONTNEED);
-		    madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
+		    __madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
-    madvise (pd->stackblock, freesize - PTHREAD_STACK_MIN, MADV_DONTNEED);
+    __madvise (pd->stackblock, freesize - PTHREAD_STACK_MIN, MADV_DONTNEED);
+    madvise (pd->stackblock, freesize - PTHREAD_STACK_MIN, MADV_DONTNEED);
+		    madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
-      madvise ((char *)h + new_size, -diff, MADV_DONTNEED);
+    madvise ((char *)h + new_size, diff, MADV_DONTNEED);
+      madvise ((char *)h + new_size, -diff, MADV_DONTNEED);
-      madvise ((char *)h + new_size, -diff, MADV_DONTNEED);
+      madvise ((char *)h + new_size, -diff, MADV_DONTNEED);
-      madvise ((char *)h + new_size, -diff, MADV_DONTNEED);
+      madvise ((char *)h + new_size, -diff, MADV_DONTNEED);

TCMalloc has the option to use either MADV_FREE or MADV_DONTNEED.

lukasa · July 13, 2022, 12:24pm

Nope, I was pointing to Go’s allocator and Chrome’s allocator, both of which adopted and then walked it back.

johannesweiss · July 13, 2022, 12:29pm

Oh, okay I see. So do we think that MADV_FREE is then not relevant for the vast majority of server-side Swift programs because they will have to run on glibc and are likely using glibc's allocator, right?

In other words: If the RSS is bigger than expected then the memory is fully allocated and considered in use by the kernel. As in the allocator deliberately decided to keep hold of the memory (either because it's really in use/leaked or because it's in its pools or just fragmentation) and didn't madvise(..., MADV_FREE) or madvise(..., MADV_DONTNEED) or even munmap it. Agreed?

lukasa · July 13, 2022, 12:31pm

Based on your investigation I believe so, yes.

johannesweiss · July 13, 2022, 12:33pm

Okay, I guess with a copy of /proc/pid/smaps from @taylorswift we can confirm, never know if that version of glibc maybe got patched or something.

David_Smith · July 13, 2022, 5:10pm

Another useful distinction to make is leaked ("unreachable via any live pointer") vs abandoned ("reachable but will never but needed again") memory. People often forget about the latter, and leak detection tools don't directly* identify it, so significant problems can go unnoticed.

*on the Darwin side, Instruments' ObjectAlloc tool has a "snapshot" feature aimed at this, where you record what's live on the heap at various points and then compare deltas to look for growth. But it's still not unambiguous detection like it does for leaks.

hassila · July 13, 2022, 6:22pm

Just a reflection is that in general the memory analytics tools seems very difficult to get close to as good as normal cpu analytics. Usually one wants to answer the on the surface very simple question when confronted with a large memory footprint program: where does it all go? Could I please have a map? But even with instruments/leaks/heaptrack/Valgrind/all Malloc debug libraries, etc, etc - it still truly isn’t easy (especially as the reason can be any of x different things, most which require their own approach to analyze) (huge transient footprint bumping rss but subsequently deallocated, still-live pointers but semantically dead data, classical leaks, implicit allocated memory from types containing types, containing types, ….). Sorry, no useful help, i can express empathy - best advice I can give is to amass a toolbox with these different techniques and tools on multiple platforms and it becomes somewhat manageable.

Then a short question: @johannesweiss you mentioned glibc would always be used, is there any issue with interposing e.g. mimalloc/tcmalloc/jemalloc that I’m missing, or did I misunderstand your comment? I seem to remember having played with interposing mimalloc successfully on Linux with measurable performance improvements when testing swiftnio before.

johannesweiss · July 13, 2022, 8:20pm

Sorry, no, there is no issue, it should just work and it will help in certain scenarios. I just doubt that many people are doing this and if they were, they'd probably mention it.

Agreed, it's not super easy but leaks/heap on macOS go a long way. They understand most of the things you've mentioned. leaks walks all objects to see what's reachable and heap understands how many instances of what type are allocated and how big they are collectively (and on avg). It also tells you the amount of memory fragmentation and sum of allocated bytes (currently malloc'd) vs. heap size (RSS). Unfortunately, we don't have anything comparable to these tools available on Linux yet.

t089 · July 14, 2022, 9:50am

I can share some experience of running a swift app in a container on Kubernetes using libjemalloc. Without libjemalloc the app was quite "memory hungry" and was killed/evicted by the Kubernetes runtime regularly because it exceeded it's configured memory limit. With libjemalloc the app can spike with large memory consumptions when needed but then quickly drops down to a much lower baseline. This way we don't see evictions anymore. As an experiment I disabled libjemalloc last night and enabled again this morning. On the graph below you can see the region where it was disabled in between the two blue/red markers. The graph shows the container_memory_working_set_bytes metric. It is a massive win. The app does on demand image resizing (using Swift wrappers around libvips and libraw), especially RAW files can require quite some memory for a short amount of time, so it is nice to see that with libjemalloc the container memory behaves quite "elastic".

There is some scaling happening in the graph (new pods created / destroyed) and it shows total memory over all instances of the app, but still gives a good impression of the effect.

For reference, once libjemalloc is installed, all you need to do to use it is something like this start.sh

#!/bin/bash

export LD_PRELOAD=/usr/lib/$(uname -p)-linux-gnu/libjemalloc.so.2
exec ./resized 8080

where resized in this case is my Swift app.