tl;dr: There are tools to figure out what's going on with memory & RSS. On Linux, /proc/pid/smaps
is of particular interest. Specifically the difference between Private_Dirty
/Private_Clean
and Lazy_Free
.
To expand here a little bit: Application memory can be in a few different states and whilst everything said here is true I think we should add that you can check what the problem is.
But let's first start with a few states of the memory. If you allocate memory in your application, this will likely go through malloc(3)
or friends (posix_memalign
, ...) and hopefully (unless it's in use forever or you have a leak) eventually be freed using free(3)
. Many people I speak to assume that memory from malloc
comes straight from the kernel and gets released back to the kernel after free
. This is not true. malloc
, free
etc are implemented by an allocator, the default one usually comes with your libc. These are complicated beasts, can be tuned and replaced which is what @lukasa is alluding to.
Just to be clear, in Swift you usually don't actually call malloc
/free
yourself, you typically create instances of classes or data structures that are backed by them (Array, Dictionary, String, ...). Once their reference count drops back to zero, Swift's runtime will free them for you. But the Swift runtime won't keep malloc
'd memory around so for the sake of this discussion we can ignore Swift's runtime.
Conceptually, if you (or the Swift runtime) call malloc
, the allocator will likely look if it already has spare memory that it previously requested from the kernel (usually using mmap
). If it does, it will assign that memory to you without having to ask to the kernel for anything. It will just mark this memory as in use in its data structures and your malloc
will return it.
Similarly, if you call free
the allocator can and will absolutely hold on to that memory, i.e. it will usually not immediately return this to the kernel (or even tell the kernel about that fact). In fact it often can't return the memory because you can only ever return at the very least full pages (usually 4 kB or 16 kB) to the kernel. But even if you free
fully page-aligned memory that is a multiple of your page size the allocator might (and will) still hold onto it. If the allocator's heuristics decide that something is worth returning to the kernel, then they will and only then is there a chance for RSS going down.
But, as Cory points out, there are different ways of returning memory to the kernel:
-
munmap
: which invalidates that page mapping, returns the pages to the kernel immediately. Memory returned using munmap
will immediately decrease your RSS.
-
madvise(..., MADV_FREE)
does not invalidate any mappings/pages, it merely tells the kernel that you do not need the contents of that memory anymore (so the physical memory pages backing this memory can be reused by the kernel when it wants). But it's important to understand that these mappings are still owned by the application. The application can even still use (read/write) that memory (which will re-dirty it). Memory "returned" using madvise(..., MADV_FREE)
does not immediately decrease your RSS size. It will eventually decrease your RSS if the kernel runs into memory pressure which means it'll actually reuse the physical pages for something else (and you'll likely get mapped a copy-on-write zero page).
-
madvise(..., MADV_DONTNEED)
is like madvise(..., MADV_FREE)
except that it does all the returning memory immediately. I.e. your RSS will decrease immediately.
Okay, we covered a few basics. Let's discuss again a few (non exhaustive states of memory):
- allocated (using
malloc
) and in use: obviously counts towards RSS, you're using it [you, the allocator and the kernel agree that it's in use]
- allocated using
malloc
and leaked: counts towards RSS [you might think it's not in use but the allocator and the kernel don't know that]
- not allocated using
malloc
but either in the allocator's pools or sharing a page with something that's allocated. Counts towards RSS [you think it's freed, the allocator thinks it's freed, kernel thinks it's allocated]
- not allocated & returned using
madvise(..., MADV_FREE)
. This counts towards RSS because until the kernel has actually reused that memory [you think it's freed, the allocator knows it's freed, the kernel knows that you don't need the data (so it can just drop the physical memory pages) but your application still has the mappings]. Essentially, you no longer control your exact RSS size after MADV_FREE
because the kernel can reduce it when it wants which is usually when it suffers memory pressure.
- not allocated & returned using
madvise(..., MADV_DONTNEED)
. This does not count towards your RSS [you think it's freed, the allocator knows it's freed, the kernel knows that you don't need the data and has unmapped the physical pages)]. You still get to keep the virtual memory mappings though.
- not allocated & returned using
munmap
. This does not count towards RSS [you, the allocator and the kernel know this isn't allocated]. If your application were to touch the memory after munmap
ing it, it would crash (SIGSEGV
).
And now, let's look into how we can figure out how much memory in your application is in what state.
If you want the kernel's understanding of (1), (2) and (3) you can grep ^Private_Dirty: /proc/YOUR_PID/smaps | grep -v ' 0 kB'
and see all mappings that are "private" (not shared with other processes) & "dirty" (you actually used this memory, kernel can't just forget the contents).
Unfortunately, we can't easily tell (1) and (2) apart without introspecting the actual data of your address space. valgrind
/heaptrack
/... can do this partially for Swift and heap
/leaks
on macOS can do this pretty well.
(3) is actually quite similar because the kernel is fully oblivious to this. The only way to get information of memory that's allocated & in use vs. memory that's allocated & in the allocators pools/fragmentation is from the allocator itself. On Linux, mallinfo
/mallinfo2
can help here, on macOS it's heap
.
(4) MADV_FREE
'd pages that haven't been reclaimed are fortunately quite easy to spot on Linux because the kernel has understanding of it. You can look into /proc/YOUR_PID/smaps
and check all all entries that are LazyFree
: grep ^LazyFree: /proc/YOUR_PID/smaps
. Each line that appears is something that your allocator allocated using mmap
earlier and then told the kernel that it's no longer needed using madvise(..., MADV_FREE)
. Over time (when the kernel actually reclaims those pages, they become like (5))
(5) MADV_DONTNEED
'd pages (and MADV_FREE
'd ones that have been reclaimed by the kernel) are a little bit harder to spot but they don't increase your RSS so maybe you just ignore them. If you're curious, those would appear as mappings that have a size (e.g. Size: 132 kB
) but show up as Private_Dirty
, Private_Clean
, Shared_Dirty
, Shared_Clean
all 0 kB
. That means that whilst you still own the virtual memory mappings, no physical pages are mapped to those. You can read/write them but you'll suffer page faults.
All the memory in (6) (munmap
'd memory) doesn't appear in /proc/YOUR_PID/smaps
at all and is just like memory that never got mapped.
Lastly, it's worth noting that not all allocators even use MADV_FREE
and sometimes you can configure if you want it to use MADV_FREE
or not. In general, MADV_FREE
is a good idea because it will make reclaiming previously "returned" memory much cheaper in certain cases. For example if the kernel hasn't gotten around to really unmapping the physical pages from your MADV_FREE
'd memory then you can just reuse it without doing a syscall or suffering a page fault.
But if you use a resource limiting system that limits based on your RSS, then MADV_FREE
might be an issue for you.
Especially with containers MADV_FREE
could be an issue if your hosting kernel has access to a load of memory but your container is limited by RSS. That would mean that the kernel itself never really runs into memory pressure so it might never actually reclaim the MADV_FREE
'd pages but you still "exhaust" your resources because your RSS stays high.
@taylorswift if you share a copy of your /proc/YOUR_PID/smaps
in the bad state, I'm happy to have a look too.