Memory leaking in Vapor app


(Ilja Iwas) #1

I recently complained about memory leaks in our Vapor 3 app, that we are unable to overcome:

@tomerd suggested I come here to ask for help. We extracted an example request into a demo project, and provided a script that triggers the "leak", even though Xcode doesn't show any leaks:

Our assumption is that it has to do something with how threads are handled in SwiftNIO or Vapor. Maybe somebody more knowledgable than us feels like looking into this.


(Tanner) #2

I've done some digging into the sample app @iljaiwas provided. (Thanks for putting that together!)

macOS, no leaks

From what I can tell memory usage looks normal on macOS, with no leaks. The memory usage graph does make it seem like some memory is leaking, but I believe this is due to Foundation's internal caching for JSON serialization and Date components. The memory graph debugger and leak debugger show no new allocations (after enough requests to load all lazy services in).

Linux, leaking

On Linux however, it's a much different story. The first request takes the application from 2.9MiB in the System Monitor to 374.2MiB. From what I can tell, that memory is never released.

I ran valgrind on the executable using the following command:

valgrind --leak-check=full .build/release/Run

It was able to detect 434,075,912 bytes "definitely lost". Here is the output for that loss record:

==30059== 434,075,912 (719,640 direct, 433,356,272 indirect) bytes in 9,995 blocks are definitely lost in loss record 1,218 of 1,218
==30059==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==30059==    by 0x51F1C91: swift_slowAlloc (in /home/tanner/swift/swift-5.0-RELEASE-ubuntu18.04/usr/lib/swift/linux/libswiftCore.so)
==30059==    by 0x51F1CFE: _swift_allocObject_(swift::TargetHeapMetadata<swift::InProcess> const*, unsigned long, unsigned long) (in /home/tanner/swift/swift-5.0-RELEASE-ubuntu18.04/usr/lib/swift/linux/libswiftCore.so)
==30059==    by 0x3FCFC9: $s10Foundation4DataV11bytesNoCopy5count11deallocatorACSv_SiAC11DeallocatorOtcfCTf4nngd_n (in /home/tanner/dev/sandbox/vapor-memory-leak/.build/x86_64-unknown-linux/release/Run)
==30059==    by 0x3FCB8A: $s3NIO10ByteBufferV19NIOFoundationCompatE7getData2at6length10Foundation0G0VSgSi_SitF (in /home/tanner/dev/sandbox/vapor-memory-leak/.build/x86_64-unknown-linux/release/Run)
==30059==    by 0x3FCA80: $s3NIO10ByteBufferV19NIOFoundationCompatE8readData6length10Foundation0G0VSgSi_tF (in /home/tanner/dev/sandbox/vapor-memory-leak/.build/x86_64-unknown-linux/release/Run)
==30059==    by 0x3249F6: $s5MySQL0A21SQLBinaryResultsetRowV5bytes7columnsAC3NIO10ByteBufferVz_SayAA0A21SQLColumnDefinition41VGtKcfCTf4nnd_n (in /home/tanner/dev/sandbox/vapor-memory-leak/.build/x86_64-unknown-linux/release/Run)
==30059==    by 0x31EEB2: $s5MySQL0A16SQLPacketDecoderC23decodeStatementProtocol3ctx6buffer14statementState12capabilities3NIO08DecodingK0OAI21ChannelHandlerContextC_AI10ByteBufferVzAA0a12SQLStatementgK0OAA0A15SQLCapabilitiesVtKF (in /home/tanner/dev/sandbox/vapor-memory-leak/.build/x86_64-unknown-linux/release/Run)
==30059==    by 0x31BCC5: $s5MySQL0A16SQLPacketDecoderC6decode3ctx6buffer3NIO13DecodingStateOAG21ChannelHandlerContextC_AG10ByteBufferVztKF (in /home/tanner/dev/sandbox/vapor-memory-leak/.build/x86_64-unknown-linux/release/Run)
==30059==    by 0x3202F8: $s5MySQL0A16SQLPacketDecoderC3NIO013ByteToMessageD0AadEP6decode3ctx6bufferAD13DecodingStateOAD21ChannelHandlerContextC_AD0F6BufferVztKFTW (in /home/tanner/dev/sandbox/vapor-memory-leak/.build/x86_64-unknown-linux/release/Run)
==30059==    by 0x38CA08: $sTA (in /home/tanner/dev/sandbox/vapor-memory-leak/.build/x86_64-unknown-linux/release/Run)
==30059==    by 0x38CA58: $sTA.15 (in /home/tanner/dev/sandbox/vapor-memory-leak/.build/x86_64-unknown-linux/release/Run)

And the final leak summary:

==30059== LEAK SUMMARY:
==30059==    definitely lost: 720,720 bytes in 9,999 blocks
==30059==    indirectly lost: 433,356,272 bytes in 29,836 blocks
==30059==      possibly lost: 7,782,952 bytes in 171 blocks
==30059==    still reachable: 26,334,436 bytes in 182,329 blocks
==30059==         suppressed: 0 bytes in 0 blocks

From what I can tell, the leak seems related to Foundation.Data.init(bytesNoCopy:count:deallocator:), though I'm not sure how or why. Any additional insight would be greatly appreciated, thanks!


(Simon Evans) #3

Which version of Swift was this tested with? There was quite a large reworking of Data in Swift 5 in swift-corelibs-foundation so if the above is from 4.2 it may be worth testing against 5.0 as well.


(Ilja Iwas) #4

We initially discovered the issue on Linux with Swift 4.2, but @tanner0101 seems to be also running into it with Swift 5:

....(in /home/tanner/swift/swift-5.0-RELEASE-ubuntu18.04/usr/lib/swift/linux/libswiftCore.so)

(Joe Groff) #5

cc @Philippe_Hausler. Thanks for the analysis, Tanner!


(Gábor Sebestyén) #6

Just ran the test twice on my Arch Linux with Swift 5. I got the following summary back from valgrind:

==31918== LEAK SUMMARY:
==31918==    definitely lost: 1,144 bytes in 6 blocks
==31918==    indirectly lost: 0 bytes in 0 blocks
==31918==      possibly lost: 86,960 bytes in 20 blocks
==31918==    still reachable: 3,946,396 bytes in 1,848 blocks
==31918==         suppressed: 0 bytes in 0 blocks

According to this I see no sigificant leak.
I was using the following command:

valgrind --leak-check=full \
  --show-leak-kinds=all \
  --track-origins=yes \
  --verbose \
  --log-file=valgrind-out.txt \
./.build/release/Run

(John Holdsworth) #7

There was a leak in foundation due to a retain cycle but that was fixed over a year ago. I’ve just checked and it is still fixed in the 5.0 release for Linux.

Seems like you have the stack trace. It should be possible to box this in:

swift_slowAlloc
_swift_allocObject_(swift::TargetHeapMetadata<swift::InProcess>
function signature specialization <Arg[2] = Owned To Guaranteed, Arg[3] = Dead> of Foundation.Data.init(bytesNoCopy: Swift.UnsafeMutableRawPointer, count: Swift.Int, deallocator: Foundation.Data.Deallocator) -> Foundation.Data
(extension in NIOFoundationCompat):NIO.ByteBuffer.getData(at: Swift.Int, length: Swift.Int) -> Foundation.Data?
(extension in NIOFoundationCompat):NIO.ByteBuffer.readData(length: Swift.Int) -> Foundation.Data?
function signature specialization <Arg[2] = Dead> of MySQL.MySQLBinaryResultsetRow.init(bytes: inout NIO.ByteBuffer, columns: [MySQL.MySQLColumnDefinition41]) throws -> MySQL.MySQLBinaryResultsetRow
MySQL.MySQLPacketDecoder.decodeStatementProtocol(ctx: NIO.ChannelHandlerContext, buffer: inout NIO.ByteBuffer, statementState: MySQL.MySQLStatementProtocolState, capabilities: MySQL.MySQLCapabilities) throws -> NIO.DecodingState
MySQL.MySQLPacketDecoder.decode(ctx: NIO.ChannelHandlerContext, buffer: inout NIO.ByteBuffer) throws -> NIO.DecodingState
protocol witness for NIO.ByteToMessageDecoder.decode(ctx: NIO.ChannelHandlerContext, buffer: inout NIO.ByteBuffer) throws -> NIO.DecodingState in conformance MySQL.MySQLPacketDecoder : NIO.ByteToMessageDecoder in MySQL

Somewhere in the MySQL driver? - that's complete speculation


(Tanner) #8

The confusing part is that this only happens on Linux and the MySQL driver (vapor/mysql) is the same code for both platforms. There's only one case of #if os in the package, and we already tested to make sure it's not related.


(Johannes Weiss) #9

Just a quick one: Foundation.Data.init(bytesNoCopy:...) calling malloc is expected. Data is backed by class __DataStorage. So creating a Data will still allocate the class instance for the backing store:

    @inlinable // This is @inlinable as a trivial initializer.
    public init(bytesNoCopy bytes: UnsafeMutableRawPointer, count: Int, deallocator: Deallocator) {
        let whichDeallocator = deallocator._deallocator
        if count == 0 {
            deallocator._deallocator(bytes, count)
            _representation = .empty
        } else {
            _representation = _Representation(__DataStorage(bytes: bytes, length: count, copy: false, deallocator: whichDeallocator, offset: 0), count: count)
        }
    }

(link to source)

What's definitely unexpected is that it never seems to be deallocated on Linux. It seems that something is holding on to the Data or the __DataStorage. I'll try to find time later to look into this.


(John Holdsworth) #10

Seems like your best bet is to try to replicate the leak under instruments and bring the full macOS tooling to bear. If you clone swift-corelibs-foundation and swift-corelibs-xctest you should be able to build the “Linux” Foundation for macOS (SwiftFoundation) and “-F" it into the vapor project renaming the imports. There are quite a few problems which I’ve not been able to sort out involving entities defined in both macOS Swift and Linux’s SwiftFoundation but this is one route you could try.


(Ian Partridge) #11

We (IBM) have fixed a leak in NSData before ( https://github.com/apple/swift-corelibs-foundation/pull/1455 ), but this one looks different. @weissi - let us know what you find out - if we can help we will.


#12

On my service I have the same issue on Linux as @iljaiwas and from time to time I have even a crash like this (Ubuntu 16.04 with Swift 4.2) :

*** Error in `./myservice': double free or corruption (fasttop): 0x00007f2418074730 ***,
======= Backtrace: =========,
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f243c5087e5],
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f243c51137a],
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f243c51553c],
/usr/lib/libswiftCore.so(+0x3c909b)[0x7f243ec6709b],
./myservice(+0xe5bef)[0x55d19d8cebef],
./myservice(+0xe5686)[0x55d19d8ce686],
./myservice(+0x12ca56)[0x55d19d915a56],
./myservice(+0x119a05)[0x55d19d902a05],
./myservice(+0x119a44)[0x55d19d902a44],
/usr/lib/libdispatch.so(+0x42ca1)[0x7f243ef78ca1],
/usr/lib/libdispatch.so(+0x3b122)[0x7f243ef71122],
/usr/lib/libdispatch.so(+0x42d0d)[0x7f243ef78d0d],
/usr/lib/libdispatch.so(+0x42ce0)[0x7f243ef78ce0],
/usr/lib/libdispatch.so(+0x40131)[0x7f243ef76131],
/usr/lib/libdispatch.so(+0x33c0e)[0x7f243ef69c0e],
/usr/lib/libdispatch.so(+0x521fb)[0x7f243ef881fb],
/usr/lib/libdispatch.so($S8Dispatch0A5QueueC4sync7executeyyyXE_tF+0x9e)[0x7f243ef7353e],
/usr/lib/libdispatch.so(+0x3fa81)[0x7f243ef75a81],
/usr/lib/libdispatch.so(+0x3afcb)[0x7f243ef70fcb],
/usr/lib/libdispatch.so($S8Dispatch0A5QueueC4sync7executexxyKXE_tKlF+0xae)[0x7f243ef7148e],
./myservice(+0x1164a2)[0x55d19d8ff4a2],
./myservice(+0xeada2)[0x55d19d8d3da2],
./myservice(+0x16c704)[0x55d19d955704],
./myservice(+0x16c190)[0x55d19d955190],
./myservice(+0x1346af)[0x55d19d91d6af],
./myservice(+0x16c0c1)[0x55d19d9550c1],
....
Aborted (core dumped),

It seems to happen in Dispatch and especially in this method

/usr/lib/libdispatch.so($S8Dispatch0A5QueueC4sync7executeyyyXE_tF+0x9e)[0x7f243ef7353e]

Hoping it can help


(Johannes Weiss) #13

I haven't had chance yet (and today is pretty busy) to look into this but

At least the crash part is different issue, we're looking at heap corruption (possibly due to a double free) here. Can you reproduce this? If yes, could you run this under lldb or even better valgrind?

If you can't reproduce. Would you mind trying to run this with address sanitizer? Running under address sanitizer can be done the following way:

swift build --sanitize=address

and then run the resulting binary.


(Helge Heß) #14

Shouldn't those things get moved into some GitHub issue or Jira?
I mean we can leave this open for reference, but the analysis should go into a proper bug tracker?


(Joannis Orlandos) #15

I've looked into it and the culprit seems to be this https://github.com/vapor/mysql/blob/3/Sources/MySQL/Utilities/ByteBuffer%2BMySQL.swift#L84

I'm diving into NIO now to see why that causes a memory leak.


(Johannes Weiss) #16

Are you sure this is causing the leak and isn't just the thing that is leaked?

If you think this line is causing the leak, you should be able to reproduce this without Vapor, MySQL, ... in a tiny program of just a couple of lines, right?


(Joannis Orlandos) #17

It's pretty odd, so I got this leak only one time. But when I did it filled my memory up with 4GB of data. But then another leak with Foundation.Data popped up which originated from byteBuffer.getData. I'm not sure why they're not deallocated yet.


(Joannis Orlandos) #18

Just to be sure people don't misunderstand me, I'm not suspecting NIO as the origin of the bug. But I'm very sure that it's got to do with the MySQLPacketDecoder + ByteBuffer.


(Johannes Weiss) #19

I think that's along the lines of what @tanner0101 was seeing too. The specialty of byteBuffer.getData is that is doesn't copy the bytes but creates a Data backed by a ByteBuffer. I think Tanner ruled out that it might be some special code path in Data that might not be tested that well by replacing it with something that copies the bytes Data(byteBuffer.viewBytes(at: byteBuffer.readerIndex, length: whatever).

Oh, I'd be very happy for the bug to be in NIO because then we could enhance our allocation counting test suite with a test that reproduces that precise scenario and make sure it never happens again.

And thanks so much for looking into this! I don't seem to be able to find enough time to help with this today unfortunately...


(Tim) #20

Either way a test should be added wherever the bug is to stop a regression!