Swift 5.x and Coroutines

robertjpayne · October 19, 2020, 9:30pm

Hi,

Not really sure where else to put this. We use Swift on the server with a coroutine library called libdill. This has worked wonderfully for the past 5 years but when migrating over to the Swift 5.x toolchain we've begun to have some major memory leaks.

It's hard to give the exact example of the memory leak but it's related to non-escaping closures:

var jsonEnum = ...
return JSONSerializer { writable, deadline in 
  try writable.write(jsonEnum)
}

In our case we have a large enum (tagged decoded-json) that is passed into a non-escaping closure that is used to write to a socket.

If the socket connection drops and the closure throws the enum leaks in whole.

This never occurred in Swift 4.2 so was curious if we're missing something in terms of lower level processor instructions when our coroutine library save/restores the stack?

Know it's a long shot -- will try and put together a sample project.

Cheers,
Robert

robertjpayne · October 20, 2020, 2:17am

So some additional info that may be helpful to anyone:

We use

#define dill_setjmp(ctx) sigsetjmp(ctx, 0)
#define dill_longjmp(ctx) siglongjmp(ctx, 1)

To save and restore the stack. I'm wondering if something is playing funny with stack unwinds which result in certain destructors not getting called potentially?

Joe_Groff · October 20, 2020, 4:44am

When you say "the closure throws", does that cause your scheduler to drop the associated coroutine context entirely? None of the setjmp/longjmp variations trigger stack unwinding on Darwin, and even if they did, they would never cause Swift cleanups to occur. If you abandon a Swift context, then it is expected that any resources it held would leak. That has always been the case, though, and I wouldn't expect a Swift compiler update to change that. Is it possible to test your server with 1:1 threading to see if the leak still occurs independent of your use of coroutines? There may simply be a compiler bug at play causing the leak.

lukasa · October 20, 2020, 6:43am

To emphasise what @Joe_Groff is saying here: libdill is unsafe to mix with Swift code, and always has been. Swift does not expect you to perform arbitrary setjmp/longjmp calls and if you do, things may fail to behave the way you expect.

Unlike Joe I am not surprised a compiler update has triggered this: compiler updates have a habit of moving swift_release calls around, and if one got moved the wrong side of a coroutine jump then bad things could well happen here.

Even if this does happen to be a compiler bug causing your leak, libdill still won’t be safe to use from Swift.

robertjpayne · October 20, 2020, 9:20am

To be fair while libdill does jump for the coroutines it always allows the routine to finish. I've seen no evidence that what comes after "sleeping for IO" is lost and not run.

We've been using libdill in production with Server Side Swift for about 3-4 years so I guess we were lucky it's not caused any issues to date until Swift 5.3?

I'm not a compiler engineer but I can't conceptually what using setsigjump followed by setlongjump wouldn't restore that Swift needs to keep track of memory allocations? Hence I was curious if there were any changes in Swift 5.x that may be missed by these two functions.

Anyways long term we were planning to migrate off so this may just accelerate our timeframe.

Joe_Groff · October 20, 2020, 3:23pm

Using setjmp/longjmp as control flow within Swift code is problematic, but using it as a below-the-fold implementation detail to switch Swift contexts ought to be fine, as long as you ensure those contexts eventually resume exactly once, and you aren't violating any OS-level assumptions about where the stack is. Swift shouldn't make the latter issue any worse than it is for plain C. If you're starting to see leaks, I suspect this is either a real compiler bug independent of the use of coroutines, or a bug in the scheduling of coroutines that is causing a context to be abandoned. If there's a straightforward way to test the same code in a 1:1 model without the use of coroutines, that would be good to see whether coroutines are even relevant to the problem.

robertjpayne · October 20, 2020, 3:39pm

Hey Joe,

I'm going to try and get a sample project that exhibits the error here. We're definitely not calling setjmp/longjmp within Swift code, all of the coroutine stuff is managed in C and the contexts are always resumed after being paused exactly once (no context is ever abandoned).

Again up until Swift 5.3 we had no issues, if we use the same code with threads instead of coroutines we don't see the issue either hence I was curious if Swift 5.3 uses any registers or anything specific that setjmp/setlongjmp might not snapshot.

The weird thing with the leak is it seems to stabilise based on peak concurrent TCP connections to our server. I know this sounds suspect to an error in our application code but again if we switch to threads there's no issue at all and the same exact code (with exception of required upgrade changes) on Swift 4.2 doesn't exhibit the same issue.

I need to pull out some more memory debugging tools and really hone in on what is actually leaking. I'll come back to this thread once I do.

Joe_Groff · October 20, 2020, 3:42pm

Swift's calling convention does use some nontraditional registers for the self argument, and for returning errors, but the registers it uses are callee-preserved in the standard C calling convention so that normal C calls do not disturb them. The change over to the Swift calling convention happened back in the 4.x era, so I wouldn't expect it to have changed as recently as 5.3.

robertjpayne · October 23, 2020, 5:00am

Hey @Joe_Groff @lukasa

I managed to strip this back and discover this bug has nothing to do with coroutines at all. It appears on Swift Linux there's some ARC/memory issues when using while loops that run indefinitely.

On Darwin we can use a autoreleasepool to force releasing the memory but on Linux there doesn't seem to be anything we can use to force that?

I stripped back our project to the bare minimums using threads and blocking I/O with as little of code as possible.

Reproduction steps (Ubuntu 18.04 server)

Compile swift build --configuration=release
Run ./runner.sh
In another terminal on the server run ps aux | grep server and take note of memory usage
On another machine use wrk to benchmark the endpoint wrk -c 10 -t 1 -d 10s http://<ip address>:3000
Wait until wrk finishes and then a little bit of time, check logs to ensure all tcp connections are closed
Check memory usage again using ps aux | grep server

Expected results:

Memory usage should be similar to when starting

Actual results:

Memory usage is much higher it peaks based on number of concurrent connections (the -c 10 flag in wrk) ever connected. Sometimes it will reduce back but most times it'll stay at a peak.

In my observations if you re-run wrk the peak usage doesn't change if the connection limit remains. By triggering the loops again memory is reclaimed.

I don't believe this is strictly related to Data objects. In our actual server code all of our reference type semantic objects were appropriately deinit'd but value types seemed to be causing the growth.

You can download the sample code here:

lukasa · October 23, 2020, 8:19am

It would be very valuable to try to identify whether the objects in question are leaked or whether the memory was used by the process and never released back to the OS. Memory fragmentation limits the ability of the allocator to return memory to the OS, and so in some cases you will see memory usage behave this way: it will jump up under load, and then settle somewhat but your process will appear to still be consuming lots of memory. Internally the allocator has plenty of space to allocate, but it's fragmented and so the allocator cannot return pages to the OS.

Tools like valgrind or the address sanitiser should be capable of detecting true leaks. It may be worth running those tools to get a better idea of what's going on.

robertjpayne · October 23, 2020, 8:32am

Hi Lukasa,

As far as I can tell it's Swift structures not being released when it is expected in code. If you switch Data out for a UnsafeMutableRawBufferPointer with a deferred deallocation you see no memory growth.

Again what happens is because we have an endless loop it's like until the loop "breaks" for the next request the memory used in the previous request isn't deallocated. It's not permanently lost but it does mean if we get a burst of high-load traffic we get a crashed process due to this "phantom" memory being in play.

The only thing Valgrind reports on is a lot of 8 byte unsafe access around swift_retain and swift_release calls.

I can't seem to find any way to "fool" the compiler into properly release the Data object back to the OS and unfortunately I don't think it's strictly related to Data internals as in our production app we see this growth happening after our JSON parsing which results in a large amount of enum/dictionary/array objects being allocated.

Is there anything else I can do to try and inspect where the Swift compiler is putting in retain/release calls?

lukasa · October 23, 2020, 8:36am

Sure, you can disassemble the binary, or you can hook the function calls.

What do you mean by "a deferred deallocation"? Are you slicing Data objects and storing them anywhere?

robertjpayne · October 23, 2020, 8:40am

I'm not the greatest with a disassembler but I'll give it a code.

This code shows the memory growth that isn't released back until next iteration of the while loop:

while true {
   var request = socket_read_request() // blocking IO
   var body: Data! = try! Data(contentsOf: URL(<10 mb file>))
   socket_write_data(body) // blocking IO
   body = nil
}

At the point the loop is waiting for socket IO for the next request you'd expect the body from the previous request to be deallocated but that is not the case.

However if you use a raw pointer with deferred deallocation:

while true {
   var request = socket_read_request() // blocking IO
   let body = UnsafeMutableRawBufferPointer.allocate(capacity: 10*1024*1024, alignment: MemoryLayout<UInt8>.alignment)
  defer { body.deallocate() }
   socket_write_data(body) // blocking IO
}

The memory is freed immediately before the process idles waiting for the next request.

lukasa · October 23, 2020, 8:42am

Can you try allocating a large repeated empty data, something like Data(repeating: 0, count: 10 * 1024 * 1024)? Data(contentsOf:) is a huge, complex method that does an enormous number of allocations, and it's quite unlike allocating and freeing a pointer.

robertjpayne · October 23, 2020, 8:47am

Same issue using Data(repeating: 0, count: 10*1024*1024)

lukasa · October 23, 2020, 8:49am

That's very surprising. I'd be very curious if socket_write_data is storing a slice of the Data anywhere.

robertjpayne · October 23, 2020, 8:51am

You can check the sample code I put in, it's extremely small (4 files) as I tried to extrapolate a working example to it's most simple parts.

The Data object is used with withUnsafeBytes to actually write to the raw socket using C api's

robertjpayne · October 23, 2020, 8:57am

@lukasa Actually managed to simplify this even further:

func main() throws {
    let semaphore = DispatchSemaphore(value: 0)
    for _ in 0..<100 {
        Thread.detachNewThread {
            var data: Data! = Data(repeating: 0, count: 10*1024*1024)
//            let data = UnsafeMutableRawBufferPointer.allocate(byteCount: 10*1024*1024, alignment: MemoryLayout<UInt8>.alignment)
            print("data count: \(data.count)")
            data = nil
//            data.deallocate()
            semaphore.signal()
            Thread.sleep(forTimeInterval: 600)
        }
        semaphore.wait()
    }

    print("Idle now...")
    Thread.sleep(forTimeInterval: 40)
    print("Exiting")
}
try! main()

If you swap between Data and the UnsafeRawBufferPointer allocation methods you can see the memory usage during the processes idle phase.

Raw pointers in ps aux show no memory usage outside of baseline, while Data continues to show several hundred megabytes.

lukasa · October 23, 2020, 8:58am

And to confirm this is running on Linux, right? I just want to be absolutely sure we're on the same page.

robertjpayne · October 23, 2020, 8:59am

Yup on Linux -- Ubuntu 18.04 server with Swift 5.3 toolchain