Swift 5.x and Coroutines

robertjpayne · October 23, 2020, 9:06am

It's weird because Valgrind states nothing is amiss at all, but ps aux shows a lot of ram usage still.

For us since we use Docker ps aux or whatever underlying OS methods calculate ps aux is what Docker uses to determine if a container is using more ram than it should and kill it.

I might try and test with creating a few hundred thousand structs instead of Data and see if any difference.

robertjpayne · October 23, 2020, 9:11am

Yea so using normal structs instead of Data can cause it as well.

Modified the main.swift file a bit with commented out bits for the 3 types of test runs and reduced timers so it can run in Valgrind fully.

import Foundation

// at least 32 bytes
struct MyAwesomeStruct {
    let num0: Int64 // 8 bytes
    let num1: Int64 // 8 bytes
    let num2: Int64 // 8 bytes
    let num3: Int64 // 8 bytes
}

let semaphore = DispatchSemaphore(value: 0)
for _ in 0..<100 {
    Thread.detachNewThread {
        // structs
        var data: [MyAwesomeStruct]! = [MyAwesomeStruct].init(repeating: MyAwesomeStruct(num0: 0, num1: 1, num2: 2, num3: 3), count: 100_000)
        print("data count: \(data.count)")
        data = nil

//        // data
//        var data: Data! = Data(repeating: 0, count: 10*1024*1024)
//        print("data count: \(data.count)")
//        data = nil

//        // pointer
//        let data = UnsafeMutableRawBufferPointer.allocate(byteCount: 10*1024*1024, alignment: MemoryLayout<UInt8>.alignment)
//        print("data count: \(data.count)")
//        data.deallocate()

        semaphore.signal()
        Thread.sleep(forTimeInterval: 10)
    }
    semaphore.wait()
}

print("Idle now...")
Thread.sleep(forTimeInterval: 15)
print("All threads should be be gone...")
Thread.sleep(forTimeInterval: 5)

lukasa · October 23, 2020, 9:57am

So small examples are very helpful because they allow us to do really detailed analysis. In the case above we can pretty safely compile, repro, and look at both the ASM and the SIL.

In both cases, the underlying heap allocation for the Data is unequivocally being released. The memory seems to be being kept alive by the thread closure itself. If you remove the Thread.sleep at the end to allow the thread to exit, you find the memory usage does drop back to where you want it to be. So it's not that the data is not being freed, it's just not being freed when we want it to.

As a curiosity: what do you see if you allocate slightly more data? Try changing 10*1024*1024 to 100*1024*1024.

robertjpayne · October 23, 2020, 10:38am

If I allocate 100mb it doesn't exhibit the issue.

Interesting so though -- when you say it's not being freed when we want it to is this a fault of the compiler and where it's retaining the memory and/or releasing it or is it the way we've coded this sample?

We can't really let the threads exit just whenever as we use a Thread per TCP connection and it will continue to serve requests until the client closes it or it times out.

robertjpayne · October 23, 2020, 10:56am

@lukasa

For what it's worth I tried not using a closure based Thread but instead a subclass to see if it helps but it doesn't seem to make a difference. Also if we allow the thread to exit immediately without the sleep it sometimes results in less memory usage and sometimes it does not, seems a bit like a race condition determines that.

But that really doesn't make heaps of sense either, these Threads are actually exiting before the program is finished and thus I can't see why if they sleep first the memory usage is then "stuck"?

import Foundation

// at least 32 bytes
struct MyAwesomeStruct {
    let num0: Int64 // 8 bytes
    let num1: Int64 // 8 bytes
    let num2: Int64 // 8 bytes
    let num3: Int64 // 8 bytes
}

class Executor : Thread {
    override func main() {
        super.main()
        self.dataTest()
    }

    private func structTest() {
        var data: [MyAwesomeStruct]! = [MyAwesomeStruct].init(repeating: MyAwesomeStruct(num0: 0, num1: 1, num2: 2, num3: 3), count: 100_000)
        print("data count: \(data.count)")
        data = nil
    }

    private func dataTest() {
        var data: Data! = Data(repeating: 0, count: 10*1024*1024)
        print("data count: \(data.count)")
        data = nil
    }

    private func rawPointerTest() {
        let data = UnsafeMutableRawBufferPointer.allocate(byteCount: 10*1024*1024, alignment: MemoryLayout<UInt8>.alignment)
        print("data count: \(data.count)")
        data.deallocate()
    }

    deinit {
        print("Executor deinit")
    }
}

for _ in 0..<100 {
    let thread = Executor()
    thread.start()
}
Thread.sleep(forTimeInterval: 5)
print("All threads should be be gone...")
Thread.sleep(forTimeInterval: 15)

lukasa · October 23, 2020, 11:34am

The fact that allocating larger objects makes the problem go away points at the allocator, not the compiler. If I had to guess I'd say that 10MB allocations are not being served directly from mmap, and the allocator is choosing to keep hold of the allocated pages rather than return them to the OS when the data is freed. Again, I believe the pointer is being freed, the memory just isn't returning to the OS.

robertjpayne · October 23, 2020, 11:36am

@lukasa

Right that makes more sense, is there any info about how Swift allocates Structs/Data? I'd really like to be able to replicate the issue using raw C allocations because using raw memory pointers via UnsafeMtuableRawBufferPointer the ram is always released back to the OS

robertjpayne · October 23, 2020, 1:07pm

Interestingly if I swap the memory allocator out to jemalloc the issue goes away entirely. @lukasa thanks heaps for your help here it looks like it's simply the allocator not releasing the pages back to the OS.

I wonder if my manual allocations via UnsafeRawBufferPointer are not kept because they are not aligned to the memory page sizing… hard to know.

EDIT:

A bit more education on memory allocators and this is definitely the reason. I suppose a lot of Swift's allocations are small and those tend to never get released back to the OS while larger ones do and the largest of allocations are released back immediately which is why your 100mb test works!