I have written a Swift interface to the Apache Arrow data format, which is popular for high performance computing and data science with file types like .arrow, .feather, and .parquet. My library works end-to-end Swift->CArrow->file->CArrow->Swift. However, it is very inefficient in memory space. A dataset that can be loaded from disk as C types in 30GB of RAM is taking >120GB of RAM to decode to Swift types. This must be remedied for my library to be useful.
First, I'm trying to understand memory requirements of some basic Swift types. I'm using a helper function to print the current Swift process' memory usage:
import Foundation
#if canImport(Darwin)
enum MachError: Error {
case FailedToGetMemory(String)
}
func getMemoryUsage() -> UInt64? {
var taskInfo = mach_task_basic_info()
var count = mach_msg_type_number_t(MemoryLayout<mach_task_basic_info>.size)/4
let kerr: kern_return_t = withUnsafeMutablePointer(to: &taskInfo) {
$0.withMemoryRebound(to: integer_t.self, capacity: 1) {
task_info(mach_task_self_, task_flavor_t(MACH_TASK_BASIC_INFO), $0, &count)
}
}
if kerr == KERN_SUCCESS {
return taskInfo.resident_size
} else {
return nil
}
}
func getMemoryUsageString() -> String? {
if let memoryUsage: UInt64 = getMemoryUsage() {
return ByteCountFormatter().string(fromByteCount: Int64(memoryUsage))
} else {
return nil
}
}
#else
func getMemoryUsageString() -> String? {
// TODO: Implement this for Linux
return ""
}
#endif
And here is the code whose memory usage I'm testing:
print(Date(), getMemoryUsageString()!, "Creating random large column values...")
let numRows = 50_000_000
let doublesColumn: [Double] = (0..<numRows).map { Double.random(in: 0.0...Double($0)) }
let intsColumn: [Int] = (0..<numRows).map { Int.random(in: 0...$0) }
print(intsColumn.count, doublesColumn.count)
print(Date(), getMemoryUsageString()!, "Done creating random columns")
let largeColumns: [[BaseArrowArrayElement]] = [doublesColumn, intsColumn]
print(Date(), getMemoryUsageString()!, "Done creating array of arrays")
Where BaseArrowArrayElement
is super simple:
public protocol BaseArrowArrayElement: CustomStringConvertible {
}
Here is the output:
2020-10-19 19:45:49 +0000 27.1 MB Creating random large column values...
50000000 50000000
2020-10-19 19:47:57 +0000 827.7 MB Done creating random columns
2020-10-19 19:48:05 +0000 4.83 GB Done creating array of arrays
50M Double
s and 50M Int
(64)s is 2 columns * 8 bytes * 50M values = 800M bytes = 0.8GB, so the memory usage after creating the arrays is exactly as expected: 827.7 MB.
What then am I doing to cause it to balloon to 4.83GB just by putting them into an array of arrays? I was hoping copy on write would save me here, or worst case it would 2x the size from 0.8 to 1.6GB? This is instead 6x!
This is on my macOS 10.15.6 MacBook Pro with Swift for Tensorflow 0.11:
$ which swift
/Library/Developer/Toolchains/swift-tensorflow-RELEASE-0.11.xctoolchain/usr/bin/swift
$ swift --version
Swift version 5.3-dev (LLVM db8896f3f345af2, Swift 61684f62a6132c0)
Target: x86_64-apple-darwin19.6.0
I have a slew of followup questions but will try to take this one step at a time. Are there any resources on writing memory constrained Swift code where safety can be sacrificed in the interest of operating on terabytes of data in as little memory space as possible? Any pointers will be helpful, thanks!