I had prototyped a script that processed files one at a time, and initially it simply loaded the full files into memory with Data(contentsOf:). Then I began to need to process large files and so I tried to update the script to process the files in chunks using FileHandle. For example:
import Foundation
let filePath = "/path/to/large/file"
let chunkSize = 1024 * 1024
let url = URL(fileURLWithPath: filePath)
let handle = try FileHandle(forReadingFrom: url)
defer { try? handle.close() }
var totalBytes: UInt64 = 0
while true {
let data = try handle.read(upToCount: chunkSize) ?? Data()
if data.isEmpty { break }
totalBytes += UInt64(data.count)
}
print("Read \(totalBytes) bytes")
But, surprisingly (to me), this didn't work. When processing a 16GB video file this script uses up all system memory and freezes all running apps. Eventually I gave Codex the go ahead to drop down to POSIX calls, which solved the problem.
Is this a bug in FileHandle? What's going on here?
Huh, strange, while iterating on this issue I tried autoreleasepool but evidently I was doing something wrong, because I thought it didn't solve the issue but now it does
The main reason was simply that I didn't know about it, and ChatGPT pointed me to FileHandle when I asked what the standard way of reading a file in chunks is.
But I just tried it out and my initial reaction to the API shape seems to be confirmed, which is that it is incredibly slow as compared to the FileHandle approach, at least in my usage of it.
I'm computing hashes:
func sha256Hash(forFileAt url: URL) async throws -> SHA256Hash {
var hasher = SHA256()
for try await byte in url.resourceBytes {
hasher.update(data: Data(repeating: byte, count: 1))
}
return SHA256Hash(rawValue: Data(hasher.finalize()))
}
At first I had written Data([byte]), and gave up on waiting to see how long it would ultimately take the script to finish, so I tried Data(repeating: byte, count: 1) thinking that it would be at least somewhat more performant, but that's also taking multiple times longer than the FileHandle approach and seems to be not even close to done, so I'll probably abort it too and just give up on resourceBytes, despite how convenient it is, syntax-wise.
I take it you're not aware of any similar API that allows me to request chunks of bytes, rather than one byte at a time?
There’s also AsyncBufferedByteIterator from Swift Async Algorithms, but I’m not sure how useful that is for you. It’s a pity that resourceBytes is too slow for your use case, as it’s otherwise a very nice API.
P.S. From my understanding, now that we have ~Copyable types, there’s nothing standing in the way of a chunked version.