FileHandle fails to reduce memory usage

I had prototyped a script that processed files one at a time, and initially it simply loaded the full files into memory with Data(contentsOf:). Then I began to need to process large files and so I tried to update the script to process the files in chunks using FileHandle. For example:

import Foundation

let filePath = "/path/to/large/file"
let chunkSize = 1024 * 1024

let url = URL(fileURLWithPath: filePath)
let handle = try FileHandle(forReadingFrom: url)

defer { try? handle.close() }

var totalBytes: UInt64 = 0
while true {
    let data = try handle.read(upToCount: chunkSize) ?? Data()
    if data.isEmpty { break }
    totalBytes += UInt64(data.count)
}

print("Read \(totalBytes) bytes")

But, surprisingly (to me), this didn't work. When processing a 16GB video file this script uses up all system memory and freezes all running apps. Eventually I gave Codex the go ahead to drop down to POSIX calls, which solved the problem.

Is this a bug in FileHandle? What's going on here?

Just curious, any particular reason you didn’t choose url.resourceBytes?

P.S. @tera has an excellent post comparing different methods for reading files: What is the best way to work with the file system? - #18 by tera

2 Likes

One of those cases when you need autoreleasepool {...}

3 Likes

Huh, strange, while iterating on this issue I tried autoreleasepool but evidently I was doing something wrong, because I thought it didn't solve the issue but now it does

The main reason was simply that I didn't know about it, and ChatGPT pointed me to FileHandle when I asked what the standard way of reading a file in chunks is.

But I just tried it out and my initial reaction to the API shape seems to be confirmed, which is that it is incredibly slow as compared to the FileHandle approach, at least in my usage of it.

I'm computing hashes:

func sha256Hash(forFileAt url: URL) async throws -> SHA256Hash {
    var hasher = SHA256()
    for try await byte in url.resourceBytes {
        hasher.update(data: Data(repeating: byte, count: 1))
    }
    return SHA256Hash(rawValue: Data(hasher.finalize()))
}

At first I had written Data([byte]), and gave up on waiting to see how long it would ultimately take the script to finish, so I tried Data(repeating: byte, count: 1) thinking that it would be at least somewhat more performant, but that's also taking multiple times longer than the FileHandle approach and seems to be not even close to done, so I'll probably abort it too and just give up on resourceBytes, despite how convenient it is, syntax-wise.

I take it you're not aware of any similar API that allows me to request chunks of bytes, rather than one byte at a time?

This was my solution in swift-mmio: swift-mmio/Tests/SVDTests/FileManager+Hashing.swift at main · apple/swift-mmio · GitHub

See:

  func hashOfFile<H>(
    at url: URL,
    using hashFunction: H.Type = H.self
  ) throws -> H.Digest where H: HashFunction {

It should probably be updated to use Span

1 Like

There’s also AsyncBufferedByteIterator from Swift Async Algorithms, but I’m not sure how useful that is for you. It’s a pity that resourceBytes is too slow for your use case, as it’s otherwise a very nice API.

P.S. From my understanding, now that we have ~Copyable types, there’s nothing standing in the way of a chunked version.

A simple wrapper which hides autoreleasing under the hood:

extension FileHandle {
    func readFixed(upToCount count: Int) throws -> Data? {
        try autoreleasepool { 
            try read(upToCount: count)
        }
    }
}
I'd love if some simple (even non cryptographic) file hash was available in O(1) time from the file system itself.

Pseudocode:

func _writeBlock(newBytes) {
    let oldBytes = _readUnderlyingBlock()
    _writeUnderlyingBlock(newBytes)
    let oldHash = calculatePartialHash(oldBytes)
    let newHash = calculatePartialHash(newBytes)
    fileHash ^= newHash ^ oldHash
}
4 Likes