Difficulties With Efficient Large File Parsing

JCurtisRMIT · January 19, 2021, 4:38am

I’ve been looking for profiling on these kinds of tasks this week - this thread is great, thanks!

I’m essentially trying to do a similar task, however I’m trying to parse into 2D arrays Array<Array<Float>> as I need to preserve the order of the lines and access them by index. (~500,000 indices). Each line in my txt file contains 6 floats at double precision - but i’m dropping the precision to try and speed up the process.

These are immutable, my app doesn’t need write access to them, just read access. They only need to be parsed and stored once - which I’ve been doing at run time after the first build into Core Data. I’m getting out of memory crashes, so I’m certain this isn’t the way to approach it.

How should large immutable 2D arrays be persisted in swift? Some info suggests that Core Data isn’t the correct choice and txt files should be used (not Plists) but my load times are 120 seconds for each in some cases. There are dozens of these files.

Should I reexport the parsed data into a format that could be “literal eval”ed somehow? Hard to find a clear answer on this. Thanks!

TellowKrinkle · January 19, 2021, 5:51am

If you know your arrays will always be arrays of exactly 6 floats, you could store them on disk as a flat array of native-endian floats, and mmap in the file. Then, you will only use the ram associated with the rows you're actually accessing, as the OS can unload data whenever it wants.
Then write a wrapper class that treats the mmap'd pointer as an array of floats:

class FileBackedFloatArray {
	let ptr: UnsafeBufferPointer<Float>

	init(file: URL) throws {
		let handle = try FileHandle(forReadingFrom: file)
		handle.seekToEndOfFile()
		let length = Int(handle.offsetInFile)
		let result = try handleErrors(mmap(nil, length, PROT_READ, MAP_PRIVATE | MAP_FILE, handle.fileDescriptor, 0))
		ptr = UnsafeRawBufferPointer(start: result, count: length).bindMemory(to: Float.self)
	}
	deinit {
		let buffer = UnsafeMutableRawBufferPointer(mutating: UnsafeRawBufferPointer(ptr))
		munmap(buffer.baseAddress, buffer.count)
	}
}

extension FileBackedFloatArray: RandomAccessCollection {
	typealias Element = Float

	var startIndex: Int { return 0 }
	var endIndex: Int { return ptr.count }

	subscript(position: Int) -> Element {
		precondition(ptr.indices.contains(position))
		return ptr[position]
	}
}

struct FileBacked6FloatArray: RandomAccessCollection {
	var base: FileBackedFloatArray
	typealias Element = Slice<FileBackedFloatArray>
	var startIndex: Int { return 0 }
	var endIndex: Int { return base.endIndex / 6 }

	subscript(position: Int) -> Slice<FileBackedFloatArray> {
		return base[position*6 ..< position*6 + 6]
	}
}

JCurtisRMIT · January 19, 2021, 6:29am

Wow, that’s amazing! Thanks! Really neat and portable solution :)

I have some questions about how to relate this to an entity in Core Data - but I’ll give this a shot first and experiment. Some stuff in here I’d not even known was possible in Swift

Thanks again!

gonsolo · January 19, 2021, 8:29am

I'm reading large files too and it used to be slow. I ended up with a custom Parser based on InputStream, UnsafeMutablePointer and Array. You can have a look here:

github.com

gonsolo/gonzales/blob/c6dd8960236629d48779d70e40ca9d225bfb8d7d/Sources/gonzales/Api/Parser.swift#L188


      
          let e:                  UInt8 = 101
          
          
enum PbrtScannerError: Error {
                  case noFile
                  case unsupported
          }
          
          
final class PbrtScanner {
          
          
        init(path: String) throws {
                          guard let s = InputStream(fileAtPath: path) else {
                                  throw PbrtScannerError.noFile
                          }
                          stream = s
                          stream.open()
                          if stream.streamStatus == .error {
                                  throw PbrtScannerError.noFile
                          }
                          var bytes = Array<UInt8>(repeating: 0, count: bufferLength)
                          buffer = UnsafeMutablePointer<UInt8>.allocate(capacity: bufferLength)
                          buffer.initialize(from: &bytes, count: bufferLength)

eskimo · January 19, 2021, 8:54am

store them on disk as a flat array of native-endian floats

Using native-endian is a good idea but it does come with a gotcha: If the user moves this file from one machine to another, and the other machine uses the opposite endianness, you have to swap everything. It’s easy to adapt your code to do that, the tricky part is determining whether you should do it.

Having lived through two endian transitions, I’m a big fan of planning for this stuff in advance (-:

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

TellowKrinkle · January 19, 2021, 9:20am

True, it's probably better to make FileBackedFloatArray an UnsafeBufferPointer<UInt32> and put this in the subscript:

return Float(bitPattern: UInt32(littleEndian: ptr[position]))

Then your file is guaranteed to be treated as little endian, and you won't have to worry when you ship your app as a PowerPC+ARM universal binary or whatever other transition happens next

JCurtisRMIT · January 19, 2021, 8:08pm

Great point I hadn't even considered - future self thanks you for solving an issue that would have stumped me later on down the road. Thanks to both of you!

I'm actually trying to find more info about memory mapping files in Swift, but it seems that mmap is perhaps undocumented? I found some stuff from 2011 which is preswift - and the digging into the definition in XCode it looks to be straight C? (edit: didn't realise it was a system call)

Not even quite sure how to approach writing throw conditions for try handleErrors(...) without trial and error. Any tips on finding documentation for this?

Related: What’s the recommended way to memory-map a file? - #5 by benrimmington

tim1724 · January 19, 2021, 8:32pm

It's part of the POSIX standard. Any modern UNIXish OS should have a man page for it. Or see the description in POSIX.1-2017.

eskimo · January 20, 2021, 9:46am

Any modern UNIXish OS should have a man page for it.

Indeed. There’s an ongoing problem preventing us from including man pages in the standard Apple documentation (r. 16512537), hence this: Reading UNIX Manual Pages.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

phoneyDev · January 20, 2021, 11:31pm

Simply googling 'man mmap' will find the pages for you. It won't necessarily be an Apple page but it should get you going, for mmap or any other tool or posix api.

eskimo · January 21, 2021, 9:25am

It won't necessarily be an Apple page

If you’re targeting Apple platforms you should use Apple man pages:

In some cases these APIs have Apple-specific extensions that are super useful. For example, the Apple man page for mmap covers MAP_JIT, which you won’t find documented anywhere else.
If you use another platform’s man pages, you may try to use its non-standard extensions on Apple platforms, which won’t end well.

If you’re writing code that you expect to run on multiple platforms, it’s best to use the standard man pages, per the link posted by Tim Buchheim above.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

JCurtisRMIT · January 22, 2021, 2:04am

Thanks everyone, helpful and friendly culture on this forum :)

I've implemented the above solution from @TellowKrinkle in my app and the load times are incredible. For the sake of others who might discover this post but might be unfamiliar with some of the methods being used as I was, here is the implementation from TK for saving the parsed flat array to disk:

 let parsedFlatArrayFromTxt = yourTextFileToFlatArrayParser(yourFile: URL)
        
 let array: [Float] = parsedFlatArrayFromTxt
 let littleEndian = array.map { $0.bitPattern.littleEndian }
 let data = littleEndian.withUnsafeBytes(Data.init(_:))
 let filename = path
 try? data.write(to: filename)

Another advantage I didn't consider is that the file size for these are much smaller (~50%) than storing as plaintext too. Cut down my app's footprint a great deal since there are dozens of these files.

For any fellow swift noobs discovering this, I lost some time fighting the compiler to try and instantiate the FileBacked6FloatArray and use it correctly. I wasn't too familiar with Slice as a return type and was trying to cast to every variation of [Float] I could think of. Slice type feels a bit strange to access it since the indices are shared with the base collection - though the enumeration is nice.

var fileBackedFloatArray = try? FileBacked6FloatArray(base: FileBackedFloatArray(file: url))
//this works
let valAtLineZeroIndexThree = fileBackedFloatArray[0][3]

// out of bounds
let valAtLineOneIndexThree = fileBackedFloatArray[1][3]

// this works but is essentially the same as accessing the base
let line = 1
var idx = 3 
idx += fileBackedFloatArray[line].startIndex
let valAtLineOneIndexThree = fileBackedFloatArray[line][idx]

I'm sure the value of the Slice type will become clearer - still fairly new to Swift. This is a really powerful method for efficiently handling large files and a good learning experience. Thanks again @TellowKrinkle

eskimo · January 22, 2021, 9:55am

I've implemented the above solution from TellowKrinkle in my app and
the load times are incredible.

That’s great to hear.

There’s one further gotcha I need to call out: Can you guarantee that the file you’re reading is never going to become inaccessible? If you can’t make this guarantee then memory mapping the file is not safe. For example, if the user just happens to select a file an a USB thumb drive, it’s not safe to memory map because the user might pull out the thumb drive without notice. Once that happens the VM subsystem will be unable to satisfy page faults from that file. If your program then accesses a new page, one that’s not cached, that’ll be reflected by a memory access exception [1] which takes down your entire process O-:

So, when is it safe to memory map a file? The easy answer here is to use Data, which supports the mappedIfSafe option. The downside here is that, if the file isn’t safe, Data will read it all into memory (which would be problematic, say, for a 400 MB file on iOS).

If you want to stick with mmap, my suggestion is:

On iOS and friends, only use it if you can guarantee that the file is within your app’s container or a shared app group container. Files selected by the user, via iOS’s document architecture say, could be on an external drive or network file system, and thus unsafe to map.
On macOS, only use it if the file is on the root volume.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

[1] This is a machine exception, not a language exception, which makes it virtually impossible to catch [2].

[2] Yeah, I know, you can catch it with signal handlers (or, better yet, a Mach exception handler) but doing that in Swift is going to be… tricky.

JCurtisRMIT · January 22, 2021, 10:42am

Thanks Quinn! I'd been curious about this - for this application all of the files I'm reading to memory live inside the app's Bundle resources - they're strictly read only, never moved or written to. We'd like to push content updates via the app store where we'd add more of these files to the same location, but we're not building any systems where content updates can be applied while the app is running.

One question: I'm wondering if supplying strange binary files and calling potentially hazardous mmap calls could create problems for the review process when we're ready to submit to the app store?

eskimo · January 25, 2021, 10:26am

all of the files I'm reading to memory live inside the app's Bundle
resources

Yeah, that’s an ideal use case for mmap. If the volume containing your app goes away, all bets are off. Indeed, I should’ve listed that in my previous post!

I'm wondering if supplying strange binary files and calling
potentially hazardous mmap calls could create problems for the
review process when we're ready to submit to the app store?

I don’t work for App Review and they are the only folks who can give you definitive answers about what will or won’t be allowed on the store. Having said that, I can’t see any problem here; the difference between mmap and simply reading a file into memory doesn’t seem to intersect with any of App Review’s usual concerns.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

nickasd · November 5, 2023, 6:20pm

Thanks a lot for all these suggestions. It helped me dramatically reduce the time spent to parse my text file. This should be mentioned prominently somewhere in the Swift documentation. Just one note: I initially tested these various methods in Xcode with a debug build and the performance was the same as my previous implementation or even worse. Only when creating a release build (using the Xcode Profile command) I noticed the huge improvement.

I was wondering what the difference is between String(decoding:as:) and String(data:encoding:), since the latter is a failable initializer. When should I use each one of them? Is it perhaps that the first one makes no checks and should not be used when opening arbitrary documents created by any app, but rather only when opening documents created by my app so that I know that the data is valid?

David_Smith · November 6, 2023, 4:06am

The difference is more historical than anything else. The first one is the “built in” method from the Swift standard library, and the second one is imported from Foundation and used to operate internally by bridging to ObjC NSStrings (I believe it still does for non-ASCII non-UTF8 encodings, but that may have changed since I last looked at it, a lot of the bridging-under-the-hood methods did get rewritten to not do that).

nickasd · November 6, 2023, 11:20am

Thanks for your answer! I was just wondering: is there an equivalent to convert a string to UTF8 data? I've read online that when using a Unicode encoding (such as .utf8) String.data(using:allowLossyConversion:) will always be a valid string that can be safely force unwrapped, but couldn't find any official documentation. I also read that one can use Data(string.utf8) to avoid force unwrapping, which looks cleaner to me... but perhaps it has a drawback? Is there an advantage in each of these two methods?

David_Smith · November 6, 2023, 5:05pm

Do you actually need Data specifically? Could you use some RandomAccessCollection<UInt8> instead? Then you could just pass string.utf8 directly and save an allocation + copy.

nickasd · January 19, 2024, 11:09am

Do you actually need Data specifically?

Most of the times I need to write the output to disk using Data.write(to:options:). Can you do this without first converting to Data?