Array index access to mapped Data > 32 bits?

I've been working on a pure-Swift GeoTIFF parser off an on for the last year. I had started by loading the file using Data(contentsOf: url, options: .alwaysMapped), and then using a Reader object I created that keeps an Int index and allows for relative reads and writes of the file. I had hoped this would push the burden of buffering file reads into the OS, although I'm not at all sure how .alwaysMapped behaves with very large files.

But today I learned that some GeoTIFF files are BigTiff, where all the 32-bit offsets and some counts are 64-bit. Unfortunately, this is starting to wreak havoc with my approach to reading the data, because it seems Swift arrays are limited by choice to Int.max, and that means my code can only work on 64-bit systems, and even then there's the potential to run into Int vs UInt limits. I don’t think it’s unreasonable to want to access more than 2 GB of data on a 32-bit system, is it?

Barring that, is there a platform-independent buffered file reader in the Swift standard library?

Before we dig into your main issue I need to address this:

I had started by loading the file using Data(contentsOf: url, options: .alwaysMapped)

Relying on memory mapping in library code is tricky because you don’t know the origin of the file. Consider this:

  1. A developer writes a Mac app using your library.

  2. A user runs the app and opens a file on a network volume.

  3. The user’s guinea pig chews through their Ethernet cable.

  4. User does something in the app that tries to access a page in the file that’s not already cached. This triggers a page fault.

  5. The virtual memory (VM) system tries to resolve this by reading the file; that read fails with an error because of the activities of the afore-mentioned guinea pig.

  6. Because the VM system can’t satisfy this page fault its only recourse is to trigger a memory access exception in the app.

IMPORTANT This isn’t a Swift error, and nor it is a Objective-C / C++ language exception. Rather, it’s a machine exception. It’s possible to catch this (via a Mach exception handler or a signal handler) but that’s extremely hard to do correctly.

So, if your library wants to work with arbitrary files you must not rely on memory mapping. This is why we have the .mappedIfSafe option, but that’s not suitable for potentially large files on platforms that don’t support anonymous VM (like iOS and its descendants).


With regards your main issue, you wrote:

I don’t think it’s unreasonable to want to access more than 2 GB of
data on a 32-bit system, is it?

Personally I’d ignore this (well, implement a check in your code that explicitly catches and rejects files larger than 2 GiB when running 32-bit). Most 32-bit platforms will not let you map a file that large:

  • Non-Apple systems tend to split a process’s 32-bit address space between user and kernel ranges, meaning you don’t have access to the full 4 GiB. In many cases this split is 2/2, so it’s simply impossible to map a file that large. Some use a 3/1 split, so it’s theoretically possible to do this, but you’ll end up hitting a limit somewhere in the 2..<3 GiB range. I don’t think it’s worthwhile writing a whole bunch o’ code for that one extra GiB.

  • iOS and its descendents limit the amount of VM that a single process can use and that limit is way less than 2 GiB.

  • macOS is one of the very few 4/4 systems out there, where the user space and kernel have their own completely separate address spaces, each of which can host up to 4 GiB. However, this is irrelevant to you because Swift doesn’t not support 32-bit macOS.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

4 Likes

So much great info from you Quinn, as always!

I always figured memory mapping would be a stop-gap, at best, but for now there’s no platform-independent file API, right? And that scenario you describe is pretty awful. Seems like something that could be handled with a lot of work all the way down to the lowest levels of the OS, but I’m not going to count on that.

It seems one could write a sparse file reader implementation behind the subscript operator, even on a 32-bit platform, no? And why is the index Int and not UInt? In any case, at this point, I guess I'm writing my own subscript method, and I can make that take a UInt64, right?

Thanks!

there’s no platform-independent file API, right?

There’s at least two:

I’d probably go with the latter because its read(fromAbsoluteOffset offset:…) method (equivalent of pread) let you atomically read from an offset. Then again, FileHandle has a much more generous deployment target.

It seems one could write a sparse file reader implementation behind
the subscript operator … no?

Correct. This sort of thing was very common prior to the widespread adoption of 64-bit systems, and it can still make sense today. You are effectively implementing your own VM system and, while it’s not a good idea to bet against the VM system is general, you can win if you exploit deep knowledge of your file format.

And why is the index Int and not UInt?

Because Swift has a strong preference for Int as its currency type. Remember that:

  • The extra bit only matters on 32-bit systems.

  • Swift ‘grew up’ in a world of 64-bit systems.

  • Even on 32-bit systems, the extra bit is rarely useful in practice because of the address space limitations I mentioned previously.

I guess I'm writing my own subscript method, and I can make that take
a UInt64, right?

If you’re going to build an abstraction you should, IMO, go all the way and make a custom index type. That’ll avoid any chance of mixups.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

2 Likes

Thanks for all the amazing info!

Well, I’m going to try to build this as a SwiftUI app, so the app, at least, will require macOS 11+. That’s not to say that the file parsing code needs to be restricted as such, but I’ll burn that bridge when I come to it.

Welp, it seems you can’t read more than Int32.maxValue bytes at a time. If you try, you get EINVAL, which the manpage for read() says means “The sum of the iov_len values in the iov array over-flowed a 32-bit integer.”

I realize this is a limitation of read() on macOS (maybe elsewhere too), but it’s frustrating, because the FileDescriptor API implies I can read more than 2GiB-1 bytes at a time.

Given you won't be able to allocate a full UInt32.max anyway, what would you read it into? Think of it in practice:

The entire OS and your app need to be loaded into RAM before you can even get to this read call. So your address space is guaranteed to be < UInt32.max.

You'll have to split up anything you read into much smaller blocks anyway, as the OS has limits on individual allocated blocks, and iOS without virtual memory comes with fairly little RAM. A "big" iPhone 11 has 6GB RAM and 512GB storage, part of which is occupied by the OS and apps ... so you are unlikely to encounter a file > 500GB, and can allocate < 6GB (even less on older devices, which came with 1GB and less. Apple watch (the only current 32-bit platform Apple sells) came with 512MB...1GB.

Unless the code is running on a Mac, or a Linux box, or something bigger than an iOS device…

Last Macs that were 32 bit only that Apple shipped were Core Solo/Duo, discontinued 15 years ago.

I'm sure you can still find 32 bit Linux boxes out there. If that's your target audience, then you gotta find a way to support them I suppose. In my experience it is rarely worth it trying to micro-optimize for these systems, at most it's good enough that you're segmenting your image reading code and it at least still works. Most 32-bit only systems I encounter are in "small" contexts (like wearables), so have little RAM and swap anyway.

As a workaround, I suppose you could for example read in chunks of Int32.max into a larger buffer.

1 Like
Terms of Service

Privacy Policy

Cookie Policy