How to calculate the alignment of unsafe pointer for heterogeneous data?

Hello, I'm trying to parse a binary file from the Jet Propulsion Laboratory (JPL). They provide binary files known as Double Precision Array Files (DAF) which contain data for planets' ephemeris. This file is created as contiguous address of mixed integers, doubles and strings (with different byte counts). The structure for it is detailed here: DAF Reference

I'm using swift-system with their FileDescriptor struct to go little by little on the data without loading it entirely into memory since it's several MB heavy.

var buffer = ...
let _ = try descriptor.read(fromAbsoluteOffset: Int64(0), into: buffer)

The read method requires as parameter an UnsafeMutableRawBufferPointer with count of 1024.

The only way I found to instantiate the buffer is through:

UnsafeMutableRawBufferPointer.allocate(byteCount: 1024, alignment: ???)

But I don't what to put on alignment, neither what its role is. Is there a way to determine what is the best alignment for a set of heterogeneous data in a binary file?

The first part of the DAF file, determines its properties located at:

  1. LOCIDW (8 characters, 8 bytes): An identification word.

  2. ND ( 1 integer, 4 bytes) [Address 8]

  3. NI ( 1 integer, 4 bytes) [Address 12]

  4. LOCIFN (60 characters, 60 bytes) [Address 16]

  5. FWARD ( 1 integer, 4 bytes) [Address 76]

  6. BWARD ( 1 integer, 4 bytes) [Address 80]

  7. FREE ( 1 integer, 4 bytes) [Address 84]

  8. LOCFMT ( 8 characters, 8 bytes) [Address 88]

  9. PRENUL ( 603 characters, 603 bytes) [Address 96]

  10. FTPSTR ( 28 characters, 28 bytes) [Address 699]

  11. PSTNUL ( 297 characters, 297 bytes) [Address 727]

For this kind of case, it’s best to just use the largest alignment of anything you’ll load from the file.

It’s possible that objects in the file won’t actually be aligned relative to the start of the file, of course, so loads will generally need to be unaligned. But often loads that work on unaligned addresses still get the performance benefits of aligned loads if they’re from addresses that happen to be well-aligned.

That’s good to hear :slightly_smiling_face: Just to see if understood what you meant. Alignment would be an optimization for loading data, therefore the base addresses of whatever I want to extract should be multiple of the alignment I choose to be more optimized. If the largest thing in the file is a String and the MemoryLayout alignment of it says 8, that would the best option then.

Or should it be 4 since most of the base address, start with a multiple by 4 base address? 8, 12, 16, 76, etc…

Or 603, because the largest string here takes up to 603 bytes?

Or 1, since there odd-numbered addresses?

I think understanding the role of alignment could help, but I believe I'm not grasping the consequences of it being "unaligned"?

In your case the largest alignment is 4 bytes and "1 integer, 4 bytes" would match Swift's Int32 or UInt32. I wonder though how do you plan to lay this out in swift. You could attempt:

struct S {
	let l1, l2, l3, l4, l5, l6, l7, l8: UInt8  // LOCIDW
	let nd: Int32
	let ni: Int32
	let locifn: (UInt8, UInt8, UInt8 ... 60 times)
	let fward: Int32
	let bward: Int32
	let free: Int32
	let locfmt: (UInt8, UInt8, UInt8, UInt8, UInt8, UInt8, UInt8, UInt8)
	let prenul: (UInt8, ... 603 times)
	let ftpstr: (UInt8, ... 28 times)
	let pstnul: (UInt8, ... 297 times)
}

but it has several issues:

  • it is awful

  • swift doesn't not give guarantees about struct alignment (unless struct is a C struct imported to Swift). I won't be surprised there are unwanted gaps in that structure e.g. because Swift wants to put individual fields according to its own alignment preferences.

  • it is awful

I'd recommend one of these (in order of preference):

  • instead do the manual deserialising of the structure, it's not hard. essentially you'd have smth like "readInt32(offset)", etc and the corresponding writes if you need to write that structure back.

  • express this struct in C and import it into swift:

struct S {
	uint8_t locidw[8];
	uint32_t nd;
	uint32_t ni;
	uint8_t locifn[60];
	uint32_t fward;
	uint32_t bward;
	uint32_t free;
	uint8_t locfmt[8];
	uint8_t prenul[603];
	uint8_t ftpstr[28];
	uint8_t pstnul[297];
};

assert(sizeof(S) == 1016) // 1016 if i am not mistaken

in the latter case you need to double check that compiler packs this struct without gaps (if it does that could be due to those odd string properties at the bottom, in which case you may want to combine them into a single 936 byte array.

several MB's is nothing... do you mean several GB's?

It's 16 MB, I thought it was considered a heavy file :man_facepalming:, although there is one I'd need afterwards that's GBs of size.

I agree, it's awful :cry:. I tried creating a struct for this in Swift but when transformed into data there were differences because of "pads?" between the properties which would mismatch with the binary format it's packed like. The C struct definitely sounds like a possible solution. I'm interested in what you proposed as readInt32(offset).

I don't know if one is more efficient than the other, but when you mean readInt32(offset) do you mean I should call read on each "property" I want to extract, or should I pull a big chunk, like 1024 which is the entire file record (There are multiple ones inside the file) and then loadInt32(offset).

Here's a quick sketch of the first approach (untested):

func readByte(from p: UnsafePointer<UInt8>, offset: Int) -> UInt8 {
    p[offset]
}

func writeByte(_ v: UInt8, from p: UnsafeMutablePointer<UInt8>, offset: Int) {
    p[offset] = v
}

func readInt32(from p: UnsafePointer<UInt8>, offset: Int) -> Int32 {
    (p + offset).withMemoryRebound(to: Int32.self, capacity: 1) { r in
        r.pointee
    }
}

func writeInt32(_ v: Int32, to p: UnsafeMutablePointer<UInt8>, offset: Int) {
    (p + offset).withMemoryRebound(to: Int32.self, capacity: 1) { p in
        p.pointee = v
    }
}

func readData(from p: UnsafePointer<UInt8>, offset: Int, count: Int) -> Data {
    Data(bytes: p + offset, count: count)
}

func writeData(_ data: Data, to p: UnsafeMutablePointer<UInt8>, offset: Int) {
    _ = data.withUnsafeBytes { (bp: UnsafeRawBufferPointer) in
        memmove(p + offset, bp.baseAddress!, data.count)
    }
}

func readString(from p: UnsafePointer<UInt8>, offset: Int, count: Int) -> String {
    let data = readData(from: p, offset: offset, count: count)
    return String(data: data, encoding: .ascii)!
}

func writeString(_ s: String, to p: UnsafeMutablePointer<UInt8>, offset: Int) {
    let data = s.data(using: .ascii)!
    writeData(data, to: p, offset: offset)
}

struct S {
    var area: UnsafePointer<UInt8>
    
    var fward: Int32 {
        readInt32(from: area, offset: 76)
    }
    var locifn: String {
        readString(from: area, offset: 16, count: 60)
    }
    ...
}

The records seems to be fixed size (1024? 1016 bytes?)
I'd read one block and then use the above wrapper structure to get individual fields out of it.

I'm trying to test it but I haven't found how to turn a UnsafeMutableRawBufferPointer to UnsafePointer<UInt8>. Is there a way to go back and forth these types?

let x: UnsafeMutableRawBufferPointer = ...
S(area: x.baseAddress!.assumingMemoryBound(to: UInt8.self))

or change the initialiser:

struct S {
	let area: UnsafeMutablePointer<UInt8>

	init(_ p: UnsafeMutableRawBufferPointer) {
		self.area = p.baseAddress!.assumingMemoryBound(to: UInt8.self)
	}
	...
}

Brilliant, your concept works perfectly :slight_smile: Thank you very much for your insight :slight_smile: I'm gonna also give it a go to the C struct, at least for the kicks but your solution I believe is the appropriate one.

I'll post here the solutions that worked so maybe someone else might require this:

let descriptor = try FileDescriptor.open(path, .readOnly)
        
let buffer = UnsafeMutableRawBufferPointer.allocate(byteCount: 1024, alignment: 4)
defer { buffer.deallocate() }
        
try descriptor.read(fromAbsoluteOffset: 0, into: buffer)

let pointer = buffer.bindMemory(to: UInt8.self).baseAddress
// or
let pointer = buffer.baseAddress?.assumingMemoryBound(to: UInt8.self)
let data = Data(bytes: pointer, count: 8)

// The following also seemed to have worked but since it's not typed, it might not be the preferred way. I'm guessing both work since we're using the smallest unit being UInt8

let data = Data(buffer[0..<8])

I'll try to use the same pattern with the Double type and report back.

Note you may also do the Data initialiser:

try! Data(contentsOf: url)

and then data.withUnsafeBytes ...

Double should be no problem.

Which endian is your source material? You may need to convert it if there's a mismatch.

What’s the difference of using Array(data) in comparison with data.withUnsafeBytes? I ask because of your example.

Endianness is a… funny part. It can be both so first I have to check on the locfmt, which indicates the endianness before I can parse the rest of the contents.

The above sketch needs raw bytes (either UnsafeRawPointer or UnsafeRawBufferPointer. Where you get those from is not very important, and if you start with either Data or Array both have withUnsafeBytes to give you the underlying bytes. You may also restructure your code to work directly with either array's or data's subscripts.

I see. Then you'd need some htonl / Int32's bigEndian / or the equivalent logic of your own here and there, applied conditionally.

BTW, another option would be to memory map that file and treat it as memory.

Modern days you can memory map files even bigger than 4GB (I remember there was an explicit opt-in to do that, was that opt-in required on iOS or macOS - don't remember offhand).

I remember that wrong, that entitlement is to get access to more than 5GB memory on iOS (com.apple.developer.kernel.increased-memory-limit | Apple Developer Documentation). Has nothing to do with the size of file that you can memory map, so you can probably have a very large file (like 10GB) and memory map it. NSData has mappedIfSafe option and NSData is bridgeable to Data.

It’s been sometime since I’ve done any iOS programming so I was not aware of this haha. It’s good to know, my aim is to make a Swift Package that can handle these files so it could be run on different systems.
It seems the sample file I’m using is “encoded?” In Little Endian, and works well so I’m guessing that’s what Swift runs as, so I’ll need to do the conversion to big endian as you mentioned.

Hope at some point, I can make pure swift as well but that’s on the future :)

I remember that wrong

Hey hey, you were right the first time, you just found the wrong entitlement. The droid you’re looking for is Extended Virtual Addressing Entitlement.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

3 Likes

Thank you, good to know. What is the limit without that opt-in?

C is the way to guarantee layout. Swift @frozen struct layout is unspecified but stable, so if you check that it does what you want, you can rely on it staying that way.

area should be UnsafeRawPointer, and all those APIs that read values from a byte stream should take UnsafeRawPointer, not UnsafePointer<UInt8>.

UnsafePointer<UInt8> means that you're literally pointing to a value that's already declared as UInt8, or storage of Array<UInt8>. It never makes sense for a byte buffer.

Please don't use either bindMemory or assumingMemoryBound(to:). Why do you need an UnsafePointer<UInt8>? That's not the right type to use for a byte buffer.

1 Like

Thank you. Do you know why my version works if it's doing something wrong? Anything I can do to cause it break? E.g. a certain bit patterns on the input? or debug mode + some sanitiser options, or how? Or is it doing something wrong at conceptual level, or something that works today but may break in the future versions of Swift or compiler? I thought everything is essentially "an array of bytes" (conceptually speaking), even when it is, say, a heterogeneous struct with Doubles and Ints. So I assume it is safe to treat the original sender's data structure as "an array of bytes", and to get, say, FWARD (Int32 at offset 76) I can just do:

bytes[76]*2^0  + bytes[77]*2^8 + bytes[78]*2^16 + bytes[79]*2^24

(pseudocode, ignore the endian issues, etc), which is equivalent to the current implementation of readInt32 (methinks)

Please provide your implementation for readInt32.

UnsafeRawBufferPointer

I really like @tera answer. If you want to go with UnsafeRawBufferPointer then:

import Foundation

struct FileRecord {
  /// (8 characters, 8 bytes): An identification word (`DAF/xxxx').
  let locidw: String
  /// ( 1 integer, 4 bytes): The number of double precision components in each array summary. [Address 8]
  let nd: Int32
  /// ( 1 integer, 4 bytes): The number of integer components in each array summary. [Address 12]
  let ni: Int32

  // etc…
}

func readString(_ ptr: UnsafeRawBufferPointer, offset: Int, length: Int) -> String {
  let slice = ptr[offset..<(offset + length)]
  let data = Data(slice)
  return String(data: data, encoding: .ascii)!
}

func readInt32(_ ptr: UnsafeRawBufferPointer, offset: Int) -> Int32 {
  return ptr.load(fromByteOffset: offset, as: Int32.self)
}

func parse(ptr: UnsafeRawBufferPointer) -> FileRecord {
  let locidw = readString(ptr, offset: 0, length: 8)
  let nd = readInt32(ptr, offset: 8)
  let ni = readInt32(ptr, offset: 12)
  return FileRecord(locidw: locidw, nd: nd, ni: ni)
}

// ============== TEST ==============

// This is the 1st line from 'The File Record' section of you doc.
// '0o' is means octal (because they used 'od -cbv' where -b means octal output).
// I will just use Foundation.Data instead of the file.
let data = Data([
  // 0000  D      A      F      /      S      P      K          002       \0   \0   \0   006    \0   \0   \0
           0o104, 0o101, 0o106, 0o057, 0o123, 0o120, 0o113, 0o040, 0o002, 000, 000, 000, 0o006, 000, 000, 000,
])

let result = data.withUnsafeBytes(parse(ptr:))
print(result.locidw) // DAF/SPK
print(result.nd) // 2
print(result.ni) // 6

Obviously, this is a dummy mock, I real life it will probably be more complicated.

Unit tests

If you want unit tests then you can create something like:

protocol BinaryFile {
  func read(offset: Int, count: Int) -> UnsafeRawBufferPointer
  /// Release buffer.
  func close()
}

func parse(file: BinaryFile) -> FileRecord { things and stuff }

Then in unit tests you just create BinaryFileMock that is backed by Foundation.Data (just like I did in example).

Alternative

Though the best option may be to use existing C library to do the work for you.
Swift has a really nice C interop, the code may look a little bit ugly, but it works nicely.

1 Like

Could you provide an example of the option you're referring to?

@tera Btw, I was able to make a working parser of the DAF file. Thank you very much

P.D.: I do have a question, should I call the deinitialize method from the buffer before the deallocate inside of the defer block?