How to handle endianess?

Terje · November 18, 2021, 10:36am

Application: a database written from scratch in swift
Progress: Reading/writing individual bytes works (thus far) but now I have to make sure I get/return them in the correct order.
Experience dealing with endinaness: very little.

I add features and improve correctness each time I refactor my code. This time around I want to make sure it's cross-platform so that means dealing with endianess.

At some point I have to write integers of various sizes to disk. These represents page numbers (UInt32), page offsets (UInt16), raw enum values, or actual user data (Int). The code below works fine as far as I can tell. But it doesn't take into account endinaness.

The FixedWidthInteger protocol has bigEndian and littleEndian properties. Should I just pick one and call ...Endian on every value before applying these functions?

I picked ExpressibleByIntegerLiteral because that covers floating point values as well. Floating points do not have ...Endian, but I could always add that in an extension or handle floating points separately.

Also I only have a intel mac (and an iPhone, iPad). How do I unittest this? Endianess comes down to the hardware platform, right?

Pages are (or should become) fixed size bags of bytes. Content can be written over multiple pages.
This implies that e.g. the 8 bytes representing an Int could be written on multiple pages.
For Strings I just use their UTF8 representation. That should be fine, right.

    func read<T>(type: T.Type = T.self) throws -> T
    where T: ExpressibleByIntegerLiteral //because floating points are ExpressibleByIntegerLiteral
    {
        var value: T = 0
        cursor += try Swift.withUnsafeMutableBytes(of: &value) { try self.copy(range: self.cursor ..< self.cursor + MemoryLayout<T>.size, to: $0) }
        return value
    }

    @discardableResult
    func write<T>(value: T) throws -> Int
    where T: ExpressibleByIntegerLiteral
    {
        let endIndex = cursor + MemoryLayout<T>.size
        try Swift.withUnsafeBytes(of: value) { try self.replace(range: self.cursor ..< endIndex, with: $0) }
        cursor = endIndex
        return MemoryLayout<T>.size
    }

    //because why waste disk storage if the majority of values would fit in UInt8 or UInt16
    //okay, disk are big enough these days, but fitting more data onto a page should be faster
    func read<T>(compressed: T.Type) throws -> (value: T, count: Int)
    where T: FixedWidthInteger //compressing individual floating points makes little sense.
    {
        var count = 0
        var value = T.zero
        while cursor < endIndex
        {
            let byte = try self.value(at: cursor)
            value |= T(byte & 127) << (7*count)
            count += 1
            cursor += 1
            if byte < 128 { return (value, count) }
        }
        throw Error.outOfBounds()
    }

    @discardableResult
    func write<T>(compressed value: T) throws -> Int
    where T: FixedWidthInteger
    {
        let marker = cursor
        var value = UInt(bitPattern: Int(value))
        while value > 127 && cursor < endIndex
        {
            try set(value: UInt8( (value & 127) | 128 ), at: cursor)
            value >>= 7
            cursor += 1
        }
        try set(value: UInt8(value & 127), at: cursor)
        cursor += 1
        return cursor - marker
    }

Edit: Forgot to add: generic advice, guidelines, suggestions and hints are fine and probably preferred. I am not expecting anybody to write this code for me. I just lack experience/knowledge in this particular narrow-focused topic. I rather ask now than have this blow up in my face years down the road.

Karl · November 18, 2021, 12:02pm

Endianness determines how a bunch of bytes are interpreted as a number, and applies to both integers and floating-point values. So if you're going from a numeric value -> bytes, or bytes -> numeric value, you should ensure a consistent endianness.

Note that even if you write an integer using a hex or binary literal, such as 0xDEADBEEF, that is a numeric literal and not a bytes-in-memory literal. Printing and parsing also works at the numeric level, etc.

Yes, and the operating system. Bi-endian machines will sometimes have big- and little-endian variants of the OS.

Testing is a big problem, though. If you look at Debian's list of ports, you'll notice that even for bi-endian machines, only the little-endian variants are supported or actively maintained (e.g. MIPS, MIPS64, PPC64, RISCV), and the big-endian variants are all discontinued. The notable holdout is IBM's s390x, used in mainframes that most people don't have access to.

You can basically ignore big-endian machines. Writing serialised data with a defined endianness is nice, and about as much effort as you need to put in to it; don't bother to actually write the code to handle running on a big-endian machine.

I had a similar issue with a library I was working on, and I decided it wasn't even worth worrying about. In the very unlikely case that somebody releases a new big-endian machine and wants to port your database to it, they can figure out the issues at that time. They'll already be busy porting half the universe to their system, and the chances are your untestable code won't work first time anyway.

Personally, if portability is the goal, I would rather invest my time supporting other OSes such as Windows, not obscure hardware platforms. That's a system people actually use and you can actually test.

Hope that helps!

Terje · November 18, 2021, 1:16pm

Hope that helps!

Yes and no. It confirms what I have read but an intermediate step to practicality would be helpful.

Personally, if portability is the goal, I would rather invest my time supporting other OSes such as Windows, not obscure hardware platforms.

Oh, I don't have such lofty goals. I'll be happy if it works on a Mac and an iPone/iPad (and linux/windows, sure why not). But Mac is intel and the others are ARM. These (can) have different endianness and that is giving me a bit of anxiety.

My train of thought:
So my UInt32 value has bytes ABCD. This Swift.withUnsafeBytes(of: value) will give me those bytes in that order A - B - C - D. So I write the to file in that order. If I look at my file with a hex editor, I see them in that order. I read them back and stick them in the UInt32 container with Swift.withUnsafeMutableBytes(of: &value) I have respected the byte order so of course I get the original value.

But suppose now that I write these four bytes on my Mac. And on my iPhone I read these 4 bytes and stick them in the UInt32 container. Will the UInt32 have the same value on both platforms? Why should they? ARM can be either endianness but in practice ARM and Intel have different endianness.

Mac: UInt32 (ABCD) => write (ABCD) <= ifleformat => read(ABCD) => iPhone: UInt32 (ABCD) because I stick them in there in that order but maybe they are wrongly interpreted as UInt32 (DCBA) and I should swap them ???

Is this a correct summary(?) of your answer and the link you gave:
Don't worry about endianness. Pick one for my file format and be consistent. Big-endianess is being discontinued anyway so it's not worth the trouble. It's platform specific and the compiler takes care of it.

Just use #if _endian(big) and therefore refactor as:

	func read<T>(type: T.Type = T.self) throws -> T
	where T: FixedWidthInteger
	{
		var value: T = 0
        cursor += try Swift.withUnsafeMutableBytes(of: &value) { try self.copy(range: self.cursor ..< self.cursor + T.extent, to: $0) }
        #if _endian(big)
            value = value.byteSwapped  //because I choose little endianess for my file format but this platorm happens to be big-endianess
        #endif
		return value
	}
	
	@discardableResult
	func write<T>(value: T) throws -> Int
	where T: FixedWidthInteger
	{
		let endIndex = cursor + T.extent
        #if _endian(big)
            value = value.byteSwapped //because I choose little endianess for my file format but this platorm happens to be big-endianess
        #endif
        try Swift.withUnsafeBytes(of: value) { try self.replace(range: self.cursor ..< endIndex, with: $0) }
		cursor = endIndex
		return T.extent
	}

Karl · November 18, 2021, 1:38pm

Actually, I should perhaps make this clearer: endianness is a property of your data. You can choose any endianness you like, and the endianness of the computer is irrelevant.

From Rob Pike:

The byte order of the computer doesn't matter much at all except to compiler writers and the like, who fuss over allocation of bytes of memory mapped to register pieces. Chances are you're not a compiler writer, so the computer's byte order shouldn't matter to you one bit.

Notice the phrase "computer's byte order". What does matter is the byte order of a peripheral or encoded data stream, but--and this is the key point--the byte order of the computer doing the processing is irrelevant to the processing of the data itself. If the data stream encodes values with byte order B, then the algorithm to decode the value on computer with byte order C should be about B, not about the relationship between B and C .

...

The entire Plan 9 system ran, without architecture-dependent #ifdefs of any kind, on dozens of computers of different makes, models, and byte orders. I promise you, your computer's byte order doesn't matter even at the level of the operating system.

But it seems like you understood it anyway:

Yes

In practice, every ARM machine you see will also be little-endian. Explicitly writing withUnsafeBytes(of: value.littleEndian) is good practice just to pick something. It will be a no-op on basically every system, but that's just a nice extra - really, we're saying we want the data to be little-endian, regardless of what the machine uses.

When reading, use copyBytes to read the bytes in to a correctly-aligned integer, then T(littleEndian: value) to swap if necessary. Again, a no-op on most machines, but what's really important is that we're saying the value in the data is little-endian.

Pike's suggestion is to use (for LE data):

i = (data[0]<<0) | (data[1]<<8) | (data[2]<<16) | (data[3]<<24);

Which is nice in C, but annoying in Swift because we have generic integers and don't do implicit conversions. But either way is fine, and has the same benefit that the same code runs on every system.

Swift actually has a really good API for this because it's all about the endianness of the data, not the machine running the code.

Terje · November 18, 2021, 1:47pm

Yes, that should do it.
Thank you for your time.