Which way are the endian APIs supposed to go?

CTMacUser · April 29, 2019, 8:14pm

In a playground I tried:

var g: UInt16 = 0  // 0
g <<= 8  // 0
g |= UInt16(0x12)  // 18
g <<= 8  // 4608
g |= UInt16(0xAB)  // 4779
String(g, radix: 16, uppercase: true)  // "12AB"
String(UInt16(bigEndian: g), radix: 16, uppercase: true)  // "AB12"
String(UInt16(littleEndian: g), radix: 16, uppercase: true)  // "12AB"
String(g.bigEndian, radix: 16, uppercase: true)  // "AB12"
String(g.littleEndian, radix: 16, uppercase: true)  // "12AB"

I loaded the octets into g with the more significant one first. So I thought I should use the big-endian setting. But that's wrong, I needed the little-endian one. BTW, I'm using an (Intel) Mac, so my system is a little-endian one. Would my assumption of converting a manually big-endian construction with the big-endian initializer work I was on a big-endian system? (Maybe someone with a big-endian Linux system could check?)

In my main code, I ended up accepting my manual big-endian setup and byte-swapping when little endian is needed:

public mutating func next() -> Element? {
    var result: Element = 0
    for _ in 0 ..< MemoryLayout<Element>.size {
        guard let byte = base.next() else { return nil }

        result <<= 8
        result |= Element(byte)
    }
    switch endian {
    case .big:
        return result
    case .little:
        return result.byteSwapped
    }
}

I hope this works on all architectures.

bjhomer · April 29, 2019, 8:48pm

Regardless of you you loaded them in, you want the endian-ness that matches your system. result <<= 8 does not just shift bytes to the left; it shifts them to the next most significant bit. On a little-endian machine, that means that they shift to the left within each byte, and then overflow to the next byte to the right.

You can use this code to inspect what your bytes actually look like:

func inspect(_ x: inout UInt32) {
    withUnsafeBytes(of: &x) { (buffer) -> Void in
        let bytes = buffer.map({ String(format: "%02X", $0) }).joined()
        print("0x\(bytes)")
    }
}

var x: UInt32 = 0xAB
inspect(&x) // 0xAB000000

x <<= 1

inspect(&x) // 0x56010000

The only time you should need to call the bigEndian: or littleEndian: initializers is if you're using the UnsafePointer APIs to re-interpret existing bytes in memory as an integer. (For example, if you read 4 bytes out of a file, which are supposed to represent a big-endian Int32.)

CTMacUser · April 29, 2019, 9:15pm

So, do I have next() right, or backwards?

jrose · April 29, 2019, 9:34pm

var bigEndian says "assuming this value is host-endian, give me a big-endian representation of it". init(bigEndian:) says "assuming this argument is big-endian, get its value (as host-endian)". These actually perform the same operation at the CPU level (either "nothing" or "byte order swap" depending on what the host endianness is), but imply different things about which value is in which format.

(I'll note, uselessly, that this is not my preferred design for endian-ness APIs. I would have preferred struct BigEndian<Value: FixedWidthInteger>.)

bjhomer · April 30, 2019, 5:31am

If the bytes are [0xAB, 0xCD, 0xEF, 0x11], you're logically constructing the following values on each pass through the loop:

0x000000AB
0x0000ABCD
0x00ABCDEF
0xABCDEF11

Note that this result is the same independent of the endianness of your CPU. Internally, on a big-endian system, 0xABCDEF11 is represented by the consecutive bytes [0xAB, 0xCD, 0xEF, 0x11]. On a little-endian system, it's represented by the consecutive bytes [0x11, 0xEF, 0xCD, 0xAB]. But regardless, as a logical UInt32 value, it's the number 0xABCDEF11 either way.

If you want the number 0xABCDEF11, for that sequence of bytes, then you want to always return result directly.

If you want the number 0x11EFCDAB for that sequence of bytes, then you want to always return result.byteSwapped.

If you want the sequence of bytes to be interpreted differently based on the endianness of the system you're running on, then your implementation of next() is acceptable. But it seems unlikely to me that you actually want that.

In this case, you are constructing a number byte-by-byte, so it's fair to be wondering about endianness concerns. But it's the endianness of the bytes you're receiving that matters, not the endianness of the host CPU. Whoever is providing those bytes to you probably intends them to represent a particular number, and serialized those bytes in a particular order to represent that number.

My favorite example for explaining endianness is a hypothetical Fraction struct. Imagine the following definitions:

struct Fraction1 {
    var numerator: Int
    var denominator: Int
}

struct Fraction2 {
    var denominator: Int
    var numerator: Int
}

The two structs are semantically equivalent; they both represent a fraction by storing a numerator and a denominator. The only difference is the order of the internal fields, but that order is irrelevant to anyone who is not inspecting the raw bytes underlying the struct.

In a similar way, little/big-endianness only affects the order of the raw bytes underlying various Int types. It does not affect their semantic value at all. 0xAB << 8 is semantically 0xAB00 regardless of the order of the underlying bytes.

// Two representations of 0xABCDEF11
struct MyBigEndianInt {
    var highOrderByte: UInt8       // 0xAB
    var mediumHighOrderByte: UInt8 // 0xCD
    var mediumLowOrderByte: UInt8  // 0xEF
    var lowOrderByte: UInt8        // 0x11
}

struct MyLittleEndianInt {
    var lowOrderByte: UInt8        // 0x11
    var mediumLowOrderByte: UInt8  // 0xEF
    var mediumHighOrderByte: UInt8 // 0xCD
    var highOrderByte: UInt8       // 0xAB
}