[Pitch] Safe loading of integer values from `RawSpan`

Thanks for working on this proposal. I have definitely needed these APIs when working with RawSpan,

Endianness

I agree with @fclout that endianness does not belong in this proposal. RawSpan is about access to raw memory. It is not a tool for serialization (swift-binary-parsing is better for that), and the load/store with endianness is only a partial solution anyway. I think we should remove Endianness/ByteOrder and the load/store operations that use it from the proposal.

Overloads vs. FullyInhabited

I do not think we should introduce a slew of concrete overloads. It makes the API unnecessarily overwhelming and has a detrimental effect on both usability and compile-time performance. We know we need FullyInhabited, so we should figure out its design and include it here.

Subscript please?

Can we have a subscript for (Mutable)RawSpan to access the bytes? I've found myself adding

extension RawSpan {
  subscript(index: Int) -> UInt8 { get }
}

extension MutableRawSpan {
  subscript(index: Int) -> UInt8 { get set }
}

because sometimes you just need access to the bytes, and load(fromByteOffset: index, as: UInt8.self) is absolutely unwieldy.

Doug

6 Likes

Aren’t (de)serializers likely to be built directly atop RawSpan?

Let’s say you’re implementing a TCP/IP stack in Swift. As is well-known, integer fields in the TCP header are always sent over the network in big-endian order, regardless of any host’s endianness. The NIC receives a packet, makes it available over the PCIe bus, and signals an interrupt to the CPU to wake up the driver.

The driver creates a RawSpan around the physical memory region mapped to the NIC’s incoming packet buffer. As proposed, this is how the driver starts pulling values into struct TCPHeader:

struct TCPHeader {
  var sourcePort: UInt16
  var destPort: UInt16
  var seqNum: UInt32
  // etc.
}

func handleIncomingTCPPacket(packetData: borrowing RawSpan) -> TCPHeader {
  return TCPHeader(
    sourcePort: packetData.load(as: UInt16.self, endianness: .bigEndian),
    destPort: packetData.load(fromByteOffset: 2, as: UInt16.self, endianness: .bigEndian)
    seqNum: packetData.load(fromByteOffset: 4, as: UInt32.self, endianness: .bigEndian),
    // etc.
  )
}

If the suggestion is that handleIncomingTCPPacket should delegate to an intermediary deserialization library, how should that library implement decoding of the packet header into struct TCPPacket ?

Pulling in a binary serialization library is overkill for this situation. The Linux kernel provides a set of operations on integers (I forget now if they’re macros or inline functions, but they’re trivial) for endian conversions. There’s probably no reason to combine the endian conversion with a load or store operation (if you do provide the latter for convenience, you still want byte swap operations on integers as the primitives). In Swift you could imagine an extension of FixedWidthInteger perhaps could provide such operations.

For this specific case a deserialization library would be overkill like @Slava_Pestov mentioned, but generally speaking RawSpan is not sufficient as a proper means of deserialization.

Specifically, a lot of network protocols implement a lot of structures in a non-fixed-sized manner, and notify you of the size of the data at deserialization time by for example using length-prefix byte(s).

This means that in a lot of situations you don't know where the deserialization will end, but you still need to notify the rest of the code of up to what byte-index you've deserialized.

So essentially, at the very least you need a RawSpan+readerIndex type.

In reality, swift-nio's ByteBuffer has proved to be a great implementation where it not only stores a readerIndex, but also a writerIndex as the end index of the bytes.

This way you can have a "view" of the bytes and restrict the start and end indices to your choosing, which enables you to not have to allocate a new buffer to copy only the bytes you want to, and instead just pass the same ByteBuffer to different places but with 1-2 integers worth of modifications (readerIndex/writerIndex).

In rare cases you'd even want to reset back the readerIndex to somewhere that was already read. So just using RawSpan.extracting can also be too restrictive.

Right. To do deserialization, you want an abstraction that is a RawSpan + readerIndex, and reading moves forward. So you don't write, e.g.,

func handleIncomingTCPPacket(packetData: borrowing RawSpan) -> TCPHeader {
  return TCPHeader(
    sourcePort: packetData.load(as: UInt16.self, endianness: .bigEndian),
    destPort: packetData.load(fromByteOffset: 2, as: UInt16.self, endianness: .bigEndian)
    seqNum: packetData.load(fromByteOffset: 4, as: UInt32.self, endianness: .bigEndian),
    // etc.
  )
}

where you are hardcoding offsets, but instead you read forward like this:

func handleIncomingTCPPacket(packetData: inout ParserSpan) -> TCPHeader {
  return TCPHeader(
    sourcePort: read(as: UInt16.self, endianness: .bigEndian),
    destPort: packetData.read(as: UInt16.self, endianness: .bigEndian)
    seqNum: packetData.read(as: UInt32.self, endianness: .bigEndian),
    // etc.
  )
}

swift-binary-parsing does this pretty well.

If we want to improve handling of endianness for the integer types beyond init(bigEndian:), that's fine (and separable), but bespoke load/store APIs on RawSpan aren't the place for this functionality.

Doug

2 Likes

The question I’m getting at is whether endianness is an irreducible complexity of reading even a single multibyte value from memory. If I understand correctly, the implication of your position is that endianness is only a concern when you consider higher-level context such as the provenance of the memory being read from or its participation in a larger data structure. This higher-level context will come with its own abstraction, where endianness is an appropriate API.

I keep getting stuck on how this intermediate abstraction will actually implement the endianness conversion. The solution of using UInt16.init(bigEndian:) makes me uncomfortable because it implies temporarily modeling data in the correct type but with an invalid value. When interpreting raw memory, I would like the type system to help me distinguish between ready-to-use values and dangerous, intermediate values. For non-integer types like Char or Float, trying to force the byte-swapped interpretation into a temporary value of the same type might be lossy or even impossible.

My counterposition is that while it is reasonable to assume native endianness by default, it is a possible concern at any point where a value can be created from raw memory. Therefore any operation that creates a value from raw memory ought to admit an endianness argument.

7 Likes

Exactly this. If we had avoided the misadventure that are .init(bigEndian:), the .bigEndian property and their converses, there would be no way around providing byte order control in the integer-loading operations of the standard library. We shouldn’t add designs with holes that need to be papered over by old bad ideas; we should add the best designs we can.

1 Like

It's worth recalling that these APIs are in the standard library only by accretion and were not approved by the core team as a "forever" addition when they reviewed the proposal for integer APIs: indeed, they specifically asked for the APIs to be reworked down the line:

5 Likes

Up to now, working on untyped memory using the native representation is the only thing that RawSpan can do. This is a suitable foundation to build byte parsing apparatus because the native interpretation of a byte is near-universal. Special-casing integers so that you can request to read them in byte-swapped order is a step that increases the scope of RawSpan.

I am slightly uncomfortable that the best one-liner I have to merge the old scope with the new one is “works with untyped data in non-native representations” because it is not a tight fit over the pitched API. I think people will be confused and either wonder why these APIs exist at all, or wonder why there aren’t more of them. In particular, the “why there aren’t more of them” part is an invitation to extend RawSpan with methods that load more complex types with some amount of divergence from the native representation. As I understand your position, that would become a parsing problem, which is misuse.

Secondarily, in the face of a byte-swapped integer, one sensible strategy is to build your integer by reading one byte at a time and shifting it in place, which is not different from how you would do for more complex encodings (like ULEB128). Another sensible strategy is to load a BigEndian<T: FixedWidthInteger> instead and give it a toNative() -> T method (this would be unsafe because safe loads are limited to built-in integer/floating-point types, but it is resolved in the full implementation of RawSpan). Another reasonable option is to use a binary parsing framework. The point is that loading a byte-swapped integer and flipping it is a choice, and that choice is tied to how much hand-crafted parsing we think should be encouraged on RawSpan.

Are there examples of types that we can statically determine are FullyInhabited other than the integer types? (Swift has the benefit of modernity and can ignore weird integer storage, unlike C… right?)

Pointers aren't FullyInhabited, but optional pointers are. Would the compiler be able to determine that and conditionally conform Optional to FullyInhabited? But then what of other optional types that it might be able to see are FullyInhabited, e.g. a frozen enum with 255 cases?

Maybe I'm navel-gazing? :upside_down_face:

Off the top of my head:

  • Floating-point types
  • Tuples of fully-inhabited types
  • Concrete SIMD vectors of fully-inhabited types
  • InlineArrays of fully-inhabited types
2 Likes

Don’t the last three of those depend on the alignment of the element type?

I don’t think we have total clarity about the status of padding bytes just yet, so “maybe”. But the compiler knows the alignment of a concrete type statically, so no matter what the answer is, it can be statically determined.

1 Like

Are they though? The bitwise layout of this tuple is not fully inhabited:

typealias PaddingBytesGalore = (UInt64, UInt8, UInt64, UInt16)

Swift basically ignores the entire concept of signalling NaNs, and I'm not sure they ought to be considered valid inhabitants. (That's way off-topic though.)

Padding bytes in the PaddingBytesGalore example are simply ignored. There is no attempt to interpret them as values, and this is fine!

2 Likes

I think the sense of "fully inhabited" we care about for memory safety is whether there are any bit patterns that do not correspond to valid values of the type, not whether there are multiple bit patterns that correspond to the same valid value of the type. (Floating-point types would also violate the latter unless you distinguish NaN payloads.) So that would allow for padding in the type.

Also, I don't believe any of our SIMD vectors allow elements that would require internal padding.

2 Likes

That's fair. Any future definition of a FullyInhabited marker protocol should make the distinction clear.

Conforming to FullyInhabited should require the conforming type to be safe. The whole point is to have a safe operation to load values from bytes, so storing the bytes directly to an unsafe type would be defeating.

Every bit pattern is a valid Float16/32/64/80. (Float80 is a little bit subtle, but true. The others are not even subtle.)

3 Likes

And also, pointer types are clearly not fully inhabited. An arbitrary sequence of 64 bits is not a valid pointer.

3 Likes