[Pitch] Safe loading of integer values from `RawSpan`

I just opened a PR to propose new safe-loading API to load concrete integer values from RawSpan and family: Safe loading of integer values for `RawSpan` by glessard · Pull Request #3065 · swiftlang/swift-evolution · GitHub

Motivation

In [SE-0447][SE-0447], we introduced RawSpan along with some unsafe functions to load values of arbitrary types. While it is safe to load any of the native integer types with those functions, the unsafe annotation introduces an element of doubt for users of the standard library. Furthermore, controlling the endiannness for the loading operation is not available at the point of serialization, introducing further confusion. This proposal adds the ability to safely load integer values with ergonomic endianness control, without the doubt introduced by unsafe functions.

Proposed solution

RawSpan

RawSpan will gain a series of concretely typed load(as:) functions to obtain numeric values from the underlying memory, with no alignment requirement. These load(as:) functions can be safe because they return values from fully-inhabited types, meaning that these types have a valid value for every bit pattern of their underlying bytes.

The load(as:) functions will be bounds-checked, being a safe RawSpan API. For example,

extension RawSpan {
  func load(fromByteOffset: Int = 0, as: UInt8.Type) -> UInt8

  func load(
    fromByteOffset: Int = 0, as: UInt16.Type, endianness: Endianness? = nil
  ) -> UInt16
}

@frozen
public enum Endianness: Equatable, Hashable, Sendable {
  case big, little
}

The loadable types will be UInt8, Int8, UInt16, Int16, UInt32, Int32, UInt64, Int64, UInt, Int, Float32 (aka Float) and Float64 (aka Double). On platforms that support them, loading Float16, Float80, UInt128, and Int128 values will also be supported. These are not atomic operations.

The concrete load(as:) functions will not have equivalents with unchecked byte offset. If that functionality is needed, the generic unsafeLoad(fromUncheckedByteOffset:as:) is already available.

The load(as:) functions will also be available for MutableRawSpan and OutputRawSpan.

MutableRawSpan and OutputRawSpan

MutableRawSpan will gain a series of concretely typed storeBytes() functions that accept an endianness parameter, while OutputRawSpan will have matching append() functions:

extension MutableRawSpan {
  mutating func storeBytes(
    of value: UInt16,
    toByteOffset offset: Int = 0,
    as type: UInt16.Type,
    endianness: Endianness
  )
}

extension OutputRawSpan {
  mutating func append(
    _ value: UInt16,
    as type: UInt16.Type,
    endianness: Endianness
  )
}

These functions do not have a default value for their endianness argument, as the existing generic MutableSpan.storeBytes(of:toByteOffset:as:) and OutputRawSpan.append(_:as:) functions use the native endianness, addressing this need.

These concrete implementations will support UInt16, Int16, UInt32, Int32, UInt64, Int64, UInt, Int, Float32 (aka Float) and Float64 (aka Double). On platforms that support them, Float16, Float80, UInt128, and Int128 values will also be supported. These are not atomic operations.

The concrete storeBytes(of:as:) functions will not have an equivalent with unchecked byte offset. If that functionality is needed, the generic storeBytes(of:toUncheckedByteOffset:as:) is already available.

Span

Span will gain a series of concrete initializers init(viewing: RawSpan) to allow viewing a range of untyped memory as a typed Span, when Span.Element is a numeric type. These conversions will check for alignment and bounds. For example,

extension Span {
  @_lifetime(borrow span)
  init(viewing bytes: borrowing RawSpan) where Element == UInt32
}

The supported element types will be UInt8, Int8, UInt16, Int16, UInt32, Int32, UInt64, Int64, UInt, Int, Float32 (aka Float) and Float64 (aka Double). On platforms that support them, initializing Span instances with Float16, Float80, UInt128, and Int128 elements will also be implemented.

The conversions from RawSpan to Span only support well-aligned views with the native endianness. The [swift-binary-parsing][swift-binary-parsing] package provides a more fully-featured ParserSpan type.

Please see the full proposal draft for the detailed design, alternatives and future directions.

6 Likes

Two comments on this—

  1. I think many peer languages call this just Endian rather than Endianness, which would also be supported by our naming guidelines since it'd be more succinct (and easier to say) without any loss of meaning.
    • The naming guidelines also tell us not to repeat the type in the argument label, so load(..., endianness: Endianness?) is not ideal; load(..., endianness: Endian) would be better. (And if we named the cases bigEndian and littleEndian as some other peer languages do, we could omit the label altogether.)
  2. Using Endianness? = nil as the convention to mean native endianness feels awkward. Since the enum is equatable, this would also mean that native endianness compares not-equal to both big and little, which doesn't seem right. Peer languages have a .native, which we could also have here as a static property.

Side note: We could make these additions generic over FixedWidthInteger & _ExpressibleByBuiltinIntegerLiteral, or some other similar internal-only constraint, no? Then it would be a matter of simply stating as policy that the standard library itself will never vend non-fully-inhabited conformers (which is doable).

1 Like

I am not sure that we have ascertained the endianness options on RawSpan are the way we want to go:

  • There is no Span variant that lets you control endianness; it’s annoying that we would say “use Span if you’re cool with native endianness, and if you don’t, the blessed pattern is to use RawSpan and count your offsets by hand”
  • There is no way to control endianness for anything other than integers, which creates a weird cliff if you’re trying to load a struct (assuming that one day we do get to that fully inhabited protocol)
    • There are binary formats that use different endianness in different parts (or even for two integers next to one another), like Photoshop documents, so there is just no coherent way that specifying endianness for a struct can work in all cases
  • Integers already have init(bigEndian:) and init(littleEndian:) for endianness conversion, which are good ways to deal with the problem (floating point types don’t, but they should)

I think this pushes RawSpan into parsing territory more than being the primitive for working with untyped memory, so I’m less hot for it.

1 Like

When it comes to the core functionality: we do need safe load methods on RawSpan, and how to stage the functionality is not really my business. However, I think there’s a ton of value in a layout constraint or marker protocol that a type is fully inhabited. Loading things from a RawSpan is one super important use case of some kind of FullyInhabited constraint, but it’s not the only one. For instance, we should also have bitCast<From: BitwiseCopyable, To: FullyInhabited>. So, the main thing I’d like to check here is that if we think that FullyInhabited is going to happen one way or another, we’re not adding API surface to RawSpan that will make it unwieldy to prefer/transition to that later on. (I think it’s fine, but I wanted to bring it up in case somebody else sees something I’m not seeing.)

1 Like

If we include a .native case, then the Endianness type wouldn’t be usable by the binary-parser package, for example. It would be perfectly fine to overload a one-parameter function with another that has two, including a non-defaulted Endianness parameter. I feel fine with the comparison awkwardness, because we don’t want to encourage endianness-dependent code.

A different question is whether we want to have “native” as a default. It seems to me this question has already been answered by the existing store(_:as:) functions, which have no endianness control.

Naming: ByteOrder might be clearer and less jargony than re-using a joke from Jonathan Swift.

2 Likes

I have a problem with these (and the related .bigEndian and .littleEndian properties,) in that they conflate types with their serialized representation. Endianness should be considered while serializing, which is where we store/load values to/from untyped buffers.

That seems fine; if endianness is a concern for anything other than integers, then it has become a parsing problem, and values should be reconstructed safely with initializers.

Span's goal is to represent the computer’s memory representation as directly as possible. To control endianness would be a transformation, intrinsically indirect. It would be perfectly fine to write a transforming wrapper that does something else, but it shouldn’t be Span that does it.

Parsing-adjacent use cases were always a major justification for RawSpan; parsing is mentioned in the first sentence introducing RawSpan in SE-0447!

2 Likes

Can you explain?

.native is only useful for parsing a stream that comes from your own process or another live process on your machine. When doing the kinds of tasks intended for the binary parsing package, .native would be an undesirable option to have to handle; they’d either trap on that branch or simply not use our type at all.

1 Like

I'm not sure I entirely understand. As I said above, I'm thinking of .native as a static property which aliases either .big or .little depending on the platform; there would be nothing to 'handle' from the callee's side.


That's nice: I think there's good precedent here given CFByteOrder.

The cases would naturally be spelled ByteOrder.bigEndian and ByteOrder.littleEndian, and no argument labels would then be needed for the functions as the call sites will read something like: load(..., as: Int.self, .bigEndian).

And on that note, an optional value would be a self-documenting and idiomatic way to model in modern Swift what CF has as a separate case, CFByteOrderUnknown. But I'm queasy about nil representing an actual know(n|able) endianness.

2 Likes

I see. That might be okay. I’d personally prefer not to have an even more prominent way to conditionalize on endianness, but to be fair we can already ask whether 1.bigEndian == 1.

I think this is a good case to introduce Endianness type, although I prefer ByteOrder to Endianness for its name.
I'd also want .native (or .current or .host) property, while I can't confidently say it's necessary.


Context

Endianness type including .native property was once proposed as a part of another proposal:

...and I personally thought it was worth discussing such a type separately:

One of comments on that thread says:

*Span types are "good ByteBuffer" ones, aren't they?

However, on the other hand, some concern was raised:

_Querying_ native byte order is often (though not always) wrong, but using the native byte order can be correct (usually when you have to interact with other software that is not aware of byte order).

3 Likes

I had forgotten about @YOCKOW‘s tiny pitch. I’ll add a link to that thread in the proposal.

2 Likes

Thank you for clarifying.
Then, as an API design, func load(..., endianness: Endianness = .native) is not so odd. Am I right?
I want to echo @xwu's comments:

I agree that nil(Optional<Endianness>.none) seems to correspond to CFByteOrderUnknown.

Why only integer types? Why not anything BitwiseCopyable? Are the loads aligned or unaligned? If aligned, can we also get unaligned equivalents?

See also:

I missed that future direction. :zipper_mouth_face:

1 Like

I’ve updated the document with a number of clarifications, and a rename of the Endianness enum to ByteOrder, along with the necessary changes in prose.

1 Like

My read is that we’re kind of trying to have it both ways:

  • There is no way to control endianness for anything other than integers

That seems fine; if endianness is a concern for anything other than integers, then it has become a parsing problem, and values should be reconstructed safely with initializers.

I think this pushes RawSpan into parsing territory more than being the primitive for working with untyped memory, so I’m less hot for it.

Parsing-adjacent use cases were always a major justification for RawSpan; parsing is mentioned in the first sentence introducing RawSpan in SE-0447!

I expect that more people will have Jonathan’s “why does this only work for integer types” reaction.