[Pitch 2] Safe loading of values from `RawSpan`

This is a second pitch thread for this feature, after it was significantly reworked. The first pitch thread was [Pitch] Safe loading of integer values from `RawSpan` .

The most significant change is the introduction of a new layout constraint, FullyInhabited. The safe API are now constrained by this protocol, or by this protocol in combination with FixedWidthInteger.

Safe loading API for RawSpan

Introduction

We propose the introduction of a set of safe API to load values of certain safe types from the memory represented by RawSpan instances, as well as safe conversions from RawSpan to Span for the same types.

Motivation

In SE-0447, we introduced RawSpan along with some unsafe functions to load values of arbitrary types. While it is safe to load any of the native integer types with those functions, the unsafe annotation introduces an element of doubt for users of the standard library. This proposal aims to provide clarity for safe uses of byte-loading operations.

Proposed solution

FullyInhabited

We propose a new layout constraint, FullyInhabited, to refine BitwiseCopyable. A FullyInhabited type is a safe type with a valid value for every bit pattern that can fit in its representation.

By conforming to FullyInhabited, a type declares that it has the following characteristics:

  • Its stored properties all themselves conform to FullyInhabited.
  • It is frozen if its containing module is resilient.
  • There are no semantic constraints on the values of its stored properties.

The standard library's FixedWidthInteger and BinaryFloatingPoint types will conform to FullyInhabited, as well as Never.

For example, a type representing two-dimensional Cartesian coordinates, such as struct Point { var x, y: Int } could conform to FullyInhabited. Its stored properties are Int, which is FullyInhabited. There are no semantic constraints between the x and y properties: any combination of Int values can represent a valid Point.

In contrast, Range<Int> could not conform to FullyInhabited, even though on the surface it has the same composition as Point. There is a semantic constraint between two two stored properties of Range: lowerBound must be less than or equal to upperBound. This makes it unable to conform to FullyInhabited.

Other examples of types that cannot conform to FullyInhabited are UnicodeScalar (some bit patterns are invalid), a hypothetical UTF8-encoded SmallString (the sequencing of the constituent bytes matters,) and UnsafeRawPointer (it is marked with @unsafe.)

In the initial release of FullyInhabited, the compiler will not validate conformances to it. Validation will be implemented in a later version of Swift.

RawSpan and MutableRawSpan

RawSpan and MutableRawSpan will have a new, generic load(as:) function that return FullyInhabited values read from the underlying memory, with no pointer-alignment restriction. Because the returned values are FullyInhabited and the request is bounds-checked, this load(as:) function is safe.

extension RawSpan {
  func load<T: FullyInhabited>(
    fromByteOffset: Int = 0,
    as: T.Type = T.self
  ) -> T
}

Additionally, a special version of load() will have an additional argument to control the byte order of the value being loaded, for values of types conforming to both FullyInhabited and FixedWidthInteger:

extension RawSpan {
  func load<T: FullyInhabited & FixedWidthInteger>(
    fromByteOffset: Int = 0,
    as: T.Type = T.self,
    _ byteOrder: ByteOrder
  ) -> T
}

@frozen
public enum ByteOrder: Equatable, Hashable, Sendable {
  case bigEndian, littleEndian
  
  static var native: Self { get }
}

The list of standard library types to conform to FullyInhabited & FixedWidthInteger is UInt8, Int8, UInt16, Int16, UInt32, Int32, UInt64, Int64, UInt, Int, UInt128, and Int128.

The load() functions are not atomic operations.

The load(as:) functions will not have equivalents with unchecked byte offset. If that functionality is needed, the function unsafeLoad(fromUncheckedByteOffset:as:) is already available.

MutableRawSpan and OutputRawSpan

MutableRawSpan will gain a storeBytes() function that accept a byte order parameter:

extension MutableRawSpan {
  mutating func storeBytes<T: FullyInhabited & FixedWidthInteger>(
    of value: T,
    toByteOffset offset: Int = 0,
    as type: T.Type,
    _ byteOrder: ByteOrder
  )
}

OutputRawSpan will have a matching append() function:

extension OutputRawSpan {
  mutating func append<T: FullyInhabited & FixedWidthInteger>(
    _ value: T,
    as type: T.Type,
    _ byteOrder: ByteOrder
  )
}

These functions do not need a default value for their byteOrder parameter, as the existing generic MutableSpan.storeBytes(of:toByteOffset:as:) and OutputRawSpan.append(_:as:) functions use the native byte order.

Span

Span will have a new initializer init(viewing: RawSpan) to allow viewing a range of untyped memory as a typed Span, when Span.Element FullyInhabited. These conversions will check for alignment and bounds.

extension Span where Element: FullyInhabited {
  @_lifetime(borrow span)
  init(viewing bytes: borrowing RawSpan) where Element == UInt32
}

The conversions from RawSpan to Span only support well-aligned views with the native byte order. The swift-binary-parsing package provides a more fully-featured ParserSpan type for use cases beyond reinterpreting memory in-place.

Please see the full proposal draft for the detailed design, alternatives and future directions.

13 Likes

IIUC there'll be a problem specifying correct byte order for multi-component types like the mentioned: struct Point { var x, y: Int }

Consider having two sets of API's - one set for integers/floats with byte order parameter, another for other types without byte order parameter.

What happens when loading or storing from an invalid offset? I guess a trap? †

By the same logic, what if working with Range<Int> was allowed and worked fine when the bounds play fair, and traps when lowerBound is greater than upperBound? (The actual checking/trapping could be implemented in some later compiler version as in proposal.)


† BTW, have we considered making those API's throwing instead?

1 Like

That is exactly what we are proposing. The APIs with the byte order parameter are constrained to FullyInhabited & FixedWidthInteger.

Yes, those are simply attempts to access memory out of bounds, and trap.

This would require every such type to implement its own validation in a new protocol. This can be implemented just as well as parsing: load one bound, then load the other, then construct a Range.

The established stdlib behaviour for out-of-bounds accesses is to trap. This makes the simple case simple. For the case where you need to check whether an index would trap, doing so requires no more ceremony than a safe non-trapping alternative would, since you would need to unwrap an optional, or catch an error, etc.

2 Likes

Is it possible for a Swift struct to safely implement FullyInhabited? AFAIK the layout, alignment, and padding are never guaranteed by the language.

As such, I think it could only be implemented for primitive types, and those imported from other languages which might provide those guarantees. So the proposal should probably forbid manual conformance to it (which might be relaxed in the future, eg. by allowing @layout(c) on a struct declaration)

2 Likes

Alignment and padding are not relevant for FullyInhabited. What matters is using all the bit patterns for the bytes you do use. There is also a requirement to be @frozen, which means the representation is constant. But this is not a replacement for Codable.

The semantic requirement means that `FullyInhabited` can only be implemented for simple types. Most types require parsing for safety and/or correctness.

1 Like

Yes

I was also thinking about opening this to a "wider audience" of user types like enums with associated values, optionals, etc... Consider a user struct with a bunch of int fields and ... a Bool.

Trying to load enums without parsing is a fun gateway to undefined behaviour, so enums will not be FullyInhabited any time soon, probably ever. Bool is also right out because of the way llvm treats it.

1 Like

Hmm, indeed. Merely marking my struct FullyInhabited would be cheating:

struct Workaround: FullyInhabited { // hohoho
    var bool: Bool // not FullyInhabited
}

Maybe it's achievable via a "negative marker" protocol? Multi-field types that consist of FullyInhabited would be treated FullyInhabited by default unless you opt-out and mark the type "not FullyInhabited" (somehow) – then it is not. (And types that contain non FullyInhabited components are automatically non FullyInhabited, no need to opt-out).

No floats though?

This makes sense, but doesn't match my intuition about what "fully inhabited" means...

2 Likes

Floating-point byte order does come up from time to time, but is considerably less common to need to wrangle with in cases where you shouldn’t just admit that you’re writing a parser and write an actual parser.

Even then, Float(bitPattern: span.load(.bigEndian)) is pretty straightforward.

2 Likes

I like the direction of this pitch; it's good to reduce the need for unsafe code.


I don't think Never should conform to FullyInhabited. In theory, it should always be okay to perform an unsafeBitCast if the destination type conforms to FullyInhabited and has the same size as the source type (and if both types are BitwiseCopyable). But unsafeBitCast(Void(), to: Never.self) is always undefined behavior because it constructs a value of type Never, despite the fact that Never and Void have the same size.


The proposal draft mentions a storeBytes method, bound to FullyInhabited & FixedWidthInteger. But isn't always safe to store a value that conforms to FullyInhabited (even if it also conforms to FixedWidthInteger), because the value may contain padding bytes. That would require another protocol; let's call it HasNoPadding. The protocol would be orthogonal to FullyInhabited; for example, the Bool type would conform to HasNoPadding, even though it doesn't conform to FullyInhabited.

A subtlety with HasNoPadding is that even if a value has no padding bytes, an inline array of values could have padding bytes. For example, Void has no padding bytes, but [1 of Void] does (this is because the "size" and "stride" of Void are different). To solve the problem with full generality, we would need a protocol like ArrayHasNoPadding. Another solution is to just make it so HasNoPadding requires arrays to have no padding, so Void wouldn't conform to HasNoPadding.


While the names FullyInhabited and HasNoPadding are unambiguous, they're also quite technical. We tried to avoid confusing names for BitwiseCopyable, which originally could've been called TriviallyCopyable (which is the name of the concept in C++ jargon). Thinking about these protocols in terms of what people would do with them, maybe FullyInhabited could be called BitwiseLoadable, and HasNoPadding could be called BitwiseStorable. (And ArrayHasNoPadding could be called something like BitwiseArrayStorable).


There is some prior art for these ideas in Rust: the bytemuck, zerocopy, and safe_transmute libraries, and the work done by the safe transmute project group, primarily the Safer Transmute RFC. These efforts are trying to use the type system to allow users to safely do more and more kinds of bit manipulation. As a result, they use some very complex protocol hierarchies and other type system tricks. For example, the zerocopy library has a FromZeros protocol, for types whose values can be safely "zeroed out".

All of that makes me think even if supporting safe bit manipulation for integer types is simple enough, we should probably draw a line on how much we want to generalize it (at least for now), because going too far could easily introduce a lot of complexity. Personally, I think restricting it to integer and floating-point types, and inline arrays of integer types, as the pitch currently does, would be good enough for now.

I also think maybe it would be good to discourage users from relying on layout details that aren't guaranteed by the language. The benefit of restricting the FullyInhabited protocol to integer and floating-point types, and inline arrays of these types, is that those types all have a predictable memory layout.


Maybe the functions with a FixedWidthInteger requirement should drop the FullyInhabited requirement. Then those functions would need to manually do bit manipulations (possibly using the BinaryInteger.words property, maybe with an optimization for the built-in integer types) instead of just directly writing the bit pattern of the value. But the benefit is that those functions would be interoperable with custom types that conform to FixedWidthInteger, even if they don't conform to FullyInhabited. Maybe they could even generalize to BinaryInteger, relying on the per-value BinaryInteger.bitWidth property.

5 Likes

Good catch. Here you probably meant some future bitCast that accepts FullyInhabited type as a second parameter.

I don't see a need here. It's "safe" to store an Int and load that as a struct with UInt8, 3 bytes of padding, and UInt32 - the overall size matches, the UInt8 and UInt32 fields will be set appropriately, and the padding bytes will be filled with some values but that shouldn't be a problem (unless there is a rule that padding bytes must be zeroed). Vice a versa is not a problem either, just the Int will get some bytes of padding in it's bytewise representation.

The problem is that the padding bytes are uninitialized memory. Reading uninitialized memory is not memory-safe, because the uninitialized memory may have leftover data from previous allocations, including sensitive data which would cause security vulnerabilities if exposed. (It's also undefined behavior unless a special "freeze" operation is performed beforehand.) Another way to solve the problem is to set the padding bytes to zero before storing them, but I don't know if it's possible to reliably do that for all types.

1 Like

This sounds a concern. Is there no clause that relaxes this rule for bitwise copyable types?

The security vulnerabilities you are talking about are not related to "memory safety" per se and could be exposed by other (unsafe) means like memcpy or unsafeBitCast, so by virtue of those other means it is kind of not a problem (at least not a new problem).

All of my yes. I've been wanting FullyInhabited for years. The safe loads/stores from RawSpan is also obviously needed.

I am a little concerned about allowing FullyInhabited on types with internal (non-tail) padding. I know it's not allowed in any of the Rust libraries with an equivalent Trait, but I can't remember if that was because LLVM poisons the padding bits, or if it just wasn't sound enough for the Rustaceans.

1 Like

hmm. I appreciate the guidance of the FixedWidthInteger constraint, but I have a lot of types which are just struct’s wrapping UInt64 values. I don’t want clients working with them as integers (so I’d like to avoid conforming the wrapper type), but I can see situations where loading with byte order would be needed. (They’re rare enough that I’m not sure I’d change the load API constraints, though.)

Could you outline a code illustrating the problem?

This:

struct S {
    var a: Int8 = 1
    /* padding */
    var b: Int32 = 2
}

span.storeBytes(of: r, as: S())
span.load(as: Int64.self)

looks safe, i.e. memory for S (including padding bytes) was initialised.

Could it be the case that the only way to get yourself into "uninitialised memory" territory are using unsafe API's – but then all bets are off anyway... the discussed load / store are only safe provided their arguments are safe.

The memory for the S was initialized, but the compiler may have done so by storing an Int8 to offset 0 and an Int32 to offset 4, rather than storing 8 bytes to offset zero. In that case, bytes 1, 2, and 3 still contain whatever was there before the storeBytes (e.g. three bytes of someone’s password that was previously stored to that address in a String or [UInt8] or whatever that no longer exists).

Allowing reads from uninitialized bytes is a different flavor of memory safety vulnerability than wild pointer accesses or buffer overruns or use after free, but it can have significant consequences. Aside from the case I mentioned above, leaking (partial) pointers to known code addresses is often a component of ASLR bypass, and loads from uninitialized memory produce an undef, which will generally result in UB when used in further computations.

There are a variety of options for what to do about this. As @ellie20 has mentioned, the Rust and LLVM communities have had ongoing discussions about these issues. The simplest and most conservative option is to simply restrict the FullyInhabited constraint to require that there are no padding bytes. There’s a whole spectrum of other options with various degrees of safety associated with them. Freezing the loaded value makes it so that it is no longer undef but rather a fixed, unspecified value; this may still violate some program invariants, however, because consecutive loads of the same byte that are separately frozen can end up with distinct observable values.

5 Likes

I would still be able reading those leaked bytes unsafely with withUnsafeBytes / unsafeBitcast, etc, no?

If you have an enum with exactly 256 cases (or exactly 65536, etc), and none of them have any associated values, would it be possible to do then?

I don't think this is feasible, as (at least as far as I'm aware) there's no requirement that a fixed-width integer stores its words in-line. I could see e.g. Int1024 that's just a wrapper around an array of integers.