[Pitch 2] Safe loading of values from `RawSpan`

But how can you cover the padding constraints consistently on different ABIs?

Could we have these marker protocols that influence how compiler generates padding?

// bike-shed names ahead
protocol HasNoPadding {} // aka packed
protocol HasPaddingInitialized {}

struct S: HasNoPadding {
    var a: UInt8
    // no padding here
    var b: UInt16
}

struct S: HasPaddingInitialized {
    var a: UInt8
    // as if var _: UInt8 = 0 was here
    var b: UInt16
}

HasNoPadding could imply HasPaddingInitialized (all padding bytes are zeroed, there just none of them) and vice versa HasPaddingInitialized could imply HasNoPadding (HasPaddingInitialized is equivalent to manually adding fields, so there are no padding bytes left).

HasNoPadding works like packed in other languages - could have significant impact on performance due to lack of natural alignment (unless in cases where natural field alignment is undisturbed). OTOH HasPaddingInitialized performance impact is not so significant, equal to that if I added the padding bytes manually. The advantage compared to manual implementation is that compiler knows the platform layout better than me so it will make a more correct decision IRT where padding would be on a given platform to fill that padding out.

FullyInhabited could either imply or require HasPaddingInitialized and it would be possible for the padding bytes to take arbitrary bit patterns (when used with unsafeBitCast or when coercing from a wrong type via RawSpan). The presence of explicit (though inaccessible) padding bytes is merely to get rid of undef / UB situation.

Standard types like Int, String, Double could automatically conform to both HasPaddingInitialized and HasNoPadding. User types could specify conformances explicitly for guaranteed padding behaviour.

The most obvious cases that this would come up with are Int and CGFloat, since they’re not the same alignment on all supported platforms. A type that’s fully inhabited in a 32-bit build might not be fully inhabited in a 64-bit build, though the other way around would always be fine. I think this is fine because it’s a problem that you will only run into as you add support for new platforms.

Separately, I think this reinforces that RawSpan only is a good idea to use for data that’s in the native representation.

1 Like

If you really want to force it:

func undefined() -> Int {
  withUnsafeTemporaryAllocation(of: Int.self, capacity: 1) { $0[0] }
}

Passing this function to the compiler with -emit-ir produces:

; Function Attrs: mustprogress nofree norecurse nosync nounwind sspreq willreturn memory(none)
define hidden swiftcc i64 @"$s3tmp9undefinedSiyF"() local_unnamed_addr #1 {
entry:
  ret i64 undef
}

UB follows, of course, so don't do this in production.

I think there actually should be a general-purpose feature for parsing an integer from bytes, including choice of byte order and support for arbitrary BinaryInteger types. It's a common enough operation that it should be supported in the standard library, even if advanced deserialization isn't.

We can look at String for comparison. The String type doesn't have facilities for advanced parsing, but we still have the ability to parse an integer from a string using the FixedWidthInteger.init?(_: String) initializer. There's even a way to specify an arbitrary radix.

We can also look at Python for prior art. In Python, advanced deserialization is delegated to the struct module in the standard library. But parsing an integer from bytes is common enough that the built-in int type has a from_bytes static method.

Given that floating-point numbers have a standardized binary representation, I think it would also be useful to support general-purpose deserialization for those as well, including choice of byte order and support for arbitrary BinaryFloatingPoint types, including Float80.

I sympathize with the argument that RawSpan should be limited to manipulating raw memory, not deserialization, especially because it's ambiguous at present whether a RawSpan should be guaranteed to point to fully-initialized memory (if so, it would currently be unsound). I think it's okay if RawSpan ends up being limited to lower-level unsafe operations over memory that may be partially uninitialized. But abstractions over fully-initialized binary data, with safe operations like the ones proposed, would still be useful even if RawSpan doesn't end up filling that role. For prior art, we could look at Python's bytes, bytearray, and memoryview types, and JavaScript's ArrayBuffer, TypedArray, and DataView types.

2 Likes

well, strictly speaking, nothing about the current behavior of RawSpan or MutableRawSpan is unsound. MutableRawSpan lets you safely write uninitialized memory, but because you can only read that memory with @unsafe, there's no way to violate memory safety in safe code with just a MutableRawSpan.

having said that, any API that safely provides a MutableRawSpan is completely useless, as it may never read from it again (@unsafe or no). the user may have (safely!) written uninitialized padding bytes to it using storeBytes(of:toByteOffset:as:).

[Mutable]RawSpan are significantly better types if they're required to be fully initialized. a performant and safe way to process (some types of) raw memory would be extremely useful, and having zerocopy-like functionality in the standard library would be fantastic!

there are multiple dimensions of safety, so an @unsafe [Mutable]RawSpan would still be safer than an Unsafe[Mutable]RawPointer, but there's literally no downside to requiring them to be fully initialized and safe. a safe [Mutable]RawSpan would still have its unsafe load/store methods, and could still be left partially uninitialized over @unsafe API boundaries, so it wouldn't lose any functionality.

I think it's a tradeoff. There are two major use cases: working with raw memory (which generally includes padding for most types), and working with fully-initialized binary data. If RawSpan guarantees that it points to fully-initialized memory, then it would be suitable for the latter case, but unsuitable for the former case.

A benefit of RawSpan losing its initialization guarantee is that the standard library APIs wouldn't have to change. But a downside is that third-party APIs that depend on full initialization would have to change instead. If RawSpan keeps its initialization guarantee, then at the very least, some standard library APIs would have to be marked @unsafe; but personally, I think that would be inadequate, because it would only have an effect if strict memory safety checking is turned on.

Personally, I think another downside of RawSpan keeping its initialization guarantee is that its name evokes raw memory access, so it's more fitting for a type intended to work with raw memory than a type intended to work with fully-initialized binary data. For the latter kind of type, I think a better name would be ByteSpan.

I'm also thinking that it would be nice for Swift to have comparable support for working with binary data as it currently does with text. Python has both str and bytes/bytearray, and comparable support for both in the standard library. A naming convention centered on the word "byte" would easily extend to multiple types:

  • A ByteArray type, analogous to String
  • A ByteSlice type, analogous to Substring
  • A ByteSpan type, analogous to UTF8Span

I think it would be error prone for a type to have that kind of "transiently unsound" state (that is, a state where the safety invariants don't hold, so safe operations can cause undefined behavior), even though it's technically sound for such a state to exist and be reachable through unsafe code. It breaks the normal expectation that a value is guaranteed to have the safety invariants of its type. I can't think of a type that currently publicly exposes a "transiently unsound" state in Swift.

There are "transiently unsound" states in Rust, such as the Vec type having the set_len method, but those are generally only supposed to be used within short sections of code. Rust has features and practices that are intended to avoid putting values in a "transiently unsound" state in the long term. For example, some APIs use references to MaybeUninit, even though (to my knowledge) it technically isn't undefined behavior for a reference to an invalid value to exist.

i guess i just can't think of any cases where someone would want to deal with uninitialized bytes in a [Mutable]RawSpan for very long. most the things i want to use this type for fall under "working with fully-initialized binary data". anything involving possibly-uninitialized memory would only be for those short sections of code where the 'transient unsoundness' is manageable.

to me, this seems like the niche the Span family of types is aiming for. i'd be curious to know if lots of people want a more 'raw memory access'-style type though.

To close the loop: as of last week, there have been corrections for some standard library RawSpan APIs to be marked unsafe: 1, 2. These changes fit squarely within the approved proposal for auditing and marking stdlib APIs as @unsafe.

You do raise a good point that, typically, unsafe APIs on types not themselves called Unsafe (or something related like Unmanaged, etc.) are expected to have some indication in their name rather than just an @unsafe annotation, which only affects people who use the opt-in strict memory safety checking mode.

As renaming would be an API change, though, that would require a proposal which is separable from this one.

3 Likes

I have revised the proposal (at Swift Evolution #3065), adapting it to the two-protocol solution I outlined above.

We now have the protocols a) ConvertibleToRawBytes, which is largely automatable in the vein of BitwiseCopyable (though it will not be initially.) and b) ConvertibleFromRawBytes, which embodies the prohibition on semantic constraints originally attached to FullyInhabited.

The API additions have been expanded accordingly, including additions to OutputSpan and OutputRawSpan, as well as the list of conformances in the standard library.

6 Likes

This version looks great to me. Finally, a type safe way to initialize values from random bit-streams that I don't need to validate myself.

Regarding ConvertibleToRawBytes:

A type can conform to ConvertibleToRawBytes if its memory representation includes no padding. In other words, the sum of the size of its stored properties is equal to its size.

Nit: under MemoryLayout terminology, should this be ā€œthe sum of the size of its stored properties is equal to its strideā€?

More important: the current pitch should address whether this allows bit padding. Bool does not conform to ConvertibleToRawBytes, but Optional (conditionally) does. I would expect that they both do or don’t.


Regarding ConvertibleFromRawBytes:

The compiler cannot enforce the semantic requirements of ConvertibleFromRawBytes, therefore types outside the standard library can only conform with an unsafe conformance.

That’s a broad statement. The compiler can enforce at least that all constituents of a type with ConvertibleFromRawBytes are ConvertibleFromRawBytes themselves. Does it do that? If it doesn’t, is it something we would consider doing at some point?

And although the compiler can’t prove that there are no semantic constraints on the values of stored properties, it can get in your way by requiring all stored properties to be var and as visible as the type itself (ie, public var in public types), and ensuring that there is always a non-overridable synthesized memberwise initializer that is also as visible as the type itself. This reflects that the program has no meaningful control over loading an instance from raw bytes, and as a sleight of hand, you can say ā€œloading from raw bytes is the same as calling the memberwise initializer with all arguments loaded from raw bytesā€. Is this considered? There’s a bit on ConvertibleToRawBytes in alternatives considered, but not on ConvertibleFromRawBytes.


I believe that under this proposal, types imported from C (except basic types with equivalent Swift types) would never be Convertible{From,To}RawBytes because they are not part of the standard library and you can only declare conformance in the module that declares them. Can you clarify if this is the case? If yes, should there be a mention of it in future directions?


What are we trying to enable by allowing ConvertibleFromRawBytes conformances but not ConvertibleToRawBytes conformances? I guess that lets you safely bitCast a built-in type to small project-owned structs, like UInt128 to an eventual MyUUID?


RawSpan and MutableRawSpan will have a new, generic load(as:) function that return ConvertibleFromRawBytes values read from the underlying memory, with no pointer-alignment restriction. Because the returned values are ConvertibleFromRawBytes and the request is bounds-checked, this load(as:) function is safe.

Given that ConvertibleFromRawBytes accepts padding, is the bounds check against the size or the stride? I am almost certain it’s the stride, but I think it’s worth clarifying.


The load(as:) functions will not have equivalents with unchecked byte offset. If that functionality is needed, the unsafeLoad(fromUncheckedByteOffset:as:) function is already available.

Is there an unsafeLoad(fromByteOffset:as:) to load a non-ConvertibleFromRawBytes type from a checked offset?


Span will have a new initializer init(viewing: RawSpan) to allow viewing a range of untyped memory as a typed Span, when Span.Element conforms to ConvertibleFromRawBytes. These conversions will check for alignment and bounds.

The behavior of the check for alignment is self-evident to me, but not the check for bounds. Does it trap if the RawSpan’s length is not a multiple of the stride of Span.Element, or does it truncate to a multiple of it? I don’t have a strong preference, but I think it should be explicitly documented.


The existing bytes and mutableBytes accessors will have safe overloads for when Element conforms to ConvertibleToRawBytes.

I believe you can’t overload on @unsafe. You might need to use a different name.


Tuples composed of ConvertibleToRawBytes types should themselves be ConvertibleToRawBytes. The same applies to ConvertibleFromRawBytes. The standard library's SIMD types also seem to be naturally suited to these protocols.

Tuples with padding can’t conform to ConvertibleToRawBytes. In a similar vein, SIMD3 probably can’t because it has a whole Element of padding. I imagine we don’t need to discuss them too much since they’re not actually being proposed here.


The Span initializers require a correctly-aligned RawSpan; there should be be utilities to identify the offsets that are well-aligned for a given type.

Arguably, they should be added now. Would-be users have no way to defend themselves against misaligned RawSpans.


For shed color consideration: I think the only other ā€œConvertibleā€ protocols that we have are Custom{,Debug}StringConvertible. Is Copyable{From,To}RawBytes in better alignment with BitwiseCopyable? When endianness is not involved, there’s no conversion happening, it’s just a bit cast.

1 Like

I can't imagine why. Tail padding shouldn't be a problem as long as you never write more thanMemoryLayout<T>.size bytes.

I shouldn’t have left Optional in there, I had a momentary lapse in sanity. In general its size isn’t equal to its stride. On the other hand, Bool should conform, since its size and stride are the same.

(both fixed.)

1 Like

I added a future direction here. We can teach the Clang importer to add the protocols to appropriate C types, but for custom types it would be good to allow conformances to the protocols somehow. (I also added a future direction about ConvertibleFromRawBytes, which I’d forgotten to do.)

The preference is to make all the features available, but it’s preferable to hold back rather than let libraries make promises they can’t keep. The promise you make with ConvertibleToRawBytes is hard to keep, since the compiler could decide to change your struct layout at the next revision. For that reason I thought it would be best left as an automatic thing. As for ConvertibleFromRawBytes, I came from the point of view that it is not fully enforceable by the compiler, so it would always be a promise by the implementer anyway. ConvertibleFromRawBytes also won’t stop being true because the compiler orders your fields in a different order.

This will be against the size. Will clarify.

Yes, there is an existing unconstrained generic function of that name.

This is simply not decided yet.

They have different type signatures by virtue of the distinct protocol constraints. I tried this overload with FixedWidthInteger and it worked fine. This being said, we might want to use a different name for the old unsafe ones, because there is no indicator unless one uses ā€œ-strict-memory-safetyā€

This is a good suggestion, I like it.

1 Like

And there’s the rub: tail padding for T is internal padding for [2 of T]. We really must disallow tail padding in order to be able to build aggregates of ConvertibleToRawBytes things.

1 Like

I shouldn’t have left Optional in there, I had a momentary lapse in sanity. In general its size isn’t equal to its stride. On the other hand, Bool should conform, since its size and stride are the same.

Thanks. CollectionOfOne.Iterator has one Element? field, so I think it means it needs to lose the conformance too. I think ClosedRange.Index might need to lose it too because it has the same shape as Optional.

The promise you make with ConvertibleToRawBytes is hard to keep, since the compiler could decide to change your struct layout at the next revision.For that reason I thought it would be best left as an automatic thing.

My interpretation of swift-evolution rules is that this proposal actually ties the hands of the compiler because it makes layout changes source-breaking. The bar for source-breaking changes is on the higher side. This also probably deserves a callout (or a refutation).

1 Like

Something else I wanted to sit with for the night and see if I felt any different in the morning: AnyObject is ConvertibleToRawBytes. There are two consequences that I think are not entirely obvious:

  • It’s the reference that converts to raw bytes, not the object itself;
  • Almost everything can convert to raw bytes if AnyObject does, but very few things can convert from raw bytes.

For instance, Array could be ConvertibleToRawBytes: it only contains an _ArrayBridgeStorage that only contains a _BridgeStorage that only contains a reference. If we end up with a Sendable-like design where non-public structs have their raw bytes conformances inferred automatically, a type like Array would qualify.

I think that we don’t want Array to conform, though. I’m not 100% on articulating the reasons (storage probably shouldn’t be ConvertibleToRawBytes if the convertible part is the reference rather than the contents?) but I nebulously feel that they apply to all reference types.

The two-protocol design we have is the most semantically precise one, but I’m a little concerned that if people see that virtually all types are safely convertible to raw bytes but just a handful are convertible from raw bytes, they will think that missing a ConvertibleFromRawBytes conformance is an implementation limitation rather than a conscious design decision and they will use unsafe conversions without trying to understand what’s going on.

I think one convenient reason that AnyObject conforms to ConvertibleToRawBytes is that it lets you bitCast them to integer types safely. If that’s the main appeal, I think we could get that benefit with a func bitCast<From: AnyObject, To: ConvertibleFromRawBytes> overload and not have to deal with the rest.

1 Like

Second thing I wanted to sit with for the night: bitCast has a consuming argument and that made me realize we didn’t clarify the interactions of this proposal with ~Copyable types much. We already know that ~Copyable types are never ConvertibleFromRawBytes because ~Copyable types are never BitwiseCopyable. I would argue that ~Copyable types should also never be ConvertibleToRawBytes because ConvertibleToRawBytes is a copy that discards semantics.

We might not be able to prevent that today because compiler support for the raw byte protocols is pretty early, but it might point to consuming being the wrong argument modifier if the argument is always expected to be copyable.

1 Like

Regardless of whether non-copyable types should be allowed to conform to ConvertibleToRawBytes, you do still want the argument to be consuming for the same reason arguments to init default to consuming. It lets the caller handle the question of whether a copy needs to be made of the passed in value. If the value is created only to be the argument to bitCast, then no copy needs to be made at all.