SE-0447: Span: Safe Access to Contiguous Storage

A non-optional UnsafePointer that has all bits zero is just as undefined. If we add dedicated NonZeroUInt32 and such like Rust has, the same will be true there. Bool might be unique in having many invalid representations, but it’s something that’s been in Swift for a long time, and it’s hard to change while still allowing Option<UnsafeRawPointer>.none to be all zeros and not having to check for structural validity on every raw read.

EDIT: as you pointed out, this doesn’t just affect reads of Bool, but also compound types containing Bool, like Result<Option<(Bool, Bool, Int)>>. So the structural validity check is now conditional and multi-step.

This might be good for a sanitizer, though!

3 Likes

As I said, any enum whose discriminator doesn't use every possible bit pattern has invalid states and will cause UB if those states are created. As well as non optional pointers, as pointed out by @jrose. This has been the status quo of the language since it's release, and is also how Rust works.

UTF8View.Element is just a UInt8 though, so it doesn't have any invalid states. It's not like Rust's char type that requires valid non-surrogate UTF-8 code points. I also don't think Swift cares about the value of padding bits, but I could be wrong there.

1 Like

We want to fulfill the initialization need with OutputSpan, which would provide a safe way to initialize storage in order. An OutputSpan would be separated into two parts: an initialized prefix that is accessible, and an uninitialized suffix that is inaccessible. We've used a prototype of this to good effect in swift-foundation.

1 Like

Does the mere existence of such a value trigger UB, or is it only UB if you perform any operation on such a value other than a bitwise copy?

Specifically, I think the language should have well-defined behavior for the following: unsafeBitCast(unsafeBitCast(0, to: UnsafeRawPointer), UnsafeRawPointer?).

You won't invoke UB related to an element type mismatch until you actually read from or write to the memory.
With UnsafePointer<T> it's UB to access memory bound to an incompatible type even if the bit pattern is valid for both types.
With Span<T: BitwiseCopyable> it's only UB to access memory when the bit pattern is invalid. It doesn't matter what type the memory is bound to.

We could offer a safe Span<T: AnyBitPattern>.view(as: T.Type), but we may want to come up with a more general rule that handles enums. That's still under consideration.

2 Likes

That’s good, because it allows you to load a buffer as a compound BitwiseCopyable value and then check for validity as you go.

1 Like

I stand corrected.

A value can't exist with an invalid bit pattern, so unsafeBitCast(0, to: UnsafeRawPointer) is already UB. Pointing at an invalid bit pattern is fine as long as you don't load it.

To be clear, you would have to do that validation by loading the memory as a fully-inhabited type first.

4 Likes

That seems to be in conflict with @Andrew_Trick’s comment above. My understanding of the definition of whether a value of type T exists is “memory at addresses [A..<(A+MemoryLayout<T>.size)] is bound to type T”. This definition is compatible with Andrew’s explanation, and does not make any statements about the bit pattern in memory.

Andrew said that UB is invoked when the program tries to read from or write to that memory. That’s sufficient for my use case of checking an “is valid” flag before attempting to access a subsequent value of partially-inhabited type, but it can be made even stronger with the additional qualification of “reading from or writing to that memory using safe operations`. That would then extend defined behavior to the unsafe bit-cast dance too.

It might be better to think of it like your program is running on a CPU with a register file that can only hold the valid values of type T. Memory can be bound to T regardless of what's in the memory at the time, but when you load, there has to be something in the memory that the notional register can represent.

1 Like

That interpretation is still compatible with “a value can exist with an invalid bit pattern”: the value is “whatever is stored in register N”. This allows us to divide operations into “safe” and “unsafe” operations: “safe” operations are ones that assume that the value stored in register N (or memory location A) has a valid bit representation for that type, and preserve that property for all valid types; “unsafe” operations are allowed to choose their own behavior for any bit representation, valid or invalid. I would prefer unsafeBitCast() to be defined as a preserving bit representation, and to not count as “accessing” the value.

I think we'd be fairly limited in what operations we could designate to be "unsafe", since passing a value to a function is a "safe" operation by your definition, and most operations are modeled as function calls at some level. Maybe copying and destroying a bitwise-copyable value could be guaranteed "unsafe" by your definition, so if you load a value and do nothing else with it you're OK, but I can't think of much else.

Perhaps we could have a standard AnyBitPattern<T: BitwiseCopyable> wrapper that can safely hold any bit pattern with the size and alignment of T, and then APIs like Span could have the ability to loadAny<T>() -> AnyBitPattern<T>, and the AnyBitPattern type could have a validate() -> T? method that uses the compiler's layout info to check whether the bit pattern is safe to load as a T.

2 Likes

I interpreted Andrew’s post as saying that this is in fact the case.

Dereferencing the pointer is an "access":
_ = pointer.pointee // this is an access

A safe decoder should only access memory using a type that allows all valid bit patterns.

We don't formally distinguish between loading an aggregate as a whole vs reading a particular field within that aggregate. Although, naturally, the later is more likely break in practice.

We could think about different levels of correctness violations:

  1. API violation, but not UB: verification may fail
  2. Theoretical UB: no well-defined meaning
  3. Practical UB: will likely cause problems
struct Layout {
  var code: UInt8
  var flag: Bool
}
// Given `rawp: UnsafeRawPointer` that points to an unknown bit pattern:
let value = rawp.load(as: Layout.self) // UB, in theory
foo(value.flag) // Practical UB

let codePointer = rawp.bindMemory(to: UInt8.self, capacity: 1)
let code = codePointer.pointee // OK

// API violation because memory is bound to UInt8: verification can fail
let badLayoutPointer1 = rawp.assumingMemoryBound(to: Layout.self) 

let badLayout1 = badLayoutPointer1.pointee // UB, in theory
foo(badLayout1.flag) // Practical UB

// OK to bind memory, regardless of the in-memory bitpattern
let badLayoutPointer2 = rawp.bindMemory(to: Layout.self, capacity: 1)

let badLayout = badLayoutPointer2.pointee // UB, in theory
foo(badLayout.flag) // Practical UB

Similary, zero is an invalid pointer value, so:

let rawp = unsafeBitCast(0, to: UnsafeRawPointer.self) // UB, in theory
return UnsafeRawPointer?(rawp) != nil // Practical UB
3 Likes

Hey folks,

Based on the ongoing discussion, the authors have provided a pull request with some updates to the API for Span:

  • Add Index type and indices property to Span.
  • Add byteOffsets to RawSpan.
  • Removed the index validation utilities (e.g., boundsContain, isWithin) from Span and RawSpan.
  • Provide more detail on how the subscript accessor is likely to evolve (it's specified as _read).
  • Some wordsmithing around unsafeLoad.

We'll extend this review until October 15, 2024.

Doug Gregor
Review Manager

14 Likes

I'm very happy to see the indices and byteOffsets properties being added. I think their addition makes Span's API more 'Swifty' and nicer to use, though I'm a bit curious how this interacts with the concerns others have mentioned. (Namely that a noncopyable-supporting collection protocol may require an indices property with a different type)

1 Like

After consideration, a concrete indices property (without a typealias) seems nearly risk-free. The namespace clobbering I had worried about is about the typealias, which we sidestep here. We did define the typealias for Index, as this is completely in line with lots of extant random-access containers.

3 Likes