Pitch: Add implementation for `withContiguousStorageIfAvailable()` to `Foundation.ContiguousBytes`-conforming sequence types

I was pretty surprised to discover that certain types in the Foundation overlay which are used to represent unstructured data—particularly, the Data struct—do not support Sequence's withContiguousStorageIfAvailable method, instead falling back on Sequence's built-in implementation which just returns nil:

> Data([0,1,2]).withContiguousStorageIfAvailable { _ in return 1 } 
$R15: Int? = nil

This seems non-ideal for a few reasons:

  1. When reading bytes from arbitrary sequences of UInt8, withContiguousStorageIfAvailable provides an important performance optimization, including in one of Data's own initializers. With Data not supporting this method, special logic needs to be used to check for Data and/or ContiguousBytes (as Data's sequence-taking initializer does), since otherwise the code will take the slow path.

  2. Since Data and ContiguousBytes are only available through importing the Foundation library, a Swift package or library that wishes to be compatible with apps that do not import Foundation cannot explicitly check for them, with the result that passing a Data to an API in such a library will result in inferior performance.

Fortunately, this seems as if it should be fairly easy to solve by adding something along these lines to the Data and DispatchData.Region types (in both corelibs-foundation and the Darwin overlay):

public func withContiguousStorageIfAvailable<R>(
    _ body: (UnsafeBufferPointer<UInt8>) throws -> R
) rethrows -> R? {
    try self.withUnsafeBytes {
        try body($0.bindMemory(to: UInt8.self))
    }
}

This would allow libraries to efficiently access data storage without necessarily needing to depend on, and thus force all clients to also depend on, Foundation.

1 Like

The issue with this implementation, and the reason we didn't support withContiguousStorageIfAvailable as-is is that Data is explicitly a collection of raw bytes, which means the underlying memory can't be safely bound in this way under the rules of the language. For instance, at the moment, you can create a Data whose buffer is a raw pointer pointing to memory which is already bound to a type T != UInt8 (with Data.init(bytesNoCopy:count:deallocator)) — it's guaranteed by the semantics of the language that it's safe to access the bytes of T as raw bytes, but binding the memory to UInt8 is undefined behavior.

Theoretically, the language could make a special case for Int8/UInt8 specifically when it comes to binding and rebinding memory, treating memory bound to these types as if they were raw access, but it doesn't at the moment (and I'm not 100% what the ramifications of that would be... /cc @andrew_trick).

The only really safe way to do this would be to create a temporary copy of the underlying buffer, bind that memory, and pass it to the closure, but that defeats the entire purpose of this fast-access method.

(Theoretically, too, the ContiguousBytes protocol could be sunk down to the standard library along with some of the conformances added in Foundation, but it would be up for those teams to decide whether that's a move they would support.)

2 Likes

The issue with this implementation, and the reason we didn't support withContiguousStorageIfAvailable as-is is that Data is explicitly a collection of raw bytes, which means the underlying memory can't be safely bound in this way under the rules of the language. For instance, at the moment, you can create a Data whose buffer is a raw pointer pointing to memory which is already bound to a type T != UInt8 (with Data.init(bytesNoCopy:count:deallocator) ) — it's guaranteed by the semantics of the language that it's safe to access the bytes of T as raw bytes, but binding the memory to UInt8 is undefined behavior.

Theoretically, the language could make a special case for Int8 / UInt8 specifically when it comes to binding and rebinding memory, treating memory bound to these types as if they were raw access, but it doesn't at the moment (and I'm not 100% what the ramifications of that would be... /cc @andrew_trick).

Argh... I see. In C, a pointer-to-bytes would be what you would use to access the raw bytes, but that's just because you can't dereference a void * at all, whereas with UnsafeRawPointer, you can (to get bytes, anyway). If the system did have an exception for UInt8 pointers, it would sure be a lot friendlier to us geezers who have been writing Mac code since the 90s ;-) but I suspect that's not a battle I'll win, as frustrating as that may be.

With that said, I do think this optimization ought to be possible. In your opinion, which of these three pitches would be likely to draw the least amount of friction?

  1. Put ContiguousBytes in the standard library

  2. Add a withContiguousBytesIfAvailable method on Sequence that does the same thing as withContiguousStorageIfAvailable, but gives a raw pointer

  3. Make a byte-reading exception for UnsafePointer<UInt8> (hey, maybe I'm wrong about no one liking this one? :person_shrugging:)

One additional thing I would like to request, regardless, is that some kind of warning be added to the documentation for withContiguousStorageIfAvailable on Data and similar constructs like DispatchData, DispatchData.Region, et al., letting the reader know that these will always just return nil. The current Xcode documentation just shows the description inherited from Sequence, which feels rather deceptive to me, particularly since intuitively, Data would be exactly the sort of thing where you'd expect this to work.

2 Likes

I don't disagree! Though C pointers have their fair share of wackiness too that Swift has been working hard to avoid...

As someone with 0 authority or input on the matter... I'd say of the three options, it'd probably be easiest to go with (1).

Option (2) was something briefly considered at the time and rejected as unsafe. In general, it's not helpful to expose arbitrary Sequences as sequences of bytes. (Think [String] or Sequence<UnsafeRawPointer> — those aren't useful to traverse as bytes in pretty much any scenario, so you wouldn't want an unconstrained withContiguousBytesIfAvailable exposed on those types.) If it were possible to constrain the extension to Sequence where Element: Trivial (where Trivial is a compiler-controlled marker protocol to indicate POD types) then maybe, but that's not a feature supported by the language right now.

(3) might also be possible, but I have no idea what that work might entail, or whether that's something the language wants to undergo. Someone like @andrew_trick would need to weigh in on this because the subject is out of my wheelhouse.

This is likely worth filing Feedback for, though I don't believe this is currently possible in the documentation system. Since Data and these other types don't override this method, I don't know if it's possible to attach documentation to the inherited method...

Adding a ContiguousBytes protocol is the most logical way to handle containers that can be viewed as contiguous bytes. But there may be some concern about the runtime overhead associated with adding more conformances to generic types. I'm not sure how valid those concerns still are given the current state of runtime optimization. Nonetheless, it's a significant language change.

As of Feb 5, 2022, it is reasonably safe to implement Foundation.Data's withContiguousStorageIfAvailable by using withMemoryRebound: SE-0333 Expand usability of withMemoryRebound

It is now legal to pass a Swift UnsafePointer into a C function taking char *. That solves the problem of calling into (abusive) C APIs that assume char * can alias with anything.

It is extremely important for Swift programmers, whether they are C experts or not, to realize that casting typed pointers is the wrong way to reinterpret types. There is a safe and convenient way to do this using raw pointers. If they see common examples of code converting typed pointers, rather than using raw pointers, then that is actively misleading. It will lead to more undefined behavior in Swift code as they find ways to force the compiler to cast pointers to other types. Special-casing the semantics of Swift's generic unsafe pointer types based on an element type, which is not always known at compile time, would add a bizarre source of complexity to the language.

Remember that it's common for C code to have undefined behavior related to this. People seldom notice this problem because of limited optimization scope across C libraries.

4 Likes

What is that safe and convenient way?

Is there a documentation page for it that I can reference when this topic inevitably comes up again?

UnsafeRawPointer.load and UnsafeRawPointer.store.

The official API docs:
https://developer.apple.com/documentation/swift/unsaferawpointer/

The original SE proposal:

We are definitely missing a Swift tutorial for working with byte buffers. We have a few ongoing SE proposals to revise the UnsafePointer APIs right now for usability, then I think we need to prioritize that.

In the meantime, there's a 2020 WWDC talk called "Safely Manage Pointers in Swift".

2 Likes

This is fantastic, and something of which I was, until now, unaware. I've wondered for a while why the raw pointer types didn't support withMemoryRebound, and am delighted to see that that is changing.

Would you be opposed to my re-pitching this pitch using this new functionality instead of bindMemory? Should I make a new thread or just edit the OP in this one?

I am unable to find an answer in those links.

Is there a simple example showing exactly how to take an existing typed pointer, say UnsafeMutablePointer<Foo> and treat the memory it points to as having type Bar, including mutations of that memory?

From what you wrote it sounds like there is some process involving “Get a raw pointer to the same memory, call load on the raw pointer, perform whatever operations you want on the loaded value, then call store on the raw pointer.”

So perhaps something like this?

let rawPtr = UnsafeMutableRawPointer(fooPtr + offset)
var bar = rawPtr.load(as: Bar.self)
bar.mutate()
rawPtr.storeBytes(of: bar, as: Foo.self)

If you say that’s safe, great, but I certainly would not call it convenient.

It’s mostly boilerplate, and it only gives one element at a time. Getting a second element increases the boilerplate, and if we actually need a pointer to a buffer of Bars it doesn’t appear to help.

If anything, withMemoryRebound seems far more convenient.

Probably the best "official" documentation I've found so far is that WWDC20 video, although even that leaves a lot of questions unanswered. Honestly, I've found the most helpful info, such that it exists, to be in old discussion threads on this forum that come up in a search. :person_shrugging:

This is great! I have to admit, I haven't had much time to keep up with many recent proposals, but this is excellent news. I'll take the time to read through; very exciting. :smile:

Since Data belongs to Foundation and adding this would just require adding an implementation of withContiguousStorageIfAvailable to the type, this isn't something that would go through a pitch — your best bet will be to file Feedback on this. It's something the Foundation team will need to prioritize and pick up. (/cc @Tony_Parker in case there's interest)

Theoretically, too, this could be added to ContiguousBytes rather than on concrete types directly:

extension ContiguousBytes where Self: Sequence, Self.Element == UInt8 {
    func withContiguousStorageIfAvailable<R>(_ body: (UnsafeBufferPointer<UInt8>) throws -> R) rethrows -> R {
        try self.withUnsafeBytes { rawBuffer in
            try rawBuffer.withMemoryRebound(to: UInt8.self) { typedBuffer in
                try body(typedBuffer)
            }
        }
    }
}

(caveat emptor: code compiled in browser, may not build)

That was actually the first thing I tried when testing this out on my end, but it didn't get called when I passed the Data to a function taking a generic Sequence, only when the function was taking a Data itself. I think it's because an extension like this will only get called via static dispatch?

Hmm, I had the same concern before posting and tested out a minimized version to check the dynamic dispatch; on my machine (w/ Swift 5.5.2):

protocol S { func f() }
extension S { func f() { print("S.f") }}

protocol C {}
extension C where Self: S { func f() { print("C.f") }}

struct X: S, C {}

func generic<T: S>(_ v: T) { v.f() }
func existential(_ v: S) { v.f() }

generic(X()) // C.f
existential(X()) // C.f

If the behavior here is due to the fact that these are all in the same file and module, then I may have been misled. (Though if it's okay for S and C to be in different modules, and the implementation on C where Self: S will get picked up if the definition of C and the extension are in the same file, then this will work for ContiguousBytes.)

I am unable to find an answer in those links.

Sorry. There is a lot more background information to be found here on the forums than in the official docs. Although I should have linked to:
https://developer.apple.com/documentation/swift/unsafemutablerawpointer
https://developer.apple.com/documentation/swift/unsafemutablerawbufferpointer/

UnsafeMutableRawBufferPointer is a collection of bytes. It's API is almost the same as UnsafeMutableBufferPointer if that's what you're looking for.

There are two proposals underway to add more convenience and fill in holes in the API, namely with initialization and slicing:

You may want to reinterpret a byte buffer as a typed mutable collection. This is a more refined version of the basic skeleton for that collection presented in "Safely Manage Pointers in Swift":

struct UnsafeBufferView<Element> : RandomAccessCollection {
  // Note: Always build raw memory containers on top of
  // Unsafe[Mutable]RawPointer, not Unsafe[Mutable]RawBufferPointer,
  // which requires two Optional unwrapping checks on every access!
  let rawBytes: UnsafeMutableRawPointer
  let count: Int

  init(reinterpret rawBytes: UnsafeMutableRawBufferPointer, as: Element.Type) {
    assert(rawBytes.baseAddress != nil || rawBytes.count == 0)
    self.rawBytes = rawBytes.baseAddress ?? UnsafeMutableRawPointer(bitPattern: -1)!
    self.count = rawBytes.count / MemoryLayout<Element>.stride
    precondition(self.count * MemoryLayout<Element>.stride == rawBytes.count)
    precondition(Int(bitPattern: rawBytes.baseAddress).isMultiple(of: MemoryLayout<Element>.alignment))
  }

  public var startIndex: Int { 0 }

  public var endIndex: Int { count }

  subscript(unchecked index: Int) -> Element {
    get {
      rawBytes.load(fromByteOffset: index * MemoryLayout<Element>.stride, as: Element.self)
    }
    nonmutating set(newValue) {
      rawBytes.storeBytes(of: newValue, toByteOffset: index * MemoryLayout<Element>.stride,
        as: Element.self)
    }
  }
  // Unlike Unsafe[Mutable]RawBufferPointer, subscripts should be
  // bounds-checked by default in release builds.
  subscript(index: Int) -> Element {
    get {
      precondition(index >= 0)
      precondition(index < count)
      return self[unchecked: index]
    }
    nonmutating set(newValue) {
      precondition(index >= 0)
      precondition(index < count)
      self[unchecked: index] = newValue
    }
  }
}

This isn't in the standard library yet because unsafe pointers don't get much attention. Also, for flexible API design, a "buffer view" should be compatible with both safe and unsafe buffers, which requires move-only types.

1 Like

What I am looking for is a simple self-contained example which starts with a typed pointer, say UnsafeMutablePointer<Foo>, and calls a function which expects a differently typed pointer, say UnsafeMutablePointer<Bar>, when I the programmer know that the bytes of a buffer of Foo are also valid as the bytes of a buffer of Bar.

Noted. Assuming you can't change those functions, then you're describing withMemoryRebound:

https://developer.apple.com/documentation/swift/unsafepointer/2430863-withmemoryrebound

Are these functions both defined in Swift? And why do they expect different types? We need to accumulate the common use cases for a tutorial.

I originally wrote explanatory text and example in the API docs, but they were stripped out. We need to decide whether that should be in the formal docs or a separate tutorial.

5 Likes