Foundation Data does not support an internal representation as contiguous storage?

rlovelett · April 11, 2021, 6:21pm

When running this code I expected to see it print "Contiguous storage!". However, no such thing prints.

import Foundation
let data = Data(bytes: [0x48, 0x65, 0x6C, 0x6C, 0x6f], count: 5)
data.withContiguousStorageIfAvailable { _ in
    print("Contiguous storage!")
}

Does this mean that Data does not support an internal representation as contiguous storage? That feels odd since, Data has a protocol ContiguousBytes. And I could have sworn that there was an announcement about Data guaranteeing contiguous bytes.

What am I doing wrong?

itaiferber · April 11, 2021, 8:02pm

A subtle point that could likely be called out a little bit more explicitly in documentation, but the purpose of withContiguousStorageIfAvailable is to offer access to the underlying storage of a Sequence or Collection if it is typed. This is implicit in the parameter to the closure passed to withContiguousStorageIfAvailable: it accepts an UnsafeBufferPointer bound to the Element type of the Sequence/Collection.

Data, however, is untyped (AKA, raw), and cannot provide a typed pointer without binding its underlying storage, which is not strictly valid if the buffer is already bound to another data type. Instead, through ContiguousBytes, it offers an UnsafeRawBufferPointer with raw byte access to its underlying buffer.

rlovelett · April 11, 2021, 8:18pm

Not to say your are wrong, that could very well be the intention and indeed the reason, but that feels incongruent with the rest of Data to me.

I mean Data is, by way of RandomAccessCollection, a Sequence where the Element = UInt8. So sometimes it is raw (apparently here) and sometimes it is not (pretty much all the element based operators on Data).

This is surprising.

xwu · April 11, 2021, 8:30pm

It makes sense how that may seem inconsistent, but the salient point is that the underlying contiguous storage is not (and cannot be) available typed.

itaiferber · April 11, 2021, 8:34pm

[For a lot of context on bound memory, I highly recommend watching @Andrew_Trick's WWDC session on pointers in Swift — it's really really helpful for grokking bound vs. raw memory.]

I agree that the state of things feels a bit pedantic, especially since Data is a Sequence of UInt8 values. At its core, we wanted Data to abstract over raw memory to fit in line with the rest of Swift's strongly-typed pointer system — we didn't want Data to make it easy to subvert the rules of typed memory, and make it easy to invoke undefined behavior. Datas initializers all take raw memory buffers, and where they don't, they take copies of the given pointers to ensure raw memory access.

Importantly, though, Data isn't guaranteed to be the owner of the underlying buffer: it has an initializer which allows it to temporarily hold on to a buffer it doesn't own. Although the buffer is passed in as raw, it could validly be a raw pointer pointing to typed memory, which Data wouldn't be allowed to rebind. This is sort of the key point: once a pointer is bound to a specific type of memory, it cannot be unbound and rebound until the memory is deallocated. Binding is a permanent operation which affects how Swift sees and treats a pointer.

@Andrew_Trick has mentioned in the past that we could codify in language rules that it is always possible to rebind a pointer to one of a trivial type (e.g. it's always safe to rebind to UnsafeBufferPointer<UInt8> or UnsafeBufferPointer<Double>, or others), but the operation is currently not permitted (and we don't have a set definition for trivial types, or guarantees in the type system on what might be considered trivial).

So, erring on the side of caution, Data prevents this sort of binding.

Note too that UnsafeRawBufferPointer is itself a Sequence of UInt8. It's always safe to read trivial data (UInt8, UInt32, etc.) out of an UnsafeRawBufferPointer, so Data's conformance to Sequence in the same way is not a violation of this. We just can't rebind the pointer itself, which is what withContiguousStorageIfAvailable would force us to do.

rlovelett · April 11, 2021, 9:01pm

That link to WWDC talk is a link to init(bytesNoCopy:count:deallocator:) docs.

Should it be to WWDC20 - Unsafe Swift with the follow up of WWDC20 - Safely manage pointers in Swift?

itaiferber · April 11, 2021, 9:04pm

Apologies! Fixed the link — it's the second talk, Safely manage pointers in Swift.

rlovelett · April 11, 2021, 10:02pm

I guess I will need to watch that talk and think on it some more.

I get that there are 2 different APIs for working with raw vs typed pointers. What still is not clear to me is how it is safe to rebind (I am not even sure if that is what is happening or right terminology) Data to have an Element type UInt8 for for the purposes RandomAccessCollection. While simultaneously not being safe to rebinding it to UInt8 for the purposes withContiguousStorageIfAvailable.

This whole thing started for me when looking at trying to fwrite some information. Sometimes that information comes in the form of a String sometimes in the form of Data output of JSONEncoder. I tried to unify that by way of withContiguousStorageIfAvailable, clearly unsuccessfully. In light of this discussion, it feels like JSONEncoder should be returning UnsafeBufferPointer<UInt8> and not Data.

Or more generally, I am now struggling with the use of Data if it is indeed making everything raw (and effectively erasing the storage type). It now feels like there is a missing typed Data. Or maybe there needs to be a new "raw" contiguous storage API and a contiguous storage API. Similar to what we have with pointers.

When I started walking down this road it felt simple. How naïve I was.

itaiferber · April 11, 2021, 11:02pm

The key to this is that you can read arbitrary data from a raw pointer without the need to bind it. The right terminology here would be that you can load data from the pointer, as with UnsafeRawBufferPointer.load(fromByteOffset:as:).

Reframed slightly: when you bind memory in Swift, you are asserting to the compiler and optimizer that that memory can only contain data of a certain type, be it UInt8, Double, String, or MyCustomType. Memory can only be bound to a single type at a time, cannot be arbitrarily rebound (see UnsafePointer.withMemoryRebound(to:capacity:) for some more nuance), and cannot be unbound without being deallocated.

Raw memory, on the other hand, does not have these restrictions, and can be accessed byte-wise as any type, so long as you get the stride and alignment correct. From the UnsafeRawBufferPointer docs:

Each byte in memory is viewed as a UInt8 value independent of the type of values held in that memory. Reading from memory through a raw buffer is an untyped operation.

In addition to its collection interface, an UnsafeRawBufferPointer instance also supports the load(fromByteOffset:as:) method provided by UnsafeRawPointer , including bounds checks in debug mode.

Leaving memory unbound makes reading from it significantly more manual (you have to correctly manage byte offsets and ensure your stride and alignment are correct), but it means that you can safely read anything you want out of it.

So this is how both UnsafeRawBufferPointer and Data can be sequences of UInt8 — the UInt8 Element type is really a "byte" type which you're getting raw access to. The difference between that and UnsafeBufferPointer<UInt8> is... subtle... if there is truly a meaningful difference. The language can guarantee that the compiler and optimizer treat UnsafeBufferPointer<UInt8> as UnsafeRawBufferPointer and vice versa, and guarantee that it is always safe that rebinding that way is safe — it just doesn't, yet. If and when it does, Data can certainly hand out an UnsafeBufferPointer<UInt8> safely; until then, the safest thing to do is have it hand out UnsafeRawBufferPointer exclusively.

This whole thing started for me when looking at trying to fwrite some information. Sometimes that information comes in the form of a String sometimes in the form of Data output of JSONEncoder . I tried to unify that by way of withContiguousStorageIfAvailable , clearly unsuccessfully. In light of this discussion, it feels like JSONEncoder should be returning UnsafeBufferPointer<UInt8> and not Data .

I don't have access to a machine with Swift on it at the moment to verify, but IIRC, fwrite takes a const void *, which I believe should export to Swift as an UnsafeRawPointer. If this is the case, you should still be able to abstract over these types using a custom protocol — because UnsafeBufferPointer<UInt8> itself conforms to ContiguousBytes, you should be able to get a consistent buffer from both String and Data, and write that out.

FWIW, this is definitely a really thorny topic! I think life would be a lot simpler if the language could make some clearer guarantees about convertibility between UInt8 pointers and Raw pointers, but with the goal here being safety and increasing the bar from how easy it is to make memory-aliasing mistakes in other languages like C, it can be a bit tough to prevent easy mistakes.

rlovelett · April 12, 2021, 1:40am

I finally watched the talk by @Andrew_Trick. I think I am starting to finally understand that subtlety between load and bind. Maybe. Good suggestion I will be bookmarking that.

To test my theory that I did actually learn something that video and all the suggestions you've given. I set out to try and do your suggestion. I think, to some degree, I have been able to do so.

The imported signature for fwrite on Darwin is:

func fwrite(_ __ptr: UnsafeRawPointer!, _ __size: Int, _ __nitems: Int, _ __stream: UnsafeMutablePointer<FILE>!) -> Int

So I have come up with a write(data:) method.

func write(data: ContiguousBytes) {
  ...
  let result = data.withUnsafeBytes { buffer in
    fwrite(buffer.baseAddress!, 1, buffer.count, file)
  }
  ...
}

mutableString.withUTF8 {
  write(data: $0)
}

write(data: Data(bytes: [0x48, 0x65, 0x6C, 0x6C, 0x6f], count: 5))

This seems to work and I am reasonably happy with this.

Though I cannot seem to convince myself after watching @Andrew_Trick's video the 1 for size is always right for all types that conform to ContiguousBytes. Is it? How can I convince myself?

Andrew_Trick · April 12, 2021, 1:43am

@itaiferber answers here are perfect.. just adding some commentary

withContiguousStorageIfAvailable gets you into this mess by giving you a typed pointer. A typed pointer's Element type needs to match the type that the memory is bound to. That rule falls out of the fact that UnsafePointer is used for C interop, so it needs to be at least as conservative as strict aliasing in C to avoid being broken by C compilers. That rule also means that it's incompatible with any "bag of bytes" data type like Data that doesn't completely control it's own memory.

For a bag of bytes, you need something like withContiguousBytes.

A collection's Element type does not need to imply anything about the memory's bound type for anything other than Unsafe[Buffer]Pointer. So it would be easy to provide some other withContiguousStorageView that gives you a typed view like this rather than an unsafe pointer:

struct BufferView<Element> : RandomAccessCollection {
  let rawBytes: UnsafeRawBufferPointer
  let count: Int 

  init(reinterpret rawBytes: UnsafeRawBufferPointer, as: Element.Type) {
    self.rawBytes = rawBytes
    self.count = rawBytes.count / MemoryLayout<Element>.stride
    precondition(self.count * MemoryLayout<Element>.stride == rawBytes.count)
    precondition(Int(bitPattern: rawBytes.baseAddress).isMultiple(of: MemoryLayout<Element>.alignment))
  }

  public var startIndex: Int { 0 }

  public var endIndex: Int { count }

  subscript(index: Int) -> Element {
    rawBytes.load(fromByteOffset: index * MemoryLayout<Element>.stride, as: Element.self)
  }
}

Andrew_Trick · April 12, 2021, 1:52am

As much as I hate the idea of needing to rebind memory, given the APIs we already have, especially withContiguoutStorageIfAvailable and Data, we should really add this feature:
[SR-11087] Add a closure taking API: UnsafeRaw[Mutable][Buffer]Pointer.withMemoryRebound(to:[capacity:])
https://bugs.swift.org/browse/SR-11087

It would go a long way toward helping people work around these problems.

Andrew_Trick · April 12, 2021, 2:03am

Yep, that's the right way to work with bytes in Swift.

rlovelett:

    fwrite(buffer.baseAddress!, 1, buffer.count, file)
  }
This seems to work and I am reasonably happy with this.

Though I cannot seem to convince myself after watching @Andrew_Trick's video the 1 for size is always right for all types that conform to ContiguousBytes . Is it? How can I convince myself?

ContiguousBytes is not explicitly documented in this respect. But you need to trust that withUnsafeBytes gives you a buffer whose count equals MemoryLayout<Element>.stride * self.count

Nevin · April 12, 2021, 2:09am

I would like to understand this perspective better.

You are clearly the expert here, and I have relatively little experience with low-level programming.

My mental model of a computer would indicate that memory is memory, and for a given chunk of memory, sometimes we want to access it as one data-type, and other times we want to access that same memory as a different data-type.

This seems like a basic, fundamental operation on pointers.

Given a pointer to a region of memory, I would expect a systems-level programming language to provide simple, straightforward APIs for getting a pointer to that same region of memory interpreted as any data-type the programmer wants.

itaiferber · April 12, 2021, 2:21am

Andy can give a much more detailed answer, but just wanted to say that this is a really big topic. The answer is much less about actual physical memory access and much more about compiler implementations and optimizations. Along with Andy's talk, if you're somewhat comfortable with C, this is very easily demonstrated in C with its pointer aliasing rules (which we are at least as strict about). Once you have more than one pointer pointing to the same memory address but with different types at the same time, it can get extremely difficult (if not impossible) for the compiler to determine which pointers point where, and analyzing reads and writes to those pointers can become pretty impossible. In some cases, it means that you just can't optimize a program almost at all, or worse, if you do try to optimize it based on some simple-seeming rules, you end up with completely incorrect behavior.

So, to have any hope for applying many reasonable optimizations, the very short answer to this is that C (and Swift, for similar reasons) disallows you to have multiple pointers pointing to the same memory location at once if those pointers are of different types. Swift goes a step further and formalizes this as "bound" memory.

If you're comfortable following some C, some examples of pointer aliasing and what can go wrong:

Pointers in C, Part III: The Strict Aliasing Rule - Approxion
The joys and perils of C and C++ aliasing, Part 1
The Strict Aliasing Situation is Pretty Bad (less of an example but more showing how optimizations can differ between compilers)

rlovelett · April 12, 2021, 2:27am

It took @itaiferber's excellent tutelage and @Andrew_Trick's WWDC video to finally get that to click for me. Seriously I suggest watching it.

Not to continue to beat the same drum but this feels spot on after having watching the "Safely manage pointers in Swift" WWDC talk mentioned previously. It really talks about how some of these constructs live in the compiler and are part of what makes Swift a safe language. It also shows how you walk down the levels into progressively more unsafe code. With examples of why you might need the different levels of unsafeness.

Nevin · April 12, 2021, 3:47am

Yes, I have watched Andy’s WWDC video (and read many of his posts on the these forums), and I have a fair amount of experience with C pointer programming, including what can go wrong with pointer aliasing and how to use the restrict keyword.

I am saying that I would expect a systems-level language to treat “Accessing the same memory first as one type, then as another” as a basic operation, and provide easy-to-understand facilities for doing so.

Therefore, it is surprising to me that an expert in the field would express such opposition to what seems like it should be a fundamental operation on computer memory.

Karl · April 12, 2021, 4:28am

Hehe, it's interesting that you say that. I was watching this talk a little while ago, about adding a TypeSanitizer to LLVM to try and catch TBAA violations, and when he introduces the problem, he says pretty much word-for-word what you just said.

Programmers have this mental model of how the computer works, and how memory works, but those models have a large, compiler-shaped hole in the middle. TBAA is nothing to do with how the computer works, and everything to do with how modern compilers work. Alias analysis is too valuable for optimisation, and if the compiler couldn't rely on type information to do it, it would have almost nothing to work with.

FWIW, the LLVM differential for the sanitiser mentioned in that talk is here. It seems the original author no longer has the time to take it further, but hopefully somebody else does one day. You hear all these massive corporations explaining that memory safety bugs are such a big issue for them, and how they need to make sweeping changes to their software and infrastructure to deal with them (like... uh... creating a new, "safe" language), but at least AFAICT, there seems to be essentially no tooling to help developers find issues that may be lurking in their code.

You're right that it can be quite unintuitive, though. TBAA rules are something that I always spend a lot of time researching and working though until I understand them, until 8-12 months later, when I encounter some code and don't feel confident saying whether it's safe or not. At which point I spend a lot of time researching and working through it until I understand it again. Andy's talks and posts are extremely valuable, but they can't be a substitute for proper tooling IMO.

Nevin · April 12, 2021, 5:06am

Right, the default of non-aliasing is fine.

I’m saying I would expect a language with low-level pointers to have a simple way for the programmer to tell the compiler, “Hey, I’m doing some aliasing here.”

Swift seems to have this, with the various memory-binding APIs, but (a) it is not obvious how to use them, (b) whenever they are discussed on the forums, almost invariably the response is, “No, not like that, you’re breaking the rules and creating undefined behavior,” and (c) the person most familiar with them has just said in this thread that he doesn’t like the idea of needing to rebind memory.

Karl · April 12, 2021, 6:18am

I think the issue with rebinding is that actually changes the type of the pointer (like a cast in C), which as we've seen, can have subtle side-effects because the type information is important. That said, what you really want most of the time, is to load and store values of a different type than the pointer is bound to (type punning). You don't care if the memory is bound to a different type elsewhere in the program. AIUI, we'd need 2 things in order to do that safely:

The ability to load/store arbitrary types. We kind-of have that with UnsafeRawPointer.load(fromByteOffset:as:), but it requires an aligned pointer so it doesn't really support arbitrary types. We would need support for unaligned loads and stores.
A generic wrapper type which could wrap a raw pointer, but whose subscript would perform an unaligned load/store using the desired Element type. I'm not entirely sure, but I don't think that type would be able to implement withContiguousStorageIfAvailable either, since it wouldn't actually bind its memory to anything or change what it was already bound to.

(Bonus: generic constraints which allow the above to be limited to POD types)