How to use Data.withUnsafeBytes in a well-defined manner?

Data has a method withUnsafeBytes(_:) which allows you to access its contents through a typed UnsafePointer<T>, where the type T is satisfied by the caller (there's also withUnsafeMutableBytes(_:) that gives you mutable access to the underlying data).

However it's not documented exactly what we're allowed to substitute for the type T. Is the following fairly innocuous looking example well-defined?

import Foundation

let data = Data([0xFF])
let i = data.withUnsafeBytes { (ptr: UnsafePointer<Int8>) in
 ptr.pointee
}
print(i) // -1

Looking at the source, assumingMemoryBound(to:) is used in order to get the typed pointer to the underlying bytes – however AFAIK that means that T and UInt8 must be both be related types, and UInt8 must be layout compatible with T (and for withUnsafeMutableBytes(_:) this would need to extend to mutual layout compatibility).

From that I would conclude that the above example exhibits undefined behaviour due to the violation of strict aliasing, assuming that Int8 and UInt8 are unrelated types. Is this correct?

If so, it would be great if the documentation for both withUnsafeBytes(_:) and withUnsafeMutableBytes(_:) could be updated to make the user aware of this very easy to fall into pit (especially given the fact the user doesn't always have to spell out the type T explicitly, it can be inferred).

I did notice a proposal from December 2016 that aimed to replace withUnsafeBytes(_:)'s UnsafePointer<T> argument with an UnsafeRawBufferPointer (note the link in the post is broken, you have to expand the body), which I would support – but it was deferred until sometime after Swift 4. Now that we're in the window for Swift 5, would now be a good time to re-visit this proposal?

Strictly speaking, Foundation should probably use withMemoryRebound(to:capacity:) rather than assumingMemoryBound(to:), but that's on us, not the user. It's not supposed to be required for the user to write this, and it should not be documented as a pitfall.

(@Andrew_Trick, @Philippe_Hausler?)

Thanks, though even with withMemoryRebound(to:capacity:), I believe something like this, which still looks fairly innocuous, would be undefined:

let data = Data([0xFF, 0xFF])
let j = data.withUnsafeBytes { (ptr: UnsafePointer<UInt16>) in
  ptr.pointee
}

as withMemoryRebound requires a type with the same size and alignment. Is it intentional for the above usage of withUnsafeBytes to be undefined? (or is it another case of being an implementation detail?).

Regardless, I really do think the documentation in general for both withUnsafeBytes and withUnsafeMutableBytes needs to be updated specifying what preconditions are needed for the type T in order for them to work in a well-defined manner.

Hamish is right. Thank you for pointing this out.

Data.withUnsafeBytes(_) cannot use withMemoryReboundTo(to:capacity:) because that requires a typed pointer to begin with. Data is a raw byte buffer so has no idea how the memory may have previously been typed.

Originally, the Data.withUnsafeBytes used bindMemory, which avoided undefined
behavior in the above example. (It still encouraged undefined behavior
when using Data(bytesNoCopy:) though). For some reason, that was changed to assumingMemoryBound(to:). I either didn't notice that change or can't remember it.

FYI: here's an open bug on fixing the Foundation.Data API. Feel free to file more, particularly one to change the implementation back to bindMemory!

To summarize the strategy that I think makes sense: Anyone using Data.withUnsafeBytes() to manipulate the bytes should be reading/writing memory through an Unsafe[Mutable]RawPointer, unless either:

(a) they avoid the Data(bytesNoCopy:) initializer.

(b) they have some special, certain knowledge of the how the non-copied memory
is used outside of the Data object.

3 Likes

Thanks Andrew – I went ahead and filed SR-7726.

Would it be worth filing a separate bug to update the documentation for withUnsafe(Mutable)Bytes in order to make users aware of the fact that it will be rebinding memory (therefore they need to ensure layout compatibility, and need to be careful with Data(bytesNoCopy:))?

I'm not sure where to best document these tips, but agree it should be explained somewhere. It seems worthy of a bug. I could help with the wording and review...

cc @nnnnnnnn @Philippe_Hausler

Thanks Andy, this isn't well covered. I'm particularly not happy with the way the name inverts the naming from the standard library (i.e. bytes, but you get a typed pointer). I'll take a look and ping you.

To be clear, this isn't as horrible as the example above makes it look. Data initializes it's own copy of the underlying storage using a raw pointer. So, the only way you get undefined behavior is if the user calls withUnsafeBytes with different types on the same Data object (and even that's ok if it's read-only):

import Foundation

let data = Data([0xFF])
let i = data.withUnsafeBytes { (ptr: UnsafePointer<Int8>) in
 ptr.pointee
}
// Now *this* is bad.
data.withUnsafeBytes { (ptr: UnsafePointer<UInt8>) in
 ptr.pointee = 3
}
print(i) // -1

Ah, I see. I didn't quite appreciate before that the underlying memory in Data([0xFF]) is completely untyped – I incorrectly presumed it would be bound to UInt8 in that case, but looking at the implementation I now see that malloc and memmove are used. Thanks for clarifying that point!

How then does this play with Data's subscript that returns a UInt8? Currently the implementation uses assumingMemoryBound(to: UInt8.self) on the raw pointer to the underlying memory in order to access the byte at a given index (this is also used in handful of other places in the implementation in places such as appending and getting the hash).

Now that I think about it, if we changed the implementation of withUnsafeBytes back to using bindMemory, wouldn't we have undefined behaviour for those other places in the implementation if withUnsafeBytes is used with an unrelated type? Or would we also need to change the implementation in those cases too? (e.g using load(as:) and storeBytes(of:as:) for the subscript logic).

I'm not sure I completely follow you here; are you saying that

let data = Data([0xFF])
let i = data.withUnsafeBytes { (ptr: UnsafePointer<Int8>) in
 ptr.pointee
}

on its own is well-defined with the current implementation of withUnsafeBytes (using assumingMemoryBound(to:), that is)? Is it not UB to perform typed operations on untyped memory?

Also, I filed a bug for the documentation: SR-7755

Yes. Data is a wrapper around UnsafeMutableRawBufferPointer that provides interop with frameworks. Viewing it as strongly typed memory is just wrong unless we provide a typed wrapper on top of it. If all we care about is the convenience of using a typed pointer, we could easily introduce a typed buffer pointer that always accesses memory using raw memory operations (i.e. the API would be typed but the backing store would be raw bytes).

The implementation of Data's subscript, and most of the other uses of assumingMemoryBound(to:) should just use the UnsafeRawPointer API internally. Here's where you say "should I file a bug" and I say "yes please". I think separate smaller bugs are more likely to get results than my old bug that seems to have died.

Yes, exactly, the internal Data implementation should only work with raw memory.

Well, I suppose we could write the spec that way. But I don't think that would be a useful rule. The goal of the memory model is to allow the compiler to make all of its usual, helpful assumptions about typed memory access while providing a set of rules that can be precisely verified at runtime. The rule can simply be that we don't have multiple accesses to the same memory location of unrelated types where at least one access modifies memory.

So, we want to strongly discourage using assumingMemoryBound on untyped memory (for that matter we want to strongly discourage using assumingMemoryBound ever)--it's a sign of a broken API. But we don't need to declare it undefined behavior.

1 Like

Thanks Andrew, that greatly clarified things :) I've filed a few more bugs: SR-7849, SR-7850, SR-7851.

Is there currently a timeline for the writing up a formal spec specifying the rules of Swift's memory model; in particular, the rules over what types are considered related and/or layout compatible?

2 Likes

Thank you for the bug reports Hamish!

No, there is no timeline for a formal spec (that would require a lot of careful attention from and deliberation among people who are busy with urgent work).

I'm afraid you'll have to make due with this informal writeup:

https://github.com/atrick/swift/blob/type-safe-mem-docs/docs/TypeSafeMemory.rst

It captures my own understanding as an implementer. It's decidedly not part of the language, so there's no future guarantee. And it hasn't seen much review, so it's likely incomplete in areas. But I think it's unlikely that we'll want to make more aggressive assumptions in the compiler than what I've already proposed there.

1 Like