Foundation Data does not support an internal representation as contiguous storage?

lukasa · April 12, 2021, 7:48am

I’m not @Andrew_Trick and won’t speak for him, but I’ll add a voice here: ideally, you never alias. The moment you allow aliasing pointers you greatly hamstring the compiler’s ability to make optimizations: in particular, loads and stores cannot be arbitrarily omitted or re-ordered because the compiler has no idea if they affect each other. In practice you create a zone of the program within which all loads/stores must be maximally pessimistic and assume that any operation that touches any other memory may have affected the load/store that you’re about to do.

So in general, much better than aliasing is…well…not aliasing. You can almost always avoid aliasing by performing bytewise copies and then binding the new memory location. That allows you to spill to the stack, for example, to create appropriately typed temporaries. While the copy is a mild performance penalty, it unlocks substantial optimization possibility within the compiler that will almost certainly pay itself back.

So, to clarify: if Swift seems to have the ability to say “let me alias”, that is an unfortunate confusion. Swift’s aliasing rules are very simple: you may never alias typed pointers with other typed pointers. If you have a typed memory region, you may only access it either as that type or as a raw pointer. You may never alias it. From this, the rules of the Swift operations follow:

bindMemory(to:) establishes a new binding. If the new binding matches other bound pointers in the program no alias has been established and those other pointers may be used freely. If the new binding is to a different type, all other bound pointers are invalidated and using them invokes undefined behaviour, i.e. aliasing is forbidden.
assumingMemoryBound is bindMemory but you invoke undefined behaviour earlier in the process, at the point of making the assumption. In principle Swift could have a runtime memory aliasing checker that would validate this assumption for you. It’s only safe to use when you know where the pointer came from: otherwise it’s unsafe.
withMemoryRebound is the union of two bind operations: bind to the new pointer type, followed by bind back to the original type. As it’s a composition of two bind operations the semantics should follow from (1): during the block it is a undefined behaviour to access this memory as its original type.

So I agree with @Andrew_Trick and @Karl: we do not want to make it easy to alias memory, we want to make it easier to perform type punning, and easier to interoperate with C. Andy has marked this out above with the most obvious missing piece: temporarily binding a raw pointer, which would ease C interop enormously. Other than that, the advice for type punning is usually to use raw memory and load/store: those should be enhanced to allow unaligned loads/stores.

Nevin · April 12, 2021, 1:46pm

Okay, say I have a typed pointer to a buffer of T, and I want to access its memory as both T and U, interleaved with each other. So, type-punning.

If I understand correctly, you are saying I should

(i) Get a raw pointer to the same memory.
(ii) Load individual bytes to construct a U (or in the future, perform an unaligned load).
(iii) When done with that instance of U, store it back (currently byte-for-byte).
(iv) Repeat those operations, interleaved with normal accesses through the original pointer.

Already, the original T-typed pointer and the raw pointer clearly alias, so the compiler cannot optimize as aggressively as if they did not. That’s fine, they really do alias.

• • •

That seems needlessly clunky to me.

I can picture an API along the lines of “Given a typed pointer, create a type-punned pointer aliasing it.”

The compiler would know they alias, just like the raw pointer version, so there’s no change to performance or correctness. But there is a substantial simplification for the programmer:

(i) Get a typed-punned pointer to the same memory.
(ii) Use bother pointers to read and write as desired.

• • •

I could be misunderstanding, but the impression I get is that some of the people most familiar with low-level pointer programming here, are opposed to making such a simple API, for reasons I do not grasp.

As near as I can tell, those reasons sound a lot like, “I wish it were never necessary, therefore I do not want to make it easy to do correctly.”

But for users of the language, the result of not making it easy to do correctly, is that it ends up being done incorrectly, which from my perspective is what we should really strive to avoid.

The way to avoid incorrect type-punning is not by making all type-punning difficult, but rather by making correct type-punning easy.

• • •

Edit: perhaps you were saying “Step (iv) should actually be more loads and stores from the raw pointer, to construct instances of T.”

In that case, with everything going through the raw pointer, there would no longer be aliasing problems, which is great. But then the programming experience is even clunkier.

The equivalent simple API would be a type-punning pointer that can access its memory as either T or U.

The point remains: whatever the actual recommended semantics, we can make it easy for programmers to do correctly.

lukasa · April 12, 2021, 2:30pm

Ah, yes, that API can absolutely be provided. But you add:

Yes, an API can be provided that does this. But it necessarily will be implemented on top of the raw pointer API. There is potentially some use to this (though I dispute how useful it will be), but it cannot help be anything except a wrapper on top of that type.

This is what UnsafeRawPointer is: a type-punning pointer that can access its memory as any T. You seem to implicitly want something else though, which I derive from this:

This seems to me to dive at the heart of matters. When you talk about making "correct type punning easy", I agree that it should be easy to do. But, importantly, I also believe that none of Unsafe[Mutable][Buffer]Pointer<T> should do it. Those pointers are used for interoperability with C, and so their aliasing rules must be at least as strict as C. Additionally, for the same performance reasons that C strict aliasing exists, Swift should implement the same strict aliasing rules for these types, as they are used for foundational data structures that benefit from those optimisations.

Thus, any pointer that can be used to type-pun must not be one of these bound type pointers, but instead be something else. That something else will be fundamentally similar to UnsafeRawPointer, and indeed almost certainly build directly on top of that type. Put another way, the typed pointers implicitly include in their contract that they are assumed to follow strict aliasing. The raw pointers explicitly include in their contract that they may alias any other pointer in the program and so much always perform all loads and stores explicitly.

Providing syntactic sugar around this operation is helpful, but I don't think it's the most important thing to do. My instinct is that the vast majority of times people need to change the type of a pointer, it is interaction with a C API. Here, the thing we want is to take a raw pointer and temporarily bind it. Within this scope we are making a promise to the compiler: we promise that we are not aliasing this memory. The API to achieve this is @Andrew_Trick's proposed UnsafeRawPointer.withMemoryRebound(to:) API.

lukasa · April 12, 2021, 2:40pm

With all of the above said, I think there may be some bugs in the alias analysis of swiftc.

scanon · April 12, 2021, 2:43pm

Concretely, with the exception of assembly, I don't think any mainstream "systems" language views memory this way (OK, that's a lie, BASIC did/does, but with a much simpler type system, so it's a much weaker change). You always have to use some specific blessed mechanism, rather than it being the default behavior for normal accesses.

As an assembly programmer at heart, I often find this frustrating, but it's definitely the norm for compiled languages.

Dante-Broggi · April 12, 2021, 3:32pm

Rust does not have any TBAA, so accessing memory as multiple types is primitive.

IIUC, this is because the common case is &mut and in that case the memory can only be accessed by one type anyway.

Karl · April 12, 2021, 5:29pm

Oh that's interesting. It seems like they allow immutable borrows to alias, and guarantee that mutable borrows never alias anything else (Rust doesn't have classes - only value types and "traits"). It seems like it may be possible to violate that using UnsafeCell, but it doesn't expose memory rebinding APIs, so even if you manage to write incorrect code, aliasing mutable pointers won't differ by type.

I don't see it as being extremely different from the guarantees that Swift offers. I see a Rust mutable borrow as more like an inout T than an UnsafePointer<T> in Swift, and the law of exclusivity gives us similar non-aliasing guarantees in that case (modulo classes).

I am a bit uncomfortable with how often we use the UnsafePointer<T> types, though - for instance, I'm not a fan of how we use these APIs as a shorthand for any and all contiguous storage (withContiguousStorageIfAvailable, String.withUTF8, etc), and how we often don't even make the word "unsafe" very prominent in the name. We should have a better type which doesn't tempt people to abuse it by exposing typed pointers and memory rebinding APIs.