ManagedBuffer alignment?

taylorswift · August 5, 2020, 3:32am

i would like to use an ManagedBuffer instance as a raw buffer with 16-byte alignment. The buffer is heterogeneous so semantically, the Element type should be UInt8, but the documentation says that wouldn’t produce any alignment. Should i force the alignment by setting the Element to something like SIMD16<UInt8> or is there a better way to do this?

this is the memory layout (per element) i am trying to implement:

      +0 ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
         │UInt8╎UInt8╎UInt8╎UInt8╎UInt8╎UInt8╎UInt8╎UInt8│    loads as
      +8 ├─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┤   SIMD16<UInt8>
         │UInt8╎UInt8╎UInt8╎UInt8╎UInt8╎UInt8╎UInt8╎UInt8│
         └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
     +16 ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
         │         UInt32        ╎   UInt16  │     ╎     │
     +24 ├─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┤ 
         │         UInt32        ╎   UInt16  │     ╎     │
     +32 ├─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┤ 
         │         UInt32        ╎   UInt16  │     ╎     │
     +40 ├─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┤
         ╷                     . . .                     ╷
    +128 ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
         │         UInt32        ╎   UInt16  │     ╎     │
    +136 ├─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┤ 
         │         UInt32        ╎   UInt16  │   UInt16  │
    +144 └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘

lorentey · August 5, 2020, 5:27am

I think what this use case needs is a ManagedRawBuffer class, with a create method that takes an explicit alignment, like UnsafeRawBufferPointer.allocate. Ideally at the same time we'd add this we'd also flesh out the existing raw pointer APIs with e.g. new helpers to help figuring out alignments within such heterogeneous storage. (This would resolve problem #4 in the (only tangentially related) pitch I just posted in Evolution.)

Alternatively, we could provide standard ManagedBuffer variants that allow multiple Element types, but that could be difficult to do well without variadic generics.

If you only ever need to store a single SIMD16<UInt8> value, it is most likely a better idea to move it to Header, and set the Element type to something that holds one row of integers. (Say, UInt64, (UInt32, UInt16, UInt16), or an equivalent struct.)

If you have strict layout requirements for the integer rows, then representing them with UInt64 or struct imported from C are probably the best options.

If you need multiple SIMD16 values before the plain integers (or if you need to guarantee that the SIMD value is laid out immediately before the integer rows), then setting Element to SIMD16<UInt8> is probably the best you can do! You can then rebind the parts that correspond to the plain integer rows to one of the types above. Be careful when calculating the capacity of the buffer -- it will need some slightly creative accounting.

taylorswift · August 5, 2020, 8:30pm

The layout in the diagram is for one element, meaning there is an SIMD16<UInt8> at the beginning of each block. originally i was going to store 16 rows per block, but now i think it would make more sense to store 14, so that the element lines up with the cache lines. of course, this would require some way of producing a 128-byte alignment…

There’s no cost to raw pointer rebinding calls, right? meaning i could just write subscripts that each have their own withUnsafe____ context inside them, and it would be the same as loading values directly at the use site from within a single raw pointer context?

lorentey · August 5, 2020, 10:03pm

This sounds more like a homogeneous buffer with a relatively large Element type. This simplifies things, because you don't need to mess with bindings at all -- all you need is to define an Element struct that has the layout you want. This will probably require defining it in C.

Swift doesn't support more than 16 (or is it 32?) bytes alignment. You can work around this by allocating more elements than you actually need, and manually offsetting pointers so that the elements start on a 128-byte boundary.

Rebinding calls don't have a runtime cost, but I wouldn't want to repeatedly rebind things on every access, even if a usecase required binding tricks. (I think of bindMemory(to:) as performing a mutation of the abstract execution state, even if I'm only doing it to do read-only accesses. It's like painting the memory locations to a particular color -- I wouldn't want to put on a new coat of paint if it already looks right...) Once the memory is correctly initialized, you'd just need to use assumingBound(to:) to restore its existing binding without changing it.

taylorswift · August 6, 2020, 2:16am

i don’t see how this prevents the need to rebind memory,, the fixed-size array in the C struct would get imported as a ((UInt32, UInt16), (UInt32, UInt16), ..., (UInt32, UInt16)) tuple, which would still need to be rebound to a buffer of (UInt32, UInt16) elements.

i’m also not sure how this would interact with mutation, as the struct is 128B in size, and _modify only works with structs that are less than 32B in size.

taylorswift · August 9, 2020, 4:16am

i know this is late, but my experiments have shown the opposite, swift seems to have a hard time condensing address calculations into compact lea/mov instructions when using “properly” bound pointers. for some reason, it seems to think differently-typed pointers into the same struct have no relation to each other. For example, compare the generated assembly for find(key:) using this computed property:

// accessor:
extension General.Dictionary.District 
{
    var items:UnsafeMutablePointer<Row> 
    {
        (self.base + 16).bindMemory(to: Row.self, capacity: 14)
    } 
}
// call site:
                    guard district.items[i].key == key 
                    else 
                    {
                        ...
                    }
                    
                    return district.items[i].value

assembly:

to this version which just does it on-the-fly with UnsafeMutableRawPointer.load/storeBytes:

// accessor:
extension General.Dictionary.District 
{
    subscript(index:Int) -> Row 
    {
        get 
        {
            (self.base).load(
                fromByteOffset: 16 &+ 8 &* index, as: Row.self)
        }
        nonmutating 
        set(value)
        {
            (self.base).storeBytes(of: value, 
                toByteOffset: 16 &+ 8 &* index, as: Row.self)
        }
    }
}

// call site:
                    guard district[i].key == key 
                    else 
                    {
                        ...
                    }
                    
                    return district[i].value

assembly:

compared to the first version, the second version overwrites one fewer callee-save register and has simpler address calculations.