Use rowBytes method from MPSMatrixDescriptor for matrix multiplication

I'm using Metal Performance Shaders to perform matrix multiplication as shown below. I know that such a small matrix space is not well suited for Metal; but I'm using small matrices so I can print the result to check the calculation.

// main.swift

import MetalPerformanceShaders

// Arrays and their rows and columns

let a: [Float] = [1, 2, 3,
                  4, 5, 6,
                  7, 8, 9]

let b: [Float] = [1, 2, 3,
                  4, 5, 6,
                  7, 8, 9]

let rowsA = 3
let columnsA = 3

let rowsB = columnsA
let columnsB = 3

let rowsC = rowsA
let columnsC = columnsB

// Setup the Metal matrices

guard let device = MTLCreateSystemDefaultDevice() else {
    fatalError("Failed to get GPU (Metal device)")
}

let bufferA = device.makeBuffer(bytes: a, length: rowsA * columnsA * MemoryLayout<Float>.stride, options: [])!
let bufferB = device.makeBuffer(bytes: b, length: rowsB * columnsB * MemoryLayout<Float>.stride, options: [])!
let bufferC = device.makeBuffer(length: rowsC * columnsC * MemoryLayout<Float>.stride, options: [])!

let descA = MPSMatrixDescriptor(dimensions: rowsA, columns: columnsA, rowBytes: columnsA * MemoryLayout<Float>.stride, dataType: .float32)
let descB = MPSMatrixDescriptor(dimensions: rowsB, columns: columnsB, rowBytes: columnsB * MemoryLayout<Float>.stride, dataType: .float32)
let descC = MPSMatrixDescriptor(dimensions: rowsC, columns: columnsC, rowBytes: columnsC * MemoryLayout<Float>.stride, dataType: .float32)

let matrixA = MPSMatrix(buffer: bufferA, descriptor: descA)
let matrixB = MPSMatrix(buffer: bufferB, descriptor: descB)
let matrixC = MPSMatrix(buffer: bufferC, descriptor: descC)

// Perform matrix multiplication using Metal

let commandBuffer = device.makeCommandQueue()!.makeCommandBuffer()!

let mul = MPSMatrixMultiplication(device: device, resultRows: rowsC, resultColumns: columnsC, interiorColumns: columnsA)
mul.encode(commandBuffer: commandBuffer, leftMatrix: matrixA, rightMatrix: matrixB, resultMatrix: matrixC)

commandBuffer.commit()
commandBuffer.waitUntilCompleted()

// Print result

let rawPointer = matrixC.data.contents()
let floatPointer = rawPointer.bindMemory(to: Float.self, capacity: rowsC * columnsC)
let bufferPointer = UnsafeBufferPointer(start: floatPointer, count: rowsC * columnsC)
let arrayC = Array(bufferPointer)

for i in 0..<matrixC.rows {
    for j in 0..<matrixC.columns {
        print(arrayC[i * matrixC.columns + j], terminator: "  ")
    }
    print("")
}

This prints the following:

30.0  36.0  42.0  
66.0  81.0  96.0  
102.0  126.0  150.0  

The Metal docs suggest using the rowBytes() method to determine the recommended matrix row stride, in bytes, for a given number of columns. So I tried the following:

let rowBytesA = MPSMatrixDescriptor.rowBytes(forColumns: columnsA, dataType: .float32)
let rowBytesB = MPSMatrixDescriptor.rowBytes(forColumns: columnsB, dataType: .float32)
let rowBytesC = MPSMatrixDescriptor.rowBytes(forColumns: columnsC, dataType: .float32)

let bufferA = device.makeBuffer(bytes: a, length: rowsA * rowBytesA)!
let bufferB = device.makeBuffer(bytes: b, length: rowsB * rowBytesB)!
let bufferC = device.makeBuffer(length: rowsC * rowBytesC)!

let descA = MPSMatrixDescriptor(rows: rowsA, columns: columnsA, rowBytes: rowBytesA, dataType: .float32)
let descB = MPSMatrixDescriptor(rows: rowsB, columns: columnsB, rowBytes: rowBytesB, dataType: .float32)
let descC = MPSMatrixDescriptor(rows: rowsC, columns: columnsC, rowBytes: rowBytesC, dataType: .float32)

But this gives me a weird result such as:

38.0  14.0  4.441763e+08  
0.0  98.0  46.0  
1.0364113e+09  0.0  -nan(0x1fffff)  

I think the rowBytes() size is different than the original array length in memory which is causing the wrong print values. But I'm not sure about this. Anyway, does anyone have suggestions on how to properly use the rowBytes() method?

I believe if you use a custom row stride, then you need to adjust you data source to actually match that value.

For example, if rowBytesA is 16, then you cannot simply create bufferA from a because it does not have the corresponding row stride.

You can of course create an empty buffer and stuff data into it, as long as you obey the row stride.

2 Likes

FWIW, on my computer Metal prefers column counts divisible by 4 except for a single column:

for columns in 0 ..< 100 {
    let rowBytes = MPSMatrixDescriptor.rowBytes(forColumns: columns, dataType: .float32)
    let recommended = rowBytes/MemoryLayout<Float>.stride
    print("columns: \(String(format: "%2d", columns)), recommended columns: \(recommended)")
}

columns:  0, recommended columns: 0
columns:  1, recommended columns: 1
columns:  2, recommended columns: 4
columns:  3, recommended columns: 4
columns:  4, recommended columns: 4
columns:  5, recommended columns: 8
columns:  6, recommended columns: 8
columns:  7, recommended columns: 8
columns:  8, recommended columns: 8
columns:  9, recommended columns: 12
columns: 10, recommended columns: 12
columns: 11, recommended columns: 12
columns: 12, recommended columns: 12
columns: 13, recommended columns: 16
...

but I guess this is computer specific and you could get a different result on another computer. Would love to see what you are getting on Intel.

Right, the data you pass has to actually have the layout that you say it does. If your descriptor says that rowBytes is, say, 16, then the buffer itself would have to be laid out like:

[1, 2, 3, x
 4, 5, 6, x,
 7, 8, 9, x]

where x is a don't-care unreferenced value. (If you're familiar with BLAS/LAPACK conventions, rowBytes is just leading dimension).

So I need to pad the input arrays (a and b) based on the rowBytes, create each MPSMatrix from the padded arrays, do the Metal matrix multiplication, convert the result to a regular Swift array, then unpad the final Swift array to the expected dimensions. Is that correct?

I'm familiar with padding arrays in NumPy where I use the pad() function. See example below. But I'm not aware of such a feature in Swift.

import numpy as np
a = np.array([[1, 2, 3,], [4, 5, 6], [7, 8, 9]])
p = np.pad(a, ((0,0),(0,1)))

# p is the following padded array
# [[1 2 3 0]
#  [4 5 6 0]
#  [7 8 9 0]]

Usually you wouldn't do any conversions, you just translate the indexing to access the input and result buffers with the correct padding:

// strawman for illustration purposes only, written in a browser window and
// untested, so take with a grain of salt.
struct Matrix {
  var storage: [Float]
  var rows: Int
  var columns: Int
  var rowStride: Int

  init(rows m: Int, columns n: Int, repeating value: Float) {
    // figure out desired rowStride
    let rowBytes = MPSMatrixDescriptor.rowBytes(forColumns: n, dataType: .float32)
    // if rowBytes isn't a multiple of 4 (Float stride), just use n.
    // (I don't think that this can actually happen)
    rowStride = rowBytes % 4 == 0 ? rowBytes / 4 : n
    rows = m
    columns = n
    // Don't actually need to init padding values, but it's easy enough.
    storage = Array(repeating: value, count: m * rowStride)
  }

  subscript(row: Int, col: Int) -> Float {
    get {
      precondition(0 <= col && col < columns)
      // array access will check row for us, since col is in bounds.
      storage[row*rowStride + col]
    }
    // ... etc
  }
}

I don' know if there's a similar pad function in Swift. For your case I would manually write code like this.

func stuff(_ from: [Float], to: UnsafeMutableRawPointer, _ rowCount: Int, _ columnCount: Int, _ rowStride: Int) {
    // I choose manuall byte-copying
    from.withUnsafeBytes { bytes in
        for row in 0..<rowCount {
            let toStart = to + row * rowStride
            let fromStart = bytes.baseAddress! + row * columnCount * MemoryLayout<Float>.stride
            let byteCount = columnCount * MemoryLayout<Float>.stride
            // don't care about padding bytes
            toStart.copyMemory(from: fromStart, byteCount: byteCount)
        }
    }
}
let bufferA = device.makeBuffer(length: rowsA * rowBytesA)!
// this works for shared-storage buffer
stuff(a, to: bufferA.contents(), rowsA, columnsA, rowBytesA)

I cooked a simple example for you:

import MetalPerformanceShaders

extension MPSMatrix {
    convenience init(floats: [Float], rows: Int, columns: Int) {
        let rowBytes = columns * MemoryLayout<Float>.stride
        let buffer = Self.makeBuffer(floats: floats, rows: rows, columns: columns)
        let descriptor = MPSMatrixDescriptor(rows: rows, columns: columns, rowBytes: rowBytes, dataType: .float32)
        self.init(buffer: buffer, descriptor: descriptor)
        data.contents().bindMemory(to: Float.self, capacity: rows * effectiveColumns)
    }
    convenience init(rows: Int, columns: Int, withPadding: Bool = true) {
        let rowBytes = withPadding ? MPSMatrixDescriptor.rowBytes(forColumns: columns, dataType: .float32) : columns * MemoryLayout<Float>.stride
        let buffer = MTLCreateSystemDefaultDevice()!.makeBuffer(length: rows * rowBytes, options: [])!
        let descriptor = MPSMatrixDescriptor(rows: rows, columns: columns, rowBytes: rowBytes, dataType: .float32)
        self.init(buffer: buffer, descriptor: descriptor)
        data.contents().bindMemory(to: Float.self, capacity: rows * effectiveColumns)
        clean()
    }
    func clean(_ value: Float = .nan) {
        for y in 0 ..< rows {
            for x in 0 ..< effectiveColumns {
                self[x: x, y: y, withPadding: true] = value
            }
        }
    }
    var effectiveColumns: Int {
        precondition((rowBytes % MemoryLayout<Float>.stride) == 0)
        return rowBytes / MemoryLayout<Float>.stride
    }
    subscript(x x: Int, y y: Int, withPadding withPadding: Bool = false) -> Float {
        get {
            precondition(x >= 0 && x < columns(withPadding: withPadding) && y >= 0 && y < rows, "out of bounds")
            let p = data.contents().assumingMemoryBound(to: Float.self)
            return p[y*effectiveColumns + x]
        }
        set {
            precondition(x >= 0 && x < columns(withPadding: withPadding) && y >= 0 && y < rows, "out of bounds")
            let p = data.contents().assumingMemoryBound(to: Float.self)
            p[y*effectiveColumns + x] = newValue
        }
    }
    func columns(withPadding: Bool) -> Int {
        withPadding ? effectiveColumns : columns
    }
    func log(_ title: String) {
        let padding = columns == effectiveColumns ? "no padding" : "plus padding"
        print("\(title), \(rows)x\(columns) \(padding)")
        for y in 0 ..< rows {
            for x in 0 ..< columns {
                let r = self[x: x, y: y, withPadding: false]
                print(r, terminator: " ")
            }
            if columns < effectiveColumns {
                print(" | ", terminator: "")
                for x in columns ..< effectiveColumns {
                    let r = self[x: x, y: y, withPadding: true]
                    print(r, terminator: " ")
                }
            }
            print()
        }
        print()
    }
    static func makeBuffer(floats: [Float], rows: Int, columns: Int) -> MTLBuffer {
        floats.withUnsafeBufferPointer {
            let rowBytes = columns * MemoryLayout<Float>.stride
            return MTLCreateSystemDefaultDevice()!.makeBuffer(bytes: $0.baseAddress!, length: rows * rowBytes, options: [])!
        }
    }
}

Usage:

var a = MPSMatrix(floats: [1, 2, 3, 4, 5, 6], rows: 2, columns: 3)
var b = MPSMatrix(rows: 2, columns: 3, withPadding: true)
b[x: 0, y: 0] = 10
b[x: 1, y: 0] = 20
b[x: 2, y: 0] = 30
b[x: 0, y: 1] = 40
b[x: 1, y: 1] = 50
b[x: 2, y: 1] = 60
var c = MPSMatrix(rows: 1, columns: 1, withPadding: false)
a.log("a")
b.log("b")
c.log("c")
c = a
c.log("c=a")
c = b
c.log("c=b")

Outputs:

Here I am using "nan" for padding / unassigned bytes and the log function visually separates padding elements.

2 Likes

Do you notice any difference in performance when using padding vs without padding?

Interestingly no speed difference (on M1 at least). Just tried with 8001x8001 matrices with and without padding.

Yeah, I'm not seeing a difference either.

API like this are often about future-proofing; obviously no one wants to have performance depend on padding, since it would sometimes necessitate copies to get the best performance. So we all try to build implementations that give the best performance no matter what, but want people to use these API in case there’s some future hardware where that isn’t possible.

Also, the padding is determined independent of the operation that you’re doing; gemm usually doesn’t matter much because gemm has enough data reuse that an implementation can amortize any misalignment penalty. Operations that use each element only O(1) times (like ger/gemv/etc) are much more likely to exhibit a difference.

1 Like