Accelerate in Swift is slower than NumPy in Python for matrix multiplication

@scanon Thank you for all of the replies. They are very helpful. I'm working on a Swift package that provides Vector and Matrix types for performing linear algebra computations on Apple devices. Kind of like NumPy but for Swift. As I started working on the matrix multiplication, I was curious about Accelerate's performance compared to other languages/packages which is how this post got started. The example I posted is just something I came up with off the top of my head. If you have suggestions for a better benchmark then please let me know. Also, you mentioned the following:

8000 x 8000 is much more than large enough for every BLAS implementation to be multithreaded

So are you saying the cblas_dgemm() function in Accelerate is multithreaded for large matrices? If it is multithreaded, then what is the difference between vDSP_mmulD() and cblas_dgemm()? The Apple developer docs say vDSP_mmulD() is multithreaded for large data but I don't see anything in the docs about multithreaded BLAS.

Yes; broadly speaking this is true of every BLAS implementation unless you explicitly opt out of threading via some nonstandard API.

None whatsoever. The vDSP interfaces predate availability of BLAS in macOS (they actually predate MacOS X--they came into vDSP via Mercury's SAL library, which sort of lives on as OpenSAL, back in the MacOS 9.x days). In some older OS releases (before something like 10.9, in any case old enough to predate any version software written today is likely to run on) they had a separate implementation, but now vDSP_mmul[D] just calls [s|d]gemm.

4 Likes

I compiled and ran your Metal example with:

swiftc -Ounchecked mainmetal.swift
./mainmetal

But it fails to run with the following error:

/AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSMatrix.mm:241: failed assertion `[MPSMatrix initWithBuffer:descriptor:] buffer may not be nil.'
zsh: abort      ./mainmetal

Also, the left and right matrices in your example are full of zeros. In my original post, the matrix multiplication is performed with matrices that contain random values. Can you rerun the example with matrices filled with random values? I'm curious as to how much that would affect the elapsed time to perform the multiplication.

I did run it in Xcode previously and it was ok. Now checked compiling in terminal – indeed got this crash. No idea why the difference.

I did the randomisation initially, then tried with all zeroes – got the same timing - so I removed the randomisation for simplicity. Looks like multiplying 0.0 * 0.0 is as quick as multiplying 0.1209452932173 * 0.1782379143424.

Interestingly, running Xcode's built executable from terminal also crashes.

1 Like

Just checked. Same here.

CmdLine/MetalMatrix.swift:17: Fatal error: Unexpectedly found nil while unwrapping an Optional value
Illegal instruction

extension MPSMatrix {
    static func makeBuffer(floats: [Float], rows: Int, columns: Int) -> MTLBuffer {
        floats.withUnsafeBufferPointer { bp in
            let array = bp.baseAddress!
            let rowBytes = columns * MemoryLayout<Float>.stride

            // Line 17 - Fatal error: Unexpectedly found nil while 
            return MTLCreateSystemDefaultDevice()!.makeBuffer(bytes: array, length: rows * rowBytes, options: [])!
        }
    }

Execution from the command line requires setting an environment variable?

Just link with the CoreGraphics.framework

MTLCreateSystemDefaultDevice()

Returns the device instance Metal selects as the default.

Discussion

In macOS, in order for the system to provide a default Metal device object, you must link to the Core Graphics framework. You usually need to do this explicitly if you’re writing apps that don’t use graphics by default, such as command line tools.

3 Likes

@tera, thank you for the Metal example.

What's the optimal way to get the values out of the MPSMatrix?

Is this good enough?

extension MPSMatrix {
    var values: [Float] {
        let size = rows * columns
        let v = data.contents().withMemoryRebound (to: Float.self, capacity: size) {
            var u = [Float] ()
            var p = $0
            for _ in 0..<size {
                u.append (p.pointee)
                p = p.successor()
            }
            return u
        }
        return v
    }
}
1 Like

Oh thanks! This works for me:

swiftc -framework CoreGraphics -Ounchecked metalmultiply.swift
2 Likes

Very much expected; floating-point addition and multiplication are fully-pipelined high-throughput operations on most hardware (and certainly all non-embedded CPUs and GPUs), so there's really no opportunity to add early-outs for zero data--the resulting hiccup in an otherwise regular instruction retirement pattern would kill any performance win. Some lower-clocked wider HW designs (GPUs and ML accelerators and even some SIMD designs) will power down circuits when they see a zero input, so there can be a measurable energy impact, but there's rarely any performance difference (except in the smallest CPUs, where floating-point might be emulated in software).

6 Likes

Yep, something like that.

I didn't bother optimising that part when I tested my implementation, was using a quick and dirty (and very non optimal):

extension MPSMatrix {
    subscript(x x: Int, y y: Int) -> Float {
        get {
            let p = data.contents().bindMemory(to: Float.self, capacity: rows * columns)
            return p[y*columns + x]
        }
        set {
            let p = data.contents().bindMemory(to: Float.self, capacity: rows * columns)
            p[y*columns + x] = newValue
        }
    }

which is obviously slower than needed. On the bright side - it doesn't allocate memory.

1 Like

I ran your Metal example and didn't get any speed up compared to Accelerate on my Intel Mac. The exact code I ran is shown below.

import MetalPerformanceShaders

extension MPSMatrix {
    static func makeBuffer(floats: [Float], rows: Int, columns: Int) -> MTLBuffer {
        floats.withUnsafeBufferPointer { bp in
            let array = bp.baseAddress!
            let rowBytes = columns * MemoryLayout<Float>.stride
            return MTLCreateSystemDefaultDevice()!.makeBuffer(bytes: array, length: rows * rowBytes, options: [])!
        }
    }
    convenience init(floats: [Float], rows: Int, columns: Int) {
        let buffer = Self.makeBuffer(floats: floats, rows: rows, columns: columns)
        let rowBytes = columns * MemoryLayout<Float>.stride
        let descriptor = MPSMatrixDescriptor(rows: rows, columns: columns, rowBytes: rowBytes, dataType: .float32)
        self.init(buffer: buffer, descriptor: descriptor)
    }
    convenience init(rows: Int, columns: Int) {
        let rowBytes = columns * MemoryLayout<Float>.stride
        let buffer = MTLCreateSystemDefaultDevice()!.makeBuffer(length: rows * rowBytes, options: [])!
        let descriptor = MPSMatrixDescriptor(rows: rows, columns: columns, rowBytes: rowBytes, dataType: .float32)
        self.init(buffer: buffer, descriptor: descriptor)
    }
    static func * (lhs: MPSMatrix, rhs: MPSMatrix) -> MPSMatrix {
        let device = lhs.device
        precondition(device === rhs.device)
        precondition(lhs.columns == rhs.rows)
        let result = MPSMatrix(rows: lhs.rows, columns: rhs.columns)
        let commandBuffer = device.makeCommandQueue()!.makeCommandBuffer()!
        let mul = MPSMatrixMultiplication(device: device, resultRows: lhs.rows, resultColumns: rhs.columns, interiorColumns: lhs.columns)
        mul.encode(commandBuffer: commandBuffer, leftMatrix: lhs, rightMatrix: rhs, resultMatrix: result)
        commandBuffer.commit()
        commandBuffer.waitUntilCompleted()
        return result
    }
}

func runExample() {
    let n = 3
    let a: [Float] = [1, 2, 3, 4, 5, 6, 7, 8, 9]
    let b: [Float] = [1, 2, 3, 4, 5, 6, 7, 8, 9]
    let left = MPSMatrix(floats: a, rows: n, columns: n)
    let right = MPSMatrix(floats: b, rows: n, columns: n)
    let c = left * right
    var cPointer = c.data.contents().bindMemory(to: Float.self, capacity: n * n)
    for _ in 0..<n*n {
        let y = Float(cPointer.pointee)
        print(y, terminator: " ")
        cPointer = cPointer.advanced(by: 1)
    }
    print("")
}

func runBenchmark() {
    let n = 8000
    let left = MPSMatrix(floats: .init(repeating: 0, count: n*n), rows: n, columns: n)
    let right = MPSMatrix(floats: .init(repeating: 0, count: n*n), rows: n, columns: n)
    let tic = ContinuousClock().now
    _ = left * right
    let toc = ContinuousClock().now
    print("metal elapsed \(toc - tic) (w/o random)")
}

runExample()
runBenchmark()

Compile and run with:

swiftc -framework CoreGraphics -Ounchecked mainmetal.swift
./mainmetal   

Gives the following:

30.0 36.0 42.0 66.0 81.0 96.0 102.0 126.0 150.0 
metal elapsed 4.259647533 seconds (w/o random)

Interesting. This is what I was using for the test:

(About this Mac -> More Info -> General -> About -> System Report -> Graphics/Displays)

  Apple M1 Pro
  Chipset Model: Apple M1 Pro
  Type: GPU 
  Bus: Built-In 
  Total Number of Cores: 16
  Metal Support: Metal 3 

I'm on an old 2019 Mac with the following:

AMD Radeon Pro 5500M:

  Chipset Model:	AMD Radeon Pro 5500M
  Type:	GPU
  Bus:	PCIe
  PCIe Lane Width:	x16
  VRAM (Total):	4 GB
  Vendor:	AMD (0x1002)
  Device ID:	0x7340
  Revision ID:	0x0040
  ROM Revision:	113-D3220E-190
  VBIOS Version:	113-D32206U1-019
  Option ROM Version:	113-D32206U1-019
  EFI Driver Version:	01.A1.190
  Automatic Graphics Switching:	Supported
  gMux Version:	5.0.0
  Metal Support:	Metal 3

One thing I've learned from this post is that I just need to get a new Mac.

1 Like

One thing I have learned from this exercise is that up to 7 x 7 matrices, doing multiplication locally on the CPU yields results much faster than using Metal. This is not unexpected though, due to the cost of data transfer to and from the GPU.

Is that true for Apple Silicon machines? The unified memory is supposed to minimize the transfer cost.

1 Like

For isolated operations, you still have the synchronization cost, even when moving memory is cheap / unnecessary. For small matrix multiplies, it will never be advantageous to use an implementation that has to talk to another part of the SoC, unless you group them together with other stuff to amortize that overhead.

A 7x7 matrix multiply is only 343 multiply-adds. A totally naive three-nested-for-loops scalar implementation does that in about 100 cycles. Almost any form of synchronization is more expensive.

6 Likes

Totally unsurprising... For small jobs algorithms with smaller overhead win. E.g. linear search is quicker than binary search for small N despite having worse O() complexity.

1 Like