Accelerate in Swift is slower than NumPy in Python for matrix multiplication

wigging · April 20, 2024, 12:59pm

I'm performing matrix multiplication in Swift using the vDSP_mmulD and cblas_dgemm functions from Accelerate as shown below. Each matrix is a square N x N matrix. The code is in a file named main.swift and is compiled with swiftc -Xcc -DACCELERATE_NEW_LAPACK -Ounchecked main.swift then executed with ./main.

// main.swift

import Accelerate

func random(_ n: Int) -> [Double] {
    var idist: Int32 = 1

    var iseed: [Int32] = (0..<3).map { _ in Int32.random(in: 1..<4095) }
    iseed += [2 * (Int32.random(in: 1..<4095) / 2) + 1 ]

    var num = Int32(n)
    var vec = [Double](repeating: 0.0, count: n)
    dlarnv_(&idist, &iseed, &num, &vec)

    return vec
}

func mmul(_ a: [Double], _ b: [Double], size: Int) -> [Double] {
    let len = vDSP_Length(size)
    let m: vDSP_Length = len
    let n: vDSP_Length = len
    let p: vDSP_Length = len
    let stride = 1

    var c = [Double](repeating: 0, count: a.count)
    vDSP_mmulD(a, stride, b, stride, &c, stride, m, n, p)

    return c
}

func dgemm(_ a: inout [Double], _ b: inout [Double], size: Int32) -> [Double] {
    let m: Int32 = size
    let n: Int32 = size
    let k: Int32 = size

    var c = [Double](repeating: 0, count: a.count)
    cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, k, 1.0, &a, size, &b, size, 0.0, &c, size)

    return c
}

// Example to compare with Python result
// Notice that n is for NxN matrix
func run_mmul_example() {
    let n = 3
    let a: [Double] = [1, 2, 3, 4, 5, 6, 7, 8, 9]
    let b: [Double] = [1, 2, 3, 4, 5, 6, 7, 8, 9]
    let c = mmul(a, b, size: n)
    print(c)
}

// Example to compare with Python result
// Notice that n is for NxN matrix
func run_dgemm_example() {
    let n: Int32 = 3  // where n is for NxN matrix
    var a: [Double] = [1, 2, 3, 4, 5, 6, 7, 8, 9]
    var b: [Double] = [1, 2, 3, 4, 5, 6, 7, 8, 9]
    let c = dgemm(&a, &b, size: n)
    print(c)
}

func run_mmul_benchmark() {
    let tic = ContinuousClock().now

    let n = 8000
    let a = random(n * n)
    let b = random(n * n)
    let c = mmul(a, b, size: n)
    print(c[0])
    print(c.last!)

    let toc = ContinuousClock().now
    print("elapsed \(toc - tic)")
}

func run_dgemm_benchmark() {
    let tic = ContinuousClock().now

    let n = 8000
    var a = random(n * n)
    var b = random(n * n)
    let c = dgemm(&a, &b, size: Int32(n))
    print(c[0])
    print(c.last!)

    let toc = ContinuousClock().now
    print("elapsed \(toc - tic)")
}

run_mmul_example()
run_dgemm_example()

run_mmul_benchmark()
run_dgemm_benchmark()

I compared the Swift/Accelerate results to Python using the NumPy package. See Python code below. The code is in a file named main.py and executed with python main.py. The Swift and Python results are equivalent, so I feel confident that each code is doing the same calculation.

# main.py

import numpy as np
import time


def run_example():
    # Example to compare with Swift result
    a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
    b = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
    c = np.matmul(a, b)
    print(c)


def run_benchmark():
    # Benchmark
    tic = time.perf_counter()

    n = 8000
    a = np.random.rand(n, n)
    b = np.random.rand(n, n)
    c = np.matmul(a, b)
    print(c[0, 0])
    print(c[-1, -1])

    toc = time.perf_counter()
    print(f"elapsed {toc - tic} sec")


if __name__ == "__main__":
    run_example()
    run_benchmark()

However, for a large 8000 x 8000 matrix, the Swift/Accelerate code is about one second slower than the Python/NumPy code. The table below shows the elapsed time in seconds from the benchmark functions for three runs. The results are from a 2019 MacBook Pro with a 2.6 GHz 6-core Intel CPU and 32 GB of RAM running on macOS 14.4.1 Sonoma.

Run	vDSP_mmulD	cblas_dgemm	NumPy
1	5.69	5.38	4.42
2	5.77	5.51	4.40
3	5.73	5.48	4.43

Even if I ignore the generation of the random matrices, the Swift code is still slower than the Python code. I know NumPy has some optimized code under-the-hood but I'm surprised it's faster than Accelerate. I thought Accelerate is optimized for Apple hardware so I was expecting to get better performance than NumPy. Is there something else I should do to improve performance in Swift/Accelerate for matrix multiplication?

wigging · April 20, 2024, 2:27pm

Here are the results when I exclude the random number generation from the benchmarks. Again, the Swift code is about one second slower than the Python code.

// Swift benchmarks that exclude timing the random number generation

func run_mmul_benchmark2() {
    let n = 8000
    let a = random(n * n)
    let b = random(n * n)

    let tic = ContinuousClock().now
    let c = mmul(a, b, size: n)
    let toc = ContinuousClock().now

    print(c[0])
    print(c.last!)
    print("elapsed \(toc - tic)")
}

func run_dgemm_benchmark2() {
    let n = 8000
    var a = random(n * n)
    var b = random(n * n)

    let tic = ContinuousClock().now
    let c = dgemm(&a, &b, size: Int32(n))
    let toc = ContinuousClock().now

    print(c[0])
    print(c.last!)
    print("elapsed \(toc - tic)")
}

# Python benchmark that excludes timing the random number generation

def run_benchmark2():
    n = 8000
    a = np.random.rand(n, n)
    b = np.random.rand(n, n)

    tic = time.perf_counter()
    c = np.matmul(a, b)
    toc = time.perf_counter()

    print(c[0, 0])
    print(c[-1, -1])
    print(f"elapsed {toc - tic} sec")

Run	vDSP_mmulD	cblas_dgemm	NumPy
1	4.42	4.37	3.27
2	4.63	4.64	3.20
3	4.81	4.72	3.18

tera · April 20, 2024, 2:28pm

That's because NumPy is multithreaded while Accelerate is thread unaware (single-threaded). You could try disabling NumPy from using multithreading to see if you could make it slower than Accelerate.

wigging · April 20, 2024, 2:33pm

According to the Accelerate docs, vDSP_mmulD is multithreaded depending on the size of the data. See the last paragraph in the Overview section. vDSP | Apple Developer Documentation

tera · April 20, 2024, 2:41pm

The threshold is undocumented, so it may well be your data size is not large enough. I'd say check the profiler to see if it does or doesn't multithread.

I'd also recommend to pre-transpose the right matrix (and change the multiplication algorithm accordingly) for better cache locality.

I'd probably not do any of the above but a method that could utilise GPU instead of CPU, it should be much faster.

wadetregaskis · April 20, 2024, 3:08pm

What does a time profile say of the two applications?

hyperjeff · April 20, 2024, 4:38pm

Seems that architecture is a big factor. On my M1 Max MBP, for the calculation-only benchmarks, I get:

mmul	dgemm	numpy
2.27	2.25	3.51
2.30	2.33	3.45
2.35	2.29	3.47

During the python runs, all 10 cores are going full-tilt. During the swift runs, it's harder to tell, but it seems like 2 cores are going full tilt and another 6 are sorta up to stuff. So, on this M1, I'm seeing much better performance while using less resources.

wigging · April 20, 2024, 5:41pm

How are you checking the usage of the CPU cores? Are you using the CPU Usage and CPU History windows from the Activity Monitor?

hyperjeff · April 20, 2024, 5:59pm

I ran the tests 20 times in a loop and watched the cpu meter. made sure no other processes were up to much.

wigging · April 20, 2024, 6:24pm

The CPU Usage and CPU History windows in Activity Monitor show 12 cores on my 2019 MacBook Pro which has a 2.6 GHz 6-core Intel Core i7 CPU. I guess there are two threads per core so each physical core represents two virtual cores. Anyway, looking at the CPU Usage and CPU History windows, it looks like the Swift code is using all 12 cores and the Python code is using 6 cores on my machine.

wadetregaskis · April 20, 2024, 7:34pm

It wouldn't surprise me if it's actually a performance loss to use hyperthreading for efficient SIMD code. It should be basically bandwidth-limited (or thermally limited) such that SMT offers no possible benefit and only adds overheads.

That said, it's surprising that Accelerate wouldn't "know" that, and that the difference is seemingly that severe. So I suspect it's something else [also] going on here.

It's good that Accelerate performs comparatively better on AArch64 (given that's the more important architecture) but that might say more about NumPy's lack of optimisation there, than anything else. A lot of the "Linux" world (into which I loosely lump most of Python) is still stuck in the stone age (i.e. x86-64 centric, if not exclusive).

hyperjeff · April 20, 2024, 11:36pm

I ran the swift code thru the profiler and see that the accelerate calls split out into two threads, and the execution itself is distributed among the performance cores evenly, hopping frequently from one to the next. (At each moment in time, only 2 cores are active, but with the frequent hopping, the CPU meter made the average use look higher than the fine-grained analysis shows.) Still, a very stark contrast to the resource use of the python version.

tera · April 21, 2024, 10:39pm

GPU matrix multiple takes ~0.3 sec on M1 Pro Mac for 8000x8000 matrix:

import MetalPerformanceShaders

extension MPSMatrix {
    static func makeBuffer(floats: [Float], rows: Int, columns: Int) -> MTLBuffer {
        floats.withUnsafeBufferPointer { bp in
            let array = bp.baseAddress!
            let rowBytes = columns * MemoryLayout<Float>.stride
            return MTLCreateSystemDefaultDevice()!.makeBuffer(bytes: array, length: rows * rowBytes, options: [])!
        }
    }
    convenience init(floats: [Float], rows: Int, columns: Int) {
        let buffer = Self.makeBuffer(floats: floats, rows: rows, columns: columns)
        let rowBytes = columns * MemoryLayout<Float>.stride
        let descriptor = MPSMatrixDescriptor(rows: rows, columns: columns, rowBytes: rowBytes, dataType: .float32)
        self.init(buffer: buffer, descriptor: descriptor)
    }
    convenience init(rows: Int, columns: Int) {
        let rowBytes = columns * MemoryLayout<Float>.stride
        let buffer = MTLCreateSystemDefaultDevice()!.makeBuffer(length: rows * rowBytes, options: [])!
        let descriptor = MPSMatrixDescriptor(rows: rows, columns: columns, rowBytes: rowBytes, dataType: .float32)
        self.init(buffer: buffer, descriptor: descriptor)
    }
    static func * (lhs: MPSMatrix, rhs: MPSMatrix) -> MPSMatrix {
        let device = lhs.device
        precondition(device === rhs.device)
        precondition(lhs.columns == rhs.rows)
        let result = MPSMatrix(rows: lhs.rows, columns: rhs.columns)
        let commandBuffer = device.makeCommandQueue()!.makeCommandBuffer()!
        let mul = MPSMatrixMultiplication(device: device, resultRows: lhs.rows, resultColumns: rhs.columns, interiorColumns: lhs.columns)
        mul.encode(commandBuffer: commandBuffer, leftMatrix: lhs, rightMatrix: rhs, resultMatrix: result)
        commandBuffer.commit()
        commandBuffer.waitUntilCompleted()
        return result
    }
}

func matrixTest() {
    let n = 8000
    let left = MPSMatrix(floats: .init(repeating: 0, count: n*n), rows: n, columns: n)
    let right = MPSMatrix(floats: .init(repeating: 0, count: n*n), rows: n, columns: n)
    let result = left * right
}

I didn't try your CPU based code on this machine to know for sure if GPU method is faster or slower (and to what degree).

Another crucial decision maker: is Float32 precision enough for you: at this time GPU matrix multiple supports 32-bit floats only.

wigging · April 22, 2024, 12:49am

Thank you for the Metal example. You can definitely get massive performance boosts for certain operations on the GPU compared to the CPU. However, in this post I'm interested in CPU comparisons such as Accelerate vs NumPy. I will keep the Metal/GPU approach in mind, but for now I want to do everything on the CPU because it's easier to program.

David_Smith · April 22, 2024, 3:50am

As @wadetregaskis suggests, it's generally impossible to make informed statements about the relative performance of programs without profiling them.

wigging · April 22, 2024, 1:06pm

How do I profile this code? I know Xcode has profiling features but this is just a Swift file that I compile and run on the command line. Is there a way to profile it from the command line?

David_Smith · April 22, 2024, 3:19pm

You can run Instruments targeting anything, whether from Xcode or not. You can also use sample or spindump.

scanon · April 22, 2024, 7:18pm

I went ahead and looked at a trace and all of the time here is being spent in Accelerate, so there's really not a "Swift" issue to discuss. IIRC NumPy defaults to using either OpenBLAS or MKL on Intel machines. Depending on the workload and specific routines used and the architecture on which you're running they may be faster or slower than Accelerate--in particular, I've seen cases in the past where MKL or OpenBLAS was faster than Accelerate on an otherwise quiet Intel system, but Accelerate performed better when the system was under heterogeneous load. All of which is to say that performance optimization and measurement is an inexact science and measuring the tasks that you're actually using a library for is key.

On Apple Silicon, Accelerate's BLAS is generally more efficient than other implementations today, but of course many of the same caveats apply.

Well, not really, because we can compare like-for-like to convince ourselves that this isn't the case (gemm running on x86 vs gemm running on Apple Silicon). GEMM has its faults as a benchmark, but it is indisputably a fixed workload, and therefore excellent for this sort of shootout.

scanon · April 22, 2024, 7:21pm

8000 x 8000 is much more than large enough for every BLAS implementation to be multithreaded. Thresholds vary with uArch and BLAS implementation and library version (which is why they are generally not documented), but are typically somewhere around 64-512, for a rough order-of-magnitude.

dgemm implementations do this for you; since there's only O(n²) work to do transposing but O(n³) work to do for the multiplication itself, the cost of the transpose is easily amortized for sufficiently large matrices (8000 x 8000 is certainly sufficiently large).

GPU isn't really relevant, since the question is about dgemm, and relatively few GPUs offer double-precision throughput worth writing home about.

scanon · April 22, 2024, 7:33pm

@wigging What are you actually trying to measure? The benchmark as setup here is just measuring raw BLAS performance for gemm of one (somewhat unusual¹) problem size, which tells us something, but probably isn't representative of whatever you're actually interested in. If you can provide some more context, I can try to help you come up with more representative measurements.

¹ 8k x 8x is in a sort of no-man's land of "not a properly big matrix, but also impractically large for a small matrix". Matrices this size do come up in real problems from time to time, of course, but they're kind of anomalous.