Performance of Accelerate framework vs Swift on Apple Silicon

Can you share the BLAS example that you benchmarked? I don't think the BLAS code I shared earlier is doing the same thing as the Swift and vDSP examples if you put it in a for-loop and time it.

Here are my examples where the for-loop is timed.

// Accelerate vDSP example
// swiftc accel.swift -Ounchecked -o build/accel && ./build/accel

import Accelerate

func main() {
    // setup
    let n = 1_000_000
    let m = 10_000

    let a = Array(repeating: 2.5, count: n)
    let b = Array(repeating: 1.88, count: n)
    var c = Array(repeating: 0.0, count: n)

    // warmup
    for _ in 0..<3 {
        vDSP.add(a, b, result: &c)
    }

    // benchmark
    print("Accelerate vDSP")

    let tic = ContinuousClock.now

    for _ in 0..<m {
        vDSP.add(a, b, result: &c)
    }

    let toc = tic.duration(to: .now)

    print("first", c[0], "last", c[n-1])
    print(toc)
}

main()
// Swift example
// swiftc basic.swift -Ounchecked -o build/basic && ./build/basic

func main() {
    // setup
    let n = 1_000_000
    let m = 10_000

    let a = Array(repeating: 2.5, count: n)
    let b = Array(repeating: 1.88, count: n)
    var c = Array(repeating: 0.0, count: n)

    // warmup
    for _ in 0..<3 {
        for i in 0..<n {
            c[i] = a[i] + b[i]
        }
    }

    // benchmark
    print("Swift")

    let tic = ContinuousClock.now

    for _ in 0..<m {
        for i in 0..<n {
            c[i] = a[i] + b[i]
        }
    }

    let toc = tic.duration(to: .now)

    print("first", c[0], "last", c[n-1])
    print(toc)
}

main()
// Accelerate BLAS example
// swiftc blas.swift -Xcc -DACCELERATE_NEW_LAPACK -Xcc -DACCELERATE_LAPACK_ILP64 -Ounchecked -o build/blas && ./build/blas

import Accelerate

func main() {
    // setup
    let n = 1_000_000
    let m = 10_000

    let a = Array(repeating: 2.5, count: n)
    let b = Array(repeating: 1.88, count: n)
    var c = Array(repeating: 0.0, count: n)

    // warmup
    for _ in 0..<3 {
        cblas_dcopy(n, b, 1, &c, 1)
        cblas_daxpy(n, 1.0, a, 1, &c, 1)
    }

    // benchmark
    print("Accelerate BLAS")

    let tic = ContinuousClock.now

    for _ in 0..<m {
        cblas_dcopy(n, b, 1, &c, 1)
        cblas_daxpy(n, 1.0, a, 1, &c, 1)
    }

    let toc = tic.duration(to: .now)

    print("first", c[0], "last", c[n-1])
    print(toc)
}

main()

Here are the benchmark results for n = 1_000_000 and m = 10_000. The BLAS code is slower than the Swift and vDSP because it copies the b array. The Swift and vDSP code performs about the same.

Accelerate vDSP
first 4.38 last 4.38
1.93274625 seconds

Swift
first 4.38 last 4.38
1.95879725 seconds

Accelerate BLAS
first 4.38 last 4.38
3.33451375 seconds

If I use n = 500_000 for the array size then the BLAS code is almost 2x faster than the others. Below are the benchmark results. I think this is because the arrays are stored in the cache memory and the copy in the BLAS code is much faster in the cache.

Accelerate vDSP
first 4.38 last 4.38
0.943721291 seconds

Swift
first 4.38 last 4.38
0.951104833 seconds

Accelerate BLAS
first 4.38 last 4.38
0.55452775 seconds