Performance of Accelerate framework vs Swift on Apple Silicon

I finally upgraded from Intel to Apple Silicon with a M4 MacBook Air. I was curious about the performance of Accelerate on the new Mac so I ran some benchmarks on large vector addition.

Swift example

Below is a basic Swift example that adds two arrays of doubles. The size of the arrays is defined by n which is 800,000,000. The arrays a and b are initialized with repeated doubles. The for-loop adds each element of the arrays and stores the result in the c array. The first and last item in the c array is printed to check the result.

// Swift example
// swiftc basic.swift -Ounchecked -o build/basic && ./build/basic

func main() {
    let n = 800_000_000

    let a = Array(repeating: 2.5, count: n)
    let b = Array(repeating: 1.88, count: n)
    var c = Array(repeating: 0.0, count: n)

    for i in 0..<n {
        c[i] = a[i] + b[i]
    }

    print("first", c[0], "last", c[n-1])
}

main()

Accelerate examples

Here is the same example using vDSP from the Accelerate framework.

// Accelerate vDSP example
// swiftc accel.swift -Ounchecked -o build/accel && ./build/accel

import Accelerate

func main() {
    let n = 800_000_000

    let a = Array(repeating: 2.5, count: n)
    let b = Array(repeating: 1.88, count: n)
    var c = Array(repeating: 0.0, count: n)

    vDSP.add(a, b, result: &c)

    print("first", c[0], "last", c[n-1])
}

main()

Here is the same example using BLAS daxpy from the Accelerate framework. The cblas_daxpy function overwrites the result array but I still create three arrays to keep the number of arrays consistent with the previous examples.

// Accelerate BLAS example
// swiftc blas.swift -Xcc -DACCELERATE_NEW_LAPACK -Ounchecked -o build/blas && ./build/blas

import Accelerate

func main() {
    let n = 800_000_000

    let a = Array(repeating: 2.5, count: n)
    let _ = Array(repeating: 1.88, count: n)
    var c = Array(repeating: 1.88, count: n)

    cblas_daxpy(Int32(n), 1.0, a, 1, &c, 1)

    print("first", c[0], "last", c[n-1])
}

main()

Benchmarks

I ran some benchmarks of the above examples using the hyperfine tool. The Makefile shown below was used to run the benchmarks. Here are the computer specs where I ran the benchmarks:

  • MacBook Air 15-inch, M4, 2025
  • macOS Sequoia 15.5 arm64
  • CPU: Apple M4 (10) @ 4.46 GHz
  • GPU: Apple M4 (10) @ 1.47 GHz [Integrated]
  • Memory: 32.00 GiB
  • swift-driver version: 1.120.5 Apple Swift version 6.1.2, swiftlang-6.1.2.1.2, clang-1700.0.13.5
  • Target: arm64-apple-macosx15.0
# Makefile

benchmark:
    mkdir -p build
    swiftc basic.swift -Ounchecked -o build/basic
    swiftc accel.swift -Ounchecked -o build/accel
    swiftc blas.swift -Xcc -DACCELERATE_NEW_LAPACK -Ounchecked -o build/blas
    hyperfine --warmup 3 'build/basic' 'build/accel' 'build/blas'

clean:
    rm -rf build

Below are the results from the benchmarks. And yes, the benchmarks include the time to create the arrays.

mkdir -p build
swiftc basic.swift -Ounchecked -o build/basic
swiftc accel.swift -Ounchecked -o build/accel
swiftc blas.swift -Xcc -DACCELERATE_NEW_LAPACK -Ounchecked -o build/blas
hyperfine --warmup 3 'build/basic' 'build/accel' 'build/blas'

Benchmark 1: build/basic
  Time (mean ± σ):      1.659 s ±  0.004 s    [User: 0.643 s, System: 1.016 s]
  Range (min … max):    1.651 s …  1.667 s    10 runs

Benchmark 2: build/accel
  Time (mean ± σ):      1.645 s ±  0.003 s    [User: 0.650 s, System: 0.994 s]
  Range (min … max):    1.640 s …  1.649 s    10 runs

Benchmark 3: build/blas
  Time (mean ± σ):      1.594 s ±  0.004 s    [User: 0.671 s, System: 0.923 s]
  Range (min … max):    1.585 s …  1.599 s    10 runs

Summary
  build/blas ran
    1.03 ± 0.00 times faster than build/accel
    1.04 ± 0.00 times faster than build/basic

I also benchmarked the examples where the code performs the addition operation several times to make sure more time is spent doing the addition than creating the arrays (see below). But even with these changes the benchmark timings were similar.

// Swift addition
for _ in 0..<100 {
    for i in 0..<n {
        c[i] = a[i] + b[i]
    }
}

// Accelerate addition
for _ in 0..<100 {
    vDSP.add(a, b, result: &c)
}

Questions

Based on the examples, I don't see any major performance gains from Accelerate compared to using plain Swift code when compiling with -Ounchecked. I tried other large vector arithmetic operations and saw similar benchmark results. When I had my old Intel MacBook Pro, there were noticeable performance benefits with Accelerate, but I don't see these benefits with the Apple Silicon MacBook Air. So what is going on here? Is Swift code more efficient on the new M-series Macs and therefore negates the use of the Accelerate framework for certain operations? Are there certain compiler options for Accelerate that I need to use to take advantage of the Apple Silicon architecture?

1 Like

My guess is that the Array constructors are the main time sink, and that print statement. I’d suggest not building a binary to test against, with all its overhead, but rather just do timing around the math functions.

At a guess, I’d say these arrays are much larger than your CPU cache, and therefore the execution time is dominated by shunting data back and forth from main memory.

If you repeat with arrays that fit in cache, then you might see a difference. However, that difference might be a result of using the trapping ā€œ+ā€ operator, and switching to the wrapping ā€œ&+ā€ might make things close again.

2 Likes

I agree that the creation of the arrays takes up most of the time. But I also ran examples (see below) that perform the arithmetic operation several times and I still see similar results between Swift and Accelerate. If the math operation was more efficient with Accelerate, I would expect these examples to show a difference but the benchmark results are still similar.

// Swift example
// swiftc basic.swift -Ounchecked -o build/basic && ./build/basic

func main() {
    let n = 5_000_000
    let m = 1000

    let a = Array(repeating: 2.5, count: n)
    let b = Array(repeating: 1.88, count: n)
    var c = Array(repeating: 0.0, count: n)

    for _ in 0..<m {
        for i in 0..<n {
            c[i] = a[i] + b[i]
        }
    }

    print("first", c[0], "last", c[n-1])
}

main()
// Accelerate vDSP example
// swiftc accel.swift -Ounchecked -o build/accel && ./build/accel

import Accelerate

func main() {
    let n = 5_000_000
    let m = 1000

    let a = Array(repeating: 2.5, count: n)
    let b = Array(repeating: 1.88, count: n)
    var c = Array(repeating: 0.0, count: n)

    for _ in 0..<m {
        vDSP.add(a, b, result: &c)
    }

    print("first", c[0], "last", c[n-1])
}

main()

What is the size of the CPU cache on the M4 MacBook Air? Do large arrays need to be chunked into smaller sizes then operated on then combined back together? I would expect Accelerate functions to already do such things but maybe they don't. Regarding the use of &+, I'm using doubles here and I don't think overflow operators like &+ can be used with Double, they are for integer operations.

1 Like

A quick web search says the L2 cache is 16MB per performance core, and 4MB per efficiency core.

Assuming you’re testing on a performance core, since you have 3 arrays involved that suggests a limit of about 5MB per array. At 8 bytes per element, that comes out somewhere in the ballpark of 600,000 elements per array.

So if you run the test again with half a million or so elements, you might see a difference. On the other hand, it’s also possible that Array operations get auto-vectorized under the hood, so they might already be fast.

I’m not an expert on cache management, so I don’t have an answer here.

I would expect, however, that if you’re doing a lot of computation on a large amount of data, then in general it would make sense to operate on chunks that fit in cache.

Oh yeah, I was think of integers, you’re right.

Here's the generated code for the loop:

LBB1_5:
	ldp	q0, q1, [x8, #-16]
	ldp	q2, q3, [x10, #-16] 
	fadd.2d	v0, v0, v2 
	fadd.2d	v1, v1, v3
	stp	q0, q1, [x9, #-16] 
	add	x8, x8, #32
	add	x9, x9, #32
	add	x10, x10, #32
	subs	x11, x11, #4
	b.ne	LBB1_5

So it is indeed autovectorized and unrolled 2x, processing 256 bytes of each array at a time. I expect that's probably optimal or close to it.

2 Likes

"L1 cache is a beer in hand, L3 is fridge, main memory is walking to the store".

This benchmark is obviously dominated by RAM Access speed.

Could you make it faster? No. But you could do much more per RAM access than just + that's how to make the overall processing faster.

Can you elaborate on this? What do you mean by doing more per RAM access?

You could likely do much more arithmetic than just an add without slowing it down at all, because the extra arithmetic would just fill up the time that was previously spent stalled waiting for memory loads.

1 Like

Nice!

The last time I was messing around with SIMD stuff, @scanon showed me a magical incantation to let the compile to use extra-wide registers, but I don’t remember it off the top of my head.

So if the Swift code is already vectorized then what is the point of using the Accelerate functions? It doesn't seem like there is any benefit of using Accelerate when doing basic arithmetic operations on vectors (arrays).

For something simple like this you don’t need Accelerate, yeah.

It might be interesting to see if it helps at much larger sizes, but in general I’d expect you need more complex work than just addition for it to really matter.

Of course it does have the benefit that it’ll still be fast in debug builds, which won’t be the case for the automatic vectorization.

[edit] also it means if there’s suddenly a new chip to target that has different tradeoffs, you’ll already be optimized for it, since Accelerate will have a version for that chip

4 Likes

As an aside, when I benchmarked this (just the calculations, explicitly eliminating the array building), I find that the BLAS rendition was ~4Ɨ faster on my M4 Mac mini. The vDSP and manual versions were basically the same as each other, but BLAS was faster.


My technique for measuring time was:

let n = 800_000_000

let a = Array(repeating: 2.5, count: n)
let b = Array(repeating: 1.88, count: n)
var c = Array(repeating: 0.0, count: n)

let start = ContinuousClock.now
for i in 0..<n {
    c[i] = a[i] + b[i]
}
let duration = start.duration(to: .now)
3 Likes

Can you share the BLAS example that you benchmarked? I don't think the BLAS code I shared earlier is doing the same thing as the Swift and vDSP examples if you put it in a for-loop and time it.

Here are my examples where the for-loop is timed.

// Accelerate vDSP example
// swiftc accel.swift -Ounchecked -o build/accel && ./build/accel

import Accelerate

func main() {
    // setup
    let n = 1_000_000
    let m = 10_000

    let a = Array(repeating: 2.5, count: n)
    let b = Array(repeating: 1.88, count: n)
    var c = Array(repeating: 0.0, count: n)

    // warmup
    for _ in 0..<3 {
        vDSP.add(a, b, result: &c)
    }

    // benchmark
    print("Accelerate vDSP")

    let tic = ContinuousClock.now

    for _ in 0..<m {
        vDSP.add(a, b, result: &c)
    }

    let toc = tic.duration(to: .now)

    print("first", c[0], "last", c[n-1])
    print(toc)
}

main()
// Swift example
// swiftc basic.swift -Ounchecked -o build/basic && ./build/basic

func main() {
    // setup
    let n = 1_000_000
    let m = 10_000

    let a = Array(repeating: 2.5, count: n)
    let b = Array(repeating: 1.88, count: n)
    var c = Array(repeating: 0.0, count: n)

    // warmup
    for _ in 0..<3 {
        for i in 0..<n {
            c[i] = a[i] + b[i]
        }
    }

    // benchmark
    print("Swift")

    let tic = ContinuousClock.now

    for _ in 0..<m {
        for i in 0..<n {
            c[i] = a[i] + b[i]
        }
    }

    let toc = tic.duration(to: .now)

    print("first", c[0], "last", c[n-1])
    print(toc)
}

main()
// Accelerate BLAS example
// swiftc blas.swift -Xcc -DACCELERATE_NEW_LAPACK -Xcc -DACCELERATE_LAPACK_ILP64 -Ounchecked -o build/blas && ./build/blas

import Accelerate

func main() {
    // setup
    let n = 1_000_000
    let m = 10_000

    let a = Array(repeating: 2.5, count: n)
    let b = Array(repeating: 1.88, count: n)
    var c = Array(repeating: 0.0, count: n)

    // warmup
    for _ in 0..<3 {
        cblas_dcopy(n, b, 1, &c, 1)
        cblas_daxpy(n, 1.0, a, 1, &c, 1)
    }

    // benchmark
    print("Accelerate BLAS")

    let tic = ContinuousClock.now

    for _ in 0..<m {
        cblas_dcopy(n, b, 1, &c, 1)
        cblas_daxpy(n, 1.0, a, 1, &c, 1)
    }

    let toc = tic.duration(to: .now)

    print("first", c[0], "last", c[n-1])
    print(toc)
}

main()

Here are the benchmark results for n = 1_000_000 and m = 10_000. The BLAS code is slower than the Swift and vDSP because it copies the b array. The Swift and vDSP code performs about the same.

Accelerate vDSP
first 4.38 last 4.38
1.93274625 seconds

Swift
first 4.38 last 4.38
1.95879725 seconds

Accelerate BLAS
first 4.38 last 4.38
3.33451375 seconds

If I use n = 500_000 for the array size then the BLAS code is almost 2x faster than the others. Below are the benchmark results. I think this is because the arrays are stored in the cache memory and the copy in the BLAS code is much faster in the cache.

Accelerate vDSP
first 4.38 last 4.38
0.943721291 seconds

Swift
first 4.38 last 4.38
0.951104833 seconds

Accelerate BLAS
first 4.38 last 4.38
0.55452775 seconds

I'll dig into this thread and write up some details later, but this has come up a few times in the past, and I really cannot emphasize it enough: if you care about low-level performance, don't benchmark streaming operations on million-element arrays.

ALUs are much, much faster than memory and L3 caches, and for many operations they are faster than L2 caches. If you want to analyze optimization of core computational loops, your working set must be resident in L1 cache.

The exact size of L1 varies across architecture and uArch, but 32KB is generally a safe target (Apple Silicon has a larger L1, but 32KB will keep you in L1 on a relatively diverse set of architectures). That means that if you have three buffers of doubles, your buffer size should be at most 1000 or so. If your buffers are much bigger than that, you are only measuring the speed of the cache hierarchy, which is independent of the implementation used.

In addition, your output buffers must be allocated ahead of time and already warmed up with a write, you have to warm up your input buffers by reading from them at least once. The normal way to do this is to run your benchmark multiple times and discard the first (few) run(s) (or simply take the minimum across all runs--for a benchmark like this, that is the most relevant statistic to gather).

20 Likes

It was your original BLAS example, but measuring duration around the operation itself, excluding the start up of the app, the preparation of the arrays, etc.:

import Accelerate

func main() {
    let n = 800_000_000

    let a = Array(repeating: 2.5, count: n)
    let _ = Array(repeating: 1.88, count: n)
    var c = Array(repeating: 1.88, count: n)

    let start = ContinuousClock.now
    cblas_daxpy(Int32(n), 1, a, 1, &c, 1)
    let duration = start.duration(to: .now)

    …
}

My point was merely to say that you reported that your first three examples produced indistinguishable results, but that was not my experience.

(FWIW, in my benchmarks, although I hadn't mentioned it, I did the same warmup as you have in your subsequent examples.)

No, it’s not. That’s why cblas_daxpy is faster, because although it’s doing extra work (the multiplication of the ā€œalphaā€), rather than referring to three distinct vectors (two inputs and one output) it only needs to reference two (one is ā€œreadā€, the other is ā€œread/writeā€), which offers better cache locality.

You then propose:

I can see why you might have thought that introducing cblas_dcopy might feel like it should balance the scales, but this is not ā€œdoing the same thing as the Swift and vDSP examplesā€ either. It just shifts your thumb from one side of the scales to the other. Doing a full copy of the array first, and then proceeding cblas_daxpy, is not the same thing as the vDSP and manual looping examples.


So, my ā€œtake homeā€ messages:

  1. The appropriate choice of cBLAS or vDSP comes down to the specific requirements of your use-case. I would hesitate to make any categorical statements about one’s performance benefits over the other.

    The only general statement I would make is that the more distinct memory structures you need to reference, the more overhead that entails.

  2. Regarding why Accelerate yielded greater benefits on Intel architecture, I wonder if that might vary based upon your project’s ā€œEnable Additional Vector Extensionsā€ (CLANG_X86_VECTOR_INSTRUCTIONS) build setting, which is only applicable for Intel targets. Using Accelerate would do vector calculations. Your own Swift code’s use of vector instructions on Intel targets was subject to this build setting.

    Apple Silicon renders this build setting unnecessary.

2 Likes