Can you share the BLAS example that you benchmarked? I don't think the BLAS code I shared earlier is doing the same thing as the Swift and vDSP examples if you put it in a for-loop and time it.
Here are my examples where the for-loop is timed.
// Accelerate vDSP example
// swiftc accel.swift -Ounchecked -o build/accel && ./build/accel
import Accelerate
func main() {
// setup
let n = 1_000_000
let m = 10_000
let a = Array(repeating: 2.5, count: n)
let b = Array(repeating: 1.88, count: n)
var c = Array(repeating: 0.0, count: n)
// warmup
for _ in 0..<3 {
vDSP.add(a, b, result: &c)
}
// benchmark
print("Accelerate vDSP")
let tic = ContinuousClock.now
for _ in 0..<m {
vDSP.add(a, b, result: &c)
}
let toc = tic.duration(to: .now)
print("first", c[0], "last", c[n-1])
print(toc)
}
main()
// Swift example
// swiftc basic.swift -Ounchecked -o build/basic && ./build/basic
func main() {
// setup
let n = 1_000_000
let m = 10_000
let a = Array(repeating: 2.5, count: n)
let b = Array(repeating: 1.88, count: n)
var c = Array(repeating: 0.0, count: n)
// warmup
for _ in 0..<3 {
for i in 0..<n {
c[i] = a[i] + b[i]
}
}
// benchmark
print("Swift")
let tic = ContinuousClock.now
for _ in 0..<m {
for i in 0..<n {
c[i] = a[i] + b[i]
}
}
let toc = tic.duration(to: .now)
print("first", c[0], "last", c[n-1])
print(toc)
}
main()
// Accelerate BLAS example
// swiftc blas.swift -Xcc -DACCELERATE_NEW_LAPACK -Xcc -DACCELERATE_LAPACK_ILP64 -Ounchecked -o build/blas && ./build/blas
import Accelerate
func main() {
// setup
let n = 1_000_000
let m = 10_000
let a = Array(repeating: 2.5, count: n)
let b = Array(repeating: 1.88, count: n)
var c = Array(repeating: 0.0, count: n)
// warmup
for _ in 0..<3 {
cblas_dcopy(n, b, 1, &c, 1)
cblas_daxpy(n, 1.0, a, 1, &c, 1)
}
// benchmark
print("Accelerate BLAS")
let tic = ContinuousClock.now
for _ in 0..<m {
cblas_dcopy(n, b, 1, &c, 1)
cblas_daxpy(n, 1.0, a, 1, &c, 1)
}
let toc = tic.duration(to: .now)
print("first", c[0], "last", c[n-1])
print(toc)
}
main()
Here are the benchmark results for n = 1_000_000 and m = 10_000. The BLAS code is slower than the Swift and vDSP because it copies the b array. The Swift and vDSP code performs about the same.
Accelerate vDSP
first 4.38 last 4.38
1.93274625 seconds
Swift
first 4.38 last 4.38
1.95879725 seconds
Accelerate BLAS
first 4.38 last 4.38
3.33451375 seconds
If I use n = 500_000 for the array size then the BLAS code is almost 2x faster than the others. Below are the benchmark results. I think this is because the arrays are stored in the cache memory and the copy in the BLAS code is much faster in the cache.
Accelerate vDSP
first 4.38 last 4.38
0.943721291 seconds
Swift
first 4.38 last 4.38
0.951104833 seconds
Accelerate BLAS
first 4.38 last 4.38
0.55452775 seconds