I have been happily using the Accelerate framework to speed up some financial operations involving thousands of data points. Sadly, there is no vDSP.sum(_:)
for signed integer numbers (only floating-point numbers).
The naive reduce()
implementation is too slow for my current purposes. Therefore, I decided to use this opportunity to learn how to use AVX intrinsics with Swift. So far I got this:
import _Builtin_intrinsics.intel
extension vDSP {
/// Returns the single-precision vector sum.
@_transparent static func sum<U>(_ vector: U) -> Int32 where U:AccelerateBuffer, U.Element==Int32 {
vector.withUnsafeBufferPointer { (buffer) -> Int32 in
let (iterations, remaining) = (buffer.count / 8, buffer.count % 8)
var result: Int32 = buffer.baseAddress!.withMemoryRebound(to: __m256i.self, capacity: iterations) {
var accumulator = _mm256_setzero_si256()
for i in stride(from: 0, to: iterations, by: 1) {
let element = _mm256_loadu_si256($0 + i)
accumulator = _mm256_add_epi32(accumulator, element)
}
let values = unsafeBitCast(accumulator, to: SIMD8<Int32>.self)
return values[0] &+ values[1] &+ values[2] &+ values[3] &+ values[4] &+ values[5] &+ values[6] &+ values[7]
}
for i in stride(from: 0, to: remaining, by: 1) {
result += buffer[iterations * 8 + i]
}
return result
}
}
}
The current code has several shortcomings and somehow I am unable to use some AVX2 intrinsics, such as _mm256_extracti128_si256
. I would like to use some horizontal adds and extract parts of the values.
You can see the current compiler outcome in Godbolt.
Concretely, I have several questions:
- Is someone out there actively using vector extensions with Swift?
- How can I activate AVX2 compilations per function? The
-Xcc -Xclang -Xcc -target-feature -Xcc -Xclang -Xcc +avx2
flags seems too heaviy handed (likewise for setting the whole project "Enable Additional Vector Extensions" attribute) - What is the best way to implement an
Int32
sum with AVX2?