I suspect that the version of adler32 that's in the system library you're comparing against via DataCompression isn't identical to the plain C version, but uses SIMD intrinsics like this version from Chromium.
Here's a Gist with my incredibly näive port of that C code into Swift, using the SIMD types wherever I could and only falling back to the low-level intrinsics when it wasn't obvious that an operation was supported (this is getting outside my area of expertise). When I run it in your benchmark, I get
Neon: 0.00966799259185791s
Library: 0.010444998741149902s
...
Neon is 0.9256097421792077x slower than library.
Neon is 180.8131342400217x faster than worst.
One thing I left out was that there's a block of code that's supposed to handle any non-16-byte-aligned prefix of the data. When I uncomment that, the codegen goes wild and it brings the performance back down to ~7x the library implementation. I wasn't able to figure out what was going on there, but it didn't seem to matter for the test cases in your benchmark, so
Given the operation on large sequences of bytes, I wondered if we could vectorize it in some way. I'll need to ship both Intel and ARM versions, so I'll investigate what it looks like to support both. Thanks.
I wonder about the "big endian" business. Yes, I can see that "The bytes are stored in network order" but it's not quite obvious what it means in this context. The example on wikipedia page calculates adler32 checksum for the word "Wikipedia" and get to the value of 300286872 which I can replicate with their reference implementation without calling "bigEndian" on the result:
func adler32(_ data: Data) -> UInt32 {
var a: UInt32 = 1, b: UInt32 = 0
for v in data {
a = (a + UInt32(v)) % 65521
b = (b + a) % 65521
}
return (b << 16) | a
}
let v = adler32("Wikipedia".data(using: .ascii)!) // 300286872, as per wikipedia
let big = v.bigEndian // 2550392337 - something else when running on little endian computer
Unless I'm misunderstanding something, surely adler32("Wikipedia") should give the same result on big and little endian platforms... but that means I should NOT mess with endianness of the final result when doing the calculation on little endian computer.
Some naïve side questions / ideas on how to make this (or similar †) checksum calculation faster:
could Apple's Accelerate framework help here to make it faster?
would it be reasonable to spawn N threads (where N is the number of CPU cores) to do this checksum calculation in chunks († ok, ok, maybe not for this particular calculation if it is impossible to work on N chunks independently in this case but some other checksum calculation that allows that ability of working on N chunks independently and them combining the individual chunk results into the final result).
is it reasonable to speed up this calculation using GPU?
Most likely it's the use of the resulting value in the deflate body encoding that requires bigEndian to match network order when encoded into the body itself. I'll look into it more next week as I try more optimizations.
Using Accelerate might be possible. I need to familiarize myself with the vector instructions and the core algorithm anyway, so I'll take a look. For this case parallel or GPU calculations seem unlikely.
Well I feel pretty dumb. Strangely the DataCompression library I was comparing to uses Apple's Compression library but somehow doesn't realize zlib is available without dynamic loading. And of course using it direct is the fastest way (barring vectorization). Apple's implementation (at the bottom) is also simpler than the libz version I found, so perhaps I'll do another port to compare.
Thanks everyone. I do wish I could've gained more insight into the performance difference here (feel free to let me know if you find anything) but this is largely solved for my needs.
Answering myself on this one: yes, that's possible, although the speed difference is only observable on big amounts of data (20+ megabytes). In the particular case of adler32 there's a special call that combines partial checksums: adler32_combine.
quick & dirty proof of concept test
import Foundation
import zlib
func adler32(_ data: Data) -> UInt32 {
data.withUnsafeBytes {
UInt32(adler32(1, $0.baseAddress!, UInt32($0.count)))
}
}
func adler32_async(_ data: Data, threadCount: Int = 10, execute: @escaping (UInt32) -> Void) {
var threads: [Thread] = []
var results = [UInt32](repeating: 0, count: threadCount)
precondition(threadCount > 1)
precondition(data.count >= threadCount) // TODO
let chunkLen = data.count / threadCount
precondition(chunkLen >= 1)
let lastLen = data.count - (chunkLen * (threadCount - 1))
var doneCount = 0
for i in 0 ..< threadCount {
let t = Thread {
let start = i * chunkLen
let end = i == threadCount - 1 ? data.count : start + chunkLen
let v = adler32(data[start ..< end])
DispatchQueue.main.async {
results[i] = v
doneCount += 1
if doneCount == threadCount {
var r = results[0]
for i in 1 ..< threadCount {
let len = i == threadCount - 1 ? lastLen : chunkLen
r = UInt32(adler32_combine(UInt(r), UInt(results[i]), len))
}
execute(r)
}
}
}
threads.append(t)
t.start()
}
}
func test() {
// print("initializing data")
let size = 1024*1024*1024
let p = malloc(size)!
let data = Data(bytes: p, count: size)
// print("start test \(size)")
let start1 = Date()
let v1 = adler32(data) // cold run
let elapsed1 = Date().timeIntervalSince(start1)
// print("adler32 (cold run) \(elapsed1)")
let start2 = Date()
let v2 = adler32(data) // hot run
let elapsed2 = Date().timeIntervalSince(start2)
print("adler32 size: \(size), time: \(elapsed2)")
let start3 = Date()
adler32_async(data) { result in
precondition(result == v2)
let elapsed3 = Date().timeIntervalSince(start3)
print("adler32_async size: \(size), time: \(elapsed3), ~\(elapsed2/elapsed3) times faster")
print("test ended")
exit(0)
}
RunLoop.main.run(until: .distantFuture)
}
test()
Got these results when using 8 threads (on a computer with 8 performance cores + 2 efficiency cores):
1000MB, orig: 0.080558, threaded: 0.011235, ~7.17 times faster
100MB, orig: 0.007674, threaded: 0.001697, ~4.52 times faster
10MB, orig: 0.000628, threaded: 0.000680, ~0.92 times faster (a bit slower)
1MB, orig: 6.20e-05, threaded: 0.000578, ~0.107 times faster (9x slower)
You should see increased performance on smaller inputs and more stable performance results on larger inputs. You will also reuse already spawned threads from GCD and don't need to start and stop your own just for your parallel compute.
Ah, I searched under the apple org but didn't find anything. Yeah, looks like the latest version matches the version I found.
For completeness sake, how did you get the intrinsics to work? I tried to copy it in but the compiler says they're not enabled. module '_Builtin_intrinsics.arm.neon' requires feature 'neon', and I can't find a way to enable it.
Try updating your Xcode project's settings to only build for arm64. I had to tweak that because "Standard archs" would tell it to build for both arm64 and x86_64, so it was the x86_64 slice where that error message was coming from.