Optimizing Swift: Adler32 Checksum

allevato · March 9, 2023, 9:15pm

I suspect that the version of adler32 that's in the system library you're comparing against via DataCompression isn't identical to the plain C version, but uses SIMD intrinsics like this version from Chromium.

Here's a Gist with my incredibly näive port of that C code into Swift, using the SIMD types wherever I could and only falling back to the low-level intrinsics when it wasn't obvious that an operation was supported (this is getting outside my area of expertise). When I run it in your benchmark, I get

Neon:      0.00966799259185791s
Library:   0.010444998741149902s
...
Neon is      0.9256097421792077x slower than library.
Neon is      180.8131342400217x faster than worst.

One thing I left out was that there's a block of code that's supposed to handle any non-16-byte-aligned prefix of the data. When I uncomment that, the codegen goes wild and it brings the performance back down to ~7x the library implementation. I wasn't able to figure out what was going on there, but it didn't seem to matter for the test cases in your benchmark, so

Jon_Shier · March 9, 2023, 10:21pm

Given the operation on large sequences of bytes, I wondered if we could vectorize it in some way. I'll need to ship both Intel and ARM versions, so I'll investigate what it looks like to support both. Thanks.

tera · March 10, 2023, 12:35am

Thanks for looking into it.

I wonder about the "big endian" business. Yes, I can see that "The bytes are stored in network order" but it's not quite obvious what it means in this context. The example on wikipedia page calculates adler32 checksum for the word "Wikipedia" and get to the value of 300286872 which I can replicate with their reference implementation without calling "bigEndian" on the result:

func adler32(_ data: Data) -> UInt32 {
    var a: UInt32 = 1, b: UInt32 = 0
    for v in data {
        a = (a + UInt32(v)) % 65521
        b = (b + a) % 65521
    }
    return (b << 16) | a
}

let v = adler32("Wikipedia".data(using: .ascii)!) // 300286872, as per wikipedia
let big = v.bigEndian // 2550392337 - something else when running on little endian computer

Unless I'm misunderstanding something, surely adler32("Wikipedia") should give the same result on big and little endian platforms... but that means I should NOT mess with endianness of the final result when doing the calculation on little endian computer.

Some naïve side questions / ideas on how to make this (or similar †) checksum calculation faster:

could Apple's Accelerate framework help here to make it faster?
would it be reasonable to spawn N threads (where N is the number of CPU cores) to do this checksum calculation in chunks († ok, ok, maybe not for this particular calculation if it is impossible to work on N chunks independently in this case but some other checksum calculation that allows that ability of working on N chunks independently and them combining the individual chunk results into the final result).
is it reasonable to speed up this calculation using GPU?

Jon_Shier · March 10, 2023, 12:40am

Most likely it's the use of the resulting value in the deflate body encoding that requires bigEndian to match network order when encoded into the body itself. I'll look into it more next week as I try more optimizations.

Using Accelerate might be possible. I need to familiarize myself with the vector instructions and the core algorithm anyway, so I'll take a look. For this case parallel or GPU calculations seem unlikely.

scanon · March 10, 2023, 1:11am

Accelerate doesn't provide adler32, but Apple's zlib implementation contains an adler32 that you can use directly. This should look something like:

import zlib
import Foundation

func adler32(_ data: Data) -> UInt32 {
  data.withUnsafeBytes {
    let mut = UnsafeMutableRawPointer(mutating: $0.baseAddress!)
    return UInt32(adler32(1, mut, UInt32($0.count)))
  }
}

tera · March 10, 2023, 1:37am

Yeah, I was fantasising about some low level magic vectorising operation (pseudocode):

data.vectorises_magically { byte in
    // do something with byte
}

BTW, this also gives 300286872 for "Wikipedia".. which firmly suggests there is no need for an explicit "bigEndian" conversion in code.

Jon_Shier · March 10, 2023, 2:16am

Well I feel pretty dumb. Strangely the DataCompression library I was comparing to uses Apple's Compression library but somehow doesn't realize zlib is available without dynamic loading. And of course using it direct is the fastest way (barring vectorization). Apple's implementation (at the bottom) is also simpler than the libz version I found, so perhaps I'll do another port to compare.

Thanks everyone. I do wish I could've gained more insight into the performance difference here (feel free to let me know if you find anything) but this is largely solved for my needs.

Martin · March 10, 2023, 8:07am

Minor nitpicking: adler32() takes a const pointer argument for the buffer, so the mutable pointer is not needed:

func adler32(_ data: Data) -> UInt32 {
    data.withUnsafeBytes {
        UInt32(adler32(1, $0.baseAddress, UInt32($0.count)))
    }
}

tera · March 10, 2023, 9:59am

Answering myself on this one: yes, that's possible, although the speed difference is only observable on big amounts of data (20+ megabytes). In the particular case of adler32 there's a special call that combines partial checksums: adler32_combine.

quick & dirty proof of concept test

import Foundation
import zlib

func adler32(_ data: Data) -> UInt32 {
    data.withUnsafeBytes {
        UInt32(adler32(1, $0.baseAddress!, UInt32($0.count)))
    }
}

func adler32_async(_ data: Data, threadCount: Int = 10, execute: @escaping (UInt32) -> Void) {
    var threads: [Thread] = []
    var results = [UInt32](repeating: 0, count: threadCount)

    precondition(threadCount > 1)
    precondition(data.count >= threadCount) // TODO
    let chunkLen = data.count / threadCount
    precondition(chunkLen >= 1)
    let lastLen = data.count - (chunkLen * (threadCount - 1))
    var doneCount = 0
    
    for i in 0 ..< threadCount {
        let t = Thread {
            let start = i * chunkLen
            let end = i == threadCount - 1 ? data.count : start + chunkLen
            let v = adler32(data[start ..< end])
            DispatchQueue.main.async {
                results[i] = v
                doneCount += 1
                if doneCount == threadCount {
                    var r = results[0]
                    for i in 1 ..< threadCount {
                        let len = i == threadCount - 1 ? lastLen : chunkLen
                        r = UInt32(adler32_combine(UInt(r), UInt(results[i]), len))
                    }
                    execute(r)
                }
            }
        }
        threads.append(t)
        t.start()
    }
}

func test() {
//    print("initializing data")
    let size = 1024*1024*1024
    let p = malloc(size)!
    let data = Data(bytes: p, count: size)
    
//    print("start test \(size)")
    let start1 = Date()
    let v1 = adler32(data) // cold run
    let elapsed1 = Date().timeIntervalSince(start1)
//    print("adler32 (cold run) \(elapsed1)")
    
    let start2 = Date()
    let v2 = adler32(data) // hot run
    let elapsed2 = Date().timeIntervalSince(start2)
    print("adler32 size: \(size), time: \(elapsed2)")
    
    let start3 = Date()
    adler32_async(data) { result in
        precondition(result == v2)
        let elapsed3 = Date().timeIntervalSince(start3)
        print("adler32_async size: \(size), time: \(elapsed3), ~\(elapsed2/elapsed3) times faster")
        print("test ended")
        exit(0)
    }
    RunLoop.main.run(until: .distantFuture)
}

test()

Got these results when using 8 threads (on a computer with 8 performance cores + 2 efficiency cores):

1000MB, orig: 0.080558, threaded: 0.011235, ~7.17 times faster
100MB,  orig: 0.007674, threaded: 0.001697, ~4.52 times faster
10MB,   orig: 0.000628, threaded: 0.000680, ~0.92 times faster (a bit slower)
1MB,    orig: 6.20e-05, threaded: 0.000578, ~0.107 times faster (9x slower)

dnadoba · March 12, 2023, 2:07pm

You can also split the workload into more chunks than you have cores and use DispatchQueue. concurrentPerform(iterations:execute:). This is especially useful on heterogeneous hardware where you have performance and efficiency cores or something else is running on the system and eating up some CPU. The WWDC session "Modernizing Grand Central Dispatch Usage" goes into great detail on how to chose a good iteration count at 4:52.

You should see increased performance on smaller inputs and more stable performance results on larger inputs. You will also reuse already spawned threads from GCD and don't need to start and stop your own just for your parallel compute.

benrimmington · March 13, 2023, 5:53am

Your link is to an older (macOS 10.5.2) but simpler implementation.

The latest mirror (macOS 13.2):
https://github.com/apple-oss-distributions/xnu/blob/xnu-8792.81.2/libkern/zlib/adler32.c
(Should this account's subdomain be "verified" by GitHub?)

Another mirror (macOS 11.4):
https://github.com/apple/darwin-xnu/blob/xnu-7195.121.3/libkern/zlib/adler32.c
(Should this repository be "archived", if abandoned?)

Jon_Shier · March 14, 2023, 4:31pm

Ah, I searched under the apple org but didn't find anything. Yeah, looks like the latest version matches the version I found.

For completeness sake, how did you get the intrinsics to work? I tried to copy it in but the compiler says they're not enabled. module '_Builtin_intrinsics.arm.neon' requires feature 'neon', and I can't find a way to enable it.

allevato · March 14, 2023, 4:34pm

Try updating your Xcode project's settings to only build for arm64. I had to tweak that because "Standard archs" would tell it to build for both arm64 and x86_64, so it was the x86_64 slice where that error message was coming from.

Jon_Shier · March 14, 2023, 4:55pm

I wrapped it in arch checks and it worked fine. I've pushed up all the implementations.