Optimizing Swift: Adler32 Checksum

Ah, okay. I was thinking "is that supposed to be 1 << 16??" but if it's a hardcoded constant in a general implementation that makes sense!

Replacing the hot inner loop with this is actually slower than the while loop. Even manually doing the work without the for loops is slower.

let firstEight = pointer.load(fromByteOffset: position, as: UInt64.self)
position += 8
let secondEight = pointer.load(fromByteOffset: position, as: UInt64.self)
position += 8

for shift in stride(from: 0, through: 56, by: 8) {
    s1 += UInt32((firstEight >> shift) & 0xFF)
    s2 += s1
}

for shift in stride(from: 0, through: 56, by: 8) {
    s1 += UInt32((secondEight >> shift) & 0xFF)
    s2 += s1
}

here is the implementation swift-png uses:

    s1 += UInt32((firstEight >> shift) & 0xFF)

This expression calls the following initializer:

  @_semantics("optimize.sil.specialize.generic.partial.never")
  @inlinable // FIXME(inline-always)
  @inline(__always)
  public init<T: BinaryInteger>(_ source: T) {
    // This check is potentially removable by the optimizer
    if T.isSigned {
      _precondition(source >= (0 as T), "Negative value is not representable")
    }
    // This check is potentially removable by the optimizer
    if source.bitWidth >= Self.bitWidth {
      _precondition(source <= Self.max,
        "Not enough bits to represent the passed value")
    }
    self.init(truncatingIfNeeded: source)
  }

Those “potentially”s worry me. Try calling init(truncatingIfNeeded:) directly instead.

I assume you are using reasonably sized Data chunks, not, like 10 or 100 byte blocks but at least a few K, otherwise the overhead would be significant.

Some suggestions for you:

  • check with "-enforce-exclusivity=none"
  • check on Intel. Last time I checked loops are not automatically unrolled under ARM.
  • unroll loops manually to see if there's a difference.
  • double check the speed of that C code is "on par" with libz implementation (if not – you may need a better starting point for your swift port).

Nice, this is even faster than the complex version I had ported from C.

func taylor_adler32Checksum(of data: Data) -> UInt32 {
    data.withUnsafeBytes { pointer in
        let (q, r): (Int, Int) = data.count.quotientAndRemainder(dividingBy: 5552)
        var (single, double): (UInt32, UInt32) = (1 & 0xffff, (1 >> 16) & 0xffff)

        for i: Int in 0 ..< q {
            for j: Int in 5552 * i ..< 5552 * (i + 1) {
                single &+= .init(pointer[j])
                double &+= single
            }
            single %= 65521
            double %= 65521
        }

        for j: Int in 5552 * q ..< 5552 * q + r {
            single &+= .init(pointer[j])
            double &+= single
        }

        return ((double % 65521) << 16 | (single % 65521)).bigEndian
    }
}

This implementation is 6.2x slower than libz (through DataCompression), while the previous version ported from C is 7.8x.

1 Like

Yes, the tests are running on a 16MB block of randomized data. At very small sizes the Swift implementations are actually faster, but there may be other overhead for libz there.

Last I checked there was no difference when turning off exclusivity or bounds checking. In some cases performance was actually worse.

Given the vast majority of Alamofire use in on ARM it seems to be the best comparison platform. I also don't have an Intel machine set up with this environment.

I tried copying the manual unrolling the libz implementation did and performance was worse.

The C code is libz. The DataCompression library loads libz with dyld and calls its implementation of adler32.

Not my experience: in my tests turning off exclusivity checks had quite significant positive impact on the performance.

Also doesn't match my experience: on ARM manual loop unrolling had significant positive impact on the performance.

In my tests typically the speed difference between C code and that same code ported to Swift was like 10 - 30%. How would you explain 7x slowdown? Something in your analysis doesn't check out.

Do they calculate a partial checksums in 8 threads and then combine the results?

I don't know how to explain the 7x slowdown, that's why I posted this thread. For instance, in Taylor's implementation, not using the wrapping operators dropped performance by quite a bit. Using the wrapping operators in my other implementations had no effect at all. I just tested again: turning off exclusivity had no effect, disabling the safety checks actually regresses performance for the C-port version but leaves Taylor's version alone (weird).

Feel free to rerun my tests or take a look at the libz implementation, I linked it in the first post. I've pushed the latest implementations.

Generally I'd advice to use a less optimised version of an implementation as a starting point to port to Swift. For example that version is better than what you have in the git repo, but I'd still choose something simpler, even the slow reference implementation (and once I get that working add various optimisations on top, like loop unrolling, etc).

Note that the linked implementation gives a different result compared to a reference implementation (e.g. found in wikipedia), it might be related to moving single & double variables truncation out of the inner loop †. Perhaps to correctly move that division out of the loop the intermediate results should be done using UInt64 (so you can tolerate adding without an overflow for quite a while), not UInt32 as in the code.†

† Edit: ah, no. The difference is caused by bigEndian. It looks strange it is used there but I'll leave it up to you.

Edit2: switching to UInt64 might still worth it as instead of doing work in 5552 chunks you'll be able doing it in 5552 * 2^32 chunks – which is about 2^44 or 21TB if I am not mistaken†† – you may not even need the inner loop anymore as Data will never be that large. That's assuming that 64 bit arithmetic (e.g. division / reminder) is as fast as 32 bit arithmetic – worth checking beforehand if that's the case.

†† Edit3 - correction to Edit2 above. A simple test shows that switching to UInt64 would give the chunk sizes of about 256MB (plus or minus). Data can easily be bigger than that, so the inner loop will still be required.

I suspect that the version of adler32 that's in the system library you're comparing against via DataCompression isn't identical to the plain C version, but uses SIMD intrinsics like this version from Chromium.

Here's a Gist with my incredibly näive port of that C code into Swift, using the SIMD types wherever I could and only falling back to the low-level intrinsics when it wasn't obvious that an operation was supported (this is getting outside my area of expertise). When I run it in your benchmark, I get

Neon:      0.00966799259185791s
Library:   0.010444998741149902s
...
Neon is      0.9256097421792077x slower than library.
Neon is      180.8131342400217x faster than worst.

One thing I left out was that there's a block of code that's supposed to handle any non-16-byte-aligned prefix of the data. When I uncomment that, the codegen goes wild and it brings the performance back down to ~7x the library implementation. I wasn't able to figure out what was going on there, but it didn't seem to matter for the test cases in your benchmark, so :man_shrugging:

2 Likes

Given the operation on large sequences of bytes, I wondered if we could vectorize it in some way. I'll need to ship both Intel and ARM versions, so I'll investigate what it looks like to support both. Thanks.

1 Like

Thanks for looking into it.

I wonder about the "big endian" business. Yes, I can see that "The bytes are stored in network order" but it's not quite obvious what it means in this context. The example on wikipedia page calculates adler32 checksum for the word "Wikipedia" and get to the value of 300286872 which I can replicate with their reference implementation without calling "bigEndian" on the result:

func adler32(_ data: Data) -> UInt32 {
    var a: UInt32 = 1, b: UInt32 = 0
    for v in data {
        a = (a + UInt32(v)) % 65521
        b = (b + a) % 65521
    }
    return (b << 16) | a
}

let v = adler32("Wikipedia".data(using: .ascii)!) // 300286872, as per wikipedia
let big = v.bigEndian // 2550392337 - something else when running on little endian computer

Unless I'm misunderstanding something, surely adler32("Wikipedia") should give the same result on big and little endian platforms... but that means I should NOT mess with endianness of the final result when doing the calculation on little endian computer.


Some naïve side questions / ideas on how to make this (or similar †) checksum calculation faster:

  • could Apple's Accelerate framework help here to make it faster?
  • would it be reasonable to spawn N threads (where N is the number of CPU cores) to do this checksum calculation in chunks († ok, ok, maybe not for this particular calculation if it is impossible to work on N chunks independently in this case but some other checksum calculation that allows that ability of working on N chunks independently and them combining the individual chunk results into the final result).
  • is it reasonable to speed up this calculation using GPU?

Most likely it's the use of the resulting value in the deflate body encoding that requires bigEndian to match network order when encoded into the body itself. I'll look into it more next week as I try more optimizations.

Using Accelerate might be possible. I need to familiarize myself with the vector instructions and the core algorithm anyway, so I'll take a look. For this case parallel or GPU calculations seem unlikely.

Accelerate doesn't provide adler32, but Apple's zlib implementation contains an adler32 that you can use directly. This should look something like:

import zlib
import Foundation

func adler32(_ data: Data) -> UInt32 {
  data.withUnsafeBytes {
    let mut = UnsafeMutableRawPointer(mutating: $0.baseAddress!)
    return UInt32(adler32(1, mut, UInt32($0.count)))
  }
}
2 Likes

Yeah, I was fantasising about some low level magic vectorising operation (pseudocode):

data.vectorises_magically { byte in
    // do something with byte
}

BTW, this also gives 300286872 for "Wikipedia".. which firmly suggests there is no need for an explicit "bigEndian" conversion in code.

2 Likes

Well I feel pretty dumb. Strangely the DataCompression library I was comparing to uses Apple's Compression library but somehow doesn't realize zlib is available without dynamic loading. And of course using it direct is the fastest way (barring vectorization). Apple's implementation (at the bottom) is also simpler than the libz version I found, so perhaps I'll do another port to compare.

Thanks everyone. I do wish I could've gained more insight into the performance difference here (feel free to let me know if you find anything) but this is largely solved for my needs.

2 Likes

Minor nitpicking: adler32() takes a const pointer argument for the buffer, so the mutable pointer is not needed:

func adler32(_ data: Data) -> UInt32 {
    data.withUnsafeBytes {
        UInt32(adler32(1, $0.baseAddress, UInt32($0.count)))
    }
}

Answering myself on this one: yes, that's possible, although the speed difference is only observable on big amounts of data (20+ megabytes). In the particular case of adler32 there's a special call that combines partial checksums: adler32_combine.

quick & dirty proof of concept test
import Foundation
import zlib

func adler32(_ data: Data) -> UInt32 {
    data.withUnsafeBytes {
        UInt32(adler32(1, $0.baseAddress!, UInt32($0.count)))
    }
}

func adler32_async(_ data: Data, threadCount: Int = 10, execute: @escaping (UInt32) -> Void) {
    var threads: [Thread] = []
    var results = [UInt32](repeating: 0, count: threadCount)

    precondition(threadCount > 1)
    precondition(data.count >= threadCount) // TODO
    let chunkLen = data.count / threadCount
    precondition(chunkLen >= 1)
    let lastLen = data.count - (chunkLen * (threadCount - 1))
    var doneCount = 0
    
    for i in 0 ..< threadCount {
        let t = Thread {
            let start = i * chunkLen
            let end = i == threadCount - 1 ? data.count : start + chunkLen
            let v = adler32(data[start ..< end])
            DispatchQueue.main.async {
                results[i] = v
                doneCount += 1
                if doneCount == threadCount {
                    var r = results[0]
                    for i in 1 ..< threadCount {
                        let len = i == threadCount - 1 ? lastLen : chunkLen
                        r = UInt32(adler32_combine(UInt(r), UInt(results[i]), len))
                    }
                    execute(r)
                }
            }
        }
        threads.append(t)
        t.start()
    }
}

func test() {
//    print("initializing data")
    let size = 1024*1024*1024
    let p = malloc(size)!
    let data = Data(bytes: p, count: size)
    
//    print("start test \(size)")
    let start1 = Date()
    let v1 = adler32(data) // cold run
    let elapsed1 = Date().timeIntervalSince(start1)
//    print("adler32 (cold run) \(elapsed1)")
    
    let start2 = Date()
    let v2 = adler32(data) // hot run
    let elapsed2 = Date().timeIntervalSince(start2)
    print("adler32 size: \(size), time: \(elapsed2)")
    
    let start3 = Date()
    adler32_async(data) { result in
        precondition(result == v2)
        let elapsed3 = Date().timeIntervalSince(start3)
        print("adler32_async size: \(size), time: \(elapsed3), ~\(elapsed2/elapsed3) times faster")
        print("test ended")
        exit(0)
    }
    RunLoop.main.run(until: .distantFuture)
}

test()

Got these results when using 8 threads (on a computer with 8 performance cores + 2 efficiency cores):

1000MB, orig: 0.080558, threaded: 0.011235, ~7.17 times faster
100MB,  orig: 0.007674, threaded: 0.001697, ~4.52 times faster
10MB,   orig: 0.000628, threaded: 0.000680, ~0.92 times faster (a bit slower)
1MB,    orig: 6.20e-05, threaded: 0.000578, ~0.107 times faster (9x slower)
1 Like

You can also split the workload into more chunks than you have cores and use DispatchQueue. concurrentPerform(iterations:execute:). This is especially useful on heterogeneous hardware where you have performance and efficiency cores or something else is running on the system and eating up some CPU. The WWDC session "Modernizing Grand Central Dispatch Usage" goes into great detail on how to chose a good iteration count at 4:52.

You should see increased performance on smaller inputs and more stable performance results on larger inputs. You will also reuse already spawned threads from GCD and don't need to start and stop your own just for your parallel compute.