In implementing request body compression for Alamofire, I had to implement the Adler32 checksum algorithm, as it's part of the deflate
encoding. Alternatively I could dynamically load libz to get their implementation, which libraries like DataCompression do, but I felt a native Swift version would be best, depending on the complexity. It was also an interesting challenge. In the end I ported the libz implementation to Swift. Unfortunately, this implementation, while far better than the naive implementation from DataCompression (86x slower than libz), is still 7x slower than libz. Perhaps the community can do better?
Repo: GitHub - jshier/Adler32: Swift Adler32 checksum implementations for performance testing.
Current fastest implementation:
// Swift port of libz algorithm from https://github.com/madler/zlib/blob/04f42ceca40f73e2978b50e93806c2a18c1281fc/adler32.c#L63
@inline(__always)
func adler32Checksum(of data: Data) -> UInt32 {
var s1: UInt32 = 1 & 0xffff
var s2: UInt32 = (1 >> 16) & 0xffff
let prime: UInt32 = 65521
data.withUnsafeBytes { pointer in
if pointer.count == 1, let byte = pointer.first {
s1 += UInt32(byte)
s1 %= prime
s2 += s1
s2 %= prime
} else if pointer.count < 16 {
var position = 0
while position < pointer.count {
s1 += UInt32(pointer[position])
s2 += s1
position += 1
}
s1 %= prime
s2 %= prime
} else {
var remainingCount = pointer.count
// maxIteration is the largest n such that 255n(n + 1) / 2 + (n + 1)(prime - 1) <= 2^32 - 1, per libz.
let maxIteration = 5552
var position = 0
while remainingCount >= maxIteration {
remainingCount -= maxIteration
var loops = maxIteration / 16 // evenly divisible
repeat {
var innerPosition = 0
while innerPosition < 16 {
s1 += UInt32(pointer[position + innerPosition])
s2 += s1
innerPosition += 1
}
s1 %= prime
s2 %= prime
position += innerPosition
loops -= 1
} while loops > 0
}
if remainingCount > 0 {
while position < pointer.count {
s1 += UInt32(pointer[position])
s2 += s1
remainingCount -= 1
position += 1
}
s1 %= prime
s2 %= prime
}
}
}
return ((s2 << 16) | s1).bigEndian
}
Most of the remaining work is either in the actual work of the algorithm or the URBP's element access, at least as far as I can tell from Instruments. I even ported libz's manually 16 byte unrolled loops and that just slowed things down. Additionally, I've tried various unchecked settings which either had no effect or actually slowed things down, so I'm not sure where else to go from here aside from micro optimizations.