Optimizing Swift: Adler32 Checksum

In implementing request body compression for Alamofire, I had to implement the Adler32 checksum algorithm, as it's part of the deflate encoding. Alternatively I could dynamically load libz to get their implementation, which libraries like DataCompression do, but I felt a native Swift version would be best, depending on the complexity. It was also an interesting challenge. In the end I ported the libz implementation to Swift. Unfortunately, this implementation, while far better than the naive implementation from DataCompression (86x slower than libz), is still 7x slower than libz. Perhaps the community can do better?

Repo: GitHub - jshier/Adler32: Swift Adler32 checksum implementations for performance testing.
Current fastest implementation:

// Swift port of libz algorithm from https://github.com/madler/zlib/blob/04f42ceca40f73e2978b50e93806c2a18c1281fc/adler32.c#L63
@inline(__always)
func adler32Checksum(of data: Data) -> UInt32 {
    var s1: UInt32 = 1 & 0xffff
    var s2: UInt32 = (1 >> 16) & 0xffff
    let prime: UInt32 = 65521

    data.withUnsafeBytes { pointer in
        if pointer.count == 1, let byte = pointer.first {
            s1 += UInt32(byte)
            s1 %= prime
            s2 += s1
            s2 %= prime
        } else if pointer.count < 16 {
            var position = 0
            while position < pointer.count {
                s1 += UInt32(pointer[position])
                s2 += s1
                position += 1
            }

            s1 %= prime
            s2 %= prime
        } else {
            var remainingCount = pointer.count
            // maxIteration is the largest n such that 255n(n + 1) / 2 + (n + 1)(prime - 1) <= 2^32 - 1, per libz.
            let maxIteration = 5552
            var position = 0

            while remainingCount >= maxIteration {
                remainingCount -= maxIteration
                var loops = maxIteration / 16 // evenly divisible

                repeat {
                    var innerPosition = 0
                    while innerPosition < 16 {
                        s1 += UInt32(pointer[position + innerPosition])
                        s2 += s1
                        innerPosition += 1
                    }
                    s1 %= prime
                    s2 %= prime
                    position += innerPosition
                    loops -= 1
                } while loops > 0
            }

            if remainingCount > 0 {
                while position < pointer.count {
                    s1 += UInt32(pointer[position])
                    s2 += s1
                    remainingCount -= 1
                    position += 1
                }

                s1 %= prime
                s2 %= prime
            }
        }
    }

    return ((s2 << 16) | s1).bigEndian
}

Most of the remaining work is either in the actual work of the algorithm or the URBP's element access, at least as far as I can tell from Instruments. I even ported libz's manually 16 byte unrolled loops and that just slowed things down. Additionally, I've tried various unchecked settings which either had no effect or actually slowed things down, so I'm not sure where else to go from here aside from micro optimizations.

2 Likes

Have you tried using the wrapping integer operators like &+? The standard operators like + perform overflow checks by default which incurs a performance penalty.

Also, there's probably nothing to gain from declaring this function @inline(__always). It is not generic, nor is it so simple that the procedure call overhead dwarfs the execution time spent in the function. Inlining it might even make matters worse if you call it from many places.

2 Likes

This has similar performance to the attempts I made with the unchecked compiler flags.

Complex is 7.800220188699348x slower than library.
Complex is 12.029237769157454x faster than worst.
Wrapping is 12.587039645803985x slower than library.
Wrapping is 7.454548960042887x faster than worst.

Turning off the inlining for this test seems to make no difference.

What does Instruments say is taking more time in Swift vs C?

Instruments can't see inside libz, so it's hard to say. Instruments also crashes when trying to see the assembly view of the various Swift code at hottest paths, so I can't really tell what's going on. This is what I can get to on the Swift side:

6.88 s   99.9%	2.07 s	 	                 closure #1 in adler32Checksum(of:)
  937.00 ms   13.6%	0 s	 	                  UnsafeRawBufferPointer.subscript.getter
    937.00 ms   13.6%	937.00 ms	 	                   specialized UnsafeRawPointer.load<A>(fromByteOffset:as:)
  456.00 ms    6.6%	0 s	 	                  UnsafeRawBufferPointer.subscript.getter
    456.00 ms    6.6%	456.00 ms	 	                   specialized UnsafeRawPointer.load<A>(fromByteOffset:as:)
  306.00 ms    4.4%	0 s	 	                  UnsafeRawBufferPointer.subscript.getter
    306.00 ms    4.4%	306.00 ms	 	                   specialized UnsafeRawPointer.load<A>(fromByteOffset:as:)
  301.00 ms    4.3%	0 s	 	                  UnsafeRawBufferPointer.subscript.getter
    301.00 ms    4.3%	301.00 ms	 	                   specialized UnsafeRawPointer.load<A>(fromByteOffset:as:)
  300.00 ms    4.3%	0 s	 	                  UnsafeRawBufferPointer.subscript.getter
    300.00 ms    4.3%	300.00 ms	 	                   specialized UnsafeRawPointer.load<A>(fromByteOffset:as:)
  288.00 ms    4.1%	0 s	 	                  UnsafeRawBufferPointer.subscript.getter
    288.00 ms    4.1%	288.00 ms	 	                   specialized UnsafeRawPointer.load<A>(fromByteOffset:as:)
  279.00 ms    4.0%	0 s	 	                  UnsafeRawBufferPointer.subscript.getter
    279.00 ms    4.0%	279.00 ms	 	                   specialized UnsafeRawPointer.load<A>(fromByteOffset:as:)
  265.00 ms    3.8%	0 s	 	                  UnsafeRawBufferPointer.subscript.getter
    265.00 ms    3.8%	265.00 ms	 	                   specialized UnsafeRawPointer.load<A>(fromByteOffset:as:)
  260.00 ms    3.7%	0 s	 	                  UnsafeRawBufferPointer.subscript.getter
    260.00 ms    3.7%	260.00 ms	 	                   specialized UnsafeRawPointer.load<A>(fromByteOffset:as:)
  251.00 ms    3.6%	0 s	 	                  UnsafeRawBufferPointer.subscript.getter
    251.00 ms    3.6%	251.00 ms	 	                   specialized UnsafeRawPointer.load<A>(fromByteOffset:as:)
  247.00 ms    3.5%	0 s	 	                  UnsafeRawBufferPointer.subscript.getter
    247.00 ms    3.5%	247.00 ms	 	                   specialized UnsafeRawPointer.load<A>(fromByteOffset:as:)
  246.00 ms    3.5%	0 s	 	                  UnsafeRawBufferPointer.subscript.getter
    246.00 ms    3.5%	246.00 ms	 	                   specialized UnsafeRawPointer.load<A>(fromByteOffset:as:)
  238.00 ms    3.4%	0 s	 	                  UnsafeRawBufferPointer.subscript.getter
    238.00 ms    3.4%	238.00 ms	 	                   specialized UnsafeRawPointer.load<A>(fromByteOffset:as:)
  169.00 ms    2.4%	0 s	 	                  UnsafeRawBufferPointer.subscript.getter
    169.00 ms    2.4%	169.00 ms	 	                   specialized UnsafeRawPointer.load<A>(fromByteOffset:as:)
  150.00 ms    2.1%	0 s	 	                  UnsafeRawBufferPointer.subscript.getter
    150.00 ms    2.1%	150.00 ms	 	                   specialized UnsafeRawPointer.load<A>(fromByteOffset:as:)
  118.00 ms    1.7%	0 s	 	                  UnsafeRawBufferPointer.subscript.getter
    118.00 ms    1.7%	118.00 ms	 	                   specialized UnsafeRawPointer.load<A>(fromByteOffset:as:)

Anything else to try here?

Oh, one thing to try would be to replace this with a version that loads 4 or 8 bytes at a time using

let value = buffer.load(fromByteOffset: position, as: UInt32.self or UInt64.self)

since it looks like reading one-byte-at-a-time is what's using up a ton of cycles

Is there an efficient way to do the required math without manual bit shifting to access the relevant byte? I would think a loop would be overhead, but I'll try it and compare.

At the very least, this is a bug; it may be worth re-checking that you're getting the same results as your reference implementation! [I was wrong and Jon explained below.]

1 Like

Try compiling your file with -S, eg xcrun swiftc -Ounchecked -S adler.swift. I get the below output for the closure passed to withUnsafeBytes, which is where the inner loop is contained. I don't see anything obviously wrong here except for the byte-at-a-time memory accesses that @harlanhaskins already pointed out.

_$s5adler15adler32Checksum2ofs6UInt32V10Foundation4DataV_tFySWXEfU_:
	sub	sp, sp, #16
	cbz	x0, LBB2_20
	sub	x8, x1, x0
	cmp	x8, #1
	b.ne	LBB2_4
	cmp	x1, x0
	b.ne	LBB2_7
	mov	w9, #0
	cmp	x1, x0
	b.ne	LBB2_8
	b	LBB2_9
LBB2_4:
	cmp	x8, #15
	b.le	LBB2_9
	mov	w9, #5552
	cmp	x8, x9
	b.ge	LBB2_12
	mov	x9, #0
	b	LBB2_19
LBB2_7:
	ldrb	w9, [x0]
	cmp	x1, x0
	b.eq	LBB2_9
LBB2_8:
	ldr	w8, [x2]
	add	w8, w8, w9
	mov	w9, #32881
	movk	w9, #32775, lsl #16
	umull	x9, w8, w9
	lsr	x9, x9, #47
	mov	w10, #65521
	msub	w8, w9, w10, w8
	str	w8, [x2]
	ldr	w9, [x3]
	add	w8, w9, w8
	b	LBB2_21
LBB2_9:
	cmp	x8, #1
	b.lt	LBB2_20
	mov	x9, #0
LBB2_11:
	ldrb	w10, [x0, x9]
	ldr	w11, [x2]
	add	w10, w11, w10
	str	w10, [x2]
	ldr	w11, [x3]
	add	w10, w11, w10
	str	w10, [x3]
	add	x9, x9, #1
	cmp	x9, x8
	b.lt	LBB2_11
	b	LBB2_20
LBB2_12:
	mov	x9, #0
	mov	w10, #32881
	movk	w10, #32775, lsl #16
	mov	w11, #65521
	mov	x12, #-5552
	mov	w13, #11103
	mov	x15, x8
LBB2_13:
	mov	x14, x15
	mov	w15, #348
LBB2_14:
	add	x16, x0, x9
	ldrb	w17, [x16]
	ldr	w1, [x2]
	add	w17, w1, w17
	str	w17, [x2]
	ldr	w1, [x3]
	add	w17, w1, w17
	str	w17, [x3]
	ldrb	w17, [x16, #1]
	ldr	w1, [x2]
	add	w17, w1, w17
	str	w17, [x2]
	ldr	w1, [x3]
	add	w17, w1, w17
	str	w17, [x3]
	ldrb	w17, [x16, #2]
	ldr	w1, [x2]
	add	w17, w1, w17
	str	w17, [x2]
	ldr	w1, [x3]
	add	w17, w1, w17
	str	w17, [x3]
	ldrb	w17, [x16, #3]
	ldr	w1, [x2]
	add	w17, w1, w17
	str	w17, [x2]
	ldr	w1, [x3]
	add	w17, w1, w17
	str	w17, [x3]
	ldrb	w17, [x16, #4]
	ldr	w1, [x2]
	add	w17, w1, w17
	str	w17, [x2]
	ldr	w1, [x3]
	add	w17, w1, w17
	str	w17, [x3]
	ldrb	w17, [x16, #5]
	ldr	w1, [x2]
	add	w17, w1, w17
	str	w17, [x2]
	ldr	w1, [x3]
	add	w17, w1, w17
	str	w17, [x3]
	ldrb	w17, [x16, #6]
	ldr	w1, [x2]
	add	w17, w1, w17
	str	w17, [x2]
	ldr	w1, [x3]
	add	w17, w1, w17
	str	w17, [x3]
	ldrb	w17, [x16, #7]
	ldr	w1, [x2]
	add	w17, w1, w17
	str	w17, [x2]
	ldr	w1, [x3]
	add	w17, w1, w17
	str	w17, [x3]
	ldrb	w17, [x16, #8]
	ldr	w1, [x2]
	add	w17, w1, w17
	str	w17, [x2]
	ldr	w1, [x3]
	add	w17, w1, w17
	str	w17, [x3]
	ldrb	w17, [x16, #9]
	ldr	w1, [x2]
	add	w17, w1, w17
	str	w17, [x2]
	ldr	w1, [x3]
	add	w17, w1, w17
	str	w17, [x3]
	ldrb	w17, [x16, #10]
	ldr	w1, [x2]
	add	w17, w1, w17
	str	w17, [x2]
	ldr	w1, [x3]
	add	w17, w1, w17
	str	w17, [x3]
	ldrb	w17, [x16, #11]
	ldr	w1, [x2]
	add	w17, w1, w17
	str	w17, [x2]
	ldr	w1, [x3]
	add	w17, w1, w17
	str	w17, [x3]
	ldrb	w17, [x16, #12]
	ldr	w1, [x2]
	add	w17, w1, w17
	str	w17, [x2]
	ldr	w1, [x3]
	add	w17, w1, w17
	str	w17, [x3]
	ldrb	w17, [x16, #13]
	ldr	w1, [x2]
	add	w17, w1, w17
	str	w17, [x2]
	ldr	w1, [x3]
	add	w17, w1, w17
	str	w17, [x3]
	ldrb	w17, [x16, #14]
	ldr	w1, [x2]
	add	w17, w1, w17
	str	w17, [x2]
	ldr	w1, [x3]
	add	w17, w1, w17
	str	w17, [x3]
	ldrb	w16, [x16, #15]
	ldr	w17, [x2]
	add	w16, w17, w16
	str	w16, [x2]
	ldr	w17, [x3]
	add	w16, w17, w16
	str	w16, [x3]
	ldr	w16, [x2]
	mul	x17, x16, x10
	lsr	x17, x17, #47
	msub	w16, w17, w11, w16
	str	w16, [x2]
	ldr	w16, [x3]
	mul	x17, x16, x10
	lsr	x17, x17, #47
	msub	w16, w17, w11, w16
	str	w16, [x3]
	add	x9, x9, #16
	sub	x15, x15, #1
	cmp	x15, #1
	b.hi	LBB2_14
	add	x15, x14, x12
	cmp	x14, x13
	b.gt	LBB2_13
	mov	w10, #5553
	cmp	x14, x10
	b.ge	LBB2_19
	add	sp, sp, #16
	ret
LBB2_18:
	ldrb	w10, [x0, x9]
	ldr	w11, [x2]
	add	w10, w11, w10
	str	w10, [x2]
	ldr	w11, [x3]
	add	w10, w11, w10
	str	w10, [x3]
	add	x9, x9, #1
LBB2_19:
	cmp	x9, x8
	b.lt	LBB2_18
LBB2_20:
	ldr	w8, [x2]
	mov	w9, #32881
	movk	w9, #32775, lsl #16
	mul	x9, x8, x9
	lsr	x9, x9, #47
	mov	w10, #65521
	msub	w8, w9, w10, w8
	str	w8, [x2]
	ldr	w8, [x3]
LBB2_21:
	mov	w9, #32881
	movk	w9, #32775, lsl #16
	umull	x9, w8, w9
	lsr	x9, x9, #47
	mov	w10, #65521
	msub	w8, w9, w10, w8
	str	w8, [x3]
	add	sp, sp, #16
	ret

That logic is copied from DataCompression's naive Adler32 implementation, which starts with 1 for the start value:

private func slowAdler32(start: UInt32, data: Data) -> UInt32
{
  var s1: UInt32 = start & 0xffff
  var s2: UInt32 = (start >> 16) & 0xffff
  let prime: UInt32 = 65521

  for byte in data {
    s1 += UInt32(byte)
    if s1 >= prime { s1 = s1 % prime }
    s2 += s1
    if s2 >= prime { s2 = s2 % prime }
  }
  return (s2 << 16) | s1
}

I am ensuring I'm matching that library's result, but I haven't tried loading libz directly and comparing from there.

Ah, okay. I was thinking "is that supposed to be 1 << 16??" but if it's a hardcoded constant in a general implementation that makes sense!

Replacing the hot inner loop with this is actually slower than the while loop. Even manually doing the work without the for loops is slower.

let firstEight = pointer.load(fromByteOffset: position, as: UInt64.self)
position += 8
let secondEight = pointer.load(fromByteOffset: position, as: UInt64.self)
position += 8

for shift in stride(from: 0, through: 56, by: 8) {
    s1 += UInt32((firstEight >> shift) & 0xFF)
    s2 += s1
}

for shift in stride(from: 0, through: 56, by: 8) {
    s1 += UInt32((secondEight >> shift) & 0xFF)
    s2 += s1
}

here is the implementation swift-png uses:

    s1 += UInt32((firstEight >> shift) & 0xFF)

This expression calls the following initializer:

  @_semantics("optimize.sil.specialize.generic.partial.never")
  @inlinable // FIXME(inline-always)
  @inline(__always)
  public init<T: BinaryInteger>(_ source: T) {
    // This check is potentially removable by the optimizer
    if T.isSigned {
      _precondition(source >= (0 as T), "Negative value is not representable")
    }
    // This check is potentially removable by the optimizer
    if source.bitWidth >= Self.bitWidth {
      _precondition(source <= Self.max,
        "Not enough bits to represent the passed value")
    }
    self.init(truncatingIfNeeded: source)
  }

Those “potentially”s worry me. Try calling init(truncatingIfNeeded:) directly instead.

I assume you are using reasonably sized Data chunks, not, like 10 or 100 byte blocks but at least a few K, otherwise the overhead would be significant.

Some suggestions for you:

  • check with "-enforce-exclusivity=none"
  • check on Intel. Last time I checked loops are not automatically unrolled under ARM.
  • unroll loops manually to see if there's a difference.
  • double check the speed of that C code is "on par" with libz implementation (if not – you may need a better starting point for your swift port).

Nice, this is even faster than the complex version I had ported from C.

func taylor_adler32Checksum(of data: Data) -> UInt32 {
    data.withUnsafeBytes { pointer in
        let (q, r): (Int, Int) = data.count.quotientAndRemainder(dividingBy: 5552)
        var (single, double): (UInt32, UInt32) = (1 & 0xffff, (1 >> 16) & 0xffff)

        for i: Int in 0 ..< q {
            for j: Int in 5552 * i ..< 5552 * (i + 1) {
                single &+= .init(pointer[j])
                double &+= single
            }
            single %= 65521
            double %= 65521
        }

        for j: Int in 5552 * q ..< 5552 * q + r {
            single &+= .init(pointer[j])
            double &+= single
        }

        return ((double % 65521) << 16 | (single % 65521)).bigEndian
    }
}

This implementation is 6.2x slower than libz (through DataCompression), while the previous version ported from C is 7.8x.

1 Like

Yes, the tests are running on a 16MB block of randomized data. At very small sizes the Swift implementations are actually faster, but there may be other overhead for libz there.

Last I checked there was no difference when turning off exclusivity or bounds checking. In some cases performance was actually worse.

Given the vast majority of Alamofire use in on ARM it seems to be the best comparison platform. I also don't have an Intel machine set up with this environment.

I tried copying the manual unrolling the libz implementation did and performance was worse.

The C code is libz. The DataCompression library loads libz with dyld and calls its implementation of adler32.

Not my experience: in my tests turning off exclusivity checks had quite significant positive impact on the performance.

Also doesn't match my experience: on ARM manual loop unrolling had significant positive impact on the performance.

In my tests typically the speed difference between C code and that same code ported to Swift was like 10 - 30%. How would you explain 7x slowdown? Something in your analysis doesn't check out.

Do they calculate a partial checksums in 8 threads and then combine the results?

I don't know how to explain the 7x slowdown, that's why I posted this thread. For instance, in Taylor's implementation, not using the wrapping operators dropped performance by quite a bit. Using the wrapping operators in my other implementations had no effect at all. I just tested again: turning off exclusivity had no effect, disabling the safety checks actually regresses performance for the C-port version but leaves Taylor's version alone (weird).

Feel free to rerun my tests or take a look at the libz implementation, I linked it in the first post. I've pushed the latest implementations.

Generally I'd advice to use a less optimised version of an implementation as a starting point to port to Swift. For example that version is better than what you have in the git repo, but I'd still choose something simpler, even the slow reference implementation (and once I get that working add various optimisations on top, like loop unrolling, etc).

Note that the linked implementation gives a different result compared to a reference implementation (e.g. found in wikipedia), it might be related to moving single & double variables truncation out of the inner loop †. Perhaps to correctly move that division out of the loop the intermediate results should be done using UInt64 (so you can tolerate adding without an overflow for quite a while), not UInt32 as in the code.†

† Edit: ah, no. The difference is caused by bigEndian. It looks strange it is used there but I'll leave it up to you.

Edit2: switching to UInt64 might still worth it as instead of doing work in 5552 chunks you'll be able doing it in 5552 * 2^32 chunks – which is about 2^44 or 21TB if I am not mistaken†† – you may not even need the inner loop anymore as Data will never be that large. That's assuming that 64 bit arithmetic (e.g. division / reminder) is as fast as 32 bit arithmetic – worth checking beforehand if that's the case.

†† Edit3 - correction to Edit2 above. A simple test shows that switching to UInt64 would give the chunk sizes of about 256MB (plus or minus). Data can easily be bigger than that, so the inner loop will still be required.