Munchausen benchmark

I saw this simple integer benchmark on lobste.rs and just submitted a Swift version. If you see any way to speed it up, let me know. Purely aesthetically, I didn't like that I had to do the two numeric conversions and that I couldn't initialize the cached array at compile-time, as in Zig, though maybe that is possible now with a macro? Clean-up and speed-up patches welcome.

2 Likes

Which platform did you test on? On M3 Max I get basically the same result as C:

hassila@min ~/S/swift (master)> time ./main
0
1
3435
438579088

________________________________________________________
Executed in    1.35 secs    fish           external
   usr time    1.33 secs    0.24 millis    1.33 secs
   sys time    0.01 secs    2.29 millis    0.00 secs
hassila@min ~/S/c (master)> time ./main
0
1
3435
438579088

________________________________________________________
Executed in    1.32 secs    fish           external
   usr time    1.30 secs    0.24 millis    1.30 secs
   sys time    0.01 secs    1.72 millis    0.01 secs

hassila@min ~/S/c (master)> 

Linux x86_64 and Android AArch64, using the native OSS toolchain for each.

The most obvious overhead is bounds checking on the Array subscripting. Removing those (via withUnsafeBufferPointer) improves performance by about 12% on an M2 MacBook Air, i.e. gives you the same performance with -O as your version does with -Ounchecked.

That makes it pretty much exactly as fast as the C version.

let MAX: UInt = 440_000_000

var cache: [UInt] = [0, 1, 4, 27, 256, 3_125, 46_656, 823_543, 16_777_216, 387_420_489]

cache.withUnsafeBufferPointer { fastCache in
    func is_munchausen(number: UInt) -> Bool {
        var n = number
        var total: UInt = 0

        while n > 0 {
            total += fastCache[Int(n % 10)]

            if total > number {
                return false
            }

            n /= 10;
        }

        return total == number
    }

    for k in 0..<MAX where is_munchausen(number: k) {
        print(k)
    }
}

There's no benefit to using 32-bit arithmetic instead of native word width (exactly the same instruction latency and throughput, on Apple Silicon). So I switched to native word width. It does mean the cache is technically twice the size, but it still trivially fits into lowest-level cache since it's a fixed 10 entries.

The conversion to Int is also easier then, although really it doesn't cost anything either way (the compiler seems to permit a straight reinterpretation of the value, without any bounds checking - i.e. the same results as if you used Int(bitPattern:) manually).

Precalculating the cache ahead of compile time doesn't make any meaningful difference on performance, but it makes the code simpler.

I tried a few other things, but nothing made a meaningful difference. It's not immediately apparent to me how to optimise the remaining hotspots (the inner loop comparisons and cache lookup (that annoying ldr - the whole cache could fit into unused registers, but I don't know how to make the compiler realise that nor if it'd actually be usable that way, as AArch64 doesn't (to my knowledge) have a version of ldr that indexes the register space rather than memory space).

5 Likes

Also noticed readme in repo mentions size of a file, maybe someone can try Embedded Swift to check? :upside_down_face:

Nothing I did to the swiftc arguments could convince it to vectorise this, even though it feels like it should be vectorisable.

So, I took a crude stab at manually vectorising this for NEON, but the result is 25% slower than the scalar version. I suspect because it's not utilising NEON well (the generated assembly looks pretty horrific in general, and I couldn't figure out any way to generate a vector gather instruction).

I posit that this'd work substantially better with SVE2, but unless I really missed something, none of Apple's chips support SVE at all. :pensive:

let MAX: UInt32 = 440_000_000

var cache: [UInt32] = [0, 1, 4, 27, 256, 3_125, 46_656, 823_543, 16_777_216, 387_420_489]

cache.withUnsafeBufferPointer { fastCache in
    func is_munchausen(numbers: SIMD4<UInt32>) -> SIMDMask<SIMD4<UInt32>.MaskStorage> {
        var n = numbers
        var totals = SIMD4<UInt32>(repeating: 0)

        while any(n .> 0 .& totals .<= numbers) {
            let remainders = n % 10
            n /= 10;

            totals &+= SIMD4(fastCache[Int(remainders.x)],
                             fastCache[Int(remainders.y)],
                             fastCache[Int(remainders.z)],
                             fastCache[Int(remainders.w)])
        }

        return totals .== numbers
    }

    assert(0 == MAX % 4)

    for k in stride(from: 0, to: MAX, by: 4) {
        let numbers = SIMD4(k, k + 1, k + 2, k + 3)
        let matches = is_munchausen(numbers: numbers)

        for i in matches.indices {
            if matches[i] {
                print(numbers[i])
            }
        }
    }
}

I was really surprised by how much I had to baby the compiler, when using the SIMDn types. It feels like the optimiser takes a holiday as soon as it sees one. e.g. I had to manually rearrange the vectorised conditionals (re. the inner loop) to avoid a hefty performance loss. I suspect part of this is the inherent nature of SIMD (being more sensitive than scalar code, because the instructions are intrinsically heavier) but still, it felt like the compiler wasn't entirely pulling its weight.

P.S. I also made an eight-wide version, but it performed nearly twice as bad. Surprisingly, I can't seem to actually find any tech specs for Apple's M2 indicating what it supports re. NEON and vector widths and all that (is NEON fixed 128? It seems like that from glancing at the AArch64 docs from Arm, but that seems really surprising to me).

Yes, NEON is fixed-width 128b vectors. SVE is ARM's scalable vector extension. The ARM ARM goes into quite a bit more detail for the curious.

This is not a great operation for SIMD optimization, because integer division and remainder are not implemented in SIMD on any ISA. This can be worked around by you (or the compiler) replacing them with multiply-high arithmetic sequences, but you then want to do a table lookup which could almost be a PSHUFB/TBL but doesn't quite fit in register and requires splatting out lookup indices and none of that stuff maps into a cross-platform abstraction very nicely, so you really want to write intrinsics (which you can totally do).

This is all totally surmountable, but it's not a nice tidy SIMD demonstration problem.

1 Like

Ah, I did wonder about that; I noticed a lack of vectorised forms of those in the output.

It seems very weird, then, that Swift's SIMD types have division & modulus operators even for integer elements. It gives the wrong idea.

Try with reservedCapactiy?

let divisor = 10
var cache : [UInt32] = [0, 1, 4]
cache.reserveCapacity(divisor)

and then replace magic number 10 with divisor.

I'd be curious if that helps with the bounds checking issue.

EDIT: @hassila LOL - of course! Fixed. I was comparing against the C which needed them, it's hard when you keep switching. I need to get serious about using SwiftLint.

(And (very) tiny nit, I’d drop the semicolons :-) )

reserveCapacity doesn't change the valid bounds of the array. It's merely a hint regarding the underlying memory allocation. The compiler would still need to be convinced that those extra 7 indices were actually initialised (appended to the array).

How about getting the static result from a function and hoping that COW means that the compiler still has the pointer location for count in its hot little hands when it goes to modulo?

let cache : [UInt32] = setCache()
let divisor = cache.count  

func setCache() -> [UInt32] {
    //removing the o,1,4 because no one else does that.
    var cache:[Int] 
    //cache.reserveCapacity(10) ?
    for i in 0...9 {
        var power = 1
        for _ in 0..<i { power *= i }
        cache.append(UInt32(power))
    }
    return cache
}

That's along the lines of what the C code does.

And rust uses an inout. Not sure what that'd get us.

func is_munchausen(number : UInt32, cache:inout [UInt32]) -> Bool

Julia, another top performer has @inbounds, which would be comprable to using the .withUnsafe of current Swift, I think.

Two ideas:

  • quotientAndRemainder(dividingBy: 10)

  • make an array 256 entries long (padding it by some garbage at the end) and form the index as UInt8 number (further casted to Int to usable as array subscript) – the number that can't possible go outside array boundaries. In this case bound checks should disappear.

I tried that. The compiler is smart enough to produce basically the same result with the code as-is - maybe even identical, I didn't diff the disassembly. Certainly it makes no measurable difference on performance.

For the built-in integer types it's probably always fine, but quotientAndRemainder(dividingBy:) can be quite slow, e.g. for BigInt implementations.

On linux x86_64, my version is about 10% faster than yours, probably because of the 32-bit usage. If I simply modify my version to remove all use of UInt32, thus defaulting to Int, it slows down another 10% on top of yours.

On Android AArch64, I see no difference between my two versions, but yours is still slower than both. :person_shrugging:

Try them out on x86_64 and let me know what you find.

Added a version using swift concurrency too for fun as there was a subdirectory with such implementations:

Measuring on my laptop, reference implementation in swift:

hassila@min ~/S/swift (master)> hyperfine ./main --warmup 2 --runs 10
Benchmark 1: ./main
  Time (mean ± σ):      1.404 s ±  0.005 s    [User: 1.398 s, System: 0.002 s]
  Range (min … max):    1.399 s …  1.414 s    10 runs

Concurrent one:

hassila@min ~/S/_/swift (master)> hyperfine ./main --warmup 2 --runs 10
Benchmark 1: ./main
  Time (mean ± σ):     116.3 ms ±   2.8 ms    [User: 1731.8 ms, System: 4.3 ms]
  Range (min … max):   113.5 ms … 123.4 ms    10 runs

is_munchausen is doing a modulus and a divide. You can use the result from the divide to avoid doing the modulus.

func is_munchausen(number: UInt32) -> Bool {
    var n = number
    var total: UInt32 = 0
    while n > 0 {
        let p = n / 10
        total += cache[Int(n - p * 10)]
        if total > number {
            return false
        }
        n = p
    }

    return total == number
}

Gave me an improvement from 1.79s to 1.43s. Although I'm not sure that would be allowed as the repo owner states

In the implementations I tried to use the same (simple) algorithm in order to make the comparisons as fair as possible.

That makes the UInt version nearly 10% slower (on an M2), but it does improve the UInt32 version by about 14%.

Results that are comparable by running on the repos reference machines was published after @Finagolfin PR was merged just now:

and

C with clang and Rust is faster, while C++ is slightly slower for the single-threaded case
Rust is slightly faster in the multi-core case, but not too bad.

In terms of binary size Swift beats all three of those...

Not too shabby (albeit a very limited test of course).

4 Likes

I’m curious why the cache initializer went from 10 values to 3?