This is really a LLVM / Rosetta question rather than a Swift question, so it would better belong either on a LLVM board or on the Apple developer forums, but it is somewhat surprising behavior. There are a few plausible avenues for further investigation, but the behavior is pretty subtle. I'll have to discuss with some folks to get a clearer understanding of what's going on here.
I should note, however, that on both x86_64 and arm64, you can make this computation significantly faster by using a 64b full multiply instead of a 32b x 32b -> 64b. I'll write up an example of that for you later.
It appears that this is bumping up against weird corner case for accumulator forwarding in the madd instruction (x86_64 doesn't have a multiply-add instruction, so the Rosetta instruction stream contains separate multiply and add operations, which avoids the hazard). I'll follow up with the LLVM folks to see if we can find a way to avoid the bad case in native codegen.
Happily, you can avoid the problem entirely and go quite a bit faster on both x86_64 and arm64 by simply using a full-width 64b multiply instead of a 32x32 -> 64 product, like so:
With this change, the computation is about 3x as fast as previous when compiled to native code.
I would be remiss not to note that you can avoid reallocating because we know how much memory n! requires via Stirling's approximation. A very naive application of this further reduces the runtime by another 25% or so by getting rid of all the reallocations, something like this:
static int ilog2(uint64_t n) {
return 63 - __builtin_clzl(n);
}
static void faculty1(uint64_t fakultät) {
// Use Stirling's approximation to figure out how much space we have to
// allocate; this is an over-estimate, we should really include the linear
// term as well to make it tighter ...
uint64_t *result = calloc(8, (fakultät * ilog2(fakultät) + 63) / 64);
result[0] = 1;
size_t active = 1;
for (uint64_t ii = 2; ii <= fakultät; ii++) {
active = multiplyShort(result, active, ii);
}
// result[0..<active] contains the factorial.
}
Obviously it's possible to go quite a bit faster still, but this is 4x faster than we started without any drastic changes to the algorithm. (I have only tested this very hastily, so do your own evaluation before using it, etc, etc).
... and just to tie this back to Swift, here's a very direct translation into Swift:
func mul1(_ product: inout [UInt], _ n: UInt) {
var carry: UInt = 0
for i in product.indices {
let (h, l) = n.multipliedFullWidth(by: product[i])
let (r, c) = carry.addingReportingOverflow(l)
carry = h &+ (c ? 1 : 0)
product[i] = r
}
if carry != 0 { product.append(carry) }
}
func ceilLog2(_ n: Int) -> Int {
precondition(n > 0)
return Int.bitWidth - n.leadingZeroBitCount
}
func factorial(_ n: Int) {
var result = [UInt]()
result.reserveCapacity((n*ceilLog2(n) + UInt.bitWidth - 1) / UInt.bitWidth)
result.append(1)
for i in 2 ... UInt(n) {
mul1(&result, i)
}
print(result.count)
}
I haven't tried to optimize this at all, so it's about 20% slower than the C implementation above on my M1 MBP, but it ought to be pretty easy to claw that back (and it's still much, much faster than we started with).