Rosetta faster than Native

I have a small programm, which surprisingly runs much faster (factor of 1.6) in Rosetta on my MacBook Pro (M1).

Although this is a C programm, I was advised to ask here, because it might be a compiler problem.

In Debug mode (-O0) Rosetta runs as fast as native.

95 % of the time is spent in multiplyShort which is just a loop with three lines (see code below).

When I remove any of these lines, Rosetta becomes slower than M1 (as expected).

This seems to indicate to me that the Apple Silicon optimizer somehow fails to handle this rather simple loop.

Rosetta			M1		Rosetta	/ M1
	2.1			3.4			0.62			
	0.00379 	0.00247 	1.5				Line 1 removed
	0.00658 	0.00268 	2.5				Line 2 removed
	0.0076  	0.0039  	1.9				Line 3 removed

Here the complete code:

#include <stdio.h>	//	printf
#include <stdlib.h>	//	malloc
#include <string.h>	//	memcpy
@import Foundation;	//	NSDate

static size_t multiplyShort(uint32_t *res, size_t len, uint32_t n)
{
	//	Instruments → Time Profiler → 95 % of time spent here
	
	uint64_t rest = 0;
	size_t pos;
	for (pos = 0; pos < len; pos++)
	{
		rest += (uint64_t)n * res[pos];				//	Line 1
		res[pos] = (uint32_t)(rest & UINT32_MAX);	//	Line 2
		rest >>= 32;								//	Line 3
	}
	
	if (rest > 0) res[pos++] = (uint32_t)rest;
	return pos;
}

static void faculty1(uint32_t fakultät)
{
	const size_t s32 = sizeof(uint32_t);
	
	size_t len = 1;
	const size_t resLen = len * s32;
	uint32_t *result = malloc(resLen);
	*result = 1;
	
	for (uint32_t ii = fakultät; ii > 1; ii--)
	{
		const size_t resLen = len * s32;
		uint32_t *next = malloc (resLen + s32);
		memcpy (next, result, resLen);
		free(result);
		len = multiplyShort(next, len, ii) ; 
		result = next;
	}

	if ( len < 17 )	//	print result in hex
	{
		printf("%u ! = 0x", fakultät);
		for (size_t ii = len - 1;; ii--)
		{
			printf("%08x ", result[ii]);
			if (ii == 0)
			{
				printf("\n");
				break;
			}
		}
	}
	else			//	print size of result
	{
		printf("%u ! → %lu uint32_t\n", fakultät, len);
	}
}

int main(int argc, const char * argv[]) 
{
	int fac = 100000;
	printf("faculty1(%d) start\n", fac);
	NSDate *date = [NSDate date];
	faculty1(fac);
	NSTimeInterval elaps = -date.timeIntervalSinceNow;
	printf("faculty1(%d) done in %.3g sec\n", fac, elaps);
	return 0;
}
1 Like

This is really a LLVM / Rosetta question rather than a Swift question, so it would better belong either on a LLVM board or on the Apple developer forums, but it is somewhat surprising behavior. There are a few plausible avenues for further investigation, but the behavior is pretty subtle. I'll have to discuss with some folks to get a clearer understanding of what's going on here.

I should note, however, that on both x86_64 and arm64, you can make this computation significantly faster by using a 64b full multiply instead of a 32b x 32b -> 64b. I'll write up an example of that for you later.

1 Like

It appears that this is bumping up against weird corner case for accumulator forwarding in the madd instruction (x86_64 doesn't have a multiply-add instruction, so the Rosetta instruction stream contains separate multiply and add operations, which avoids the hazard). I'll follow up with the LLVM folks to see if we can find a way to avoid the bad case in native codegen.

Happily, you can avoid the problem entirely and go quite a bit faster on both x86_64 and arm64 by simply using a full-width 64b multiply instead of a 32x32 -> 64 product, like so:

static size_t multiplyShort(uint64_t *res, size_t len, uint64_t n) {
	// __uint128_t is a clang/gcc extension
	__uint128_t rest = 0;
	size_t pos;
	for (pos = 0; pos < len; pos++) {
		rest += (__uint128_t)n * res[pos];
		res[pos] = rest;
		rest >>= 64;
	}
	if (rest) res[pos++] = rest;
	return pos;
}

With this change, the computation is about 3x as fast as previous when compiled to native code.

I would be remiss not to note that you can avoid reallocating because we know how much memory n! requires via Stirling's approximation. A very naive application of this further reduces the runtime by another 25% or so by getting rid of all the reallocations, something like this:

static int ilog2(uint64_t n) {
	return 63 - __builtin_clzl(n);
}

static void faculty1(uint64_t fakultät) {
	// Use Stirling's approximation to figure out how much space we have to
	// allocate; this is an over-estimate, we should really include the linear
	// term as well to make it tighter ...
	uint64_t *result = calloc(8, (fakultät * ilog2(fakultät) + 63) / 64);
	result[0] = 1;
	size_t active = 1;
	for (uint64_t ii = 2; ii <= fakultät; ii++) {
		active = multiplyShort(result, active, ii);
	}
	// result[0..<active] contains the factorial.
}

Obviously it's possible to go quite a bit faster still, but this is 4x faster than we started without any drastic changes to the algorithm. (I have only tested this very hastily, so do your own evaluation before using it, etc, etc).

21 Likes

... and just to tie this back to Swift, here's a very direct translation into Swift:

func mul1(_ product: inout [UInt], _ n: UInt) {
	var carry: UInt = 0
	for i in product.indices {
		let (h, l) = n.multipliedFullWidth(by: product[i])
		let (r, c) = carry.addingReportingOverflow(l)
		carry = h &+ (c ? 1 : 0)
		product[i] = r
	}
	if carry != 0 { product.append(carry) }
}

func ceilLog2(_ n: Int) -> Int {
	precondition(n > 0)
	return Int.bitWidth - n.leadingZeroBitCount
}

func factorial(_ n: Int) {
	var result = [UInt]()
	result.reserveCapacity((n*ceilLog2(n) + UInt.bitWidth - 1) / UInt.bitWidth)
	result.append(1)
	for i in 2 ... UInt(n) {
		mul1(&result, i)
	}
	print(result.count)
}

I haven't tried to optimize this at all, so it's about 20% slower than the C implementation above on my M1 MBP, but it ought to be pretty easy to claw that back (and it's still much, much faster than we started with).

4 Likes

First I have to thank you for your insightful explanation and your very helpful suggestions.

I have a very small nitpick with your code: ceilLog2 should really called floorLog2PlusOne.

I have modified the reserveCapacity as follows:

#if false	//	rough estimate
let bits = Int(n) * (floorLog2PlusOne(n) - 1)  
#else		//	less excess capacity
let dn = Double(n)
let expTemp = dn * ( log2(dn) - 1.344698375 ) 
let bits = expTemp <= 1 ? 1 : Int( expTemp.rounded(.up) )
#endif
let capacity = (bits + UInt.bitWidth - 1) / UInt.bitWidth
2 Likes