Okay, I’m happy with the speed now, so I’ve been trying to shorten the generated assembly (as shown on swift.godbolt.org at -O optimization).
Changing the computation of the product of the dice sizes from this:
bound = UInt64(n)
for i in 1 ..< batchSize {
bound *= UInt64(truncatingIfNeeded: n &- i)
}
to this:
bound = 1
for i in 0 ..< batchSize {
bound *= UInt64(truncatingIfNeeded: n &- i)
}
reduces the assembly from 396 to 380 lines on aarch64 with Swift 6.3, from 417 to 399 lines on x86-64 with Swift 6.3, and from 363 to 356 lines on x86-64 with Swift nightly, according to Godbolt.
That change has no effect on the benchmark speeds (which makes sense since it’s not on the hot path), so I’ll go ahead and use it.
• • •
Strangely, adding this do-nothing line just before the while n > limit loop (immediately after batchSize gets assigned a value):
for i in 0 ..< batchSize { _ = dice[i] }
shrinks the assembly from 380 to 292 lines on aarch64 with Swift 6.3, from 399 to 360 lines on x86-64 with Swift 6.3, and leaves it unchanged at 356 lines on x86-64 with Swift nightly.
However it also makes the benchmarks significantly slower with xoshiro and the system RNG (but makes no difference with PCG, oddly enough).
So I will not be using that.
• • •
I think that’s about the limit of my optimization skills, so I’m going to polish up the implementation and add some explanatory comments, then I’ll be ready to open a PR.
If anyone has further input, please let me know.