Godbolt link.
Let's look at the disassembly for the first function:
output.globalCellIndex(forGridIndex: Swift.UInt8) -> Swift.UInt32:
push rbp
mov rbp, rsp
lea ecx, [rdi + rdi]
mov eax, -131072
sar eax, cl
and eax, 87360
pop rbp
ret
First I'll trim out the stack frame stuff, since that's irrelevant to your question, and I'll annotate what's going on:
// on function entry, i is stored in low 8 bits of rdi, result is returned in eax.
lea ecx, [rdi + rdi] // ecx <-- 2*i
mov eax, -131072 // eax <-- Int32(0b11111111111111100000000000000000)
sar eax, cl // eax &>>= ecx (i.e eax <-- mask)
and eax, 87360 // result <-- eax & 0b10101010101000000
The critical path for this algorithm is LEA -> SAR -> AND. We can look up the latency and throughput of these operations for recent architectures in Intel's optimization manual:
instruction | latency | uOps | notes
LEA | 1 | 1 | Timing for a "simple LEA", which this is
SAR | 2 | 3 | Non-constant shifts are surprisingly expensive
| | | on recent x86, for reasons that are outside
| | | the scope of this note.
AND | 1 | 1 |
The MOV instruction adds a single uOp, but it doesn't contribute to the latency chain, and if this function is heavily used and inlined (as I would expect it to be if its performance matters), then the MOV should be hoisted and its (already small) cost should be amortized. So, as a good approximation, this requires 5 uOps, and has a latency of 4 cycles.
Now let's look at the other implementation:
// This block of instructions is generating the results tuple.
movaps xmm0, xmmword ptr [rip + .LCPI2_0]
movaps xmmword ptr [rbp - 32], xmm0
movabs rax, 374933465158656
mov qword ptr [rbp - 16], rax
mov dword ptr [rbp - 8], 87360
movzx eax, dil // zero-extend i to fill register
mov eax, dword ptr [rbp + 4*rax - 32] // load results[i]
At first glance, this looks a lot worse (and it's objectively bad codegen; there's no need for the compiler to manufacture results like its doing--that whole first block of five instructions is entirely superfluous). But, just like above, in a context where the performance matters, this would be inlined, and this set-up code will get hoisted, so that it does not contribute to the observed performance. So we're left with only the MOV and MOVZX instructions. These are both single uOp, so we have two uOps total (vs 5 for the previous algorithm).
Note that the MOVZX will also be eliminated by the compiler when this function gets inlined, assuming that the compiler can prove that i is already in the range of a UInt8 (likely!)
Latency is harder to talk about here, because the second instruction is a load, and the latency of loads isn't a totally well-defined concept, but assuming this code is heavily used, it will always hit the L1 cache, so "4 cycles" is a reasonable approximation.
Thus, compared to the first algorithm, this algorithm:
- uses significantly fewer uOps (1 or 2 vs. 5)
- has approximately the same latency (4-ish cycles vs. 4 cycles)
There's one more difference that's probably most important: it uses different uOps. The uOps used by the first algorithm all have to execute on the ALU ports (0, 1, 5 and 6), which means they're competing with all the surrounding arithmetic operations for execution resources. The load uOp used by the second algorithm executes on a different group of ports (2 or 3), so it does not contend with surrounding arithmetic operations (but it could contend with surrounding load operations).
In particular, in a tight timing loop, the first will contend with the loop-arithmetic itself, while the second will not, which easily explains 2x or more differences in a stand-alone timer.
Ultimately, the only way to get conclusive answers to this sort of question is to use the performance counters, but this is a simple enough case that I can be pretty confident in an answer given from first principles like this one.