Should I expect division by 2 of a BinaryFloatingPoint number to be faster than other divisions?

madbat · December 29, 2020, 10:31pm

Hello,
I'm wondering whether I should expect division by 2 on a BinaryFloatingPoint (eg a Double) number to be faster than divisions by other non-power of 2 divisors (as long as the result is not a subnormal number)?

From this Wikipedia paragraph I'd guess so, but it doesn't seem to be the case from my measurements.

cukr · December 29, 2020, 10:56pm

When dividing Double by a constant 3, it uses divsd (division) instruction
When dividing Double by a constant 2, it uses mulsd (multiplication) instruction

Be aware that this optimization only happens if the divisor is a constant, and you compile with optimizations enabled!

madbat · December 29, 2020, 11:10pm

Interesting, thanks!

Although, I'd expect the specific optimization with powers of 2 to be more specific (and fast) than that for a multiplication by any factor.
If I replace the / 3 in your example with a multiplication by any non power of 2 number, I get the same assembly as for the / 2 .

https://godbolt.org/z/4zEo83

scanon · December 29, 2020, 11:13pm

There isn't any operation that would replace it that would be "faster" than multiplication; you could subtract one from the exponent using an integer operation, but that doesn't handle zero or infinity or nan or subnormal results correctly, so you'd have to fix those up, and as soon as you have to do more than one instruction a floating-point multiply is significantly faster.

Floating-point addition, subtraction, multiplication, and fused multiply-add are all among the fastest operations on modern cores. In particular, they are fully pipelined, meaning that one or more of them can begin every single cycle, and they have a latency of just 3-5 cycles on typical hardware designs (the world seems to be settling on a uniform 4 cycles for them, though some 3 cycle adders are around).

Floating-point division and square root are somewhat slower, but not a lot slower; on recent Intel and Apple cores, one of these operations can begin every 2 or 3 cycles, and the total latency is in the neighborhood of ten cycles. So more expensive than other operations, but cheap enough that avoiding them isn't usually worth the effort if it means using more instructions.

madbat · December 29, 2020, 11:20pm

Gotcha, thanks for the detailed response!