Floating point addition with result rounded down?

I need a function like this:

extension BinaryFloatingPoint {
    func addingRoundingDown(_ incr: Self) -> Self {
        // ...

(Note: This is not about rounding to integer values, it's about rounding to representable floating point values.)

Here's what I want it to do: Starting at the exact real result, return the first representable floating point value in direction towards negative infinity (unless the exact result is representable).
To be clear, the expected behavior is such that the following should be true:

Float(1).nextDown.addingRoundingDown(1) == Float(2).nextDown

Note that the behavior of the regular (round to nearest tie to even) addition operator is such that the following is true:

Float(1).nextDown + 1 == Float(2)

And this is not what I want.

I only need it to work for self >= 0 && incr >= 0, although it would be nice if it worked for any values. I'd like it to be reasonably fast and work for any BinaryFloatingPoint type though.

What would be the best way to achieve this in Swift today (perhaps by using the regular + and some extra code to compensate for the default rounding, or by bit manipulation, or by using C)?

EDIT: I've read this old thread and the following will work for Float32 and Float64, but not for Float80 for some reason, and only if it's annotated with @inline(never) to prevent static evaluation using standard rounding, and I assume it will be quite slow to change the rounding mode at each call:

extension BinaryFloatingPoint {
    func addingRoundingDown(_ incr: Self) -> Self {
        let savedRoundingMode = fegetround()
        let r = self + incr
        return r

cc @scanon, @Nevin

As you surmise, changing the rounding mode is somewhat expensive, and the C rounding mode isn't precisely modeled by clang/llvm, so there's no guarantee that it will even work all the time.

The best way to do this for isolated operations in both Swift and C is (swift-flavored pseudocode, written in the forums so untested, not carefully optimized, intended only to show the basic algorithm, doesn't attempt to deal with infinity/nan edge cases, though some of them will fall out correctly):

var result = a + b
let (large, small) = (maximumMagnitude(a, b), minimumMagnitude(a, b))
let error = (large - result) + small
// result + error is exactly a + b, if overflow didn't occur (by Sterbenz'
// lemma), and the desired result is either result or result.nextDown.
// We just need to check the sign of error to pick which one.
if error < 0 { result = result.nextDown }

(Note that this algorithm does not work for tiny results if subnormals are not supported on the platform--e.g. many 32b ARM targets, or x86 if someone sets the FZ or DAZ bits in MXCSR).

Can you give a little more context about what the problem you're actually trying to solve is, though? There may be a better way to do this if you can give more specifics.


Just to clarify the above, unless I'm mistaken this should be:

let (large, small) = abs(a) > abs(b) ? (a, b) : (b, a)
let error = (large - result) + small

Ie, large and small should keep the sign of a and b. (I first misinterpreted maximumMagnitude(a, b) to mean "the greatest magnitude of a and b".)

I simply needed it as a tool while investigating some numerical code, and to improve my intuition for this stuff via experimentation / observation.

Yes, that's the definition of .maximumMagnitude.


:man_facepalming: I thought it was pseudocode ...

In fairness, I left out the .

Note that there's active work going into modeling rounding modes properly in Clang and LLVM, so in a year or so you'll probably be able to just call a C function that uses #pragma STDC FENV_ROUND FE_TOWARDZERO. That will probably come with a substantial performance cost, though.

1 Like

It's been "in a year or so" for about 7 years now, so don't hold your breath. It'll land right after the Fortran front end gets upstreamed, I'm sure. =)

More seriously, even once it's available, the method I sketched out is generally a better option, unless you can arrange your code to batch a lot of operations together in a single rounding mode.

1 Like

I've been reviewing a lot of patches for it, it's not just theoretical at this point. But yes, I'm sure there are better alternatives.