What is the Float overflow behavior?

I see documentation about integer overflow (it's causes a crash, or you can use special operators like &+.

What's the deal with Float? It seems that sometimes you get infinity, and sometimes you get the maximum value, so you can get a + 1 == a. Is this defined and documented somewhere?

let a = Float.greatestFiniteMagnitude
let b = a + 1.0
print("\(a), \(b), \(a == b)")  // shows a and b are the same
let c = a * 1.0000001           
print("\(c), \(a == c)")        // inf, false

I believe that is all defined as part of IEEE 754.

https://en.wikipedia.org/wiki/IEEE_754.

The standard defines:
...

  • exception handling: indications of exceptional conditions (such as division by zero, overflow, etc. )

The arithmetic behavior of Float (and any type conforming to the FloatingPoint protocol) is fully defined by the IEEE 754 standard. This binding is documented in the documentation text for the protocols:

  /// Adds two values and produces their sum, rounded to a
  /// representable value.
  ///
  /// The addition operator (`+`) calculates the sum of its two arguments. For
  /// example:
  ///
  ///     let x = 1.5
  ///     let y = x + 2.25
  ///     // y == 3.75
  ///
  /// The `+` operator implements the addition operation defined by the
  /// [IEEE 754 specification][spec].
  ///
  /// [spec]: http://ieeexplore.ieee.org/servlet/opac?punumber=4610933
  ///
  /// - Parameters:
  ///   - lhs: The first value to add.
  ///   - rhs: The second value to add.
  override static func +(lhs: Self, rhs: Self) -> Self

IEEE 754 basic arithmetic operations (like + and *) are defined to be correctly rounded. You interpret the input values as exact real numbers, compute the result of the operation on the reals, and then round that result to the closest representable floating-point value. There's a special rule at the limit of the floating-point number line, where infinity is treated as though it were the next value, if the exponent range were unbounded.

Let's work through your examples:

let a = Float.greatestFiniteMagnitude
// a has the value 2^128 - 2^104
let b = a + 1.0
// the exact result of a + 1 (as a real number is):
// 2^128 - 2^104 + 1
// that's not representable as Float, so we have to round it.
// The closest finite value is a itself, and for the purposes of
// rounding, we treat the next large value, infinity, as though
// it were 2^128. a + 1 is much closer to a than it is to this
// value, so the result is a.
let c = a * 1.0000001
// First, 1.0000001 is actually 1.00000011920928955078125,
// or 1 + 2^(-23), since that's the closest representable value.
// When we multiply that by a in exact real arithmetic, we get:
//
// (2^128 - 2^104)(1 + 2^(-23)) = 2^128 + 2^105 - 2^104 - 2^81
//                              = 2^128 + 2^104 - 2^81
//
// Remember that, for the purposes of detecting overflow, we
// pretend that infinity is `2^128` (what would otherwise be the
// next representable value after `.greatestFiniteMagnitude).
// Since this value is bigger than that (and hence definitely
// bigger than the halfway point between a and that value), the
// computation overflow and returns +infinity.
5 Likes

If you write it in hex,

  • greatestFiniteMagnitude would be 0x1.fffffe * 2^127 or 0x1.fffffep127
  • 1.0 would be 0x1p0

For most of the addition, you can

  1. Shift the decimal point (hexadecimal point?) radix point of the smaller one so the exponent matches.
2^127 * 1.fffffe
2^127 * 0.00000000000000000000000000000001
  1. Round each number (for Float it is 24 bits, including leading 1)
2^127 * 1.fffffe
2^127 * 0.000000
  1. Add them together,
2^127 * 1.fffffe0
  1. Shift the radix point so that the leading hex is 1, and round the result to 24 bit precision. (It already is, nothing to do here)

Edit: Correction as discussed below

radix point.

1 Like

Do you include the leading 1? If so, wouldn’t intermediate result have extra bit?

No, 24 bits total. 23 explicit, one implicit.

2 Likes