RFC: On compile-time diagnosis of erroneous floating point operations

The objective of this discussion is to identify erroneous (or problematic) floating point operations that are worth flagging as compile-time errors/warning. The following are some examples that could be detected at compile time.

  1. Overflows due to use of large floating point literals e.g.
    let x: Float32 = 1e39 // x has a value of +inf.
    let x: Float80 = 1e5000

  2. Underflows due to use of small floating point literals:
    let x: Float32 = 1e-46 // x is 0 here

  3. Overflows and underflows that happen during arithmetic operations on constants (or variables whose values are known at compile time)
    let x: Float32 = 1e38
    let y: Float32 = 10
    x * y // will reduce to inf

  4. Operations over constants (or variables) that are guaranteed to yield Infinity or NaN e.g. 3.0 / 0.0, 0.0 / 0.0 etc.

These scenarios are probably signs of something broken in the code under compilation, but sometimes could be intentional or unavoidable. My question is which of these cases would warrant a compile time warning or error.

It appears to me that overflows and underflows during literal assignment (cases 1 and 2) could reasonably be flagged as compile time errors (as there are better ways to achieve the same result if desired). I wonder whether case (1) is even an undefined behavior?

I am not sure about the remaining cases. Should they be compile time warnings or should they be ignored? (Note that here the focus in only on operations that are performed on values known at compile time.)

I would appreciate any thoughts on these? Feel free to share other kinds of floating point errors that are worth catching at compile time.


This text will be hidden

3 and 4 are very difficult to do well, because they are necessarily fragile warnings / errors. They will come and go as code is rearranged; seemingly innocuous changes will cause them to appear and disappear.

These are also not errors per se in the IEEE 754 model. They are exceptional conditions, but the whole point of inf and nan is to allow computation to continue under such conditions, and actual errors to be detected by the programmer when applicable. We could warn for these, but treating them as logic errors is not really a sound interpretation of IEEE 754 semantics.

1 and 2, sure you can warn for those.

1 Like

One thing you have to decide up front is if you are using IEEE-754. AFAIK Swift doesn’t specify this (there’s no mention of this in the Swift programming language book Swift 4.1) and neither does LLVM. This is something I find peculiar (especially as LLVM optimises floating-point as if it were IEEE-754, modulo bugs). However, AFAIK all targets of interest for Swift use IEEE-754 so assuming it is probably fine…

Cases 1 and 2 seem worthy of a warning as it is very unlikely that a programmer would want a literal to be drastically different from what they wrote. If they want particular values very close to zero or infinity we could perhaps recommend hex float notation?

Case 4 could potentially be a warning. I’m not sure about case 3, as Steve mentions this might be unstable. It might also be wrong given that your analysis will probably be at the SIL level but LLVM might optimise differently.

Seeing as you’re looking at this it is probably worth pointing that underflow is a little bit subtle.

Intuitively you might think that underflow is when the exact absolute value of a computation (i.e. as if the operands to the operation and the output where reals) lies between zero and the smallest representable floating-point number (the smallest subnormal number).

However, this is not the definition IEEE-754 uses. In fact it has two different definitions and implementations have to pick one (sigh…). In 7.5 Underflow in IEEE-754 2008 the conditions for the underflow exception are detailed. The definitions are (assuming base is 2) roughly as follows

  1. after rounding - when a non-zero result computed as though the exponent range were unbounded would lie strictly between +/- 2^{e_{min}} (smallest positive normal floating-point number/ largest negative normal floating-point number).

  2. before rounding - when a non-zero result computed as though the exponent range and the precision were unbounded would lie strictly between +/- 2^{e_{min}} (smallest positive normal floating-point number/ largest negative normal floating-point number).

The standard also notes that under “default exception handling” if the result is exact the underflow exception doesn’t actually get flagged.

So for underflow you would need to decide which definition you want to use.

The header doc on the FloatingPoint functions and types in the stdlib references IEEE 754 something like 70 times (for that matter, the llvm language reference says “The binary format of half, float, double, and fp128 correspond to the IEEE-754-2008 specifications for binary16, binary32, binary64, and binary128 respectively”). I’m not sure how much more explicit we can be =)

To be really pedantic, you need to distinguish between the underflow exception, which is raised when a result is tiny, and the underflow flag, which is raised when a result is both tiny and inexact, under default exception handling. The two criteria you list are for detection of tininess.

No new implementation should choose to detect tininess after rounding. It’s what x86 hardware does for complex legacy reasons (short version: x87 excess precision), but no new implementation should use it. The standard should make this really explicit, but it doesn’t, sorry. You’re supposed to magically infer it from the fact that decimal formats only allow detecting tininess before rounding.

1 Like

My bad. I only looked at Swift programming language book.

Last time I read the LLVM language reference, that wasn’t there :man_shrugging:. Looks like that landed sometime in March 2018. I’m glad to see that the data format is now properly specified but operations on the data are the still underspecified. For example fadd makes no mention of whether or not it has IEEE-754 semantics.

I like really pedantic :grin: . I’m really curious about the under what circumstances tininess is detected after rounding . Does that apply to x86_64 too? I’m also confused by your statements (x86 detects tininess after rounding) because x86 formats aren’t decimal. I’m afraid that might derail this thread some what though.

Anyway. It sounds like for Ravi’s work (should he decide to warn on underflow) that he should assume that:

  • Tininess is detected before rounding.
  • Emit warnings when the result is both tiny and inexact.

Does that apply to x86_64 too?

Yes, x86_64 floating-point is identical to x86.

I’m also confused by your statements (x86 detects tininess after rounding) because x86 formats aren’t decimal.

Apologies for the typo. Decimal is always before rounding: “For decimal formats, tininess is detected before rounding” (7.5). Also, the total population of people who really understand the distinction between “before” and “after rounding”* is about 20, which leaves out most of the committee.

[*] the key is to realize that “after rounding” is after rounding the significand to full-width without consideration of the exponent, and not after rounding to the actual result format.

LLVM definitely uses IEEE-754 for its floating point types, other than the PPC double double format.


LLVM definitely uses IEEE-754 for its floating point types, other than the PPC double double format.

Although the LLVM langref now makes the data types IEEE-754 binary formats the floating-point instructions themselves are still very under specified. For example, fadd doesn’t mention what the semantics of the additional are (e.g. IEEE-754 addition). I suspect the reasons for that are the PPC double double format and that it would be very unusual to use a IEEE-754 binary encoding but then not perform IEEE-754 operations on that data. However I think LLVM really should be more explicit.

This is probably a bit off topic though… one for llvm-dev maybe?

Thanks Steve, Dan and Chris for your inputs.

I have a follow up question regarding emitting warnings on floating-point underflows, e.g. like the scenario 2 I had mentioned in my post. Coming from the perspective of diagnostics, I am wondering what sort of underflows would be worth reporting to the user. More specifically, I am wondering whether it is an overkill to warn on all cases where the underflow flag will be set (regardless of “before”/“after rounding” semantics).

It seems to me that if a (non-zero) literal that user is trying to assign drops to zero then it may be worth warning.
However, if a literal is represented by an inexact value (regardless of whether it is normal or subnormal) shouldn’t we treat it similar to any other value that is inexactly represented (which is to simply ignore it in the diagnostics)?
It would very helpful if you can let me know your thoughts.

For underflow, I would warn if and only if the result differs from the value that would be produced if the exponent range were unbounded (i.e. only when tininess causes the literal to have a different value than it otherwise would).

I’m not sure that we actually have all the machinery we need in APFloat to do this at present, however.

I see. That is interesting. Just for more clarity, I have listed below a positive case where we need a warning and a slightly tricky negative case where we do not need one. Let me know if my understanding is correct:

Positive example (where tininess results in loss of precision)


// The significand has a bit width of 23, and here it has 1 as the last bit (LSB), which will be lost in the subnormal representation. Whereas, it will be preserved if the exponent was unbounded. So we need a warning here, right?

Negative example (here there is a loss of precision but it is not because of tininess of the value)


// The significand has a bit width of 23, and even with unbounded exponent (but bounded mantissa) the number would be approximated as 0x1p-127. This value is accurately captured by the subnormal representation and hence we do not need a warning. right?

Regarding APFloat

It seems that APFloat does not have support for this, though it can tell us when the underflow flag will be set (which is not exactly what we want as it will be set in my “negative example” as well). However, if we extract the significand and exponent bit patterns from APFloat, then, I guess, we can check for this property in the SIL diagnostics phase (e.g. by comparing the significand precision before and after truncation, after taking into account the subnormal representation). Let me know if this is a reasonable approach to take.

Yes, that’s what I had in mind. But: what would do you think about warning for any hex literal that isn’t exact? The whole point of hex-float literal is to be exact, after all.

Makes sense to me.