The main issue with Float8 is that it just doesn’t make a lot of sense; for almost all uses a fixed-point format works better when you go below 16 bits—the main advantage of floating-point is the wide dynamic range, and that goes out the window with these very narrow types (I might carve out an exception for very specific uses of types that are mostly exponent, but they’re weird little niches).

That’s why I wrote:

Without doing a detailed review, I'll observe that your basic implementation strategy for arithmetic, of promoting to `Float`

, then doing the work, then rounding to `Float8`

, is sound and will produce correctly-rounded results for any operation that isn't dependent on the specific details of the format (.ulp, .nextUp, .nextDown, etc). So that looks good.

I would personally probably use the same approach more aggressively for some other operations, like `init(sign: FloatingPointSign, exponent: Int, significand: Float8)`

, but what you have looks like it's probably correct.

Thanks!

Btw, I found a stupid mistake:

```
static prefix func -(lhs: Float8) -> Float8 {
var lhs = Float8(lhs.magnitude)
lhs.bitPattern ^= 0b1_000_0000
return lhs
}
```

should be:

```
static prefix func -(lhs: Float8) -> Float8 {
return Float8(bitPattern: lhs.bitPattern ^ 0b1_000_0000)
}
```

(unless that proves to be wrong too ... I had to implement my own because my reuse of Float caused an infinite recursion with infix `(-)`

otherwise)

Will correct the code in the previous post.

I've noticed some more cases like that, where promoting to `Float`

, or rather converting back from `Float`

can cause an infinite recursion just for some particular values, for example:

```
init(floatLiteral value: Float) {
// NOTE: Infinite recursion here caused by:
// `Float8(-Float(0))`
// but not for eg:
// `Float8(-Float(1))` or `Float8(Float(0))` ...
self.init(value) // <-- So I guess this init is calling back to this (floatLiteral) init when f is negative zero.
}
```

So I turned it into this:

```
init(floatLiteral value: Float) {
// There was an infinite recursion here for eg `Float8(-Float(0))`,
// but not for `Float8(-Float(1))` or `Float8(Float(0))`.
// This check takes care of that particular case, but are there more?
if value == -Float(0) {
self.init(bitPattern: 0b1_000_0000)
} else {
self.init(value) // <-- Will this call back to this for some other f?
}
}
```

If I'm not mistaken, this means that when I am promoting to `Float`

in some func/init/property A, and do `self.init(f)`

(or `Float8(f)`

), I am just lucky if that `self.init(f)`

(which I have no control over) doesn't, and won't ever in the future, call back to `A`

for any `f`

(that I haven't taken care of as above) ...

In this particular case, `self.init(f)`

(with `f`

being `-Float(0)`

) ends up calling `_convert`

which will then call back to my `.init(floatLiteral:)`

, and we have infinite recursion.

```
@inlinable
public // @testable
static func _convert<Source: BinaryFloatingPoint>(
from source: Source
) -> (value: Self, exact: Bool) {
guard _fastPath(!source.isZero) else {
return (source.sign == .minus ? -0.0 : 0, true) // <-- Here! That `-0.0` calls back to my floatLiteral init.
}
...
```

It seems kind of hard/impossible to protect against this kind of infinite recursion when promoting to `Float`

...

I wrote about this here:

In this float representation, we always have (for positive): Self < (1 << Self.significandBitCount). This break at least the random() implementation which requires to be able to represent twice as much.

Too bad, a float which is visualizable by the human brain (like this one) could have had such educational power in understanding the internals of floats.

Perhaps it would be better with four exponent and three significand bits?

Agreed, so I did a quick test where I modified `Float8`

to have

- 4 bits exponent
- 3 bits significand

(instead of the other way around)

This choice feels like a better one.

- It can represent 240 finite values, from
`-240`

to`240`

`Float8(240).ulp == Float8(128).ulp == 16.0`

- Has
`ulp <= 1`

between`-16`

and`16`

- Least nonzero magnitude is
`0.001953125`

So it works with Swift's Random API implementation.

I might post the code later.

By the way, on the subtopic of trying to avoid accidental infinite recursion:

Is there any tool or something (like Xcode's Static Analyzer) we can use to automatically report any potential loops in a call graph?

Could this be an over-specification on `random()`

’s part?

The following seems to be true for all standard Float

- FloatX(1.0).exponentBitPattern) == (1 << (FloatX.exponentBitCount - 1) - 1)
- FloatX.greatestFiniteMagnitude > (1 << FloatX.significandBitCount)
- FloatX.significandBitCount > FloatX.exponentBitCount

So, there could more hidden requirements in the FloatingPoint protocol implementation.

Making the protocol implementation to support all possible variants of exponentBitCount, significandBitCount and _exponentBias would likely be unwise as only the one supported by the hardware are really useful in real life (IMHO). Which is why I did not blame the random() implementation.

The last one of those shouldn’t be required. The other two are generally desirable properties for a floating-point number system, however (and note that [Binary]FloatingPoint specifically binds IEEE 754 formats, which always have those properties).

I have some more questions in this vein:

Would it be valid to make a `BinaryFloatingPoint`

type where both `RawExponent`

and `RawSignificand`

conform to `FixedWidthInteger`

, and…

a) `RawSignificand.bitWidth == 0`

?

b) `significandBitCount == 0`

?

c) `RawSignificand.bitWidth == significandBitCount`

?

d) `RawExponent.bitWidth == 0`

?

e) `exponentBitCount == 0`

?

f) `RawExponent.bitWidth == exponentBitCount`

?

(I’m writing some generic code that would need special cases to handle these if they’re valid.)

a. No, because an IEEE 754 binary format needs to be able to differentiate qNaN and sNaN, and neither the sign nor exponent bits may be used for that purpose.

b. See previous

c. Yes, this is allowed (but no IEEE 754 basic format has this situation).

d. No, IEEE 754 imposes the following constraints on the exponent field (where w is the width in bits, emin is the minimum normal exponent, and emax is the maximum finite exponent):

- emin = 1 - emax
- emin ≤ emax.
- emax = 2**(w-1)-1

If w is 0 or 1, these constraints are violated. The smallest allowed `RawExponent.bitWidth`

is 2.

e. See previous

f. Definitely permitted (but no IEEE 754 basic format is in this situation).

Great, thanks!

Also, I just noticed that `Float`

has `UInt`

as its `RawExponent`

type, but `UInt32`

as its `RawSignificand`

.

What’s the rationale for making the significand fixed-size and the exponent platform-word-sized?

Very weak. There were some convenience factors owing to the limitations of the (radically different) integer protocols we had at the time that work was done, but also: `UInt`

is big enough for every floating-point type you're likely to encounter in normal use, ever, so having a single type that matches is a nice convenience. That's not true of any efficient type for significands. (The better question, then, is why do we have the `RawExponent`

associated type at all, instead of just using `UInt`

, and the answer to that is basically "an overabundance of caution").

Is it mandatory that the significand bits are used for that purpose, or could there be an additional dedicated bit in the type?

For example, `Float80`

has an extra bit which indicates, essentially, `isNormal`

. Could the same strategy be used for distinguishing NaNs?

You're going a bit off the rails of what IEEE 754 defined, historically.

However, the new (2018) standard comes to the rescue with some clarity:

NOTE 3—For binary formats, the precision p should be at least 3, as some numerical properties do not hold for lower precisions.

Similarly, emax should be at least 2 to support the operations listed in 9.2.

So the significand should have at least three bits (one of which may be implicit), separate from the need to encode qNaN and sNaN.