Has anyone implemented a Float8 / Quarter type?

Agreed, so I did a quick test where I modified Float8 to have

  • 4 bits exponent
  • 3 bits significand

(instead of the other way around)

This choice feels like a better one.

  • It can represent 240 finite values, from -240 to 240
  • Float8(240).ulp == Float8(128).ulp == 16.0
  • Has ulp <= 1 between -16 and 16
  • Least nonzero magnitude is 0.001953125

So it works with Swift's Random API implementation.

I might post the code later.


By the way, on the subtopic of trying to avoid accidental infinite recursion:
Is there any tool or something (like Xcode's Static Analyzer) we can use to automatically report any potential loops in a call graph?

1 Like

Could this be an over-specification on random()’s part?

The following seems to be true for all standard Float

  • FloatX(1.0).exponentBitPattern) == (1 << (FloatX.exponentBitCount - 1) - 1)
  • FloatX.greatestFiniteMagnitude > (1 << FloatX.significandBitCount)
  • FloatX.significandBitCount > FloatX.exponentBitCount

So, there could more hidden requirements in the FloatingPoint protocol implementation.

Making the protocol implementation to support all possible variants of exponentBitCount, significandBitCount and _exponentBias would likely be unwise as only the one supported by the hardware are really useful in real life (IMHO). Which is why I did not blame the random() implementation.

1 Like

The last one of those shouldn’t be required. The other two are generally desirable properties for a floating-point number system, however (and note that [Binary]FloatingPoint specifically binds IEEE 754 formats, which always have those properties).


I have some more questions in this vein:

Would it be valid to make a BinaryFloatingPoint type where both RawExponent and RawSignificand conform to FixedWidthInteger, and…

a) RawSignificand.bitWidth == 0 ?
b) significandBitCount == 0 ?
c) RawSignificand.bitWidth == significandBitCount ?
d) RawExponent.bitWidth == 0 ?
e) exponentBitCount == 0 ?
f) RawExponent.bitWidth == exponentBitCount ?

(I’m writing some generic code that would need special cases to handle these if they’re valid.)

1 Like

a. No, because an IEEE 754 binary format needs to be able to differentiate qNaN and sNaN, and neither the sign nor exponent bits may be used for that purpose.
b. See previous
c. Yes, this is allowed (but no IEEE 754 basic format has this situation).
d. No, IEEE 754 imposes the following constraints on the exponent field (where w is the width in bits, emin is the minimum normal exponent, and emax is the maximum finite exponent):

  1. emin = 1 - emax
  2. emin ≤ emax.
  3. emax = 2**(w-1)-1

If w is 0 or 1, these constraints are violated. The smallest allowed RawExponent.bitWidth is 2.
e. See previous
f. Definitely permitted (but no IEEE 754 basic format is in this situation).

1 Like

Great, thanks!

Also, I just noticed that Float has UInt as its RawExponent type, but UInt32 as its RawSignificand.

What’s the rationale for making the significand fixed-size and the exponent platform-word-sized?

Very weak. There were some convenience factors owing to the limitations of the (radically different) integer protocols we had at the time that work was done, but also: UInt is big enough for every floating-point type you're likely to encounter in normal use, ever, so having a single type that matches is a nice convenience. That's not true of any efficient type for significands. (The better question, then, is why do we have the RawExponent associated type at all, instead of just using UInt, and the answer to that is basically "an overabundance of caution").


Is it mandatory that the significand bits are used for that purpose, or could there be an additional dedicated bit in the type?

For example, Float80 has an extra bit which indicates, essentially, isNormal. Could the same strategy be used for distinguishing NaNs?

You're going a bit off the rails of what IEEE 754 defined, historically.

However, the new (2018) standard comes to the rescue with some clarity:

NOTE 3—For binary formats, the precision p should be at least 3, as some numerical properties do not hold for lower precisions.
Similarly, emax should be at least 2 to support the operations listed in 9.2.

So the significand should have at least three bits (one of which may be implicit), separate from the need to encode qNaN and sNaN.

1 Like

I’m not an expert on the IEEE–754 definitions, that’s why I’m asking these questions.

I’m writing Swift code where I would like to be able to assume that significandBitCount > 0. If the floating-point standards and/or the Swift protocols don’t officially rule out it being zero, then I’ll have to perform extra checks and provide alternate code-paths to handle that.

Excellent, this is exactly the information I was looking for. Thanks again!

If you're getting into these sorts of low-level details, it would be well-worth your time to read the standard. It's gotten longer over the years, but clauses 1-3, which contain the answers to all of these questions, are less than 20 pages in total.

So, in a Float8 with 4 exponent bits and 3 significand bits, this would be OK:

    static var nan: Float8 { Float8(bitPattern: 0b0_1111_100) }
    static var signalingNaN: Float8 { Float8(bitPattern: 0b0_1111_010) }

(Meaning that the nan/snan payload will be just one bit.)

Yes, that's fine. The following would perhaps be a slightly nicer:

static var nan: Float8 { Float8(bitPattern: 0b0_1111_110) }
static var signalingNaN: Float8 { Float8(bitPattern: 0b0_1111_010) }

We don't currently do this for standard library floating-point types for weird historical reasons that are surprisingly uninteresting, but it's an infinitesimally better choice looking into the future.¹

¹ So infinitesimal that I hesitate to bring it up, because it's due to oddball runtime shenanigans that are extremely unlikely to ever be used and therefore really not worth wasting everyone's attention on. There's like a .001% chance of it ever mattering, so basically forget we had this talk.


I noticed that my Float8 implementation doesn't quite match the behavior of Float and Double when it comes to rounding, eg:

typealias F = Float8
let a = F.greatestFiniteMagnitude
let b = a.ulp / 2
                                 // F = Float  F = Float8
print(a + b.nextDown == a)       //   true       true
print(a + b == .infinity)        //   true       false
print(a + b.nextUp == .infinity) //   true       true

(It's not specific to near infinity or the + operator, it's simply that it rounds the value exactly in the middle between two representable values differently.)

The cause of this turns out to be the way I convert values of another floating point type to Float8. While the standard library uses this for eg Float32.init(_ other: Float64):

  public init(_ other: ${That}) {
%   if srcBits > bits:
    _value = Builtin.fptrunc_FPIEEE${srcBits}_FPIEEE${bits}(other._value)
%   elif srcBits < bits:
    _value = Builtin.fpext_FPIEEE${srcBits}_FPIEEE${bits}(other._value)
%   else:
    _value = other._value
%   end

My corresponding Float8.init(_ other: Float32) is this:

init<Source: BinaryFloatingPoint>(_ value: Source) {
    self = Float8._convert(from: value).value

(where Float8._convert(from:) is my own copy of the same named standard library method, to prevent unintentional infinite recursion.)

I figured that should perform the same kind of conversion, but it doesn't, as can be demonstrated like this:

let a = Float.greatestFiniteMagnitude
let b = a.ulp / 2
let c = Double(a) + Double(b)
print(Float.init(c))           // inf
print(Float._convert(from: c)) // (value: 3.4028235e+38, exact: false)

Why aren't both inf (or 3.4028235e+38)?

That is, shouldn't
Float._convert(from: myDouble).value
always be equal to

If not, why are they doing their rounding differently?

Well, because I was the one who implemented the standard library's generic _convert, but I had nothing to do with the concrete Float.init :slight_smile: (the latter being a compiler intrinsic).

Our promise (documented in the standard library) is to create a "new instance from the given value, rounded to the closest possible representation." Now, in your example, c is roughly 3.4028235677973366E+38. The value obtained from _convert is finite and the alternative is infinity; the closer of the two converted values is (obviously) the finite value, so if we take the documentation at face value, my implementation is correct!

@scanon will no doubt explain why that's not the case or shouldn't be.

1 Like

For the purposes of rounding finite values infinity behaves as though it were the next larger finite value (alternatively, in the language of IEEE 754, overflow is detected by rounding as though the exponent range were unbounded, and checking to see if the resulting rounded exponent is representable).

So, in particular, greatestFiniteMagnitude + ulp/2 should round up to infinity, because it's exactly halfway between greatestFiniteMagnitude and greatestFiniteMagnitude + ulp, and the latter is even.

This is a bug that should be fixed (and some test cases added).


Yes, I thought so too. Here's a program that prints each pair of mismatching conversions from Double to Float for all Double values in 1.0 ... 2.0:

func concrete(_ value: Double) -> Float {
    return Float.init(value) // Will call intrinsic
func generic<T: BinaryFloatingPoint>(_ value: T) -> Float {
    return Float.init(value) // Will call ._convert(from:)
extension String {
    func leftPadded(to minCount: Int, with char: Character=" ") -> String {
        return String(repeating: char, count: max(0, minCount-count)) + self
extension BinaryFloatingPoint {
    var segmentedBinaryString: String {
        let e = String(exponentBitPattern, radix: 2)
        let s = String(significandBitPattern, radix: 2)
        return [self.sign == .plus ? "0" : "1", "_",
                e.leftPadded(to: Self.exponentBitCount, with: "0"), "_",
                s.leftPadded(to: Self.significandBitCount, with: "0")].joined()
func test() {
    var d = Double(1)
    let step = d.ulp
    var mc = 0
    while d <= 2 {
        let a = concrete(d)
        let b = generic(d)
        if a != b {
            print("Found mismatched conversion (after \(mc) matching conversions):")
            print(" Double:  ", d.segmentedBinaryString, d)
            print(" concrete:", a.segmentedBinaryString, a)
            print(" generic: ", b.segmentedBinaryString, b)
            mc = 0
        } else {
            mc &+= 1
        d += step

Here's what it will print:

Found mismatched conversion (after 805306368 matching conversions):
 Double:   0_01111111111_0000000000000000000000110000000000000000000000000000 1.0000001788139343
 concrete: 0_01111111_00000000000000000000010 1.0000002
 generic:  0_01111111_00000000000000000000001 1.0000001
Found mismatched conversion (after 1073741823 matching conversions):
 Double:   0_01111111111_0000000000000000000001110000000000000000000000000000 1.0000004172325134
 concrete: 0_01111111_00000000000000000000100 1.0000005
 generic:  0_01111111_00000000000000000000011 1.0000004
Found mismatched conversion (after 1073741823 matching conversions):
 Double:   0_01111111111_0000000000000000000010110000000000000000000000000000 1.0000006556510925
 concrete: 0_01111111_00000000000000000000110 1.0000007
 generic:  0_01111111_00000000000000000000101 1.0000006
Found mismatched conversion (after 1073741823 matching conversions):
 Double:   0_01111111111_0000000000000000000011110000000000000000000000000000 1.0000008940696716
 concrete: 0_01111111_00000000000000000001000 1.000001
 generic:  0_01111111_00000000000000000000111 1.0000008
Found mismatched conversion (after 1073741823 matching conversions):
 Double:   0_01111111111_0000000000000000000100110000000000000000000000000000 1.0000011324882507
 concrete: 0_01111111_00000000000000000001010 1.0000012
 generic:  0_01111111_00000000000000000001001 1.0000011
Found mismatched conversion (after 1073741823 matching conversions):
 Double:   0_01111111111_0000000000000000000101110000000000000000000000000000 1.0000013709068298
 concrete: 0_01111111_00000000000000000001100 1.0000014
 generic:  0_01111111_00000000000000000001011 1.0000013
Found mismatched conversion (after 1073741823 matching conversions):
 Double:   0_01111111111_0000000000000000000110110000000000000000000000000000 1.000001609325409
 concrete: 0_01111111_00000000000000000001110 1.0000017
 generic:  0_01111111_00000000000000000001101 1.0000015
Found mismatched conversion (after 1073741823 matching conversions):
 Double:   0_01111111111_0000000000000000000111110000000000000000000000000000 1.000001847743988
 concrete: 0_01111111_00000000000000000010000 1.0000019
 generic:  0_01111111_00000000000000000001111 1.0000018
Found mismatched conversion (after 1073741823 matching conversions):
 Double:   0_01111111111_0000000000000000001000110000000000000000000000000000 1.0000020861625671
 concrete: 0_01111111_00000000000000000010010 1.0000021
 generic:  0_01111111_00000000000000000010001 1.000002
Found mismatched conversion (after 1073741823 matching conversions):
 Double:   0_01111111111_0000000000000000001001110000000000000000000000000000 1.0000023245811462
 concrete: 0_01111111_00000000000000000010100 1.0000024
 generic:  0_01111111_00000000000000000010011 1.0000023
Found mismatched conversion (after 1073741823 matching conversions):
 Double:   0_01111111111_0000000000000000001010110000000000000000000000000000 1.0000025629997253
 concrete: 0_01111111_00000000000000000010110 1.0000026
 generic:  0_01111111_00000000000000000010101 1.0000025
Found mismatched conversion (after 1073741823 matching conversions):
 Double:   0_01111111111_0000000000000000001011110000000000000000000000000000 1.0000028014183044
 concrete: 0_01111111_00000000000000000011000 1.0000029
 generic:  0_01111111_00000000000000000010111 1.0000027
Found mismatched conversion (after 1073741823 matching conversions):
 Double:   0_01111111111_0000000000000000001100110000000000000000000000000000 1.0000030398368835
 concrete: 0_01111111_00000000000000000011010 1.0000031
 generic:  0_01111111_00000000000000000011001 1.000003

(I stopped it there (after a couple of minutes) :slight_smile: .)

The generic one does not behave according to its documentation, ie:

If two representable values are equally close, the result is the value with more trailing zeros in its significand bit pattern.

1 Like

Ha! I knew there was a reason I had to be wrong. This is beauty of floating point.

Yeah, those are bugs. I thought I'd handled this case; I'll need to see where the logic error lies.

1 Like
Terms of Service

Privacy Policy

Cookie Policy