Has anyone implemented a Float8 / Quarter type?

Jens · February 1, 2020, 10:57pm

I'm looking for an 8-bit "quarter precision" floating point type, as described here, conforming to BinaryFloatingPoint etc, to use when investigating numerical code.

I could use the upcoming Float16 for these purposes but a Float8 would be even better.

An 8-bit format, although too small to be seriously practical, is both large enough to be instructive and small enough to be examined in its entirety.

Jens · February 3, 2020, 7:17pm

I have put together a (very) rough implementation of Float8. It conforms to BinaryFloatingPoint and the little testing I've done seems to suggest that it works as expected, but it probably doesn't.

It has

Float8.exponentBitCount == 3
Float8.significandBitCount == 4

And here's a list of all 256 possible values / bitPatterns:

   N      Float8   bitPattern  exponent  significand     binade         ulp
---------------------------------------------------------------------------
   0         0.0   0_000_0000   Int.min         0.0         0.0    0.015625
   1    0.015625   0_000_0001        -6         1.0    0.015625    0.015625
   2     0.03125   0_000_0010        -5         1.0     0.03125    0.015625
   3    0.046875   0_000_0011        -5         1.5     0.03125    0.015625
   4      0.0625   0_000_0100        -4         1.0      0.0625    0.015625
   5    0.078125   0_000_0101        -4        1.25      0.0625    0.015625
   6     0.09375   0_000_0110        -4         1.5      0.0625    0.015625
   7    0.109375   0_000_0111        -4        1.75      0.0625    0.015625
   8       0.125   0_000_1000        -3         1.0       0.125    0.015625
   9    0.140625   0_000_1001        -3       1.125       0.125    0.015625
  10     0.15625   0_000_1010        -3        1.25       0.125    0.015625
  11    0.171875   0_000_1011        -3       1.375       0.125    0.015625
  12      0.1875   0_000_1100        -3         1.5       0.125    0.015625
  13    0.203125   0_000_1101        -3       1.625       0.125    0.015625
  14     0.21875   0_000_1110        -3        1.75       0.125    0.015625
  15    0.234375   0_000_1111        -3       1.875       0.125    0.015625
  16        0.25   0_001_0000        -2         1.0        0.25    0.015625
  17    0.265625   0_001_0001        -2      1.0625        0.25    0.015625
  18     0.28125   0_001_0010        -2       1.125        0.25    0.015625
  19    0.296875   0_001_0011        -2      1.1875        0.25    0.015625
  20      0.3125   0_001_0100        -2        1.25        0.25    0.015625
  21    0.328125   0_001_0101        -2      1.3125        0.25    0.015625
  22     0.34375   0_001_0110        -2       1.375        0.25    0.015625
  23    0.359375   0_001_0111        -2      1.4375        0.25    0.015625
  24       0.375   0_001_1000        -2         1.5        0.25    0.015625
  25    0.390625   0_001_1001        -2      1.5625        0.25    0.015625
  26     0.40625   0_001_1010        -2       1.625        0.25    0.015625
  27    0.421875   0_001_1011        -2      1.6875        0.25    0.015625
  28      0.4375   0_001_1100        -2        1.75        0.25    0.015625
  29    0.453125   0_001_1101        -2      1.8125        0.25    0.015625
  30     0.46875   0_001_1110        -2       1.875        0.25    0.015625
  31    0.484375   0_001_1111        -2      1.9375        0.25    0.015625
  32         0.5   0_010_0000        -1         1.0         0.5     0.03125
  33     0.53125   0_010_0001        -1      1.0625         0.5     0.03125
  34      0.5625   0_010_0010        -1       1.125         0.5     0.03125
  35     0.59375   0_010_0011        -1      1.1875         0.5     0.03125
  36       0.625   0_010_0100        -1        1.25         0.5     0.03125
  37     0.65625   0_010_0101        -1      1.3125         0.5     0.03125
  38      0.6875   0_010_0110        -1       1.375         0.5     0.03125
  39     0.71875   0_010_0111        -1      1.4375         0.5     0.03125
  40        0.75   0_010_1000        -1         1.5         0.5     0.03125
  41     0.78125   0_010_1001        -1      1.5625         0.5     0.03125
  42      0.8125   0_010_1010        -1       1.625         0.5     0.03125
  43     0.84375   0_010_1011        -1      1.6875         0.5     0.03125
  44       0.875   0_010_1100        -1        1.75         0.5     0.03125
  45     0.90625   0_010_1101        -1      1.8125         0.5     0.03125
  46      0.9375   0_010_1110        -1       1.875         0.5     0.03125
  47     0.96875   0_010_1111        -1      1.9375         0.5     0.03125
  48         1.0   0_011_0000         0         1.0         1.0      0.0625
  49      1.0625   0_011_0001         0      1.0625         1.0      0.0625
  50       1.125   0_011_0010         0       1.125         1.0      0.0625
  51      1.1875   0_011_0011         0      1.1875         1.0      0.0625
  52        1.25   0_011_0100         0        1.25         1.0      0.0625
  53      1.3125   0_011_0101         0      1.3125         1.0      0.0625
  54       1.375   0_011_0110         0       1.375         1.0      0.0625
  55      1.4375   0_011_0111         0      1.4375         1.0      0.0625
  56         1.5   0_011_1000         0         1.5         1.0      0.0625
  57      1.5625   0_011_1001         0      1.5625         1.0      0.0625
  58       1.625   0_011_1010         0       1.625         1.0      0.0625
  59      1.6875   0_011_1011         0      1.6875         1.0      0.0625
  60        1.75   0_011_1100         0        1.75         1.0      0.0625
  61      1.8125   0_011_1101         0      1.8125         1.0      0.0625
  62       1.875   0_011_1110         0       1.875         1.0      0.0625
  63      1.9375   0_011_1111         0      1.9375         1.0      0.0625
  64         2.0   0_100_0000         1         1.0         2.0       0.125
  65       2.125   0_100_0001         1      1.0625         2.0       0.125
  66        2.25   0_100_0010         1       1.125         2.0       0.125
  67       2.375   0_100_0011         1      1.1875         2.0       0.125
  68         2.5   0_100_0100         1        1.25         2.0       0.125
  69       2.625   0_100_0101         1      1.3125         2.0       0.125
  70        2.75   0_100_0110         1       1.375         2.0       0.125
  71       2.875   0_100_0111         1      1.4375         2.0       0.125
  72         3.0   0_100_1000         1         1.5         2.0       0.125
  73       3.125   0_100_1001         1      1.5625         2.0       0.125
  74        3.25   0_100_1010         1       1.625         2.0       0.125
  75       3.375   0_100_1011         1      1.6875         2.0       0.125
  76         3.5   0_100_1100         1        1.75         2.0       0.125
  77       3.625   0_100_1101         1      1.8125         2.0       0.125
  78        3.75   0_100_1110         1       1.875         2.0       0.125
  79       3.875   0_100_1111         1      1.9375         2.0       0.125
  80         4.0   0_101_0000         2         1.0         4.0        0.25
  81        4.25   0_101_0001         2      1.0625         4.0        0.25
  82         4.5   0_101_0010         2       1.125         4.0        0.25
  83        4.75   0_101_0011         2      1.1875         4.0        0.25
  84         5.0   0_101_0100         2        1.25         4.0        0.25
  85        5.25   0_101_0101         2      1.3125         4.0        0.25
  86         5.5   0_101_0110         2       1.375         4.0        0.25
  87        5.75   0_101_0111         2      1.4375         4.0        0.25
  88         6.0   0_101_1000         2         1.5         4.0        0.25
  89        6.25   0_101_1001         2      1.5625         4.0        0.25
  90         6.5   0_101_1010         2       1.625         4.0        0.25
  91        6.75   0_101_1011         2      1.6875         4.0        0.25
  92         7.0   0_101_1100         2        1.75         4.0        0.25
  93        7.25   0_101_1101         2      1.8125         4.0        0.25
  94         7.5   0_101_1110         2       1.875         4.0        0.25
  95        7.75   0_101_1111         2      1.9375         4.0        0.25
  96         8.0   0_110_0000         3         1.0         8.0         0.5
  97         8.5   0_110_0001         3      1.0625         8.0         0.5
  98         9.0   0_110_0010         3       1.125         8.0         0.5
  99         9.5   0_110_0011         3      1.1875         8.0         0.5
 100        10.0   0_110_0100         3        1.25         8.0         0.5
 101        10.5   0_110_0101         3      1.3125         8.0         0.5
 102        11.0   0_110_0110         3       1.375         8.0         0.5
 103        11.5   0_110_0111         3      1.4375         8.0         0.5
 104        12.0   0_110_1000         3         1.5         8.0         0.5
 105        12.5   0_110_1001         3      1.5625         8.0         0.5
 106        13.0   0_110_1010         3       1.625         8.0         0.5
 107        13.5   0_110_1011         3      1.6875         8.0         0.5
 108        14.0   0_110_1100         3        1.75         8.0         0.5
 109        14.5   0_110_1101         3      1.8125         8.0         0.5
 110        15.0   0_110_1110         3       1.875         8.0         0.5
 111        15.5   0_110_1111         3      1.9375         8.0         0.5
 112         inf   0_111_0000   Int.max         inf         nan         nan
 113   snan(0x1)   0_111_0001   Int.max   snan(0x1)         nan         nan
 114   snan(0x2)   0_111_0010   Int.max   snan(0x2)         nan         nan
 115   snan(0x3)   0_111_0011   Int.max   snan(0x3)         nan         nan
 116        snan   0_111_0100   Int.max        snan         nan         nan
 117   snan(0x1)   0_111_0101   Int.max   snan(0x1)         nan         nan
 118   snan(0x2)   0_111_0110   Int.max   snan(0x2)         nan         nan
 119   snan(0x3)   0_111_0111   Int.max   snan(0x3)         nan         nan
 120         nan   0_111_1000   Int.max         nan         nan         nan
 121    nan(0x1)   0_111_1001   Int.max    nan(0x1)         nan         nan
 122    nan(0x2)   0_111_1010   Int.max    nan(0x2)         nan         nan
 123    nan(0x3)   0_111_1011   Int.max    nan(0x3)         nan         nan
 124         nan   0_111_1100   Int.max         nan         nan         nan
 125    nan(0x1)   0_111_1101   Int.max    nan(0x1)         nan         nan
 126    nan(0x2)   0_111_1110   Int.max    nan(0x2)         nan         nan
 127    nan(0x3)   0_111_1111   Int.max    nan(0x3)         nan         nan
 128        -0.0   1_000_0000   Int.min         0.0        -0.0    0.015625
 129   -0.015625   1_000_0001        -6         1.0    0.015625    0.015625
 130    -0.03125   1_000_0010        -5         1.0     0.03125    0.015625
 131   -0.046875   1_000_0011        -5         1.5     0.03125    0.015625
 132     -0.0625   1_000_0100        -4         1.0      0.0625    0.015625
 133   -0.078125   1_000_0101        -4        1.25      0.0625    0.015625
 134    -0.09375   1_000_0110        -4         1.5      0.0625    0.015625
 135   -0.109375   1_000_0111        -4        1.75      0.0625    0.015625
 136      -0.125   1_000_1000        -3         1.0       0.125    0.015625
 137   -0.140625   1_000_1001        -3       1.125       0.125    0.015625
 138    -0.15625   1_000_1010        -3        1.25       0.125    0.015625
 139   -0.171875   1_000_1011        -3       1.375       0.125    0.015625
 140     -0.1875   1_000_1100        -3         1.5       0.125    0.015625
 141   -0.203125   1_000_1101        -3       1.625       0.125    0.015625
 142    -0.21875   1_000_1110        -3        1.75       0.125    0.015625
 143   -0.234375   1_000_1111        -3       1.875       0.125    0.015625
 144       -0.25   1_001_0000        -2         1.0       -0.25    0.015625
 145   -0.265625   1_001_0001        -2      1.0625       -0.25    0.015625
 146    -0.28125   1_001_0010        -2       1.125       -0.25    0.015625
 147   -0.296875   1_001_0011        -2      1.1875       -0.25    0.015625
 148     -0.3125   1_001_0100        -2        1.25       -0.25    0.015625
 149   -0.328125   1_001_0101        -2      1.3125       -0.25    0.015625
 150    -0.34375   1_001_0110        -2       1.375       -0.25    0.015625
 151   -0.359375   1_001_0111        -2      1.4375       -0.25    0.015625
 152      -0.375   1_001_1000        -2         1.5       -0.25    0.015625
 153   -0.390625   1_001_1001        -2      1.5625       -0.25    0.015625
 154    -0.40625   1_001_1010        -2       1.625       -0.25    0.015625
 155   -0.421875   1_001_1011        -2      1.6875       -0.25    0.015625
 156     -0.4375   1_001_1100        -2        1.75       -0.25    0.015625
 157   -0.453125   1_001_1101        -2      1.8125       -0.25    0.015625
 158    -0.46875   1_001_1110        -2       1.875       -0.25    0.015625
 159   -0.484375   1_001_1111        -2      1.9375       -0.25    0.015625
 160        -0.5   1_010_0000        -1         1.0        -0.5     0.03125
 161    -0.53125   1_010_0001        -1      1.0625        -0.5     0.03125
 162     -0.5625   1_010_0010        -1       1.125        -0.5     0.03125
 163    -0.59375   1_010_0011        -1      1.1875        -0.5     0.03125
 164      -0.625   1_010_0100        -1        1.25        -0.5     0.03125
 165    -0.65625   1_010_0101        -1      1.3125        -0.5     0.03125
 166     -0.6875   1_010_0110        -1       1.375        -0.5     0.03125
 167    -0.71875   1_010_0111        -1      1.4375        -0.5     0.03125
 168       -0.75   1_010_1000        -1         1.5        -0.5     0.03125
 169    -0.78125   1_010_1001        -1      1.5625        -0.5     0.03125
 170     -0.8125   1_010_1010        -1       1.625        -0.5     0.03125
 171    -0.84375   1_010_1011        -1      1.6875        -0.5     0.03125
 172      -0.875   1_010_1100        -1        1.75        -0.5     0.03125
 173    -0.90625   1_010_1101        -1      1.8125        -0.5     0.03125
 174     -0.9375   1_010_1110        -1       1.875        -0.5     0.03125
 175    -0.96875   1_010_1111        -1      1.9375        -0.5     0.03125
 176        -1.0   1_011_0000         0         1.0        -1.0      0.0625
 177     -1.0625   1_011_0001         0      1.0625        -1.0      0.0625
 178      -1.125   1_011_0010         0       1.125        -1.0      0.0625
 179     -1.1875   1_011_0011         0      1.1875        -1.0      0.0625
 180       -1.25   1_011_0100         0        1.25        -1.0      0.0625
 181     -1.3125   1_011_0101         0      1.3125        -1.0      0.0625
 182      -1.375   1_011_0110         0       1.375        -1.0      0.0625
 183     -1.4375   1_011_0111         0      1.4375        -1.0      0.0625
 184        -1.5   1_011_1000         0         1.5        -1.0      0.0625
 185     -1.5625   1_011_1001         0      1.5625        -1.0      0.0625
 186      -1.625   1_011_1010         0       1.625        -1.0      0.0625
 187     -1.6875   1_011_1011         0      1.6875        -1.0      0.0625
 188       -1.75   1_011_1100         0        1.75        -1.0      0.0625
 189     -1.8125   1_011_1101         0      1.8125        -1.0      0.0625
 190      -1.875   1_011_1110         0       1.875        -1.0      0.0625
 191     -1.9375   1_011_1111         0      1.9375        -1.0      0.0625
 192        -2.0   1_100_0000         1         1.0        -2.0       0.125
 193      -2.125   1_100_0001         1      1.0625        -2.0       0.125
 194       -2.25   1_100_0010         1       1.125        -2.0       0.125
 195      -2.375   1_100_0011         1      1.1875        -2.0       0.125
 196        -2.5   1_100_0100         1        1.25        -2.0       0.125
 197      -2.625   1_100_0101         1      1.3125        -2.0       0.125
 198       -2.75   1_100_0110         1       1.375        -2.0       0.125
 199      -2.875   1_100_0111         1      1.4375        -2.0       0.125
 200        -3.0   1_100_1000         1         1.5        -2.0       0.125
 201      -3.125   1_100_1001         1      1.5625        -2.0       0.125
 202       -3.25   1_100_1010         1       1.625        -2.0       0.125
 203      -3.375   1_100_1011         1      1.6875        -2.0       0.125
 204        -3.5   1_100_1100         1        1.75        -2.0       0.125
 205      -3.625   1_100_1101         1      1.8125        -2.0       0.125
 206       -3.75   1_100_1110         1       1.875        -2.0       0.125
 207      -3.875   1_100_1111         1      1.9375        -2.0       0.125
 208        -4.0   1_101_0000         2         1.0        -4.0        0.25
 209       -4.25   1_101_0001         2      1.0625        -4.0        0.25
 210        -4.5   1_101_0010         2       1.125        -4.0        0.25
 211       -4.75   1_101_0011         2      1.1875        -4.0        0.25
 212        -5.0   1_101_0100         2        1.25        -4.0        0.25
 213       -5.25   1_101_0101         2      1.3125        -4.0        0.25
 214        -5.5   1_101_0110         2       1.375        -4.0        0.25
 215       -5.75   1_101_0111         2      1.4375        -4.0        0.25
 216        -6.0   1_101_1000         2         1.5        -4.0        0.25
 217       -6.25   1_101_1001         2      1.5625        -4.0        0.25
 218        -6.5   1_101_1010         2       1.625        -4.0        0.25
 219       -6.75   1_101_1011         2      1.6875        -4.0        0.25
 220        -7.0   1_101_1100         2        1.75        -4.0        0.25
 221       -7.25   1_101_1101         2      1.8125        -4.0        0.25
 222        -7.5   1_101_1110         2       1.875        -4.0        0.25
 223       -7.75   1_101_1111         2      1.9375        -4.0        0.25
 224        -8.0   1_110_0000         3         1.0        -8.0         0.5
 225        -8.5   1_110_0001         3      1.0625        -8.0         0.5
 226        -9.0   1_110_0010         3       1.125        -8.0         0.5
 227        -9.5   1_110_0011         3      1.1875        -8.0         0.5
 228       -10.0   1_110_0100         3        1.25        -8.0         0.5
 229       -10.5   1_110_0101         3      1.3125        -8.0         0.5
 230       -11.0   1_110_0110         3       1.375        -8.0         0.5
 231       -11.5   1_110_0111         3      1.4375        -8.0         0.5
 232       -12.0   1_110_1000         3         1.5        -8.0         0.5
 233       -12.5   1_110_1001         3      1.5625        -8.0         0.5
 234       -13.0   1_110_1010         3       1.625        -8.0         0.5
 235       -13.5   1_110_1011         3      1.6875        -8.0         0.5
 236       -14.0   1_110_1100         3        1.75        -8.0         0.5
 237       -14.5   1_110_1101         3      1.8125        -8.0         0.5
 238       -15.0   1_110_1110         3       1.875        -8.0         0.5
 239       -15.5   1_110_1111         3      1.9375        -8.0         0.5
 240        -inf   1_111_0000   Int.max         inf         nan         nan
 241  -snan(0x1)   1_111_0001   Int.max  -snan(0x1)         nan         nan
 242  -snan(0x2)   1_111_0010   Int.max  -snan(0x2)         nan         nan
 243  -snan(0x3)   1_111_0011   Int.max  -snan(0x3)         nan         nan
 244       -snan   1_111_0100   Int.max       -snan         nan         nan
 245  -snan(0x1)   1_111_0101   Int.max  -snan(0x1)         nan         nan
 246  -snan(0x2)   1_111_0110   Int.max  -snan(0x2)         nan         nan
 247  -snan(0x3)   1_111_0111   Int.max  -snan(0x3)         nan         nan
 248        -nan   1_111_1000   Int.max        -nan         nan         nan
 249   -nan(0x1)   1_111_1001   Int.max   -nan(0x1)         nan         nan
 250   -nan(0x2)   1_111_1010   Int.max   -nan(0x2)         nan         nan
 251   -nan(0x3)   1_111_1011   Int.max   -nan(0x3)         nan         nan
 252        -nan   1_111_1100   Int.max        -nan         nan         nan
 253   -nan(0x1)   1_111_1101   Int.max   -nan(0x1)         nan         nan
 254   -nan(0x2)   1_111_1110   Int.max   -nan(0x2)         nan         nan
 255   -nan(0x3)   1_111_1111   Int.max   -nan(0x3)         nan         nan

Jens · February 3, 2020, 7:20pm

This is the (implementation and) program that output the above:

/// An 8-bit floating point type (which probably doesn't work as expected yet).
///
/// This type has been put together by (an amateur) looking at this:
/// * https://en.wikipedia.org/wiki/Single-precision_floating-point_format
/// * http://www.cs.jhu.edu/~jorgev/cs333/readings/8-Bit_Floating_Point.pdf
/// * https://raw.githubusercontent.com/apple/swift/master/stdlib/public/core/FloatingPointTypes.swift.gyb
/// and piggybacking on `Float32` where (it's maybe) possible.
///
/// `Float8` has three exponent bits and four significand bits.
///
/// ```
/// These are just some notes I used when implementing it:
///
/// exponent bit pattern:   0  1  2  3  4  5  6  7
///             exponent: sub -2 -1  0  1  2  3  inf/nan
///                            bias 3
///
/// 0_000_0001 = 0x01 = 2**(-2) * (0 +  1/16) =  0.015625 (least nonzero magnitude)
/// 0_000_1111 = 0x0f = 2**(-2) * (0 + 15/16) =  0.234375 (greatest subnormal magnitude)
/// 0_001_0000 = 0x10 = 2**(-2) * (1 +  0/16) =  0.25 (least normal nonzero magnitude)
/// 0_011_0000 = 0x30 = 2**( 0) * (1 +  0/16) =  1.0
/// 0_110_1111 = 0x6f = 2**( 3) * (1 + 15/16) = 15.5 (greatest finite magnitude)
/// ```
/// See: https://forums.swift.org/t/has-anyone-implemented-a-float8-quarter-type/33337/8
struct Float8 {
    private var bitPattern: UInt8

    init(bitPattern: UInt8) {
        self.bitPattern = bitPattern
    }

}
import Darwin

extension Float8 {
    var float: Float {
        // if isSignalingNaN { return Float.signalingNaN }
        // if isNaN { return Float.nan }
        // let fsign = sign == .minus ? -Float(1) : Float(1)
        // if isInfinite { return Float.infinity * fsign }
        // var zeroOrOne: Float = 1.0
        // var exp = Float(exponentBitPattern) - Float(Self._exponentBias)
        // if isSubnormal {
        //     zeroOrOne = 0.0
        //     exp += 1
        // }
        // let fraction: Float = Float(bitPattern & 0b1111) / 16.0
        // return fsign *
        //     powf(Float(2), exp) * (zeroOrOne + fraction)
        return Float(self)
    }
}

private extension BinaryInteger {

    private func _binaryLogarithm() -> Int {
        precondition(self > (0 as Self))
        var (quotient, remainder) =
            (bitWidth &- 1).quotientAndRemainder(dividingBy: UInt.bitWidth)
        remainder = remainder &+ 1
        var word = UInt(truncatingIfNeeded: self >> (bitWidth &- remainder))
        // If, internally, a variable-width binary integer uses digits of greater
        // bit width than that of Magnitude.Words.Element (i.e., UInt), then it is
        // possible that `word` could be zero. Additionally, a signed variable-width
        // binary integer may have a leading word that is zero to store a clear sign
        // bit.
        while word == 0 {
            quotient = quotient &- 1
            remainder = remainder &+ UInt.bitWidth
            word = UInt(truncatingIfNeeded: self >> (bitWidth &- remainder))
        }
        // Note that the order of operations below is important to guarantee that
        // we won't overflow.
        return UInt.bitWidth &* quotient &+
            (UInt.bitWidth &- (word.leadingZeroBitCount &+ 1))
    }
}


extension Float8 : BinaryFloatingPoint {

    typealias Exponent = Int

    typealias RawSignificand = UInt8

    typealias RawExponent = UInt

    typealias Stride = Self

    typealias Magnitude = Self

    typealias FloatLiteralType = Float

    typealias IntegerLiteralType = Int64

    static var exponentBitCount: Int { 3 }

    static var significandBitCount: Int { 4 }

    static var _exponentBias: UInt { 3 }

    static var nan: Float8 { Float8(bitPattern: 0b0_111_1000) }

    static var signalingNaN: Float8 { Float8(bitPattern: 0b0_111_0100) }

    static var infinity: Float8 { Float8(bitPattern: 0b0_111_0000) }

    /// 0.25
    static var leastNormalMagnitude: Float8 {
        Float8(bitPattern: 0b0_001_0000)
    }

    /// 0.015625
    static var leastNonzeroMagnitude: Float8 {
        Float8(bitPattern: 0b0_000_0001)
    }

    /// 15.5
    static var greatestFiniteMagnitude: Float8 {
        Float8(bitPattern: 0b0_110_1111)
    }

    private static var _infinityExponent: UInt = 0b111
    private static var _significandMask: UInt8 = 0b1111

    /// The mathematical constant pi approximated by the closest representable
    /// `Float8` value less than pi, which is `3.125`.
    static var pi: Float8 {
        return Float8(bitPattern: 0b0_100_1001)
    }

    var exponentBitPattern: UInt { UInt((bitPattern &>> 4) & 0b111) }

    var significandBitPattern: UInt8 { bitPattern & 0b1111 }

    var sign: FloatingPointSign { bitPattern & 128 == 128 ? .minus : .plus }

    var exponent: Int {
        if !isFinite { return .max }
        if isZero { return .min }
        let provisional = Int(exponentBitPattern) - Int(Self._exponentBias)
        if isNormal { return provisional }
        let shift = Self.significandBitCount -
            significandBitPattern._binaryLogarithm()
        return provisional + 1 - shift
    }

    var significand: Float8 {
        if isNaN { return self }
        if isNormal {
            return Float8(sign: .plus,
                        exponentBitPattern: Self._exponentBias,
                        significandBitPattern: significandBitPattern)
        }
        if isSubnormal {
            let shift = Self.significandBitCount -
                    significandBitPattern._binaryLogarithm()
            return Float8(
                sign: .plus,
                exponentBitPattern: Self._exponentBias,
                significandBitPattern: significandBitPattern &<< shift
            )
        }
        // zero or infinity.
        return Float8(
            sign: .plus,
            exponentBitPattern: exponentBitPattern,
            significandBitPattern: 0
        )
    }

    var ulp: Float8 {
        guard isFinite else { return .nan }
        if isNormal {
            let bitPattern_ = bitPattern & Self.infinity.bitPattern
            return Float8(bitPattern: bitPattern_) * 0x1p-4
        }
        return .leastNormalMagnitude * 0x1p-4
    }

    var binade: Float8 {
        guard isFinite else { return .nan }
        if isSubnormal {
            // The following from the FloatingPointTypes.swift.gyb file
            // (and adapted to this type) does not work, only produces inf:
            // let bitPattern_ = (self * 0x1p4).bitPattern
            //     & (-Self.infinity).bitPattern
            // return Float8(bitPattern: bitPattern_) * 0x1p-4
            // So I do this instead:
            let shifts = (bitPattern & 0b0_000_1111).leadingZeroBitCount
            return Float8(bitPattern: UInt8(1) &<< (7 &- shifts))
        }
        return Float8(bitPattern: bitPattern & (-Self.infinity).bitPattern)
    }

    var significandWidth: Int {
        let trailingZeroBits = significandBitPattern.trailingZeroBitCount
        if isNormal {
            guard significandBitPattern != 0 else { return 0 }
            return Self.significandBitCount &- trailingZeroBits
        }
        if isSubnormal {
            let leadingZeroBits = significandBitPattern.leadingZeroBitCount
            return Self.RawSignificand.bitWidth &-
                (trailingZeroBits &+ leadingZeroBits &+ 1)
        }
        return -1
    }

    var nextUp: Float8 {
        // Silence signaling NaNs, map -0 to +0.
        let x = self + 0
        if _fastPath(x < .infinity) {
            let increment = Int8(bitPattern: x.bitPattern) &>> 7 | 1
            let bitPattern_ = x.bitPattern &+ UInt8(bitPattern: increment)
            return Float8(bitPattern: bitPattern_)
        }
        return x
    }


    init(sign: FloatingPointSign,
         exponentBitPattern: UInt,
         significandBitPattern: UInt8)
    {
        self.bitPattern = (sign == .minus ? 0b1_000_0000 : 0b0_000_0000)
            | (UInt8(truncatingIfNeeded: (exponentBitPattern & 0b111)) << 4)
            | (significandBitPattern & 0b1111)
    }

    init(sign: FloatingPointSign, exponent: Int, significand: Float8) {
        var result = significand
        if sign == .minus { result = -result }
        if significand.isFinite && !significand.isZero {
            var clamped = exponent
            let leastNormalExponent = 1 - Int(Self._exponentBias)
            let greatestFiniteExponent = Int(Self._exponentBias)
            if clamped < leastNormalExponent {
                clamped = max(clamped, 3*leastNormalExponent)
                while clamped < leastNormalExponent {
                    result  *= Self.leastNormalMagnitude
                    clamped -= leastNormalExponent
                }
            }
            else if clamped > greatestFiniteExponent {
                clamped = min(clamped, 3*greatestFiniteExponent)
                let step = Float8(sign: .plus,
                                exponentBitPattern: 6,
                                significandBitPattern: 0)
                while clamped > greatestFiniteExponent {
                    result  *= step
                    clamped -= greatestFiniteExponent
                }
            }
            let scale = Float8(
                sign: .plus,
                exponentBitPattern: UInt(Int(Self._exponentBias) + clamped),
                significandBitPattern: 0
            )
            result = result * scale
        }
        self = result
    }


    mutating func round(_ rule: FloatingPointRoundingRule) {
        var f = self.float
        f.round(rule)
        self = Float8(f)
    }

    static func - (lhs: Float8, rhs: Float8) -> Float8 {
        // NOTE: My promoting to Float32 was causing an infinite recursion
        // for eg `let a = Float8(-Float(15.9))`
        // I solved it by implementing the unary minus operator below, instead
        // of letting it use the default implementation.
        return Float8(lhs.float - rhs.float)
    }
    static prefix func -(lhs: Float8) -> Float8 {
        return Float8(bitPattern: lhs.bitPattern ^ 0b1_000_0000)
    }

    static func * (lhs: Float8, rhs: Float8) -> Float8 {
        return Float8(lhs.float * rhs.float)
    }

    static func *= (lhs: inout Float8, rhs: Float8) {
        var f = lhs.float
        f *= rhs.float
        lhs = Float8(f)
    }

    static func / (lhs: Float8, rhs: Float8) -> Float8 {
        return Float8(lhs.float / rhs.float)
    }

    static func /= (lhs: inout Float8, rhs: Float8) {
        var f = lhs.float
        f /= rhs.float
        lhs = Float8(f)
    }

    static func += (lhs: inout Float8, rhs: Float8) {
        var f = lhs.float
        f += rhs.float
        lhs = Float8(lhs)
    }

    static func + (lhs: Float8, rhs: Float8) -> Float8 {
        return Float8(lhs.float + rhs.float)
    }

    static func -= (lhs: inout Float8, rhs: Float8) {
        var f = lhs.float
        f -= rhs.float
        lhs = Float8(f)
    }


    mutating func formRemainder(dividingBy other: Float8) {
        var f = self.float
        f.formRemainder(dividingBy: other.float)
        self = Float8(f)
    }

    mutating func formTruncatingRemainder(dividingBy other: Float8) {
        var f = self.float
        f.formTruncatingRemainder(dividingBy: other.float)
        self = Float8(f)
    }

    mutating func formSquareRoot() {
        var f = self.float
        f.formSquareRoot()
        self = Float8(f)
    }

    mutating func addProduct(_ lhs: Float8, _ rhs: Float8) {
        var f = self.float
        f.addProduct(lhs.float, rhs.float)
        self = Float8(f)
    }


    func isEqual(to other: Float8) -> Bool {
        return self.float.isEqual(to: other.float)
    }

    func isLess(than other: Float8) -> Bool {
        return self.float.isLess(than: other.float)
    }

    func isLessThanOrEqualTo(_ other: Float8) -> Bool {
        return self.float.isLessThanOrEqualTo(other.float)
    }

    var isNormal: Bool {
        return exponentBitPattern > 0 && isFinite
    }

    var isFinite: Bool {
        return exponentBitPattern < 7
    }

    var isZero: Bool {
        return self.bitPattern & 0b0_111_1111 == 0
    }

    var isSubnormal: Bool {
        return exponentBitPattern == 0 && significandBitPattern != 0
    }

    var isInfinite: Bool {
        return !isFinite && significandBitPattern == 0
    }

    var isNaN: Bool {
        return !isFinite && significandBitPattern != 0
    }

    private static var _quietNaNMask: UInt8 {
        return 1 &<< UInt8(significandBitCount - 1)
    }
    var isSignalingNaN: Bool {
        return isNaN && (significandBitPattern & Self._quietNaNMask) == 0
    }

    var isCanonical: Bool { return true }

    func distance(to other: Float8) -> Float8 {
        return Float8(other.float - self.float)
    }

    func advanced(by n: Float8) -> Float8 {
        return Float8(self.float + n.float)
    }

    var magnitude: Float8 {
        return Float8(self.float.magnitude)
    }

    init(integerLiteral value: Int64) {
        // Sorry:
        let signBit: UInt8 = value < 0 ? 0b1_000_0000 : 0b0_000_0000
        switch value.magnitude {
        case  0: self.init(bitPattern: 0b0_000_0000 | signBit)
        case  1: self.init(bitPattern: 0b0_011_0000 | signBit)
        case  2: self.init(bitPattern: 0b0_100_0000 | signBit)
        case  3: self.init(bitPattern: 0b0_100_1000 | signBit)
        case  4: self.init(bitPattern: 0b0_101_0000 | signBit)
        case  5: self.init(bitPattern: 0b0_101_0100 | signBit)
        case  6: self.init(bitPattern: 0b0_101_1000 | signBit)
        case  7: self.init(bitPattern: 0b0_101_1100 | signBit)
        case  8: self.init(bitPattern: 0b0_110_0000 | signBit)
        case  9: self.init(bitPattern: 0b0_110_0010 | signBit)
        case 10: self.init(bitPattern: 0b0_110_0100 | signBit)
        case 11: self.init(bitPattern: 0b0_110_0110 | signBit)
        case 12: self.init(bitPattern: 0b0_110_1000 | signBit)
        case 13: self.init(bitPattern: 0b0_110_1010 | signBit)
        case 14: self.init(bitPattern: 0b0_110_1100 | signBit)
        case 15: self.init(bitPattern: 0b0_110_1110 | signBit)
        default: fatalError()
        }
    }

    init(floatLiteral value: Float) {
        // There was an infinite recursion here for eg `Float8(-Float(0))`,
        // but not for `Float8(-Float(1))` or `Float8(Float(0))`.
        // This check takes care of that particular case, but are there more?
        if value == -Float(0) {
            self.init(bitPattern: 0b1_000_0000)
        } else {
            self.init(value)
        }
    }


}

extension Float8 : CustomStringConvertible, LosslessStringConvertible {
    var description: String { return "\(Float(self))" }
    init?(_ description: String) {
        guard let f32 = Float(description) else { return nil }
        let f8 = Float8(f32)
        if f8.description != description { return nil }
        self = f8
    }
}


//-----------------------------------------------------------------------------
// MARK: - Demo
//-----------------------------------------------------------------------------

extension String {
    func leftPadded(to minCount: Int, with char: Character=" ") -> String {
        return String(repeating: char, count: max(0, minCount-count)) + self
    }
}
extension BinaryFloatingPoint {
    var segmentedBinaryString: String {
        let e = String(exponentBitPattern, radix: 2)
        let s = String(significandBitPattern, radix: 2)
        return [self.sign == .plus ? "0" : "1", "_",
                e.leftPadded(to: Self.exponentBitCount, with: "0"), "_",
                s.leftPadded(to: Self.significandBitCount, with: "0")].joined()
    }
}
extension LosslessStringConvertible {
    func leftPadded(to minCount: Int, with char: Character=" ") -> String {
        return description.leftPadded(to: minCount, with: char)
    }
}



extension Float8 {
    static func debugPrintAllValues() {
        var finCount = 0
        var infCount = 0
        var nanCount = 0
        print("   N      Float8   bitPattern  exponent  significand     binade         ulp")
        print("---------------------------------------------------------------------------")
        for byteValue: UInt8 in .min ... .max {
            let v = Float8(bitPattern: byteValue)
            let expStr: String
            switch v.exponent {
            case .min: expStr = "Int.min"
            case .max: expStr = "Int.max"
            default: expStr = v.exponent.description
            }
            print(
                byteValue.leftPadded(to: 4),
                v.leftPadded(to: 11),
                v.segmentedBinaryString.leftPadded(to: 12),
                expStr.leftPadded(to: 9),
                v.significand.leftPadded(to: 11),
                v.binade.leftPadded(to: 11),
                v.ulp.leftPadded(to: 11)
            )
            if v.isFinite { finCount += 1 }
            if v.isNaN { nanCount += 1 }
            if v.isInfinite { infCount += 1 }
        }
        print("Number of finite values:", finCount)
        print("Number of infinite values:", infCount)
        print("Number of NaNs:", nanCount)
        precondition(finCount + infCount + nanCount == 256)
    }
}
Float8.debugPrintAllValues()

Before possibly cleaning it up and using/trusting it, I'd like to take the opportunity and ask if anyone more skilled than me would like to take a quick look and maybe spot some obvious mistakes.

Edit: Corrected the code.

scanon · February 3, 2020, 7:38pm

The main issue with Float8 is that it just doesn’t make a lot of sense; for almost all uses a fixed-point format works better when you go below 16 bits—the main advantage of floating-point is the wide dynamic range, and that goes out the window with these very narrow types (I might carve out an exception for very specific uses of types that are mostly exponent, but they’re weird little niches).

Jens · February 3, 2020, 7:41pm

That’s why I wrote:

scanon · February 3, 2020, 8:36pm

Without doing a detailed review, I'll observe that your basic implementation strategy for arithmetic, of promoting to Float, then doing the work, then rounding to Float8, is sound and will produce correctly-rounded results for any operation that isn't dependent on the specific details of the format (.ulp, .nextUp, .nextDown, etc). So that looks good.

I would personally probably use the same approach more aggressively for some other operations, like init(sign: FloatingPointSign, exponent: Int, significand: Float8), but what you have looks like it's probably correct.

Jens · February 3, 2020, 8:44pm

Thanks!

Btw, I found a stupid mistake:

    static prefix func -(lhs: Float8) -> Float8 {
        var lhs = Float8(lhs.magnitude)
        lhs.bitPattern ^= 0b1_000_0000
        return lhs
    }

should be:

    static prefix func -(lhs: Float8) -> Float8 {
        return Float8(bitPattern: lhs.bitPattern ^ 0b1_000_0000)
    }

(unless that proves to be wrong too ... I had to implement my own because my reuse of Float caused an infinite recursion with infix (-) otherwise)

Will correct the code in the previous post.

Jens · February 3, 2020, 10:32pm

I've noticed some more cases like that, where promoting to Float, or rather converting back from Float can cause an infinite recursion just for some particular values, for example:

    init(floatLiteral value: Float) {
        // NOTE: Infinite recursion here caused by:
        // `Float8(-Float(0))`
        // but not for eg:
        // `Float8(-Float(1))` or `Float8(Float(0))` ...
        self.init(value) // <-- So I guess this init is calling back to this (floatLiteral) init when f is negative zero.
    }

So I turned it into this:

    init(floatLiteral value: Float) {
        // There was an infinite recursion here for eg `Float8(-Float(0))`,
        // but not for `Float8(-Float(1))` or `Float8(Float(0))`.
        // This check takes care of that particular case, but are there more?
        if value == -Float(0) {
            self.init(bitPattern: 0b1_000_0000)
        } else {
            self.init(value) // <-- Will this call back to this for some other f?
        }
    }

If I'm not mistaken, this means that when I am promoting to Float in some func/init/property A, and do self.init(f) (or Float8(f)), I am just lucky if that self.init(f) (which I have no control over) doesn't, and won't ever in the future, call back to A for any f (that I haven't taken care of as above) ...

In this particular case, self.init(f) (with f being -Float(0)) ends up calling _convert which will then call back to my .init(floatLiteral:), and we have infinite recursion.

  @inlinable
  public // @testable
  static func _convert<Source: BinaryFloatingPoint>(
    from source: Source
  ) -> (value: Self, exact: Bool) {
    guard _fastPath(!source.isZero) else {
      return (source.sign == .minus ? -0.0 : 0, true) // <-- Here! That `-0.0` calls back to my floatLiteral init.
    }
    ...

It seems kind of hard/impossible to protect against this kind of infinite recursion when promoting to Float ...

xwu · February 4, 2020, 12:03am

I wrote about this here:

https://numerics.diploid.ca/numeric-protocols.html#default-implementations-and-unintentional-infinite-recursion

mattrips · February 4, 2020, 2:15am

@xwu, what a great resource you’ve created at Notes on Numerics in Swift . Thank you!

Dany_St-Amant · February 4, 2020, 3:07am

In this float representation, we always have (for positive): Self < (1 << Self.significandBitCount). This break at least the random() implementation which requires to be able to represent twice as much.

Too bad, a float which is visualizable by the human brain (like this one) could have had such educational power in understanding the internals of floats.

Jens · February 4, 2020, 6:03am

Perhaps it would be better with four exponent and three significand bits?

Jens · February 4, 2020, 9:40am

Agreed, so I did a quick test where I modified Float8 to have

4 bits exponent
3 bits significand

(instead of the other way around)

This choice feels like a better one.

It can represent 240 finite values, from -240 to 240
Float8(240).ulp == Float8(128).ulp == 16.0
Has ulp <= 1 between -16 and 16
Least nonzero magnitude is 0.001953125

So it works with Swift's Random API implementation.

I might post the code later.

Jens · February 4, 2020, 11:39am

By the way, on the subtopic of trying to avoid accidental infinite recursion:
Is there any tool or something (like Xcode's Static Analyzer) we can use to automatically report any potential loops in a call graph?

CTMacUser · February 7, 2020, 10:18pm

Could this be an over-specification on random()’s part?

Dany_St-Amant · February 8, 2020, 1:11am

The following seems to be true for all standard Float

FloatX(1.0).exponentBitPattern) == (1 << (FloatX.exponentBitCount - 1) - 1)
FloatX.greatestFiniteMagnitude > (1 << FloatX.significandBitCount)
FloatX.significandBitCount > FloatX.exponentBitCount

So, there could more hidden requirements in the FloatingPoint protocol implementation.

Making the protocol implementation to support all possible variants of exponentBitCount, significandBitCount and _exponentBias would likely be unwise as only the one supported by the hardware are really useful in real life (IMHO). Which is why I did not blame the random() implementation.

scanon · February 8, 2020, 1:28am

The last one of those shouldn’t be required. The other two are generally desirable properties for a floating-point number system, however (and note that [Binary]FloatingPoint specifically binds IEEE 754 formats, which always have those properties).

Nevin · February 14, 2020, 11:02pm

I have some more questions in this vein:

Would it be valid to make a BinaryFloatingPoint type where both RawExponent and RawSignificand conform to FixedWidthInteger, and…

a) RawSignificand.bitWidth == 0 ?
b) significandBitCount == 0 ?
c) RawSignificand.bitWidth == significandBitCount ?
d) RawExponent.bitWidth == 0 ?
e) exponentBitCount == 0 ?
f) RawExponent.bitWidth == exponentBitCount ?

(I’m writing some generic code that would need special cases to handle these if they’re valid.)

scanon · February 15, 2020, 12:20am

a. No, because an IEEE 754 binary format needs to be able to differentiate qNaN and sNaN, and neither the sign nor exponent bits may be used for that purpose.
b. See previous
c. Yes, this is allowed (but no IEEE 754 basic format has this situation).
d. No, IEEE 754 imposes the following constraints on the exponent field (where w is the width in bits, emin is the minimum normal exponent, and emax is the maximum finite exponent):

emin = 1 - emax

emin ≤ emax.

emax = 2**(w-1)-1

If w is 0 or 1, these constraints are violated. The smallest allowed RawExponent.bitWidth is 2.
e. See previous
f. Definitely permitted (but no IEEE 754 basic format is in this situation).

Nevin · February 15, 2020, 12:55am

Great, thanks!

Also, I just noticed that Float has UInt as its RawExponent type, but UInt32 as its RawSignificand.

What’s the rationale for making the significand fixed-size and the exponent platform-word-sized?