# How to find rounding error in floating-point integer literal initializer?

I have a type like this:

``````struct ExtraPrecision<T: FloatingPoint> {
var large: T
var small: T
}
``````

Which represents the sum of `large` plus `small`, with `small` being less than `large.ulp`.

I want to conform it to `ExpressibleByIntegerLiteral`, and while `large` is easy to handle, I’m not sure how to calculate `small`:

``````extension ExtraPrecision: ExpressibleByIntegerLiteral {
init(integerLiteral value: T.IntegerLiteralType) {
large = T(integerLiteral: value)
small = ???
}
}
``````

Specifically, `small` should be the nearest representable `T` to the exact result of `value` minus `large`.

Is there any way to achieve this?

1 Like

A working solution:

``````struct ExtraPrecision<T: BinaryFloatingPoint> {
var large: T
var small: T
}

extension ExtraPrecision: ExpressibleByIntegerLiteral where T.IntegerLiteralType: FixedWidthInteger {
init(integerLiteral value: T.IntegerLiteralType) {
typealias I = T.IntegerLiteralType

large = T(value)
while I(exactly: large) == nil {
if value < 0 {
large = large.nextUp
} else {
large = large.nextDown
}
}
small = T(value - I(exactly: large)!)
}
}
``````

It's not exactly pretty, and it loses some precision near Int.max and Int.min

In the ideal solution when trying to represent `Int.max`, `large` would be `Int.max+1`, and small would be `-1`. In my solution `large` becomes `(Int.max+1).nextDown`:(

Hmm, that certainly works for the constraints you’ve included, but I’m not sure I want to require either `T: BinaryFloatingPoint` or `T.IntegerLiteralType: FixedWidthInteger`.

The `ExtraPrecision` mechanism ought to work for any `FloatingPoint` type regardless of radix, and it ought to work with an arbitrary-width literal type as well. For example, a floating-point type might have an arbitrary-width exponent and a fixed-width significand.

What constraint do you have in mind? `ExpressibleByIntegerLiteral.IntegerLiteralType` doesn't seem to have any extra requirement other than `_ExpressibleByBuiltinIntegerLiteral` which isn't terribly useful. I can't even (lossy) roundtrip with that.

1 Like

I used `BinaryFloatingPoint` because I have no idea how non-binary floating point numbers behave. Does it work with non-IEEE floats? No idea. You can remove the constraint, and check!

As for the `FixedWidthInteger`, as far as I know all types that are `_ExpressibleByBuiltinIntegerLiteral` are also `FixedWidthInteger`, so there's nothing to lose.

The initializer requirement for `_ExpressibleByBuiltinIntegerLiteral` takes a `Builtin.IntLiteral` parameter.

I’m having trouble locating the implementation for `Builtin.IntLiteral` (does anyone know where to find it?) but while searching I came across `BuiltinIntegerLiteralType` which is documented as “The builtin arbitrary-precision integer type. Useful for constructing integer literals.”

@John_McCall is the authority on Builtin.IntLiteral. It was introduced here if you want to dig into the history. I should note that since it's in the `Builtin` module, it's not really usable in code outside of the standard library.

1 Like

My hope was that we would eventually "productize" the concept into a library type. Note that it cannot support normal integer operations, though; it is not designed to be a runtime arbitrary-precision type.

It's an oversight that the ABI isn't documented. It is quite simple:

``````// Terminology:
//   chunkWidth := sizeof(size_t) * CHAR_BITS.
//   bit i of an integer is the bit set by (1 << i)
struct IntLiteral {
// Chunks of the integer value.
// Chunks individually have native endianness, but the chunks
// are in little-endian order; i.e. data[i] represents bits
//   (i*chunkWidth) ..< ((i+1)*chunkWidth)
// Negative numbers are stored in two's complement, and
// the last chunk is sign-extended.
// The length of this array is:
//   (bitWidth + chunkWidth - 1) / chunkWidth
size_t *data;
// Bit 0 is whether the value is negative.
// Bits 1..7 are reserved and currently always set to 0.
// Bits 8..<chunkWidth are the bit-width of the integer,
// i.e. the minimum number of bits necessary to represent
// the integer including a sign bit.
size_t flags;
};
``````

So suppose we wanted to represent the number -182716:

``````  182716 == 0x2c9bc == 010 1100 1001 1011 1100 (note leading zero)
-182716            == 101 0011 0110 0100 0100 (note leading one)
``````

In both cases, the bit width is 19. If we were using a 4-bit chunk (for exposition purposes), then the chunks array for -182716 would look like the following; note the sign-extension of the final chunk:

``````  0100, 0100, 0110, 0011, 1101
``````
4 Likes