Floating-point shift operators

I’ve been doing some floating-point programming and found that I often need to scale numbers by a power of 2. In other words, to change the exponent.

If I were using integers this would be easy:

let y = x << n

But that doesn’t work for floats.

Instead we have to write something much more verbose:

let y = Double(sign: .plus, exponent: n, significand: x)

Not only is this quite unwieldy, but it is also not obviously correct:

The spelling of that initializer might suggest the exponent of the result would be n, although by consulting the documentation we see that it will in fact be n + x.exponent, which is exactly what we want.

(Edit: similarly, it “looks like” the sign of the result would equal the sign argument value, but in fact that argument acts as a “multiplier”.)

So it works, but it’s neither concise nor completely clear to readers.

• • •

Conceptually, multiplying and dividing by powers of 2 is the same operation on integers as floats.

Thus, to improve the clarity and simplicity of my code, I decided to make floating-point versions of the same shifting operators that integers use.

Their effect is to multiply (<<) or divide (>>) the left-hand side by 2 to the power of the right-hand side.

That let me write things like this:

let y = x << n
let z = 1.0 << 256
var w = (a >> (2*i)) + (b << (k+1))
w >>= max(g[j], h[j])

• • •

The implementations for these operators are simple wrappers around the aforementioned initializer.

I put them on BinaryFloatingPoint (rather than FloatingPoint) so that they consistently refer to powers-of-2 (instead of some other radix).

And I made them generic over the right-hand-side, like the existing integer versions, so they aren’t restricted to just T.Exponent.

For myself, these operators greatly improved the readability of my floating-point code, and I expect I’ll include them in many future projects.

So I wanted to gauge what other people think, and whether we should consider adding them to Swift.


What's the difference between FloatingPoint and BinaryFloatingPoint? It appears BinaryFloatingPoint is the only descendent of FloatingPoint. So it seems every concrete floating point types are both at the same times. I only know of Float and Double.

By defining your operators in BinaryFloatingPoint, what's the things that you are guarding again?

Why are you using bitshifts to do multiplication and division by 2, when you could just use multiplication or division by 2?

Powers of 2

We are multiplying by arbitrary powers of 2, which are not necessarily known at compile-time.

The radix (or “base”) of the value.

BinaryFloatingPoint specifies the radix is 2, meaning values are represented as ±significand × 2exponent.

FloatingPoint allows any radix. For example, a decimal type which represents values as ±significand × 10exponent could conform to FloatingPoint, but not to BinaryFloatingPoint.

Shifting operations correspond to “moving the point” which is equivalent to changing the exponent.

I want people to be able to easily reason about the shift operators in terms of powers of 2, rather than powers of an arbitrary radix.

1 Like

There will probably someday be a DecimalFloatingPoint.

Consulting the BinaryFloatingPoint documentation, we find this list:

1 Like

I didn’t think this made sense because << shifts the bitwise representation of an integer, but your phrasing it as shifting the radix point convinced me that it’s a reasonable spelling for the operation! Still, is it reasonable enough to be better than a method?


Even still, couldn't you (for example) do d / pow(2, x) for whatever x you want?

Unlike with << for integers, using the shift operators on doubles in languages like C doesn't actually do a shift under the hood. (bit shifting floats naively would just cause garbled nonsense)

You could, but that has a number of drawbacks.

First, pow is “heavy machinery”, conceptually on par with sin and exp. Even if the implementation is optimized for powers of 2, it is still a high-level abstraction which is built out of low-level operations that might include shifts. In other words, using pow for direct manipulation of the exponent value of a float would constitute an inversion of concerns.

Second, pow can lead to spurious overflow. Consider this example:

let a = 0x1p1023
let b = 0.5
let n =  a.exponent - b.exponent

let u = b << n
let v = b * pow(2, Double(n))

print(u == a)    // true
print(v == a)    // false

The problem is that pow(2, n) evaluates to .infinity. The same can happen the other way with underflow to 0.

The shift version work correctly for all inputs, because it directly adjusts the exponent.

The operation I am describing is a shift operation. It is exactly what you get by moving the decimal point (well, the binary point), aka “shifting the digits”.

1 Like

I certainly agree that Swift’s initializer spelling for the IEEE scaleB operation isn’t the most natural or obvious, particularly in terms of the behavior of exponent. I’ve felt that way about this for some time and I’m glad that I’m not the only one. The documentation has been due for improvement for some time, but I too wonder if we could do better.

The << and >> operators, though, are semantically bitwise shift operators. This reflects that Int and other integer types represent both an integer and its two’s complement representation as a sequence of bits.

The IEEE binary floating point types, by contrast, don’t in a meaningful sense represent a sequence of bits, and I would argue that this represents a significant enough deviation that merits a different spelling, in the same vein that % and truncatingRemainder have been distinguished. While it may make sense to think of scaleB as a binary digit shifting operation, it’s certainly not a bitwise shifting operation.

I do hope, however, that there is some latitude to consider a spelling such as scaled(byExponent:). There may be already either a forum thread or a SR bug about this topic.


Yes, that is why I restricted my implementation to BinaryFloatingPoint rather than the full FloatingPoint protocol.

The operators << and >> operate on binary values, by shifting the bits relative to the decimal point. (I’m just going to use the term “decimal point” and not keep clarifying about “binary point” or “radix point”.)

I contend that binary floating point types emphatically and entirely do represent a sequence of bits.

They represent the bits which correspond to a real number written in binary.

The significand stores the leading bits of that sequence, and the exponent specifies where those bits are located relative to the decimal point.

The digits of a binary number are exactly and precisely the bits of that number. The word “bit” literally means “binary digit”.

• • •

To be abundantly and excessively clear:

The sequence of bits which a binary floating point value logically represents, is distinct from the sequence of bits which the storage format uses to hold that value.

The shift operators I have described operate on the logical sequence of bits which the binary floating point value represents, namely the binary expansion of the number that the value represents.

They shift the bits in the number itself, not the bits in the storage format.

1 Like

The “bits in the storage format” that you describe corresponds to “specification level 4” outlined in IEEE 754, or what the standard describes as the “encoding.” By “logical sequence of bits,” you seem to be describing something approximating IEEE “specification level 3,” or what the standard describes as the “representation.”

However, the IEEE representation of a finite value is defined as a tuple of (sign, exponent, significand). The significand has a logical sequence of bits, as does the exponent. But, can you explain to me what the “logical sequence of bits” would be for -0.0 and 0.0 that explains the behavior of, say, what would be spelled in your proposal as -0.0 >> 4?

You are correct, I should add, that for integers << and >> semantically operate on the logical sequence of bits and not the actual storage representation. This comes into play for hypothetical BigInt types, which must produce the same results regardless of whether internally a value is stored in two’s complement or sign-magnitude format—to be specific, the operation is performed on the two’s complement representation. (The documentation may still be underspecified here.)

This does raise another difficulty here, of course, in that even if you can describe some sequence of bits to represent every floating point value on which a bitwise shift reproduces the result of the IEEE scaleB operation, it cannot be said to be the two’s complement representation. IEEE 754 itself makes no attempt to define such a sequence of bits and defines the scaleB operation in terms of arbitrary-precision multiplication and exponentiation of numeric values.

1 Like

Huh, I've never even seen this syntax. Is there any documentation on it?

Good point.

I don't see how it's a shift. Certainly at the low level, it's not a naive shifting of the bits (the border between mantissa and exponent needs to be maintained, as does the sign bit).

Thanks for calling attention to this.

I made a mistake in the verbose initializer example in the first post.

The argument passed in for sign: should always be .plus, not x.sign.

I’ve fixed it now.

With the correct implementation, the shift operators preserve the sign of the left-hand input.

This is yet another piece of evidence for how confusing that initializer is. Even after reading the documentation, I still got it wrong!

Conversely, with the shift operators that sort of mistake simply would never occur.

Shifting means shifting, means multiplying by a power of 2, means offsetting the exponent.


I don’t see that as a difficulty at all.

It doesn’t matter how the sign of the number is stored.

Conceptually, if we like, we can simply think of a real number as the sequence of bits of the binary representation of its magnitude, with a sign prepended.

(Not a sign-bit, but an actual + or - sign.)

Regardless of whether we view a real number’s binary representation in two’s-complement or sign-magnitude form, the effect of bit-shifting is always the same:

Shifting to the left doubles the magnitude of the number for each position moved, and shifting to the right halves the magnitude.

In every case the sign remains unchanged, and shifting is equivalent to multiplying by a power of 2, just like it is for integers.

Because that’s what shifting the bits of a binary number means.


The Swift grammar has a section on floating-point literals which describes hexadecimal floating-point literals like that.

I believe the syntax was introduced in C99.

A floating-point value represents a number.

That number can be written in binary.

It is the bits of that number, written in binary, which are shifted.

Not the bits of the storage format.


This understanding is incomplete.

What you describe doesn’t account for the very particular rounding-down behavior when right-shifting a negative integer by a positive integer (or equivalently, left-shifting a negative integer by a negative integer), which is the result obtained by truncating the two’s complement representation after shifting and is not what you would get truncating the sign-magnitude representation after shifting:

2 >> 1   // 1
-2 >> 1  // -1
-2 << -1 // -1
2 >> 4   // 0
-2 >> 4  // -1 (not 0)
-2 << -4 // -1 (not 0)

By contrast, the IEEE scaleB operation is specified to have the same rounding behavior as any other arithmetic operation after a notionally infinite-precision computation.

This distinction parallels that which separates floating-point remainder and integer % (i.e., floating-point truncatingRemainder).

Put another way, right-shifting a negative integer value repeatedly results in -1, but scaling a negative floating-point value repeatedly by a negative amount results in 0.

But there is no need for me to quote IEEE 754 to you piecemeal. The point is that I agree that the current initializer spelling for this operation is not intuitive. However, even if you could demonstrate a way to reconcile the differing semantics here in such a way that scaling a floating-point value is exactly like an integer bitwise shift, it is plain that quite a number of people do not find it intuitive to think of the operation in that way. For that reason, if we’re going to innovate in a way that improves the discoverability and usability of this operation, a clearly named method would be superior to both the status quo as well as <<.


I think adding some sugar for this is perfectly reasonable. I think that using the shift operators for this would be pretty idiosyncratic and not at all obvious to most readers, and also probably hurt typechecker performance somewhat. Some other spelling seems more appropriate, especially as this is a somewhat common operation for library implementors, but not actually that common in most numerical end-user code.

x.scaled(byTwoToThe: n) or similar seems appropriate, and I think would be more-or-less a no-brainer. It should be defined on BinaryFloatingPoint, like you said.†

As an aside, some history on why the init works like it does: traditionally, IEEE 754 viewed a floating-point number as having the form ±1.xxxxx * 2ⁿ (i.e. a significand in [1, 2)), but C and POSIX viewed a floating-point number as having the form ±0.xxxx * 2ⁿ⁺¹ (i.e. a significand in [0.5, 1.0)). I wanted to accommodate both of these viewpoints, so the initializer accepts either one. But once you do that, you might as well just make it work with an arbitrary floating-point "significand", at which point it can also pretty trivially bind the scalb operation, which saves some work. So here we are. This is probably too clever by half, though.

† The obvious alternative design would be to define it on FloatingPoint but given a name that makes clear that it is just incrementing/decrementing the exponent (i.e. scaling by a power of radix). This might look nicer at first, but I think it will be more confusing for most readers, and I'm really not sure how useful the more generic operation actually is. Outside of some pretty niche uses, in the more generic setting you usually just need to scale by radix itself to move away from the overflow or underflow boundary, so you don't need an arbitrary power; a simple multiply or divide is much easier to read (and easier for the compiler to optimize!)



We already have ulpOfOne, so how do you all feel about scaled(byPowerOfTwo:)?

1 Like

Well, if the shift operators are out, perhaps it would be simplest to just make the exponent property mutable:

x.exponent += n

Then it could go on FloatingPoint, with no restriction on the radix.


I would expect that to have different behavior than scaleB at the edges, I think: If there’s not enough precision because I decreased the exponent such that the value is now subnormal, I’d nonetheless be surprised to find that setting the exponent caused the significand silently to lose some digits. And if I increment the exponent past the maximum supported, it would likely still be surprising that this would wipe out my entire significand in order to represent infinity.

Overall, I think if what we want is the scaleB operation (and I think we do, rather than making up our own similar operation with different behavior at the edges), the least clever design is probably going to be the most robust.

1 Like
Terms of Service

Privacy Policy

Cookie Policy