How to write this generic BinaryFloatingPoint initializer without it being so slow

Please see (and try) the following little program (to be compiled with optimizations and using a recent dev snapshot). I would like to know how to best write a generic version that is as fast as the specific version for Double, or why it isn't/shouldn't be possible.

import Security
import QuartzCore

// I wish this generic version could be as fast as the Double-specific version:
extension BinaryFloatingPoint {
    /// Create a value by mapping the given unsigned integer value from the
    /// full range of the `RawSignificand` type to the floating point unit
    /// range [0, 1).
    init(unitRange i: RawSignificand) {
        let shifts = MemoryLayout<RawSignificand>.stride &* 8 &-
            (Self.significandBitCount &+ 1)
        self = Self(i >> shifts) * (.ulpOfOne / 2.0) // (Initializer here, very slow.)
    }
}

// This Double-specific version is about 18 times faster than the generic one.
// Uncomment it to let test() use it instead of the generic one:
/*
extension Double {
    static let myFastUlpOfOne = Double.ulpOfOne // Almost 5 x faster … : P
    init(unitRange i: UInt64) {
        self = Double(i >> Double.exponentBitCount) * (.myFastUlpOfOne / 2.0)
    }
}
*/

// Just a helping function for the test:
func unsafeRandomBitPatternElements<T>(count: Int) -> [T]
    where T: ExpressibleByIntegerLiteral
{
    var a = [T](repeating: 0, count: count)
    let r = SecRandomCopyBytes(nil, count * MemoryLayout<T>.stride, &a)
    precondition(r == errSecSuccess)
    return a
}

/// This test function prints the time it takes to convert a 100 million random
/// UInt64 values into Double values in the range [0, 1) and add them.
func test() {
    for _ in 0 ..< 10 {
        var sum = Double(0)
        let a: [UInt64] = unsafeRandomBitPatternElements(count: 100_000_000)
        let t0 = CACurrentMediaTime()
        for e in a { sum += Double(unitRange: e) }
        let t1 = CACurrentMediaTime()
        print("time:", t1 - t0, "seconds, ( sum:", sum, ")")
    }
}

test()

/cc @scanon, @xwu

It's not clear to me that, in the optimized form, the generic version has to be slower. However, the line

static let myFastUlpOfOne = Double.ulpOfOne // Almost 5 x faster … : P

suggests one starting point to figuring out why the generic version is slower. Given that ulpOfOne is constant, that really shouldn't speed things up in concrete code, especially not fivefold, especially not when optimized.

The .ulpOfOne-issue (SR-6919) is not the biggest source of slow down though - it's the initializer here:

init(unitRange i: RawSignificand) {
    let shifts = MemoryLayout<RawSignificand>.stride &* 8 &-
        (Self.significandBitCount &+ 1)
    self = Self(i >> shifts) * (.ulpOfOne / 2.0) // (Initializer here, very slow.)
}

The input i is of type RawSignificand which is only constrained to be an UnsignedInteger. So the following initializer from FloatingPoint will be used here:

public init<Source>(_ value: Source) where Source : BinaryInteger

And my guess is that it ends up being dynamically dispatched?

I have managed to work around both the .ulpOfOne-issue and the slow initializer-issue, but the solution is not pretty:

// A(n ugly) generic version that is as fast as the Double-specific one:
extension BinaryFloatingPoint {
    init(unitRange i: RawSignificand) {
        let shifts = MemoryLayout<RawSignificand>.stride &* 8 &-
            (Self.significandBitCount &+ 1)
        // 1. Reformulation to avoid the slow .ulpOfOne:
        let myUlpOfOne = Self(1)/(Self(UInt64(1) << Self.significandBitCount))
        // 2. Casting i to UInt64 to avoid slow generic initializer:
        self = Self(UInt64(i) >> shifts) * (myUlpOfOne / 2.0)
    }
}

(Since static stored properties are not supported in protocol extensions, we can't use the Double-specific workaround for the .ulpOfOne-issue in the generic version. That's why my code computes ulpOfOne from scratch like that, using only actually static input. And the cast to UInt64 will result in a non generic initializer that I guess the compiler is able to inline. Note however that RawSignificand could possibly be a type with larger bit width than UInt64 which would cause this code to ... trap i guess?)

I am still interested to know if there is a nicer way to write it. If there isn't, then at least this is an example that highlights a couple of problems in the current standard library and/or compiler, ie the ones preventing us from writing generic code like this without having to choose between:

  1. The generic code being much slower than expected / necessary.
  2. The generic code containing ugly workarounds like my version above.

Ah, you’re calling that method. No, it has nothing to do with dynamic dispatch. That’s a method that needs to work with any Source : BinaryInteger (including any hypothetical arbitrary-width type) and therefore doesn’t use any built-in intrinsic operations.

If you want your method truly to work with any BinaryInteger type, then it’s going to take that performance hit (you don’t need this here). If you want your method to work with only RawSignificand types of a certain precision, then exactly something like your workaround which can then be optimized to use intrinsic operations will be faster; or, actually use the initializer that takes sign, raw significand, and raw exponent.

Oh, OK, that makes sense, thanks!