Improving Float.random(in:using:)

Hmm, I wonder why the checksum in the 3rd row changed for Float and Double but not Float80. I would expect it to stay the same for all three types, since the range -greatestFiniteMagnitude ... 0 uses the standard codepath in both versions, which has not changed.

Oops, looks like I accidentally posted results after having edited the third range (tried with -1 ... 0), sorry about that.

Here are the results again, with the correct 3rd range.

I've added the checksums for the previous version of your code after the ones for this latest version.

Float
  0.226241222 seconds (checksum: 138254 (138254))
  0.053931518 seconds (checksum: 5896375 (3957719))
  0.212501097 seconds (checksum: 3631585 (3631585))
  0.051749585 seconds (checksum: 5896271 (3957703))
Double
  0.280503323 seconds (checksum: 2981475560360739 (2981475560360739))
  0.079444674 seconds (checksum: 1323779508271287 (315125084029635))
  0.277631542 seconds (checksum: 759283211863005 (759283211863005))
  0.079980958 seconds (checksum: 1323779508272975 (315125084030247))
Float80
  1.086825454 seconds (checksum: 1370150508309074436 (1370150508309074436))
  0.377088032 seconds (checksum: 1289353272936233143 (8536050357597743157))
  7.324208065 seconds (checksum: 4117734076763979444 (4117734076763979444))
  0.392084622 seconds (checksum: 1289353272936204111 (8536050357597755341))
1 Like

I’m noticing that the “fast path” is actually slower when using the SystemRandomNumberGenerator.

With both Float and Double, my implementation is about 2x slower than the stdlib’s for 0..<1, but only 1.5x slower for 0..<1.5.

This makes sense, since the “fast path” requires 2 calls to the generator whereas the standard path only needs 1 call. When the generator is slow, that extra call has a major effect.

In contrast, with Xoroshiro128Plus, my implementation is about 4x slower for 0..<1, and about 12x slower for 0..<1.5.

When the generator is fast, a second call to it doesn’t matter and the simpler logic of the fast path does indeed make it faster.

Interestingly, with Float80 and the system generator, my implementation is only about 1.2x slower than the stdlib’s on 0..<1, and 1.6x slower on 0..<1.5.

• • •

Edit:

I just added a check so the “fast path” is not used with SystemRandomNumberGenerator. And…it significantly sped up Float and Double on 0..<1 so they’re only 1.2x slower than the stdlib with the system random number generator now.

It did bump Float80 down to 1.3x slower than the stdlib, but I’m much more interested in optimizing for Float and Double.

This is getting close!

• • •

Edit 2:

For many subnormal ranges like 0 ..< .leastNormalMagnitude/4, using the system RNG my implementation is about 0–5% faster than the stdlid for Float and Double, and about 35–40% faster for Float80.

And using xoroshiro, my implementation is 25–30% faster for Float and Double, and a whopping 45–50% faster for Float80.

I’m starting to think this might actually be viable.

1 Like

Yes, but the fast path could make more efficient use of the random bits and thereby require 1 rather than 2 calls to generator.next(), at least for Float and Double (for Float80 it would require 2 calls), eg:

extension BinaryFloatingPoint where
    RawExponent: FixedWidthInteger,
    RawSignificand: FixedWidthInteger
{
    static func randomThroughRawExponent<R: RandomNumberGenerator>(
        _ maxExponentBitPattern: RawExponent,
        using generator: inout R
    ) -> Self
    {
        var bits: UInt64
        var tzSum = 0
        again: do {
            bits = generator.next()
            let maxZS = Int(truncatingIfNeeded: maxExponentBitPattern) &- tzSum
            let tzs = min(bits.trailingZeroBitCount, maxZS)
            tzSum &+= tzs
            if tzs == 64 && tzSum < maxExponentBitPattern { continue again }
            let consumedBits = tzs &+ 1
            bits &>>= consumedBits
            if (64 &- consumedBits) < Self.significandBitCount {
                bits = generator.next()
            }
        }
        let e = maxExponentBitPattern &- RawExponent(truncatingIfNeeded: tzSum)
        let shifts = RawSignificand.bitWidth &- Self.significandBitCount
        let mask = ~RawSignificand(0) &>> shifts
        let s = RawSignificand(truncatingIfNeeded: bits) & mask
        return Self(sign: .plus, exponentBitPattern: e, significandBitPattern: s)
    }
}

Using SystemRandomNumberGenerator(), this implementation will (on average) be as fast as the current stdlib F.random(in: 0 ... powerOfTwo) when F is Float or Double, and it will actually be faster than the current stdlib implementation when F is Float80.

To support -powerOfTwo ..< powerOfTwo ranges, one more bit would be needed, but that shouldn't change the performance.


Using a fast pseudo random number generator, this implementation of the fast path seems to be about twice as fast as that of the current implementation (per this commit).

But for some reason my code is (significantly) slower for Float than for Double which seems counterintuitive (I'd expect it to be equally fast or maybe a tiny bit faster for Float).

EDIT: Ah, it turned out that the cause of this was the call to min (somehow) and an < operator between different integer types, as well as the optimizer happening to inline the method for Double but not for Float. So the above implementation was giving me this (counterintuitive) result:

Float
  0.233315511 seconds
Double
  0.153776696 seconds
And this improved version of the method gives me the following (more expected) result.
extension BinaryFloatingPoint where
    RawExponent: FixedWidthInteger,
    RawSignificand: FixedWidthInteger
{
    @inline(__always)
    static func randomThroughRawExponent<R: RandomNumberGenerator>(
        _ maxExponentBitPattern: RawExponent,
        using generator: inout R
    ) -> Self
    {
        let maxExpInt = Int(truncatingIfNeeded: maxExponentBitPattern)
        var bits: UInt64
        var tzSum = 0
        again: do {
            bits = generator.next()
            let maxZs = maxExpInt &- tzSum
            let tzs_ = bits.trailingZeroBitCount
            let tzs = tzs_ < maxZs ? tzs_ : maxZs
            tzSum &+= tzs
            if tzs == 64 && tzSum < maxExpInt { continue again }
            let consumedBits = tzs &+ 1
            bits &>>= consumedBits
            if (64 &- consumedBits) < Self.significandBitCount {
                bits = generator.next()
            }
        }
        let e = maxExponentBitPattern &- RawExponent(truncatingIfNeeded: tzSum)
        let shifts = RawSignificand.bitWidth &- Self.significandBitCount
        let mask = ~RawSignificand(0) &>> shifts
        let s = RawSignificand(truncatingIfNeeded: bits) & mask
        return Self(sign: .plus, exponentBitPattern: e, significandBitPattern: s)
    }
}
Float
  0.14883914 seconds
Double
  0.148902778 seconds

Looking at the performance of the fast path in your current implementation, I see that it too is slightly slower for Float than for Double, although not as much as was the case for my code. Adding @inline(__always) to the relevant methods fixed that. With this change I see this (with WyRand):

Your fast path:
0.33 seconds for Float and Double, 2.11 seconds for Float80

My fast path (the improved version):
0.14 seconds for Float and Double, 1.01 seconds for Float80

Thus there should be no need to special case away (an improved implementation of) the fast path for the system generator.

1 Like

Good point. I can rework the randomUpToExponent function to use the leftover bits for the significand when possible.

I can even introduce an allowNegative argument, so it will use a leftover bit to handle the symmetric-bounds case (eg. -1..<1), and thus avoid calling Bool.random().

My first crack at it (using the system RNG, and compared to the stdlib’s random(in:)) is 10% slower for Float and Double, and 15% faster for Float80, on both 0..<1 and -1..<1.

And on 0 ..< .leastNormalMagnitude, it’s 10% faster for Float and Double, and a remarkable 3x faster for Float80.

I did not yet eliminate the heterogeneous comparison, but I did locate and mark all of them in the file with a comment (// Heterogeneous compare). All of them are between RawExponent and either Int or UInt64.

Rather than special-case this one function, if those comparisons are really eating the performance then I’d like to take care of all of them together.

• • •

In your implementation, you unconditionally convert from RawExponent to Int. I’m not really comfortable with that, because (a) the standard library types use UInt which could exceed Int.max (even though it can’t actually for floating-point values), and (b) an arbitrary floating-point type could use a larger integer type.

I’m going to try to special-case by checking whether exponentBitCount < Int.bitWidth, and if not then whether maxExp itself fits in an Int (eg. maxExp._binaryLogarithm() < Int.bitWidth &- 1).

Your bits variable is UInt64, so this can only put at most 64 bits into the significand. That’s enough for the standard library types (including Float80) but for the general case it’s not sufficient.

(The fix is pretty easy: first check if the number of available bits is at least significandBitCount, and if not then generate a random RawSignificand.)

• • •

Edit:

I just tested my work-in-progress implementation with xoroshiro on 0 ..< .leastNormalMagnitude, and compared to the stdlib mine is 5x faster for Float, 6x faster for Double, and a blazing 14x faster for Float80.

1 Like

Cool, not using floating point * and + probably makes a lot of difference for subnormals, especially for Float80 which on top of those usually slow subnormal arithmetic operations also seems to be subject to a lot of missed optimization opportunities.

However, note that the stdlib implementation is around 15x (Float and Double) and 72x (Float80) slower for
0 ..< leastNormalMagnitude
than it is for eg
0 ..< 1 or 0 ..< 1024.

The implementation in my previous post is about 14x faster than the stdlib for the subnormal range for Float and Double and 120x faster for Float80. But it doesn't handle the sign bit etc of course.


EDIT:

BTW, I have a very simple sketch for a PseudorandomNumberGenerator protocol, something which I think should be added to the Random API of the stdlib, since it's needed for eg initializing or restarting a PRNG with some specfic state or seed in a generic context. I've needed it for writing tests for this stuff.

Here it is together with the WyRand generator
protocol PseudorandomNumberGenerator : RandomNumberGenerator {
    associatedtype State
    init(seed: UInt64)
    init?(state: State)
}
extension PseudorandomNumberGenerator {
    init(seed: UInt64? = nil) {
        self.init(seed: seed ?? UInt64.random(in: .min ... .max))
    }
}

/// The wyrand generator, translated from:
/// https://github.com/wangyi-fudan/wyhash
struct WyRand : PseudorandomNumberGenerator {

    typealias State = UInt64

    private (set) var state: State

    init(seed: UInt64) {
        self.state = seed
    }
    init(state: State) {
        // Every UInt64 value is a valid WeRand state, so no need for this
        // initializer to be failable, it will satisfy the requirement anyway.
        self.state = state
    }

    mutating func next() -> UInt64 {
        state &+= 0xa0761d6478bd642f
        let mul = state.multipliedFullWidth(by: state ^ 0xe7037ed1a0b428db)
        return mul.high ^ mul.low
    }
}

This depends a lot on the platform. On Apple's recent ARM cores, there's zero penalty for operations with subnormal numbers. On x86, it can easily be 100x.

On platforms with slow subnormal operations, everything is going to be slow, so it's not really a concern if random number generation is also slow. This is an extremely niche corner case, that is broadly not worth worrying about (and certainly not worth designing around--if fast subnormals fall out naturally from other optimizations, great, but it doesn't make sense to spend software engineering effort on them).

Without any constraints on State, init(state:) is almost useless as a protocol requirement (it remains useful on concrete types, of course). What's the reason to put it on the protocol?

1 Like

The reason for the associated type State is simply this:

init?(state: State)

which is

For example, you might be running tests and you want to rerun something from a specific state (not seed) (for the WyRand generator, the state and seed happens to be the same, but that is not true for every PRNG, some will not take 0 as a state, and the State might be 128 bits or whatever).

But without a public state property or getter, you can't save the state to be able to restart it. Plus, you need to have the constraint that State is POD for any of this to work.

What you definitely do want to have (and seem to be missing) is some mechanism to provide an init() that fills the whole state from the system generator. I can see a few different ways to put that into the protocol, with slightly different tradeoffs.

I get similar performance for WyRand and Xoroshiro128Plus for both Float and Double, but WyRand is 2x slower for Float80.

• • •

In any event, I’ve uploaded my latest version: link to diff.

I gave you a mention in the “author” comment, I marked the lines that have heterogeneous comparisons, and I made randomUpToExponent faster as we’ve been discussing.

The randomExponent function now returns a triple of (exponent, spare bits, bit count), so the extra bits can be repurposed for the significand. I also made it generic over the exponent type, with a specialization for Int. That way if the exponent is small it can first be cast to Int and run fast, but if it’s large the code still works.

Seeing the improvement that comes from avoiding a single heterogeneous comparison, the next optimization target will likely be the other three.

2 Likes

I have that protocol too, but instead of the init(seed: ) I have a generic init that takes a RandomNumberGenerator inout.

protocol PsuedorandomNumberGenerator: RandomNumberGenerator {
    associatedtype State
    var state: State { get }
    init?(state: State)
    init<Source: RandomNumberGenerator>(from: inout Source)
}

extension PseudoRandomGenerator {
    public init() {
        var entropy = SystemRandomNumberGenerator()
        self.init(from: &entropy)
    }
}

struct WyRand : PseudorandomNumberGenerator {

    typealias State = UInt64

    private (set) var state: State

    init<Source: RandomNumberGenerator>(from source: inout Source) {
        state = source.next()
    }
    init(state: State) {
        // Every UInt64 value is a valid WeRand state, so no need for this
        // initializer to be failable, it will satisfy the requirement anyway.
        self.state = state
    }

    mutating func next() -> UInt64 {
        state &+= 0xa0761d6478bd642f
        let mul = state.multipliedFullWidth(by: state ^ 0xe7037ed1a0b428db)
        return mul.high ^ mul.low
    }
}

I find init(state:) is usually more useful than init(seed:), and if I want it then I usually need to implement it using init(from:) anyway.

2 Likes

Yes, there should be a property, here's my not-as-sketchy actual protocol:

/// A type that produces a reproducible stream of uniformly distributed
/// pseudorandom bits.
protocol PseudorandomNumberGenerator {

    associatedtype State
    var state: State { get }

    /// Creates a new instance of this generator with a state that is
    /// completely determined by the given seed. Any `UInt64` is allowed and
    /// maps to a valid state. Each seed may or may not map to a unique state
    /// depending on the generator.
    init(seed: UInt64)

    /// Creates a new instance of this generator with the given state.
    /// The initializer fails if the given state is invalid for this generator.
    init?(state: State)

    /// Advances the state of the generator and returns the next 64 uniformly
    /// distributed pseudorandom bits.
    mutating func next() -> UInt64
}

(It doesn't inherit from RandomNumberGenerator for a couple of irrelevant reasons.)

How would you add the constraint that State can be eg UInt32 or (UInt64, UInt64)? Without any constraints (and just manually ensuring that any conforming type doesn't have strange State like Void, Never or a class type), I can use the getter and init no matter the type of the State, and it has worked for me for some years now.

You mean this:

?

No, because that only consumes 64b of randomness, which is insufficient for a lot of applications. The empty init() should set the entire internal state using the system RNG.

Sure. I had one in which I used SecRandomCopyBytes at some time, but I didn't use it so I just kept the one that does it given a seed.

Note that init(seed: UInt64) is required/expected to set all bits in the state in a sufficiently scrambled way, given the seed (for example xoroshiro uses splitmix64 a couple of times to scramble the seed and set the entire state). For a generator with a state that is less than 64 bit, some seeds will map to the same state, and for generators with more than 64 bits not all states will be possible to achieve given the available seeds. But in my experience it's useful to have a single common type for the seed, just as it is for the resulting output / next bunch of bits (next() -> UInt64) (a generator with a 32 bit state would have to advance its state two times in order to produce the next output).

(This protocol has just been for personal / in-house use, and is only meant as a sketch here, not as a formal proposal.)

Is it even possible to express the POD constraint on an associated type, or is that a requirement that can only be documented?

1 Like

I'm not at all sure about the heterogeneous comparisons though, I was just trying out various things and I think it made some difference in my specific case. Do you mean you saw an improvement by avoiding that heterogeneous comparison?

It's not possible today, is fairly likely to be possible at some (near) future point, because we'd like to use it in a bunch of other places as well.

1 Like

So what did you mean by this?

EDIT: I guess what you meant is that it would work given something more like @Nobody1707's:

protocol PsuedorandomNumberGenerator: RandomNumberGenerator {
    associatedtype State
    var state: State { get }
    init?(state: State)
    init<Source: RandomNumberGenerator>(from: inout Source)
}
1 Like

That the function described cannot be used as intended.

There are a variety of workarounds available; you could constrain state to be exposed in the protocol API as [UInt64], or have a FixedWidthInteger constraint, or require init to/from Data or something similar instead of exposing the actual type--all of those options would get the functionality you want, without actually limiting things very much. If the language had a POD constraint, that would be more general, of course, but we don't have to throw our hands up in the air just because it doesn't yet =)

Sorry for being slow / tired, but what function, init(state:)?

I use the protocol and init(state: State) and it works, here is a generator that has a larger than 64 bit state as an example:

/// The jsf64 generator, translated from:
/// www.pcg-random.org/posts/bob-jenkins-small-prng-passes-practrand.html
struct Jsf64 : PseudorandomNumberGenerator {

    typealias State = (a: UInt64, b: UInt64, c: UInt64, d: UInt64)

    private (set) var state: State

    init(seed: UInt64) {
        self.state = (a: 0xf1ea5eed, b: seed, c: seed, d: seed)
        for _ in 0 ..< 20 { _ = next() }
    }

    init(state: State) {
        // Not sure if every state is a valid state … test or search internet.
        self.state = state
    }

    mutating func next() -> UInt64 {
        func rot64(_ x: UInt64, _ k: UInt64) -> UInt64 {
            return ( x &<< k ) | ( x &>> (64 &- k) )
        }
        let e   = state.a &- rot64(state.b,  7)
        state.a = state.b  ^ rot64(state.c, 13)
        state.b = state.c &+ rot64(state.d, 37)
        state.c = state.d &+ e
        state.d = e &+ state.a
        return state.d
    }
}