Improving Float.random(in:using:)

Jens · March 29, 2020, 5:37pm

Yes, it's probably has to do with the nested function, because it is uniformRandomInSection(...) that has the seemingly unnecessary alloc / dealloc.

Doing this:

        func singleBinadeRandom(using generator: inout R) -> Self {

didn't help.

But adding @inline(__always) on it (with or without inout generator) did the trick and got the time down to:

6.550383356 seconds

I think it's actually (consistently) slightly faster keeping it as is without the inout generator (but with the inline always).

The alloc / dealloc disappeared and now the thing that takes the most time is the protocol witness for nextDown getter.

Nevin · March 29, 2020, 5:49pm

Well, it’s easy enough to in-line the nested function manually, it just ends up needing either several nested ifs, or a label on the outer if:

Manually inlined

  static func uniformRandomInSection<R: RandomNumberGenerator>(_ section: Int64, maxExponent eMax: RawExponent, using generator: inout R) -> Self {
    let n = UInt64(bitPattern: (section < 0) ? ~section : section)
    let b = lowerBound(ofSection: n &+ 1, maxExponent: eMax)
    let x: Self
    
    label: if (n == 0) && (b.exponentBitPattern != 0) {
      // Section 0 starts at 0 and may span multiple binades.
      x = randomUpToExponent(b.exponentBitPattern, using: &generator)
    } else {
      // Every other section fits within a single binade.
      let a = lowerBound(ofSection: n, maxExponent: eMax)
      if a == b { x = a; break label }
      
      let low = a.significandBitPattern
      let high = b.nextDown.significandBitPattern
      if (low == high) { x = a; break label }
      
      let s = RawSignificand.random(in: low...high, using: &generator)
      let e = a.exponentBitPattern
      x = Self(sign: .plus, exponentBitPattern: e, significandBitPattern: s)
    }
    
    return (section < 0) ? (-x).nextDown : x
  }

But eventually I think everything in this implementation is going to need inlining annotations, so might as well start here. I’ve added @inline(__always) to the nested function.

Thanks for spotting this and finding the solution!

Jens · March 29, 2020, 5:51pm

Yes, that code has the same performance as this.

    // Get a random number in a single section (as defined above).
    static func uniformRandomInSection<R: RandomNumberGenerator>(_ section: Int64, maxExponent eMax: RawExponent, using generator: inout R) -> Self {
        let n = UInt64(bitPattern: (section < 0) ? ~section : section)
        let b = lowerBound(ofSection: n &+ 1, maxExponent: eMax)

        // This must only be called when the section fits in a single raw binade.
        @inline(__always)
        func singleBinadeRandom() -> Self {
            let a = lowerBound(ofSection: n, maxExponent: eMax)
            if a == b { return a }

            let low = a.significandBitPattern
            let high = b.nextDown.significandBitPattern
            if (low == high) { return a }

            let s = RawSignificand.random(in: low...high, using: &generator)
            let e = a.exponentBitPattern
            return Self(sign: .plus, exponentBitPattern: e, significandBitPattern: s)
        }

        let x: Self

        if (n == 0) && (b.exponentBitPattern != 0) {
            // Section 0 starts at 0 and may span multiple binades.
            x = randomUpToExponent(b.exponentBitPattern, using: &generator)
        } else {
            // Every other section fits within a single binade.
            x = singleBinadeRandom()
        }

        return (section < 0) ? (-x).nextDown : x
    }

Nevin · March 31, 2020, 5:48pm

Okay, I’ve started trying to optimize a bit.

The first thing I noticed was that uniformRandomInSection made 2 calls to lowerBound(ofSection:) (to get the upper and lower bounds), and I wondered if that was strictly necessary.

About 2am last night I realized that the call for section n+1 could be eliminated, and I jotted it down in my bedside notebook.

…then I couldn’t get back to sleep, so I sat up and reworked the code immediately.

I ended up eliminating lowerBound(ofSection:) entirely, moving its logic into uniformRandomInSection and appending random bits directly to get a value uniformly distributed within the section. This also eliminated the nested function.

In the process, I noticed and fixed a latent bug that would only affect types with no spare bits (ie. with significandBitCount == RawSignificand.bitWidth) a tiny portion of the time (when the section spans exactly one full raw binade).

The implementation got a few lines shorter, it should run a little faster since it only needs to calculate one section bound instead of two, and it has one less bug (well, unless my sleep-addled brain introduced more…)

Link to diff

Jens · March 31, 2020, 7:21pm

I ran a quick little performance test which shows that it has become faster.

Test code

extension BinaryFloatingPoint where
    RawSignificand: FixedWidthInteger,
    RawExponent: FixedWidthInteger
{
    static func test(range: ClosedRange<Self>) {
        var generator = WyRand(state: 1234567890123)
        var cs = RawSignificand(0)
        let t0 = DispatchTime.now().uptimeNanoseconds
        for _ in 0 ..< 1024*1024*8 {
            let f = uniformRandom(in: range, using: &generator)
            cs ^= f.significandBitPattern
            cs ^= RawSignificand(f.exponentBitPattern)
        }
        let t1 = DispatchTime.now().uptimeNanoseconds
        print(" ", Double(t1-t0)/1e9, "seconds (checksum: \(cs))")
    }
    static func test() {
        print(self)
        test(range: -greatestFiniteMagnitude ... greatestFiniteMagnitude)
        test(range: 0 ... greatestFiniteMagnitude)
        test(range: -greatestFiniteMagnitude ... 0)
        test(range: Self(0) ... Self(1024).nextDown)
    }
}
Float32.test()
Float64.test()
Float80.test()

Test results

Latest version:

Float
  0.257602232 seconds (checksum: 138254)
  0.241238388 seconds (checksum: 3957719)
  0.262408154 seconds (checksum: 3631585)
  0.241221168 seconds (checksum: 3957703)
Double
  0.283161892 seconds (checksum: 2981475560360739)
  0.257376181 seconds (checksum: 315125084029635)
  0.272264235 seconds (checksum: 759283211863005)
  0.254052645 seconds (checksum: 315125084030247)
Float80
  1.322677751 seconds (checksum: 1370150508309074436)
  1.298046256 seconds (checksum: 8536050357597743157)
  9.555791118 seconds (checksum: 4117734076763979444)
  1.287667337 seconds (checksum: 8536050357597755341)

Previous version:

Float
  0.290868295 seconds (checksum: 138254)
  0.268503795 seconds (checksum: 3957719)
  0.29267472 seconds (checksum: 3631585)
  0.27050553 seconds (checksum: 3957703)
Double
  0.339597436 seconds (checksum: 2981475560361275)
  0.324039793 seconds (checksum: 315125084030973)
  0.338665427 seconds (checksum: 759283211851814)
  0.317372234 seconds (checksum: 315125084028953)
Float80
  1.635097467 seconds (checksum: 1370150508312549190)
  1.630609227 seconds (checksum: 8536050357585630166)
  9.768561795 seconds (checksum: 4117734076772366346)
  1.600339114 seconds (checksum: 8536050357585620014)

Any idea why the third range (-geatestFiniteMagnitude ... 0) is so slow for Float80?
(I didn't notice this before but that range is slow for both the latest and the previous version.)

Another possible optimization (that I guess you might've already thought about): I think it could be about 4 times faster for ranges that spans whole "raw exponent binades" exactly, by using the method described in the paper and which we tried upthread (ie choosing raw exponent by counting consecutive trailing/leading zeroes and then filling raw significand with random bits).

Nevin · March 31, 2020, 7:58pm

Strange, I would not have expected that.

Interesting. I already have the randomUpToExponent function which uses that approach, but currently it only gets called for section 0. It would be pretty easy to insert a check at the very beginning to use that method when applicable to the whole range.

If we think ranges with a lower bound of 0 and an upper bound with 0 significand (eg. 0..<1) will be far more common than other ranges, we could optimize for it.

• • •

As for general-purpose optimization, when I run the Instruments time profiler I see about 40% of the time being in “Self Weight” of uniformRandomRoundedDown, which makes no sense to me. That function does almost no work itself, mostly just calling uniformRandomInSection, smallRangeUniformRandom, and sectionNumber to perform the actual calculations.

I would’ve expected those last 3 functions to take up the bulk of the running time, but Instruments says their total combined time is less than the “Self Weight” of uniformRandomRoundedDown.

For example, the last run I tested looked like this:

Weight	Percent	Self Weight	Symbol Name
12.26 s	100.0%	5.38 s	`uniformRandomRoundedDown(in:using:)`
2.18 s	17.8%	63.00 ms	`uniformRandomInSection(_:maxExponent:Using:)`
1.44 s	11.7%	1.06 s	`smallRangeUniformRandom(in:using:)`
1.24 s	10.0%	914.00 ms	`sectionNumber(maxExponent:)`
⋮

Now perhaps I’m misunderstanding what this all means, but it strikes me as quite odd that uniformRandomRoundedDown would have such a high “Self Weight”.

Jens · March 31, 2020, 8:39pm

Here's what I see (with "high frequency" checked) for that Float80 range:

Weight	Percent	Self Weight	Symbol Name
8.81 s	100.0%	41.70 ms	specialized static BinaryFloatingPoint<>.uniformRandomRoundedDown(in:using:)
2.67 s	30.3%	59.90 ms	specialized static BinaryFloatingPoint<>.smallRangeUniformRandom(in:using:)
2.60 s	29.4%	156.50 ms	specialized BinaryFloatingPoint<>.sectionNumber(maxExponent:)
1.04 s	11.8%	1.04 s	specialized Range.contains(_:)
968.90 ms	11.0%	962.50 ms	specialized max(::)
928.90 ms	10.5%	0 s	protocol witness for FloatingPoint.isFinite.getter in conformance Float80
248.60 ms	2.8%	82.70 ms	BinaryFloatingPoint<>.uniformRandomInSection(_:maxExponent:using:)
⋮

So, in my case, the check (which always fails) for the small range path takes a lot more time than for you. Commenting out the small range path takes it down from 9.7 seconds to 7.1 seconds on my MBP:

        // if let x = smallRangeUniformRandom(in: range, using: &generator) {
        //     return x
        // }

I have no idea why that should take so much time, and why the rest of the code (ie with that check commented away) should still be so slow, or why our time profiles looks so different.

BTW: I'm using (the default toolchain of) Xcode 11.4 (11E146) on macOS 10.15.4 (19E266)

Oh, and:

Float80.test(range: Self(-1) ... leastNormalMagnitude.nextDown.nextDown)
  9.343392155 seconds (checksum: 4117734076763979444)

Float80.test(range: Self(-1) ... leastNormalMagnitude.nextDown)
  1.441357557 seconds (checksum: 4117734076763979444)

I cannot see any clear difference in the timing profile of those two ranges (except that one takes more time than the other. Maybe it has something to do with the implementation of Float80 and/or the builtins that it is calling, like the thing with some floating point operations being extremely slow for subnormals for example.

Nevin · March 31, 2020, 9:58pm

Hmm…thinking about it more, I suspect the small-range path could be simplified a lot without much trouble, and nearly even eliminated.

Its purpose is two-fold: to guarantee that only a single call to the generator is made when the range is small, and to minimize the chances of the while true loop needing more than one pass when the range is large.

The “worst-case” for a large range would be something like 1.nextDown..<1.nextUp, where the endpoints are just barely in different sections (here the sections would be (1<<62)-1* and 1<<62). One of those sections is selected, and then a random value in that section is chosen.

But if there are many possible values in those sections, then we have a problem because only 1 value from each section is actually in the original range, so the while true loop would usually spin multiple times.

However, types with small significands won’t have that problem. In particular, 61 or fewer spare bits is sufficient to guarantee those sections are singletons. Thus Float and Double don’t need the small-range path.

With 62 significand bits, the upper section is a singleton but the lower one has 2 values, so there’s a 3/4 chance to land in the range. And with 63 bits, such as Float80, the upper section has 2 values and the lower section has 4, giving a 3/8 chance to land in the range.

However, larger types would rarely land in the range. Float128 has a 112-bit significand, so it would only succeed 3/2⁵² times, and the while loop would spin a quadrillion times on average.

So we can’t remove the small-range path wholesale, but we can be much more selective when it gets used. This should speed things up for Float and Double:

if significandBitCount > Int64.bitWidth - 3 {
  if let x = smallRangeUniformRandom(in: range, using: &generator) {
    return x
  }
}

And for Float80 and other wide types…I’ll simplify the small-range code so it only cares about adjacent binades. Then it should be able to fail early much faster.

* Technically for small types there are many sections between 1.nextDown and 1, so it would be (1<<62) - k for some k, but all those sections would be singletons so it doesn’t matter. In the large-significand case k would be 1 and the rest works.

Jens · March 31, 2020, 10:14pm

Regarding this:

I should maybe have added/restated that the "usual" time for most other Float80 ranges was around 1.6 seconds, so 9.7 s (with the (failing) small path check) and 7.1 s (without the small path check) are both way slower than 1.6 s, and I guess we still don't know why that is.

Speaking only for myself and the projects I'm working on, we have a couple of use cases where getting a 4 x speedup for eg 0 ..< 1024 rather than 0 ..< 1000 would be very nice. And I guess it's also a bit similar to things like x % 1024 being faster than x % 1000.

Time to sleep here.

Nevin · April 1, 2020, 6:27pm

Okay, I added fast-paths for full-raw-binade ranges either with 0 as one of the bounds, or with equal-and-opposite bounds (eg. -1..<1).

I also ripped out the vast majority of the small-range handling and made it return early when possible. Instead of trying to be an optimization, it is now exclusively intended to ensure the large-range codepath won’t spin the while true loop.

With the current implementation, the worst-case range for wide types (such as Float80) is something like 1.nextDown...1.5, which has less than a 1 in 2⁶⁰ chance of needing a second pass through the while loop.

(Narrow types like Float and Double will always succeed on ranges like that: their worst-case is 0x1p-62.nextDown...1 with less than a 1 in 2⁶² chance of a second pass.)

Simplifying the small-range handling meant I could entirely eliminate the multi-binade helper functions that care about spare bits, so the file got a lot shorter. I didn’t touch the large-range code, so it still has the same performance (except for the aforementioned fast-paths that skip it entirely).

Link to diff

Jens · April 1, 2020, 8:05pm

Nice! These are the results from the performance test above for this version:

Float
  0.223436068 seconds (checksum: 138254)
  0.06077624 seconds (checksum: 5896375)
  0.254109685 seconds (checksum: 2200059)
  0.060262056 seconds (checksum: 5896271)
Double
  0.292370293 seconds (checksum: 2981475560360739)
  0.076845487 seconds (checksum: 1323779508271287)
  0.31061005 seconds (checksum: 4480661437409607)
  0.076689225 seconds (checksum: 1323779508272975)
Float80
  1.112823436 seconds (checksum: 1370150508309074436)
  0.392912839 seconds (checksum: 1289353272936233143)
  7.272360371 seconds (checksum: 4117734076763979444)
  0.395903242 seconds (checksum: 1289353272936204111)

The second and last rows (the fastest) are for full-raw-binade ranges.

That extreme slow-down for the third Float80 range
(-greatestFiniteMagnitude .... 0, which might as well be eg -1 ... 0)
is still there and I still have no idea what causes it, can you reproduce it?

Nevin · April 1, 2020, 8:10pm

Yes, my profiling shows that with subnormal Float80s, things like max, abs, and comparisons are about 20x slower than with normals.

Jens · April 1, 2020, 8:21pm

Ah, OK.

Nevin · April 1, 2020, 8:23pm

Hmm, I wonder why the checksum in the 3rd row changed for Float and Double but not Float80. I would expect it to stay the same for all three types, since the range -greatestFiniteMagnitude ... 0 uses the standard codepath in both versions, which has not changed.

Jens · April 1, 2020, 8:36pm

Oops, looks like I accidentally posted results after having edited the third range (tried with -1 ... 0), sorry about that.

Here are the results again, with the correct 3rd range.

I've added the checksums for the previous version of your code after the ones for this latest version.

Float
  0.226241222 seconds (checksum: 138254 (138254))
  0.053931518 seconds (checksum: 5896375 (3957719))
  0.212501097 seconds (checksum: 3631585 (3631585))
  0.051749585 seconds (checksum: 5896271 (3957703))
Double
  0.280503323 seconds (checksum: 2981475560360739 (2981475560360739))
  0.079444674 seconds (checksum: 1323779508271287 (315125084029635))
  0.277631542 seconds (checksum: 759283211863005 (759283211863005))
  0.079980958 seconds (checksum: 1323779508272975 (315125084030247))
Float80
  1.086825454 seconds (checksum: 1370150508309074436 (1370150508309074436))
  0.377088032 seconds (checksum: 1289353272936233143 (8536050357597743157))
  7.324208065 seconds (checksum: 4117734076763979444 (4117734076763979444))
  0.392084622 seconds (checksum: 1289353272936204111 (8536050357597755341))

Nevin · April 1, 2020, 10:04pm

I’m noticing that the “fast path” is actually slower when using the SystemRandomNumberGenerator.

With both Float and Double, my implementation is about 2x slower than the stdlib’s for 0..<1, but only 1.5x slower for 0..<1.5.

This makes sense, since the “fast path” requires 2 calls to the generator whereas the standard path only needs 1 call. When the generator is slow, that extra call has a major effect.

In contrast, with Xoroshiro128Plus, my implementation is about 4x slower for 0..<1, and about 12x slower for 0..<1.5.

When the generator is fast, a second call to it doesn’t matter and the simpler logic of the fast path does indeed make it faster.

Interestingly, with Float80 and the system generator, my implementation is only about 1.2x slower than the stdlib’s on 0..<1, and 1.6x slower on 0..<1.5.

• • •

Edit:

I just added a check so the “fast path” is not used with SystemRandomNumberGenerator. And…it significantly sped up Float and Double on 0..<1 so they’re only 1.2x slower than the stdlib with the system random number generator now.

It did bump Float80 down to 1.3x slower than the stdlib, but I’m much more interested in optimizing for Float and Double.

This is getting close!

• • •

Edit 2:

For many subnormal ranges like 0 ..< .leastNormalMagnitude/4, using the system RNG my implementation is about 0–5% faster than the stdlid for Float and Double, and about 35–40% faster for Float80.

And using xoroshiro, my implementation is 25–30% faster for Float and Double, and a whopping 45–50% faster for Float80.

I’m starting to think this might actually be viable.

Jens · April 2, 2020, 11:50am

Yes, but the fast path could make more efficient use of the random bits and thereby require 1 rather than 2 calls to generator.next(), at least for Float and Double (for Float80 it would require 2 calls), eg:

extension BinaryFloatingPoint where
    RawExponent: FixedWidthInteger,
    RawSignificand: FixedWidthInteger
{
    static func randomThroughRawExponent<R: RandomNumberGenerator>(
        _ maxExponentBitPattern: RawExponent,
        using generator: inout R
    ) -> Self
    {
        var bits: UInt64
        var tzSum = 0
        again: do {
            bits = generator.next()
            let maxZS = Int(truncatingIfNeeded: maxExponentBitPattern) &- tzSum
            let tzs = min(bits.trailingZeroBitCount, maxZS)
            tzSum &+= tzs
            if tzs == 64 && tzSum < maxExponentBitPattern { continue again }
            let consumedBits = tzs &+ 1
            bits &>>= consumedBits
            if (64 &- consumedBits) < Self.significandBitCount {
                bits = generator.next()
            }
        }
        let e = maxExponentBitPattern &- RawExponent(truncatingIfNeeded: tzSum)
        let shifts = RawSignificand.bitWidth &- Self.significandBitCount
        let mask = ~RawSignificand(0) &>> shifts
        let s = RawSignificand(truncatingIfNeeded: bits) & mask
        return Self(sign: .plus, exponentBitPattern: e, significandBitPattern: s)
    }
}

Using SystemRandomNumberGenerator(), this implementation will (on average) be as fast as the current stdlib F.random(in: 0 ... powerOfTwo) when F is Float or Double, and it will actually be faster than the current stdlib implementation when F is Float80.

To support -powerOfTwo ..< powerOfTwo ranges, one more bit would be needed, but that shouldn't change the performance.

Using a fast pseudo random number generator, this implementation of the fast path seems to be about twice as fast as that of the current implementation (per this commit).

But for some reason my code is (significantly) slower for Float than for Double which seems counterintuitive (I'd expect it to be equally fast or maybe a tiny bit faster for Float).

EDIT: Ah, it turned out that the cause of this was the call to min (somehow) and an < operator between different integer types, as well as the optimizer happening to inline the method for Double but not for Float. So the above implementation was giving me this (counterintuitive) result:

Float
  0.233315511 seconds
Double
  0.153776696 seconds

And this improved version of the method gives me the following (more expected) result.

extension BinaryFloatingPoint where
    RawExponent: FixedWidthInteger,
    RawSignificand: FixedWidthInteger
{
    @inline(__always)
    static func randomThroughRawExponent<R: RandomNumberGenerator>(
        _ maxExponentBitPattern: RawExponent,
        using generator: inout R
    ) -> Self
    {
        let maxExpInt = Int(truncatingIfNeeded: maxExponentBitPattern)
        var bits: UInt64
        var tzSum = 0
        again: do {
            bits = generator.next()
            let maxZs = maxExpInt &- tzSum
            let tzs_ = bits.trailingZeroBitCount
            let tzs = tzs_ < maxZs ? tzs_ : maxZs
            tzSum &+= tzs
            if tzs == 64 && tzSum < maxExpInt { continue again }
            let consumedBits = tzs &+ 1
            bits &>>= consumedBits
            if (64 &- consumedBits) < Self.significandBitCount {
                bits = generator.next()
            }
        }
        let e = maxExponentBitPattern &- RawExponent(truncatingIfNeeded: tzSum)
        let shifts = RawSignificand.bitWidth &- Self.significandBitCount
        let mask = ~RawSignificand(0) &>> shifts
        let s = RawSignificand(truncatingIfNeeded: bits) & mask
        return Self(sign: .plus, exponentBitPattern: e, significandBitPattern: s)
    }
}

Float
  0.14883914 seconds
Double
  0.148902778 seconds

Looking at the performance of the fast path in your current implementation, I see that it too is slightly slower for Float than for Double, although not as much as was the case for my code. Adding @inline(__always) to the relevant methods fixed that. With this change I see this (with WyRand):

Your fast path:
0.33 seconds for Float and Double, 2.11 seconds for Float80

My fast path (the improved version):
0.14 seconds for Float and Double, 1.01 seconds for Float80

Thus there should be no need to special case away (an improved implementation of) the fast path for the system generator.

Nevin · April 2, 2020, 5:53pm

Good point. I can rework the randomUpToExponent function to use the leftover bits for the significand when possible.

I can even introduce an allowNegative argument, so it will use a leftover bit to handle the symmetric-bounds case (eg. -1..<1), and thus avoid calling Bool.random().

My first crack at it (using the system RNG, and compared to the stdlib’s random(in:)) is 10% slower for Float and Double, and 15% faster for Float80, on both 0..<1 and -1..<1.

And on 0 ..< .leastNormalMagnitude, it’s 10% faster for Float and Double, and a remarkable 3x faster for Float80.

I did not yet eliminate the heterogeneous comparison, but I did locate and mark all of them in the file with a comment (// Heterogeneous compare). All of them are between RawExponent and either Int or UInt64.

Rather than special-case this one function, if those comparisons are really eating the performance then I’d like to take care of all of them together.

• • •

In your implementation, you unconditionally convert from RawExponent to Int. I’m not really comfortable with that, because (a) the standard library types use UInt which could exceed Int.max (even though it can’t actually for floating-point values), and (b) an arbitrary floating-point type could use a larger integer type.

I’m going to try to special-case by checking whether exponentBitCount < Int.bitWidth, and if not then whether maxExp itself fits in an Int (eg. maxExp._binaryLogarithm() < Int.bitWidth &- 1).

Your bits variable is UInt64, so this can only put at most 64 bits into the significand. That’s enough for the standard library types (including Float80) but for the general case it’s not sufficient.

(The fix is pretty easy: first check if the number of available bits is at least significandBitCount, and if not then generate a random RawSignificand.)

• • •

Edit:

I just tested my work-in-progress implementation with xoroshiro on 0 ..< .leastNormalMagnitude, and compared to the stdlib mine is 5x faster for Float, 6x faster for Double, and a blazing 14x faster for Float80.

Jens · April 2, 2020, 7:25pm

Cool, not using floating point * and + probably makes a lot of difference for subnormals, especially for Float80 which on top of those usually slow subnormal arithmetic operations also seems to be subject to a lot of missed optimization opportunities.

However, note that the stdlib implementation is around 15x (Float and Double) and 72x (Float80) slower for
0 ..< leastNormalMagnitude
than it is for eg
0 ..< 1 or 0 ..< 1024.

The implementation in my previous post is about 14x faster than the stdlib for the subnormal range for Float and Double and 120x faster for Float80. But it doesn't handle the sign bit etc of course.

EDIT:

BTW, I have a very simple sketch for a PseudorandomNumberGenerator protocol, something which I think should be added to the Random API of the stdlib, since it's needed for eg initializing or restarting a PRNG with some specfic state or seed in a generic context. I've needed it for writing tests for this stuff.

Here it is together with the WyRand generator

protocol PseudorandomNumberGenerator : RandomNumberGenerator {
    associatedtype State
    init(seed: UInt64)
    init?(state: State)
}
extension PseudorandomNumberGenerator {
    init(seed: UInt64? = nil) {
        self.init(seed: seed ?? UInt64.random(in: .min ... .max))
    }
}

/// The wyrand generator, translated from:
/// https://github.com/wangyi-fudan/wyhash
struct WyRand : PseudorandomNumberGenerator {

    typealias State = UInt64

    private (set) var state: State

    init(seed: UInt64) {
        self.state = seed
    }
    init(state: State) {
        // Every UInt64 value is a valid WeRand state, so no need for this
        // initializer to be failable, it will satisfy the requirement anyway.
        self.state = state
    }

    mutating func next() -> UInt64 {
        state &+= 0xa0761d6478bd642f
        let mul = state.multipliedFullWidth(by: state ^ 0xe7037ed1a0b428db)
        return mul.high ^ mul.low
    }
}

scanon · April 2, 2020, 8:08pm

This depends a lot on the platform. On Apple's recent ARM cores, there's zero penalty for operations with subnormal numbers. On x86, it can easily be 100x.

On platforms with slow subnormal operations, everything is going to be slow, so it's not really a concern if random number generation is also slow. This is an extremely niche corner case, that is broadly not worth worrying about (and certainly not worth designing around--if fast subnormals fall out naturally from other optimizations, great, but it doesn't make sense to spend software engineering effort on them).

Without any constraints on State, init(state:) is almost useless as a protocol requirement (it remains useful on concrete types, of course). What's the reason to put it on the protocol?