Generating Double values in unit range using Random API is about 45 times slower than necessary

Jens · May 26, 2018, 8:48pm

I've been using my own little random api for some years, and as SE-0202: Random Unification is now implemented in recent snapshots, I've started to try it out, adding some PRNGs and doing some performance testing.

My use cases are image and audio processing and other situations where what I need is a PRNG that is as fast as possible but with sufficiently good quality. So the default RNG, Random.default, is not an option for me (it's not seedable and much too slow).

Therefore I do my performance testing using PRNGs that I have previously implemented and tested, such as SplitMix64 and Xoroshiro128Plus.

Now to my observation about generating Double values in the half open unit range [0, 1).

Random floating point values in this particular range (and a couple of other ones) are very common in my use cases. The following command line program demonstrates that using the method provided by the Random API for this:

let rndDoubleInUnitRange = Double.random(in: 0 ..< 1, using: &rg)

is about 45 times slower than using an alternative method (see implementation in the program below):

let rndDoubleInUnitRange = Double.randomInUnitRange(using: &rg)

I think I know why this is: The former method (the one provided by the Random API) calls the generator's next(upperBound:) internally, which performs a loop etc, in order to support any range, and it doesn't have special cases for the unit range or other common ranges for which it suffices to use (some of) the 64 bits from the generator's next() directly.

Note that I haven't investigated this thoroughly so it might be that next(upperBound:) doesn't have to loop (and call .next() repeatedly), for the particular range 0.0 ..< 1.0, and if so, the slow down is caused by something else, or my tests are flawed.

Question: I'd like to be able to use the methods that the Random API provides rather than rolling my own, but it will not be possible in cases like this (since it makes my code 45 times slower if I do). Would it be possible/sensible to modify the implementation of FloatingPoint.random(in: using: ) by adding checks for unit range and perhaps a couple of other common ranges, eg [-1, 1), to see if it would get about 45 times faster for these common ranges, or perhaps the checks (short paths?) should actually go in next(upperBound:)?

Or perhaps none of that is possible or sensisble, and it would be better to add separate methods, like my .randomInUnitRange(using: )?

Or is this considered too niche for motivating changes or additions to the Random API? (If so, then I guess I'll just have to forget about the Random API and continue using my own.)

Here's the demo program, in case anyone would like to have a critical look (don't forget -O / release build! And note that you need a recent dev snapshot with the Random API implementation to compile it):

import AppKit

// Just as a side note, I'm including this protocol even though it isn't
// needed or relevant for this thread or demo, but it might be interesting to
// start a separate thread discussing if we should add something like this
// protocol to the Random API.
protocol PseudoRandomNumberGenerator : RandomNumberGenerator {
    associatedtype State
    init(state: State)
    init(seed: UInt64)
}
extension PseudoRandomNumberGenerator {
    // All PRNGs get an init() which seeds the PRNG with a truly
    // random seed (from the default RNG).
    init() {
        self.init(seed: Random.default.next())
    }
}

// I'll be using the following simple PRNG for this demonstration since it is
// about 200 times faster than the default RNG Random.default (on my machine),
// and it results in less fluctuating timing results than when using Random.default.
// See the comments at the end for my speculations about why that might be.
struct SplitMix64 : PseudoRandomNumberGenerator {
    // Based on http://xoshiro.di.unimi.it/splitmix64.c
    // It is a very fast generator passing BigCrush.
    var state: UInt64
    init(state: UInt64) {
        self.state = state
    }
    init(seed: UInt64) {
        self.init(state: seed)
    }
    mutating func next() -> UInt64 {
        state &+= 0x9e3779b97f4a7c15
        var z = (state ^ (state &>> 30)) &* 0xbf58476d1ce4e5b9
        z = (z ^ (z &>> 27)) &* 0x94d049bb133111eb
        return z ^ (z &>> 31)
    }
}

// This is the interesting part. It is a method that produces a Double value
// in unit range [0, 1), just like Double.random(in: 0 ..< 1, using: &rng).
extension Double {
    // This seems to be almost 50 times faster than
    // Double.random(in: 0 ..< 1, using: &rg)
    static func randomInUnitRange<R>(using rng: inout R) -> Double
        where R: RandomNumberGenerator
    {
        let ui64 = rng.next()
        return Double(ui64 >> 11) * (Double.ulpOfOne / 2)
    }
}

func executionTime<C, R>(
    of count: Int,
    _ operation: (C, inout R) -> C,
    using rng: inout R) -> Double
    where C: Numeric, R: RandomNumberGenerator
{
    var checksum: C = 0
    let t0 = CACurrentMediaTime()
    for _ in 0 ..< count { checksum = operation(checksum, &rng) }
    let t1 = CACurrentMediaTime()
    let time = t1 - t0
    print("  time:", time, "seconds (checksum: \(checksum))")
    return time
}

@_transparent // The non-baseline tests will be about 4 x slower without this …
func test<C, R>(label: String,
                using rng: inout R,
                operation: (C, inout R) -> C) -> Double
    where C: Numeric, R: RandomNumberGenerator
{
    print("\n\(label):")
    var minTime = Double.infinity
    for _ in 0 ..< 7 {
        let time = executionTime(of: 10_000_000, operation, using: &rng)
        minTime = min(minTime, time)
    }
    print("  minimum time:", minTime, "seconds")
    return minTime
}

func test<R: RandomNumberGenerator>(using rng: inout R) {
    let minTimeOfExistingMethod =
        test(label: "Existing method", using: &rng) { (checksum, rg) in
            return checksum + Double.random(in: 0 ..< 1, using: &rg)
    }
    let minTimeOfMyMethod =
        test(label: "My method", using: &rng) { (checksum, rg) in
            return checksum + Double.randomInUnitRange(using: &rg)
    }
    let minTimeOfBaseline =
        test(label: "Baseline", using: &rng) { (checksum, rg) in
            return checksum ^ rg.next()
    }
    
    let ratio = (minTimeOfExistingMethod - minTimeOfBaseline) /
        (minTimeOfMyMethod - minTimeOfBaseline)
    
    print("\nMy method was \(ratio) times faster than the existing method.")
}


var prng = SplitMix64()
test(using: &prng)

// test(using: &Random.default)
// Uncomment the above line if you're interested in seeing the results when
// using Random.default instead of SplitMix64. You'll have to wait much longer
// and the result will be very erratic. Lowering the number of operation-calls per
// test will make it even more erratic. I suppose this is because the time it
// takes to call the underlying random function is fluctuating a bit and
// because it is so large compared to the time it takes to perform
// the UInt64 -> Double in unit range conversions.

Here's an example compile and run on my MBP, 2 GHz i7, late 2013:

$ /Library/Developer/Toolchains/swift-DEVELOPMENT-SNAPSHOT-2018-05-22-a.xctoolchain/usr/bin/swiftc -O -sdk /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.13.sdk test.swift
$ ./test

Existing method:
  time: 0.30905857200014 seconds (checksum: 4999285.093711603)
  time: 0.30168970900012937 seconds (checksum: 5000657.718667782)
  time: 0.30922266299967305 seconds (checksum: 5000852.352980945)
  time: 0.29837784899973485 seconds (checksum: 4999644.374941994)
  time: 0.29741116300101567 seconds (checksum: 4999527.347985679)
  time: 0.30160584000077506 seconds (checksum: 5001061.41999039)
  time: 0.3061574759994983 seconds (checksum: 5000080.011050173)
  minimum time: 0.29741116300101567 seconds

My method:
  time: 0.016804585999125266 seconds (checksum: 4999748.809723855)
  time: 0.01670135200038203 seconds (checksum: 4999867.282415561)
  time: 0.017937091000931105 seconds (checksum: 5000992.65858493)
  time: 0.0171803799985355 seconds (checksum: 5001887.357391258)
  time: 0.01743405500019435 seconds (checksum: 5001176.764179978)
  time: 0.017931289999978617 seconds (checksum: 4998501.019897718)
  time: 0.017273727000429062 seconds (checksum: 4999930.674665641)
  minimum time: 0.01670135200038203 seconds

Baseline:
  time: 0.011295344000245677 seconds (checksum: 434670723333784857)
  time: 0.011931514000025345 seconds (checksum: 565380713243076264)
  time: 0.01088286199956201 seconds (checksum: 7609415011435692545)
  time: 0.010823972999787657 seconds (checksum: 7875593923504742495)
  time: 0.01154418300029647 seconds (checksum: 17857085083325635480)
  time: 0.011229211000681971 seconds (checksum: 3830828303701629253)
  time: 0.011102663000201574 seconds (checksum: 7544674021828474327)
  minimum time: 0.010823972999787657 seconds

My method was 48.761053179018354 times faster than the existing method.

(I assume that the situation would be about the same for Float, but I haven't tested it, since it is actually worse, since you can get two Float values, or a simd float2, from a single UInt64 bit pattern, and I guess having methods that returns a random (Float, Float) or float2 in unit range (like I have in my own little random api) would be crossing the line of too niche ... which is kind of sad in a way, because scenarios requiring things like this constitutes my (and I guess at least some other people's) only real need for a Random API.)

palimondo · May 26, 2018, 9:34pm

I haven’t run your code yet (will do on Monday).…
What do you see when you run this through Instruments with Time Profiler?

From cursory glance at the code, your test harness has a lot of generic code, which gives compiler plenty of rope to hang itself. I suggest you repackage your tests as 2 standalone benchmarks (system rng and SplitMix64, defined in one file) for Swift Benchmark Suite. If your hunch that there are performance improvements to be extracted from special-casing the unit range, it would be good to add the benchmark to SBS before such improvements are merged to main tree.

Keep your workload small, make the benchmark run in less then 2500 microseconds, otherwise that’s additional source of measurement noise. See my earlier research on the topic for more info.

Jens · May 26, 2018, 9:48pm

FWIW, I originally wrote the same test, but using no generic code at all, getting the same result.
I added the generics after that, in order to make a compact but somewhat "full-fledged" demo.

I had never noticed the need for @_transparent there if I hadn't written the test the simple way first.

Note that this is not to say that my testing is not flawed, it might well be! But I noticed the same issue in a very different situation so I guess there's at least a pretty good chance that I'm at least almost correct. : )

I'm currently leaning more towards a feeling that the real source of the slow down might be in .next(upperBound:), that it might be looping unnecessarily for upperBounds that are powers of two, for which it could just call next() once, perform a shift and return the result. But I might be totally wrong about this.

palimondo · May 26, 2018, 9:50pm

No need to guess, examine it with Instruments.

Jens · May 26, 2018, 9:56pm

I checked "high frequency" and I've selected a section where the slow existing method should be running (click link at bottom bar for high res):

I'm far from an Instruments power user, and I can't make anything out of looking at this, I have no idea what RandomNumberGeneratorbytes is for example.

Alejandro · May 26, 2018, 10:48pm

For types like UInt64, UInt32, etc, this is incorrect. Given that T.max % upperBound == upperBound - 1, we return next() % upperBound. In fact, the odds of this ever looping more than once is probably very very minuscule. The performance drop is most likely in the two % operations that must be done in order to spit out a number from next(upperBound:) which @scanon posted about in the implementation.

Jens · May 26, 2018, 10:55pm

~~Ah, ok, thanks for the explanation.~~

EDIT: After re-reading this part of your reply:

and writing the below (much smaller) reformulation of the test in my original post, I realize that my response should be:

Note that in the case of generating these unit range doubles, we use a random UInt64 value in the range
0 ..< (1 << 53). The (exclusive) upper bound here is a power of two, but it isn't one of
(1 << 8), (1 << 16), (1 << 32) or (1 << 64) as is the case for "types like UInt64, UInt32, etc".

That said, I'm sure you're correct in that it doesn't perform more than one iteration of the loop for upperBound == (1 << 53), and it's probably the % and conditional statements that takes more time than is perhaps necessary.

Anyway, the following program demonstrates the same issue as before but in a different way, which I think makes it clearer that a significant part of the slow down is due to .next(upperBound:) not being as efficient as it probably could be, for upperBounds that are powers of two, like (1 << 53) == 0x00200000_00000000.

It would be nice if `.next(upperBound:) could be rewritten to be faster for these cases (eg perhaps using the fact that .significandWidth == 0 for powers of two, and using shifting to get the right amount of bits, removing branching in favor of calculation if possible, etc.). I tried to find the post in the implementation by @scanon that you referred to but couldn't. Do you have a link?

Here is the reformulated smaller test program:

import AppKit
    
struct SplitMix64 : RandomNumberGenerator {
    // Based on http://xoshiro.di.unimi.it/splitmix64.c
    var state: UInt64
    init(seed: UInt64) { self.state = seed }
    init() { self.state = Random.default.next() }
    mutating func next() -> UInt64 {
        state &+= 0x9e3779b97f4a7c15
        var z = (state ^ (state &>> 30)) &* 0xbf58476d1ce4e5b9
        z = (z ^ (z &>> 27)) &* 0x94d049bb133111eb
        return z ^ (z &>> 31)
    }
}

func test() {
    var prng = SplitMix64(seed: 1234567)
    for _ in 0 ..< 10 {
        var checksum: Double = 0
        let t0 = CACurrentMediaTime()
        for _ in 0 ..< 10_000_000 {
            //-----------------------------------------------------------------
            // Uncomment one of the two methods (A or B) for getting a random
            // UInt64 value in the range 0 ..< (1 << 53), which is used for the
            // random Double value in unit range 0.0 ..< 1.0.
            //-----------------------------------------------------------------
            // Note that using A is much faster than using B.
            // This would not be the case if .next(upperBound:) didn't perform
            // unnecessary work for upperBounds that are powers of two.
            // (0x00200000_00000000 == 1 << 53 is a power of two.)
            //-----------------------------------------------------------------
            
            // A | min time about 0.017 seconds on my machine:
            let bp = prng.next() &>> 11 // (64 - 11 == 53)
            
            // B | min time about 0.218 seconds on my machine:
            //let bp = prng.next(upperBound: 0x00200000_00000000 as UInt64)
            
            //-----------------------------------------------------------------
            // We are by-passing `Double.random(in: 0 ..< 1, using: &prng)`:
            //-----------------------------------------------------------------
            let rndDoubleInUnitRange = Double(bp) * (.ulpOfOne / 2)
            checksum += rndDoubleInUnitRange
        }
        let t1 = CACurrentMediaTime()
        print("time:", t1 - t0, "seconds (checksum: \(checksum)")
    }
}

test()

As a side-note, look at the error that you get if you remove "as UInt64" from the following:

let bp = prng.next(upperBound: 0x00200000_00000000 as UInt64)

I saw here that you were wondering about what seems to be the same misleading diagnostics. So I thought you might find it interesting. I filed SR-7786 for this.

benrimmington · May 27, 2018, 11:35am

RandomNumberGenerator.next(upperBound:)

https://github.com/apple/swift/pull/16413#discussion_r186310072

FWIW, in arc4random_uniform, we choose the smallest N such that 2^N is greater than or equal to upperBound, and then pull N bits at a time from the random source until we get a number smaller than upperBound. This avoids needing to compute two % operations, which is pretty significant.

BinaryFloatingPoint.random(in:using:)

https://github.com/apple/swift/pull/16413#discussion_r186310815

This implementation is not quite right; I believe that it has a slight bias to numbers with even significands, will not generate some values when the range spans the origin, and will barf when upperBound - lowerBound overflows. I should be able to provide a fixed implementation sometime later this week.

Jens · May 27, 2018, 4:58pm

Sorry for filling up this thread with intermediate results as I go, but it seems like .next(upperBound:) is spending most of its time in T._random(using:). This can be verified by adding the following test case to the program in my previous post:

        // C | min time about 0.218 seconds on my machine:
        let bp = UInt64._random(using: &prng) &>> 11

So C is identical to A except that it calls UInt64._random(using: &prng) instead of prng.next(), but it is much slower, it's actually "exactly" as slow as B.

For my test, it looks like T._random(using:) should be taking this fast path:

if bitWidth <= UInt64.bitWidth {
    return Self(truncatingIfNeeded: generator.next() as UInt64)
}

so I can't understand why it's so much slower than generator.next().

Might it be because it's missing an @inlinable or something?

Alejandro · May 27, 2018, 6:09pm

I think I see the problem now. The stdlib integer types actually have their own defined version of _random(using:) which calls RandomNumberGenerator._fill(bytes:). Here is where the slowdown may be occuring: swift/Random.swift at main · apple/swift · GitHub (which is missing a @inlinable). CC: @lorentey . I remember we used this design so that Random could implement _fill(bytes:) using _stdlib_random for stdlib integer types. I have a few ideas that we could do to remove this slowdown.