SE-0202: Random Unification

Jens · April 19, 2018, 10:41pm

Hmm, my Xoroshiro128Plus final class implementation seems to be about 40 (!) times faster than your Xorshift128Plus , ie, it performs 400 million masking additions in the same time as your Xorshift128Plus does 10 million. Yes they are different generators, but I'm assuming that Xoroshiro128Plus isn't 40 times faster than Xorshift128Plus algorithmically?

Anyway, here's what I did:

I converted your code into a Command Line App (just a few minor adjustments) as I don't like to do performance metrics in any other way (except on some iOS device when necessary of course).

These are the results that I got (on my MBP late 2013, 2 GHz Intel Core i7):

arc4random (2 calls)             1.4396      4294967295
arc4random_buf                   2.2526       648368209
Xorshift128Plus (struct)         0.1782       610931914
Xorshift128Plus                  0.7387       707966945
ThreadLocking<Xorshift128Plus>   1.3608      2348117867
LinearCongruential (struct)      0.1701      4010776896
LinearCongruential               0.5071      1913470656
ThreadLocking<LinearCongruenti   1.0558      4288772416

They look very similar to yours (except that the class versions are not marked with "(class)" by the code in the gist).

Running my entirely separate (but I think essentially equivalent) test program (see below), that is testing only my implementation of Xoroshiro128Plus (and again, it is a different generator than Xorshift128Plus, but I'm guessing that it's not 40 times faster algorithmically), I get this:

time: 0.0169437950244173 seconds (checksum: 11605410945142439131 )
time: 0.0169420730089769 seconds (checksum: 17241852839264922271 )
time: 0.0168698579072952 seconds (checksum: 17685997167798807342 )
time: 0.0165533649269491 seconds (checksum: 8398418167523955030 )
time: 0.0166222950210795 seconds (checksum: 2212566004850123018 )

Would you be interested in checking out my test program, verify that it is relevant to compare it to your test, and maybe add Xoroshiro128Plus (as a final class) to your test and see if it somehow gets slower when doing so (or, which I doubt, 40 times faster than Xorshift128Plus)?

Here's my test program (just copy paste it into the main of a fresh Command Line Project in Xcode 9.3, default toolchain):

import AppKit

protocol RandomGenerator : class {
    /// Returns the next random bit pattern and advances the state of the
    /// random generator.
    func next() -> UInt64
}

/// A pseudo random UInt64 bit pattern generator type.
///
/// The generated UInt64 values can be converted to other types by using eg
/// extensions on UInt64 for converting it to Double or float2 in unit range.
///
/// A random generator only have to implement two initializers and the
/// next() -> UInt64 method.
protocol PseudoRandomGenerator : RandomGenerator {
    associatedtype State
    
    /// The current state of the random generator.
    var state: State { get }
    
    /// Creates a a new random generator with the given state. The initializer
    /// fails if the given state is invalid according to the random generator.
    init?(state: State)
    
    /// Creates a a new random generator with a state that is determined by
    /// `seed`. Each `seed` must result in a unique valid state.
    init(seed: UInt64)
}

// NOTE: The SplitMix64 is included here only because it is used to scramble
// the seed of Xoroshiro128Plus, this is the way I have it in my original code
// and I've just copy pasted these in here so ...

/// The splitmix64 generator, translated from:
/// http://xorshift.di.unimi.it/splitmix64.c
final class SplitMix64 : PseudoRandomGenerator {
    var state: UInt64
    /// Every UInt64 value is a valid SplitMix64 state.
    init(state: UInt64) { self.state = state }
    init(seed: UInt64) { self.state = seed }
    func next() -> UInt64 {
        state = state &+ 0x9E3779B97F4A7C15
        var z = state
        z = (z ^ (z >> UInt64(30))) &* 0xBF58476D1CE4E5B9
        z = (z ^ (z >> UInt64(27))) &* 0x94D049BB133111EB
        return z ^ (z >> UInt64(31))
    }
}

final class Xoroshiro128Plus : PseudoRandomGenerator {
    var state: (UInt64, UInt64)
    /// The state of Xoroshiro128Plus must not be everywhere zero.
    init?(state: (UInt64, UInt64)) {
        guard state.0 != 0 || state.1 != 0 else { return nil }
        self.state = state
    }
    init(seed: UInt64) {
        // Uses SplitMix64 to scramble the given seed into a valid state:
        let sm = SplitMix64(seed: seed)
        state = (sm.next(), sm.next())
    }
    func next() -> UInt64 {
        func rol55(_ x: UInt64) -> UInt64 {
            return (x << UInt64(55)) | (x >> UInt64(9))
        }
        func rol36(_ x: UInt64) -> UInt64 {
            return (x << UInt64(36)) | (x >> UInt64(28))
        }
        let result = state.0 &+ state.1
        let t = state.1 ^ state.0
        state = (rol55(state.0) ^ t ^ (t << UInt64(14)), rol36(t))
        return result
    }
}

extension PseudoRandomGenerator {
    /// Creates a new pseudo random generator seeded with a cryptographically
    /// secure random seed.
    init() {
        var seed: UInt64 = 0
        withUnsafeMutableBytes(of: &seed) { (ptr) -> Void in
            let sc = SecRandomCopyBytes(nil, ptr.count, ptr.baseAddress!)
            precondition(sc == errSecSuccess)
        }
        self.init(seed: seed)
    }
}


func test() {
    let rg = Xoroshiro128Plus()
    let sampleCount = 5 // Number of times to perform the test.
    let iterationCount = 10_000_000 // Number of values to sum.
    for _ in 0 ..< sampleCount {
        var cs = 0 as UInt64
        let t0 = CACurrentMediaTime()
        for _ in 0 ..< iterationCount {
            cs = cs &+ rg.next()
        }
        let t1 = CACurrentMediaTime()
        print("time:", t1 - t0, "seconds (checksum:", cs, ")")
    }
}
test()

EDIT: Gah! I can't believe I actually did the Debug/Release mistake ... (double checked this right after posting the above). The correct results for your test program (converted to Command Line) on my machine is:

arc4random (2 calls)             0.5864      4294967295
arc4random_buf                   2.2445       733145136
Xorshift128Plus (struct)         0.0129      3613348595
Xorshift128Plus                  0.0129      2955512274
ThreadLocking<Xorshift128Plus>   0.2393      3016061566
LinearCongruential (struct)      0.0127      3534163904
LinearCongruential               0.0127      3739340608
ThreadLocking<LinearCongruenti   0.2270      3775278784

Which makes much more sense ...

I'm leaving this here as a warning example to others, and because: Look! The results for the struct and class versions of your Xorshift128Plus and LinearCongruential are now the same!

(yes, I ran the test a couple of times until they were exactly the same.)

I'm afraid it looks an awful lot like you might have done the old Release/Debug mistake too ... : ) Did you?

If so, the conclusion is that there is indeed no difference between having them as final class vs struct, once the optimizer has been allowed to do its work. This is in line with my previous experience and experiments in this exact matter.