Time to make jrand48 and friends available again?

dabrahams · April 3, 2020, 2:56am

I've just discovered that if I want a reasonably-good seeded (repeatable) pseudo-random number generator I have to code it by hand, from the ground up, because such Darwin module facilities as jrand48 are still unavailable. Now that we have easy, accessible, high-quality truly random random numbers, shouldn't we think about making it possible to call those other functions from Swift?

technogen · April 3, 2020, 5:24am

I’d love to have a seeded RNG for debugging purposes! It’s very helpful to replace the true RNG with a seeded one in debug builds, log the seed at the start of the process and allow specifying a custom seed through the command-line arguments. This way, an app that uses RNG misbehaves, one can easily plug the previous seed into the next launch and get the exact same values, which is critical for bug reproducibility!

Jens · April 3, 2020, 7:25am

Would it be motivated to add some kind of PseudorandomNumberGenerator protocol to the stdlib?

There was a short discussion about that yesterday, here:

I guess there are a lot of design decisions to make for such a protocol, ie apart from making it possible to manage state and seeding, it could also support things like skip(count:) / jump-ahead / jump-back for PRNGs where that is available.

jenox · April 3, 2020, 10:21am

Sorry, but how do we have truly random numbers? Anything the computer can generate for us is pseudorandom at best. Maybe SystemRandomNumberGenerator does a better job at appearing truly random — but it's still a PRNG at the end of the day. Or am I missing something obvious here?

benrimmington · April 3, 2020, 10:24am

@scanon wrote in apple/swift-numerics#101:

I plan to provide a random module in SN, including both sampling from the usual distributions and some additional generators beyond what the stdlib has (especially one or more fast seeded generators; I have a branch somewhere with a quick draft of a PCG implementation, for example). Obviously this would be an easy first thing for such a module.

PCG: Permuted Congruential Generator.

Jens · April 3, 2020, 10:44am

It is not seedable / repeatable, which I think is commonly implied when using the term PRNG rather than RNG.

From the documentation of SystemRandomNumberGenerator:

While the system generator is automatically seeded and thread-safe on every platform, the cryptographic quality of the stream of random data produced by the generator may vary. For more detail, see the documentation for the APIs used by each platform.

Apple platforms use arc4random_buf(3) .

Linux platforms use getrandom(2) when available; otherwise, they read from /dev/urandom .

Windows uses BCryptGenRandom .

https://www.2uo.de/myths-about-urandom/

lukasa · April 3, 2020, 10:44am

You aren't missing anything, we don't have truly random numbers, and truly random numbers aren't desirable for almost any use-case anyway.

Also, despite my concerns raised in Clarify the cryptographic properties of SystemRandomNumberGenerator, it's been made explicitly clear that SystemRandomNumberGenerator is not required to be a CSPRNG. That means that the RandomNumberGenerator protocol does not imply a CSPRNG either, and so we don't really need a new protocol for non-cryptographically secure random number generators.

We could have a protocol for a sendable random number generator, but I doubt that's very useful. The only thing it adds is a init(seed:) method to RandomNumberGenerator, and given that the whole point of having seedable generators is that they're replayable, I can't imagine a world where you care what the seed is, but not what the RNG is.

Jens · April 3, 2020, 10:47am

Writing generic code for simulation or testing and adding support for replaying a specific scenario given a certain state and/or seed (having to hard code a specific generator into that code would be strange).

Ie, the pair of a generator and seed / state is what is needed to replay a scenario.

lukasa · April 3, 2020, 10:53am

Sure, but that's supportable today. You're generic over RandomNumberGenerator.

Consider this code, which you can write today:

class Simulation<RNG: RandomNumberGenerator> {
    private var rng: RNG

    init(_ rng: RNG) {
        self.rng = rng
    }

    func simulate() {
        // Do some stuff, generating random values out of RNG
    }   
}

let simulation = Simulation(SystemRandomNumberGenerator())

This is completely expressible today. If you want to use a sendable random number generator, there are two ways to go. One is we can make seedability a property of a protocol and amend to this:

class Simulation<RNG: SeedableRandomNumberGenerator> {
    private var rng: RNG

    init(seed: RNG.Seed) {
        self.rng = RNG.init(seed: seed)
    }

    func simulate() {
        // Do some stuff, generating random values out of RNG
    }   
}

let simulation = Simulation<MersenneTwister>(seed: seed)

Or, alternatively, you can just have an RNG whose init method takes a seed, and then pass it as a value to Simulation. This requires no change to Simulation, you just do this:

let simulation = Simulation(MersenneTwister(seed: seed))

The only reason I can see to make SeedableRandomNumberGenerator a protocol is so that you can instantiate it with a seed in a generic context. But I don't see any case where that's useful: if you're trying to make things replayable, the initial state vector is not seed, it's the tuple of (RNG, seed). Why allow those handles to be controlled separately?

Jens · April 3, 2020, 11:10am

I think this is getting off topic, and I don't want to spend more time motivating the need for a PseudorandomNumberGenerator protocol, you might be right that there is no real need for it (any apparent need would go away by reformulating the problem in some (better?) way) and if so then so be it.

But here is an important semantic difference between an RNG like the system generator and a deterministic seedable PRNG.

func test<R: RandomNumberGenerator>(using generator: inout R) {
    var a = generator
    var b = generator
    print(a.next() == b.next())
}
var srng = SystemRandomNumberGenerator()
var prng = WyRand()
test(using: &srng) // false (except possibly about once in 2^64)
test(using: &prng) // true (always)

This difference could be a semantic requirement of a PseudorandomNumberGenerator protocol.

As a followup question to the one of the OP, is there a reason why eg jrand48() is not available when drand48() is (via Darwin)?

dabrahams · April 3, 2020, 12:20pm

The arc4random man page says, “The subsystem is re-seeded from the kernel random number subsystem on a regular basis, and also upon fork(2).” My understanding has always been that the kernel RNG uses a source of true entropy. Am I mistaken?

dabrahams · April 3, 2020, 12:30pm

Well, my point was that some of these "low-quality" RNG APIs, especially the obscure ones, are probably not much of a security risk now that we have first-class a RNG in the standard library. But upon consideration, I know it's Apple's right to make its own business decisions about what to expose with the Darwin module, and opinions voiced here probably won't change those decisions.

In the meantime, I'm using this:

/// A linear congruential pseudo-random number generator.
struct LinearCongruential: RandomNumberGenerator {
  /// The last value returned by `self.next`, or what `self` was seeded with.
  private var lastValue: UInt64

  /// Creates an instance with the given seed.
  ///
  /// Instances created with the same seed produce the same sequence of
  /// pseuedo-random results from `next()`.
  init(seed: UInt64 = 0) {
    lastValue = seed
  }

  /// Returns a value from a uniform, independent distribution of binary data.
  mutating func next() -> UInt64 {
    // A "good value" chosen from https://arxiv.org/pdf/2001.05304.pdf (Steele,
    // Guy; Vigna, Sebastiano (15 January 2020). "Computationally easy,
    // spectrally good multipliers for congruential pseudorandom number
    // generators". arXiv:2001.05304 [cs.DS].)
    let a: UInt64 = 0xaf251af3b0f025b5
    // As long as “m”, the modulus value, is a power of 2, the generator is
    // insensitive to the value of c, provided they are relatively prime.
    let c: UInt64 = 1
    lastValue = lastValue &* a &+ c
    return lastValue
  }
}

LemonCupcake · April 3, 2020, 1:31pm

That's a good follow-up question. I'm not sure, maybe because its not as performant because drand48() avoids int to double conversion. Also, because of a design decision as it's not required for most applications (quoting @lukasa).

But I'd be interested to know the real reason as well.

scanon · April 3, 2020, 1:50pm

In fairness, jrand48 is objectively not great, even separate from security concerns. 48 bits of internal state is insufficient for many purposes, and simply returning the high-order bits of state doesn't get as much mixing as you'd like to have (producing significant statistical issues), and you're not getting a meaningful speed advantage over other, better choices in exchange. The PCG and Xoroshiro families, in particular, are better choices along basically every dimension, and there are a number of other good choices as well.

I absolutely agree that we should provide some seeded PRNGs out of the box, but jrand48 and other "legacy" PRNGs shouldn't be included in that (both for the reasons I identified above, and because the C library interfaces have implicit state, so they are unpredictable under reentrancy).

In order to make seedable PRNGs truly useful, we also need to improve the APIs that transform the raw PRNG output into numbers (e.g. random(in: Range)) so that they can be updated to improve performance and fix bugs while allowing users to opt into getting exactly the same sequences for all time.

anandabits · April 3, 2020, 3:04pm

A seeded random number generator is one of my top wishes for the standard library. I'd love to see this topic move forward and become a proposal.

Nobody1707 · April 3, 2020, 10:27pm

I'm not sure if any part of Swift uses this, but the RDRAND instruction on my Ryzen uses a series of ring oscillators to produce bits which are truly random (or as close as is physically possible).

scanon · April 3, 2020, 10:38pm

The system is free to use RDRAND to implement the API that the Swift API uses. Some do.

dabrahams · April 4, 2020, 6:57pm

In my case, anything random-ish and repeatable is truly useful, especially when compared to having no way to do it at all. True randomness is just not that important for regression testing most of the time. That's why I started by proposing to make something as “objectively not-great” as jrand48 available. It would at least give me a way of doing this without having to spend an hour or more in research to reimplement an approach that isn't just objectively awful.

scanon · April 4, 2020, 9:17pm

The flaws that I'm talking about are not a departure from true randomness; I'm talking about clear failures on mainstream statistical tests (e.g. testU01 (pdf¹)), which have real implications for how meaningful coverage is when they're employed for test-case generation. Failing these statistical checks means that when you generate test cases using these generators, you are susceptible to getting less meaningful coverage that you would be getting if you used a higher-quality generator. That's not always the end of the world, but in certain catastrophic cases, it can lead to very meaningful breakdowns in test coverage (particularly when sampling a high-dimensional space, it can easily work out that your tests only sample a small number of hyperplanes, leaving most of the space uncovered).

It's important to be clear that the only reason anyone should use any of the rand48 generators today is if they have a hard compatibility guarantee with existing software that uses it. It is far from the Pareto frontier of performance vs. statistical quality. There's no reason to make it available when other seedable generators are available that are better in every way. We should provide interfaces to those in the standard library; in the very short-term, there are numerous public packages that wrap them.

I confess not understanding needing to reimplement it--if I actually needed jrand48 specifically, I would wrap it in a C function and import that.

¹ A particularly relevant snippet:

The first category contains linear congruential generators, which obey the recurrence x_i = (ax_{i−1} + c) mod m with output function u_i = x_i/m; they are denoted by LCG(m, a, c). LCGs with power-of-two modulus m = 2^e are known to be badly behaved, especially in their least significant bits [L’Ecuyer 1990]... LCG(248, 25214903917, 11) is the drand48 from the Unix standard library. The RNG in java.util.Random of the Java standard library is based on the same recurrence, but uses two successive values to build each output ... This is an improvement over drand48, but still not acceptable.

dabrahams · April 4, 2020, 10:06pm

In my case I just need a blob of repeatable numbers that I don't have to invent myself; I'm not using randomness to try to create coverage. If they were all the same value or had a few other regular properties (e.g. a strided sequence) that are trivial to generate, I could imagine fooling myself about the results of my algorithm, but the algorithm isn't taking branches based on the values or deciding the size of the test set or anything. As long as they're not obviously regular, there's not much danger for my use case. I could generate them truly randomly once and read them from a file; that's just a lot of drudge-work.

There's no reason to make it available when other seedable generators are available that are better in every way.

That's the problem: such things aren't (portably) available in Swift for my purposes, AFAICT.

I confess not understanding needing to reimplement it--if I actually needed jrand48 specifically, I would wrap it in a C function and import that.

I think you're missing my meaning. I didn't mean “reimplement jrand48 specifically,” I meant “reimplement the facility it gives me of a repeatable source of irregular data.” And dropping into C to get it would be another productivity drain.

¹ A particularly relevant snippet:

The first category contains linear congruential generators, which obey the recurrence x_i = (ax_{i−1} + c) mod m with output function u_i = x_i/m ; they are denoted by LCG(m, a, c). LCGs with power-of-two modulus m = 2^e are known to be badly behaved, especially in their least significant bits [L’Ecuyer 1990]...

From what I can tell from Wikipedia and the Steele and Vigna paper (a successor to L'Ecuyer's), that statement is painting with rather a broad brush. It seems to me, from my only-several-hours of research, that well-behaved LCGs with power-of-two modulus do exist, which is why I eventually wrote this one based on numbers from Steele and Vigna. Am I mistaken?

I am far from being a numerics expert and I confess to not understanding the papers at a deep level, so I have stopped where I am, with little confidence, having previously wasted an hour or so trying to code against jrand48 before discovering it was unavailable. Personally, especially until/unless some expert validates my LCG, I'd have been better served if that first effort had paid off.