Is Float.random(in:)'s implementation or documentation wrong?

The discussion section of the documentation of

static func random(in range: Range<Float>) -> Float

says:

The random() static method chooses a random value from a continuous uniform distribution in range, and then converts that value to the nearest representable value in this type. Depending on the size and span of range, some concrete values may be represented more frequently than others.

Unless I'm misinterpreting this, the process described would let any representable value within the given range be chosen.

However, this is not the case, as the following program demonstrates:

let range = Float(0) ..< Float(1 << 24)
var lowSamples = [Float]()
while lowSamples.count < 10 {
    let v = Float.random(in: range)
    if v < 100 { lowSamples.append(v) }
}
print(lowSamples)

Here's the output of a couple of runs:

[3.0, 48.0, 30.0, 82.0, 36.0, 84.0, 74.0, 5.0, 94.0, 95.0]
[95.0, 24.0, 58.0, 19.0, 53.0, 23.0, 47.0, 43.0, 11.0, 99.0]
[4.0, 49.0, 56.0, 2.0, 90.0, 51.0, 64.0, 79.0, 92.0, 21.0]

If, like the documentation says, this method "chooses a random value from a continuous uniform distribution in range, and then converts that value to the nearest representable value in this type", then how come the fractional part of every value is always zero?

We can see why by looking at the implementation, which is doing essentially this:

let unitRandom = Float(rndUInt64Value >> (32+8)) * Float(0.5).ulp
return unitRandom * (range.upperBound - range.lowerBound) + range.lowerBound

So, since in my example range = Float(0) ..< Float(1 << 24), and range.upperBound.nextDown.ulp == 1.0, the method will only be able to choose among values from [0.0, 1.0, ... Float(1 << 24 - 1)], and never eg 0.5.

This means that for eg range = Float(0) ..< Float(1 << 25), it will only return even values (since range.upperBound.nextDown.ulp == 2.0).


Question:

Should the implementation change to match the documentation or should the documentation change to match the implemented behavior?

(Related to this thread, which never got a clear answer.)

2 Likes

This is somewhat standard behaviour, e.g. Rust does something similar. C++11 (or, at least, the C++ standard library shipped on macOS by default) is better, but still doesn't have full precision:

#include<random>
#include<iostream>

int main() {
  std::random_device rd;
  std::mt19937 gen(rd());
  std::uniform_real_distribution<float> dist(0.0, 16777216.0);

  int count = 0;
  while (count < 10) {
    auto val = dist(gen);
    if (val < 100) {
      std::cout << val << std::endl;
      count++;
    }
  }

  return 0;
}

Each number printed seems to be "integer/256.0", and indeed bumping the limit up to 16777216.0 * 256 (232) gives the same behaviour as Rust and Swift: it generates integers. Its implementation, in essence, generates a 32-bit integer and divides by 232. (This is slightly different to the Swift and Rust ones, which splat 24 random bits into the mantissa directly.)

I think doing it "properly" is potentially quite slow, thus I'm inclined to say that the documentation should change.

3 Likes

The implementation should change to match the documentation (this is on my list of things to address for 5.0). The existing implementation is good-enough and at least as good as what most other languages provide, but we can and should do better.

9 Likes

That's both very interesting and important to know, since it probably means that the answer to the question in the title of my other thread:
When is the implementation of SE-0202: Random Unification allowed to change?
is: Whenever it's needed.

Meaning for example that if I implement my own pseudo random number generator, conforming it to the RandomNumberGenerator protocol, then I …

While I do think that the Random API should be able to change and improve whenever it's needed, I also think that the documentation should at least mention/warn about this, especially since the whole point of using PRNGs (rather than the default truly random / unseedable system RNG) is about reproducibility and/or efficiency.


More generally, but still in the context of PRNGs, reproducibility and efficiency, I think it's problematic that the design of the Random API mandates certain "one-single-true-ways" of mapping the (pseudo or truly) random UInt64 values into eg Float values, rather than having them as customization points.

Also, the Random API has been designed in a way that spreads these non-customizable methods all over the standard library, making them hard to "avoid" and easy to confuse with potentially needed alternatives.


These concerns are rooted in a real world example, one of our current projects, a game with procedurally generated content/levels, so stuff can be described and reproduced solely by a seed, and efficiency and reproducibility is crucial). Our conclusion is that we have to use our own alternative little pseudo random lib on the side, making sure to design it in a way that avoids any potential confusion with the Random API (which is all over the std lib).

IMO it would have been better to let the Random API mature as a stand alone lib. It would have been small enough to be easily tried out and evaluated by anyone, simply adding it to projects as source (and not as a module, to avoid the issue of missing cross-module optimization).

2 Likes

It should be expected that libraries fix bugs and improve performance, and that does not need to be documented for every routine individually. Rather, routines whose behavior opts out of that norm should explicitly be documented as such.

Other languages do the same thing: C++'s uniform random float API can return the upper-endpoint of the half-open endpoint. This is a bug that will be (has been?) fixed, changing the observable behavior of that API. Python's random float API has a similar bug, and I expect they'll get around to fixing it eventually. (We have avoided this particular bug in Swift).

Other APIs also do the same thing. The behavior of the sorting APIs will likely change to default to a stable sort at some future point, which will result in similar user-observable behavior changes. Platforms will change the implementation of math functions like exp to make them faster, more accurate, or fix bugs.

A reproducible sequence of random values that is stable across standard library versions is a niche use case (a large niche, but it's decidedly not the most-needed thing from a random API that isn't trying to serve all possible uses), which I would expect to find in a specialized game/simulation library, alongside things like "math functions whose behavior will never change, bugs/inaccuracies/inefficiencies and all."

Which isn't to say that the behavior of some of these APIs can't eventually be frozen--once they're could provide Swift5.random( ) or whatever--I just don't think that we're there yet, but they are good enough to be useful to most folks and thus meet the standard for inclusion in the stdlib.

2 Likes

I agree, the fault is mainly my thinking that our use case could/should benefit from using the Random API of the standard library, not seeing that it is a niche use case (although I assume very large) requiring rolling your own, not just PRNG, but the entire API, relying only on very basic primitives, in order to ensure as much control as possible, as well as keeping an eye out for unexpected changes to std lib, os etc that might still affect the sensitive code.

What is "the most-needed thing[s] from a random API that are trying to serve all possible uses" though, and why?

And what are some concrete examples of the most common use cases for a random API such as SE-0202?

I have tried looking through all the threads related to SE-0202 but couldn't find any imo properly motivated answers to these (sincere) questions.

All the real world needs for random number generation that I can remember ever having has always been in scenarios where reproducibility and efficiency are important: audio/graphics processing, games, simulation, etc.

3 Likes

All the real world needs for random number generation that I can remember ever having has always been in scenarios where reproducibility and efficiency are important: audio/graphics processing, games, simulation, etc.

Even constrained to these domains, reproducibility across standard library versions is frequently not the primary concern. For example, you likely do really need a map tile generator to be stable, but for lots of other random behavior in a game (maybe even most), having bugs fixed is more valuable than fixed behavior.

This is also true for audio and graphics processing--an out of range value can cause horrible ringing artifacts, so if Swift suffered from C++'s out-of-range value bug, most people doing audio work would absolutely want that bug to be fixed, even if it resulted in different (but now correct) random numbers being generated. The fact that C++, which is one of the most it's-always-worked-that-way-and-we're-stuck-with-it languages, decided to fix the bug, is illustrative.

I don't want to introduce unneeded churn, but I think we need to accept that faster algorithms will be found for some operations, and bugs will be fixed, and that type of change benefits more users than it harms. I understand the desire for completely reproducible behavior, but that comes at the cost of both performance and correctness, and it is rarely the right tradeoff for the most general APIs in a standard-library.


I do think that it will eventually make sense to start locking down explicitly-stable variants of some of the stdlib API surface, where the exact bitwise behavior is fully specified by the implementation. We would want to namespace these somehow to mark them as such, though I don't know what that would look like exactly.

2 Likes

well, saving a 64-bit seed instead of 6,000 tiles is definitely a big advantage. and you really don’t want to be loading a save game and find out your city is underwater because the random number generator changed

You might be interested in the PRNG engines for TensorFlow.

1 Like

To defend against this kind of implicit dependency, maybe we should make it so that the random number transformations that don't explicitly guarantee a stable mapping from input randomness to output sequence salt their output with the per-process seed, to prevent people from making brittle assumptions that they'll produce the same sequence in perpetuity.

2 Likes

sounds like a good idea, but it should be done before 4.2 comes out otherwise it will be too late

2 Likes

In my experience most of the time you want to save the 6,000 tiles anyway, because otherwise any updates (bug fixes, content changes, etc.) in your own code have to be made very carefully to avoid breaking saved games. Or you have to maintain duplicates of every version of the procedural generation code that ever existed and hope you don't need to edit them (e.g. to fix a security issue). If the amount of generated content is enormous (and you can't just save the part that the player has actually seen) then I agree that it might be necessary to use the seed as a fragile form of compression.

Maybe, but this cuts down the number of use cases even further by excluding cases where you want short-term reproducibility but don't care about reproducing it in the long term. For example, you might be debugging a rare issue during development and using the seed to reproduce it. Or writing research code where you want to provide the same measurements to two different methods in order to compare them.

5 Likes