SE-0202: Random Unification

nnnnnnnn · March 24, 2018, 6:50pm

I don't think RNG needs to extend IteratorProtocol - in my experimentation I haven't used them as sequences, just as sources for different amounts of random data. Would you mind saying a little more or showing what you mean by composing RNGs?

soroush · March 24, 2018, 6:59pm

I was going to say that if you want 10 random numbers, Array(rng.prefix(10)) is preferable to (0..<10).map({ Int.random(in: range) }), but then realized that 10 random Int64 numbers isn't all that valuable, since you'd get modulo bias if you tried to mod them. Consider this retracted!

The distributions in the playground better address the type of thing I'm after, and they do conform to Sequence.

Michael_Ilseman · March 24, 2018, 7:40pm

+1 from me.

Definitely. I have written a half-dozen ad-hoc wrappers around arc4random_buf, each time requiring me to carefully pour over man pages, avoid biases, and attempt to ensure uniformity. I'm not confident I got it right every time.

I appreciate the author providing reasoning and justification for decisions in the "Alternatives Considered" section.

Yes. It provides a convenient safe default and supports extensibility (e.g. @nnnnnnnn's playground).

This feels in the top tier of convenient language support.

For serious usage, I would want to use a package similar to STL's <random>. However, that feels beyond the current scope of the stdlib and this proposal handles convenient and common usage well and is extensible.

A quick read of prior threads, @nnnnnnnn's Playground, the proposal, and a history of accumulated annoyances from not having this.

However, I'm unsure and unversed regarding what a default notion of a random floating point value should be. I would like this proposal to clarify the semantics, e.g. uniform in real numbers vs representable values. The proposal should ideally have rationale for the decision. (I recall this topic coming up in one of the mega-threads, but have since forgotten the details).

kvn · March 24, 2018, 7:40pm

+1 for this. Great work.

The one thing that got me was Collection.random; I wasn't immediately assuming Element. The property name doesn't feel as obvious as it does for FixedWidthInteger.

Also, while potentially out of scope, I was also expecting a convenience initializer for Array similar to init(repeating:count:), but which takes a RandomNumberGenerator.

EDIT

I chose .random() over .randomElement() here because .randomElement() doesn't match with things like .first, .last, .min(), or .max().

~~.random~~ .any?

MattSeaman · March 24, 2018, 10:45pm

+1 overall, with modifications

Some elements of this proposal seem out of scope. In particular, .shuffle and similar shuffling functions for sequences and collections seem like they belong in a separate proposal. Certainly, the addition of this random API would allow for other features such as shuffling, but historically these would be delegated to other proposals.
On the other hand, I am okay with Bool.random() being included because it is directly tied to random number generation. However, I would not be opposed to breaking this out to a separate proposal in the future.
As for Collection.random(), I feel this fits nicely with this proposal and I agree with returning an Optional because this is the established precedent for similar methods.
Conversely, -1 for returning an Optional for ranges. I agree that the situation of an empty range is much less likely in practice. Half-open ranges ought to be supported as well, though.
I agree with the proposal author that the static methods on integer and floating point types should be provided. A global version would only be slightly easier to call (due to . prefixing) in the case where you want to use an Int or Double. For other types, nasty casts of numeric literals to specific numeric types would become more common. Yes, Int and Double are the types Swift developers are encouraged to use, but the verbosity cost is relatively small to use the static variants instead.
I agree that the exact implementations of these functions should not be restricted by the proposal. These should be free to change in the future, as long as they do not compromise the stated goals (e.g. cryptographically secure).
I fully support making the default generator cryptographically secure, provided that this is not a performance problem. Developers could easily be generating thousands of random numbers in short periods of time for non–security-critical purposes.
I would support (and be excited about) a more generic, extensible, solution like C++, as others have suggested. If this proposal is accepted, it will become part of the ABI upon Swift 5's release. As a result, it should support any future customization that we wish to do. That said, I agree that the stdlib is not and should not be a statistics library, and the current proposal does a good job at being relatively lightweight yet extensible.

Yes. This has been a hole in the stdlib for awhile now. In addition, this is a feature commonly needed by both experienced developers and those first learning Swift. In the latter case of those new to the language, learning arc4random or similar existing solutions is un-Swifty and presents a common source of confusion.

Yes. As stated above, current methods of random generation such as arc4random feel un-Swifty and are easy to use incorrectly. The proposed solution does a good job at making these methods Swifty. (However, see suggestions above.)

This proposal seems more extensible than similar random functionality such as Java's random APIs, but I do see some benefit of enabling higher levels of extensibility (like others pointing to the C++ implementation).

I read the proposal in full, reviewed the implementation line-by-line, and read the existing discussion thread.

Zhu_Shengqi · March 25, 2018, 12:03am

If it's common to cause crash if return a non-optional, it's also common to cause crash when accessing an array element by index. So what's the difference? I think we shouldn't sacrifice the usability of an API just because improper use may cause crash. Besides, the random() is consist of two steps: generate a random index and return the element by index. Since a crash always happens when using an index out of bounds to access an element, why return an optional in this API?

Another strange use case is [1, 2, 3].random()!, which also support my point: user knows how to handle the empty collection case, and we shouldn't handle this special case for them and return an optional value. In some projects I have worked on it forbids using ! to force-unwrap an optional, it would be disaster to see random() ?? defaultValue in every use case (assume the user knows the collection is not empty).

hlovatt · March 25, 2018, 1:47am

Really like the proposal, would suggest range.random() and static .random(in:) so that the developer can choose which they prefer. They both read fine and if the documentation for both notes that the other is identical it won't cause confusion.

Alejandro · March 25, 2018, 4:20am

The rationale behind this is that computed properties like .first and .last are guaranteed to return the same value if the collection is never mutated in between. .random() on the other hand will return different outputs.

Nevin · March 25, 2018, 4:40am

Given that the proposal prioritizes quality over speed, I think it makes sense to have the floating-point implementation behave as though a real number were chosen uniformly from the appropriate interval and then rounded.

When someone asks for, say, a random Double in 0..<1, I think the natural interpretation is that they want every representable value in that range to be possible, with probability proportional to its unit in the last place.

It is non-trivial to write an algorithm which accomplishes this, especially for arbitrary intervals which do not begin and end on binades. Therefore we should consider implementing it in the standard library as the default, so everyone gets the maximally-correct behavior.

hlovatt · March 25, 2018, 4:58am

I prefer the spelling .random to .random() since random doesn't mutate the collection. IE I don't see the argument that .random returns a different value as significant; I can't see that the naming guidelines favour the brackets. However the Swift naming is all over the place on this point so either could be justified :).

maven · March 25, 2018, 9:14am

I'm worried about the shared global state (in the form of Random.default) and thread-safety. Shouldn't that at least be thread local?

I could also imagine the ability to replace the default with a seed-able custom generator to be quite useful (e.g. for reproducible sequences for testing).

If the empty range traps the generator, then I think Collection.random() should trap on an empty collection.

benrimmington · March 25, 2018, 12:53pm

@maven, the default generator doesn't have any state. It uses the internal _stdlib_random function, which is a cross-platform wrapper for arc4random_buf, etc. See apple/swift#12772 for details.

One possible issue is that if _stdlib_random opens "/dev/urandom", is it always safe to use the same file descriptor in different threads, or after a fork() or execve() call? There's a comment in CommonRandomSPI.h which suggests that "exception processing" is necessary.

benrimmington · March 25, 2018, 2:10pm

@Alejandro, should an AnyObject constraint be added to the RandomNumberGenerator protocol? Or should the inout keyword be added to T: RandomNumberGenerator parameter types?

Alternatively, the Random type could be an open class instead of a struct. Then the RandomNumberGenerator protocol wouldn't be needed.

Paul_Cantrell · March 25, 2018, 3:34pm

Agreed. I would consider any implementation that does not have this behavior (within an ulp or two) to be incorrect.

It would be silly to reason so hard about random’s cryptographic strength while ignoring uniformity. (I realize cryptography generally doesn’t use floats, but … the principle of the thing.)

jarod · March 25, 2018, 4:52pm

Zhu_Shengqi:

If it’s common to cause crash if return a non-optional, it’s also common to cause crash when accessing an array element by index. So what’s the difference? I think we shouldn’t sacrifice the usability of an API just because improper use may cause crash. Besides, the random() is consist of two steps: generate a random index and return the element by index. Since a crash always happens when using an index out of bounds to access an element, why return an optional in this API?

Another strange use case is [1, 2, 3].random()!, which also support my point: user knows how to handle the empty collection case, and we shouldn’t handle this special case for them and return an optional value. In some projects I have worked on it forbids using ! to force-unwrap an optional, it would be disaster to see random() ?? defaultValue in every use case (assume the user knows the collection is not empty).

The case for trapping on access by index is well-summarized here. The index you're using was calculated in some way, and if it goes out of bounds, there's a serious bug in your logic that shouldn't be handled at run time through a failed optional unwrap.

Most of the time that you're working with a collection, you don't have perfect knowledge about its contents. A hard-coded [1, 2, 3] is the exception, not the rule, so if random traps, then you have to do an emptiness check before using it most of the time. It's safer /and/ more convenient in the common case for the API to do the emptiness check for you.

That's not to say that optional element-at-index or non-optional random element APIs would not be useful sometimes, but they should not be the default. For what it's worth, I think the [1, 2, 3].random() case would ideally be handled by some kind of type-level support for non-empty collections. But failing that, adding a ! is a minimal burden on the developer.

I've worked on projects that forbid ! as well, but I think that's a cultural issue in the Swift community that's slowly dying out. The feature exists in the language precisely for cases like this. If allowing ! is out of your control, it's simple to get the same behavior while bypassing your linter by adding a Collection extension that unwraps the random element and fatalErrors when that fails.

hlovatt · March 25, 2018, 7:00pm

Would it be possible to also supply a generator that can be seeded that is fast and for the same seed produces a known sequence. I am not proposing this as the default but as part of the library for people that want to do Monte Carlo type simulations etc.

Alejandro · March 25, 2018, 9:12pm

It is safe to read the same descriptor from different threads, however we should probably use a mutex to prevent reading at the same time. We can move this discussion to the implementation pr for specifics.

I came into this with the idea that stateful generators are generally classes and non-stateful ones are structures (Random).

Michael_Ilseman · March 26, 2018, 12:48am

@Alejandro, is this the semantics you're going for? Could you clarify the proposal?

Paul_Cantrell · March 26, 2018, 4:03am

I’ll let Alejandro speak for his own intent, but the implementation in the proposal does have this semantics (possibly up to rounding error of a few ulps, perhaps magnified by the issue Jens flagged).

This line of discussion started with this remark:

“Uniform in representable values” would make floating point randomness unusable for almost every numerical application I can think of: generating other distributions, Monte Carlo methods, game environment seeding, generating audio noise, etc. I’ve never heard of a random number library working that way. (Does such a library exist?) I don’t think the proposal’s choice of “numerically uniform” really requires further clarification or justification; this is pretty standard.

GarthSnyder · March 26, 2018, 5:49am

A couple of folks have referred to PRNG selection as if it were much too complex a problem to be tackled by mere mortals. This is largely FUD. Any one of several common PRNGs would make a fine default and could easily stand as the sole PRNG included in the stdlib.

PRNGs do not in fact differ all that much in their performance. The modern ones deliver vastly better randomness quality than the old C library standards but are still pretty darn quick. The Wikipedia page for Xorshift, which is in the same family as (and runs at about the same speed as) the Mersenne Twister, claims that it can generate a random 32-bit value in 10 clock cycles on x86 architectures. That is fast enough that performance simply isn't a relevant concern when building a general-purpose library that wants to default to generating high-quality output. Anyone that needs a random number in less than 10 clock cycles is welcome to port their own favorite PRNG hack.

Just to put some actual numbers on this, the hoary old C library rand, a linear congruential generator, generates a billion random ints in 6.75 seconds on my desktop. Xorshift compiled from C code does it in 9.07 seconds and a C implementation of the Mersenne Twister does it in 10.36. So were talking about roughly a 35% performance penalty for modern generators that leave essentially nothing to be desired in terms of period and randomness quality. (Interestingly, the random library routine, which is a "nonlinear additive feedback" generator that is supposedly significantly better than the traditional linear congruential generators, is even faster than rand. I'm not sure why GameplayKit opted to include an LCG instead of this.)

Similarly, memory consumption is also a complete nonissue. I couldn't even identify a modern algorithm that uses more than a handful of bytes in its state. You could instantiate hundreds of thousands of RNG instances and no phone is going to bat an eyelash. (But in practical terms, the number of separate generators that you'd want to actually use is most likely less than 10.)

Quality and algorithm selection are exquisitely difficult problems for cryptographically secure RNGs because the stakes are standards are both very high. But there's no reason to let this free-floating anxiety slop over into PRNG selection. You don't have to select an absolutely perfect PRNG for all applications. You just have to do better than the crap that's included in the C library and follow generally agreed-upon industry standards.

The world has already selected a standard PRNG, namely, the Mersenne Twister. It's the default for SPSS, SAS, Microsoft Excel, GAUSS, GLib, GNU Multiple Precision Arithmetic Library, GNU Octave, GNU Scientific Library, gretl, IDL, Julia, CMU Common Lisp, Embeddable Common Lisp, Steel Bank Common Lisp, Maple, MATLAB, Free Pascal, PHP, Python, R, Ruby, SageMath, Scilab, and Stata. Just in case they didn't jump out at you: SPSS. SAS. OCTAVE. R. MATLAB. STATA.

Personally, I think there's some value in investigating a few of the newer generators such as Xorshift, but if the prime concern is that people are going to point and make faces at the Swift stdlib because it uses an unusual PRNG, by all means just take the Mersenne Twister.