Well, itâs easy enough to in-line the nested function manually, it just ends up needing either several nested ifs, or a label on the outer if:
Manually inlined
static func uniformRandomInSection<R: RandomNumberGenerator>(_ section: Int64, maxExponent eMax: RawExponent, using generator: inout R) -> Self {
let n = UInt64(bitPattern: (section < 0) ? ~section : section)
let b = lowerBound(ofSection: n &+ 1, maxExponent: eMax)
let x: Self
label: if (n == 0) && (b.exponentBitPattern != 0) {
// Section 0 starts at 0 and may span multiple binades.
x = randomUpToExponent(b.exponentBitPattern, using: &generator)
} else {
// Every other section fits within a single binade.
let a = lowerBound(ofSection: n, maxExponent: eMax)
if a == b { x = a; break label }
let low = a.significandBitPattern
let high = b.nextDown.significandBitPattern
if (low == high) { x = a; break label }
let s = RawSignificand.random(in: low...high, using: &generator)
let e = a.exponentBitPattern
x = Self(sign: .plus, exponentBitPattern: e, significandBitPattern: s)
}
return (section < 0) ? (-x).nextDown : x
}
But eventually I think everything in this implementation is going to need inlining annotations, so might as well start here. Iâve added @inline(__always) to the nested function.
Thanks for spotting this and finding the solution!
// Get a random number in a single section (as defined above).
static func uniformRandomInSection<R: RandomNumberGenerator>(_ section: Int64, maxExponent eMax: RawExponent, using generator: inout R) -> Self {
let n = UInt64(bitPattern: (section < 0) ? ~section : section)
let b = lowerBound(ofSection: n &+ 1, maxExponent: eMax)
// This must only be called when the section fits in a single raw binade.
@inline(__always)
func singleBinadeRandom() -> Self {
let a = lowerBound(ofSection: n, maxExponent: eMax)
if a == b { return a }
let low = a.significandBitPattern
let high = b.nextDown.significandBitPattern
if (low == high) { return a }
let s = RawSignificand.random(in: low...high, using: &generator)
let e = a.exponentBitPattern
return Self(sign: .plus, exponentBitPattern: e, significandBitPattern: s)
}
let x: Self
if (n == 0) && (b.exponentBitPattern != 0) {
// Section 0 starts at 0 and may span multiple binades.
x = randomUpToExponent(b.exponentBitPattern, using: &generator)
} else {
// Every other section fits within a single binade.
x = singleBinadeRandom()
}
return (section < 0) ? (-x).nextDown : x
}
The first thing I noticed was that uniformRandomInSection made 2 calls to lowerBound(ofSection:) (to get the upper and lower bounds), and I wondered if that was strictly necessary.
About 2am last night I realized that the call for section n+1 could be eliminated, and I jotted it down in my bedside notebook.
âŚthen I couldnât get back to sleep, so I sat up and reworked the code immediately.
I ended up eliminating lowerBound(ofSection:) entirely, moving its logic into uniformRandomInSection and appending random bits directly to get a value uniformly distributed within the section. This also eliminated the nested function.
In the process, I noticed and fixed a latent bug that would only affect types with no spare bits (ie. with significandBitCount == RawSignificand.bitWidth) a tiny portion of the time (when the section spans exactly one full raw binade).
The implementation got a few lines shorter, it should run a little faster since it only needs to calculate one section bound instead of two, and it has one less bug (well, unless my sleep-addled brain introduced moreâŚ)
Any idea why the third range (-geatestFiniteMagnitude ... 0) is so slow for Float80?
(I didn't notice this before but that range is slow for both the latest and the previous version.)
Another possible optimization (that I guess you might've already thought about): I think it could be about 4 times faster for ranges that spans whole "raw exponent binades" exactly, by using the method described in the paper and which we tried upthread (ie choosing raw exponent by counting consecutive trailing/leading zeroes and then filling raw significand with random bits).
Interesting. I already have the randomUpToExponent function which uses that approach, but currently it only gets called for section 0. It would be pretty easy to insert a check at the very beginning to use that method when applicable to the whole range.
If we think ranges with a lower bound of 0 and an upper bound with 0 significand (eg. 0..<1) will be far more common than other ranges, we could optimize for it.
⢠⢠â˘
As for general-purpose optimization, when I run the Instruments time profiler I see about 40% of the time being in âSelf Weightâ of uniformRandomRoundedDown, which makes no sense to me. That function does almost no work itself, mostly just calling uniformRandomInSection, smallRangeUniformRandom, and sectionNumber to perform the actual calculations.
I wouldâve expected those last 3 functions to take up the bulk of the running time, but Instruments says their total combined time is less than the âSelf Weightâ of uniformRandomRoundedDown.
For example, the last run I tested looked like this:
Weight
Percent
Self Weight
Symbol Name
12.26 s
100.0%
5.38 s
uniformRandomRoundedDown(in:using:)
2.18 s
17.8%
63.00 ms
uniformRandomInSection(_:maxExponent:Using:)
1.44 s
11.7%
1.06 s
smallRangeUniformRandom(in:using:)
1.24 s
10.0%
914.00 ms
sectionNumber(maxExponent:)
âŽ
Now perhaps Iâm misunderstanding what this all means, but it strikes me as quite odd that uniformRandomRoundedDown would have such a high âSelf Weightâ.
So, in my case, the check (which always fails) for the small range path takes a lot more time than for you. Commenting out the small range path takes it down from 9.7 seconds to 7.1 seconds on my MBP:
// if let x = smallRangeUniformRandom(in: range, using: &generator) {
// return x
// }
I have no idea why that should take so much time, and why the rest of the code (ie with that check commented away) should still be so slow, or why our time profiles looks so different.
BTW: I'm using (the default toolchain of) Xcode 11.4 (11E146) on macOS 10.15.4 (19E266)
I cannot see any clear difference in the timing profile of those two ranges (except that one takes more time than the other. Maybe it has something to do with the implementation of Float80 and/or the builtins that it is calling, like the thing with some floating point operations being extremely slow for subnormals for example.
HmmâŚthinking about it more, I suspect the small-range path could be simplified a lot without much trouble, and nearly even eliminated.
Its purpose is two-fold: to guarantee that only a single call to the generator is made when the range is small, and to minimize the chances of the while true loop needing more than one pass when the range is large.
The âworst-caseâ for a large range would be something like 1.nextDown..<1.nextUp, where the endpoints are just barely in different sections (here the sections would be (1<<62)-1* and 1<<62). One of those sections is selected, and then a random value in that section is chosen.
But if there are many possible values in those sections, then we have a problem because only 1 value from each section is actually in the original range, so the while true loop would usually spin multiple times.
However, types with small significands wonât have that problem. In particular, 61 or fewer spare bits is sufficient to guarantee those sections are singletons. Thus Float and Double donât need the small-range path.
With 62 significand bits, the upper section is a singleton but the lower one has 2 values, so thereâs a 3/4 chance to land in the range. And with 63 bits, such as Float80, the upper section has 2 values and the lower section has 4, giving a 3/8 chance to land in the range.
However, larger types would rarely land in the range. Float128 has a 112-bit significand, so it would only succeed 3/252 times, and the while loop would spin a quadrillion times on average.
So we canât remove the small-range path wholesale, but we can be much more selective when it gets used. This should speed things up for Float and Double:
if significandBitCount > Int64.bitWidth - 3 {
if let x = smallRangeUniformRandom(in: range, using: &generator) {
return x
}
}
And for Float80 and other wide typesâŚIâll simplify the small-range code so it only cares about adjacent binades. Then it should be able to fail early much faster.
* Technically for small types there are many sections between 1.nextDown and 1, so it would be (1<<62) - k for some k, but all those sections would be singletons so it doesnât matter. In the large-significand case k would be 1 and the rest works.
I should maybe have added/restated that the "usual" time for most other Float80 ranges was around 1.6 seconds, so 9.7 s (with the (failing) small path check) and 7.1 s (without the small path check) are both way slower than 1.6 s, and I guess we still don't know why that is.
Speaking only for myself and the projects I'm working on, we have a couple of use cases where getting a 4 x speedup for eg 0 ..< 1024 rather than 0 ..< 1000 would be very nice. And I guess it's also a bit similar to things like x % 1024 being faster than x % 1000.
Okay, I added fast-paths for full-raw-binade ranges either with 0 as one of the bounds, or with equal-and-opposite bounds (eg. -1..<1).
I also ripped out the vast majority of the small-range handling and made it return early when possible. Instead of trying to be an optimization, it is now exclusively intended to ensure the large-range codepath wonât spin the while true loop.
With the current implementation, the worst-case range for wide types (such as Float80) is something like 1.nextDown...1.5, which has less than a 1 in 260 chance of needing a second pass through the while loop.
(Narrow types like Float and Double will always succeed on ranges like that: their worst-case is 0x1p-62.nextDown...1 with less than a 1 in 262 chance of a second pass.)
Simplifying the small-range handling meant I could entirely eliminate the multi-binade helper functions that care about spare bits, so the file got a lot shorter. I didnât touch the large-range code, so it still has the same performance (except for the aforementioned fast-paths that skip it entirely).
The second and last rows (the fastest) are for full-raw-binade ranges.
That extreme slow-down for the third Float80 range
(-greatestFiniteMagnitude .... 0, which might as well be eg -1 ... 0)
is still there and I still have no idea what causes it, can you reproduce it?
Hmm, I wonder why the checksum in the 3rd row changed for Float and Double but not Float80. I would expect it to stay the same for all three types, since the range -greatestFiniteMagnitude ... 0 uses the standard codepath in both versions, which has not changed.
Iâm noticing that the âfast pathâ is actually slower when using the SystemRandomNumberGenerator.
With both Float and Double, my implementation is about 2x slower than the stdlibâs for 0..<1, but only 1.5x slower for 0..<1.5.
This makes sense, since the âfast pathâ requires 2 calls to the generator whereas the standard path only needs 1 call. When the generator is slow, that extra call has a major effect.
In contrast, with Xoroshiro128Plus, my implementation is about 4x slower for 0..<1, and about 12x slower for 0..<1.5.
When the generator is fast, a second call to it doesnât matter and the simpler logic of the fast path does indeed make it faster.
Interestingly, with Float80 and the system generator, my implementation is only about 1.2x slower than the stdlibâs on 0..<1, and 1.6x slower on 0..<1.5.
⢠⢠â˘
Edit:
I just added a check so the âfast pathâ is not used with SystemRandomNumberGenerator. AndâŚit significantly sped up Float and Double on 0..<1 so theyâre only 1.2x slower than the stdlib with the system random number generator now.
It did bump Float80 down to 1.3x slower than the stdlib, but Iâm much more interested in optimizing for Float and Double.
This is getting close!
⢠⢠â˘
Edit 2:
For many subnormal ranges like 0 ..< .leastNormalMagnitude/4, using the system RNG my implementation is about 0â5% faster than the stdlid for Float and Double, and about 35â40% faster for Float80.
And using xoroshiro, my implementation is 25â30% faster for Float and Double, and a whopping 45â50% faster for Float80.
Iâm starting to think this might actually be viable.
Yes, but the fast path could make more efficient use of the random bits and thereby require 1 rather than 2 calls to generator.next(), at least for Float and Double (for Float80 it would require 2 calls), eg:
extension BinaryFloatingPoint where
RawExponent: FixedWidthInteger,
RawSignificand: FixedWidthInteger
{
static func randomThroughRawExponent<R: RandomNumberGenerator>(
_ maxExponentBitPattern: RawExponent,
using generator: inout R
) -> Self
{
var bits: UInt64
var tzSum = 0
again: do {
bits = generator.next()
let maxZS = Int(truncatingIfNeeded: maxExponentBitPattern) &- tzSum
let tzs = min(bits.trailingZeroBitCount, maxZS)
tzSum &+= tzs
if tzs == 64 && tzSum < maxExponentBitPattern { continue again }
let consumedBits = tzs &+ 1
bits &>>= consumedBits
if (64 &- consumedBits) < Self.significandBitCount {
bits = generator.next()
}
}
let e = maxExponentBitPattern &- RawExponent(truncatingIfNeeded: tzSum)
let shifts = RawSignificand.bitWidth &- Self.significandBitCount
let mask = ~RawSignificand(0) &>> shifts
let s = RawSignificand(truncatingIfNeeded: bits) & mask
return Self(sign: .plus, exponentBitPattern: e, significandBitPattern: s)
}
}
Using SystemRandomNumberGenerator(), this implementation will (on average) be as fast as the current stdlib F.random(in: 0 ... powerOfTwo) when F is Float or Double, and it will actually be faster than the current stdlib implementation when F is Float80.
To support -powerOfTwo ..< powerOfTwo ranges, one more bit would be needed, but that shouldn't change the performance.
Using a fast pseudo random number generator, this implementation of the fast path seems to be about twice as fast as that of the current implementation (per this commit).
But for some reason my code is (significantly) slower for Float than for Double which seems counterintuitive (I'd expect it to be equally fast or maybe a tiny bit faster for Float).
EDIT: Ah, it turned out that the cause of this was the call to min (somehow) and an < operator between different integer types, as well as the optimizer happening to inline the method for Double but not for Float. So the above implementation was giving me this (counterintuitive) result:
Looking at the performance of the fast path in your current implementation, I see that it too is slightly slower for Float than for Double, although not as much as was the case for my code. Adding @inline(__always) to the relevant methods fixed that. With this change I see this (with WyRand):
Your fast path:
0.33 seconds for Float and Double, 2.11 seconds for Float80
My fast path (the improved version):
0.14 seconds for Float and Double, 1.01 seconds for Float80
Thus there should be no need to special case away (an improved implementation of) the fast path for the system generator.
Good point. I can rework the randomUpToExponent function to use the leftover bits for the significand when possible.
I can even introduce an allowNegative argument, so it will use a leftover bit to handle the symmetric-bounds case (eg. -1..<1), and thus avoid calling Bool.random().
My first crack at it (using the system RNG, and compared to the stdlibâs random(in:)) is 10% slower for Float and Double, and 15% faster for Float80, on both 0..<1 and -1..<1.
And on 0 ..< .leastNormalMagnitude, itâs 10% faster for Float and Double, and a remarkable 3x faster for Float80.
I did not yet eliminate the heterogeneous comparison, but I did locate and mark all of them in the file with a comment (// Heterogeneous compare). All of them are between RawExponent and either Int or UInt64.
Rather than special-case this one function, if those comparisons are really eating the performance then Iâd like to take care of all of them together.
⢠⢠â˘
In your implementation, you unconditionally convert from RawExponent to Int. Iâm not really comfortable with that, because (a) the standard library types use UInt which could exceed Int.max (even though it canât actually for floating-point values), and (b) an arbitrary floating-point type could use a larger integer type.
Iâm going to try to special-case by checking whether exponentBitCount < Int.bitWidth, and if not then whether maxExp itself fits in an Int (eg. maxExp._binaryLogarithm() < Int.bitWidth &- 1).
Your bits variable is UInt64, so this can only put at most 64 bits into the significand. Thatâs enough for the standard library types (including Float80) but for the general case itâs not sufficient.
(The fix is pretty easy: first check if the number of available bits is at least significandBitCount, and if not then generate a random RawSignificand.)
⢠⢠â˘
Edit:
I just tested my work-in-progress implementation with xoroshiro on 0 ..< .leastNormalMagnitude, and compared to the stdlib mine is 5x faster for Float, 6x faster for Double, and a blazing 14x faster for Float80.
Cool, not using floating point * and + probably makes a lot of difference for subnormals, especially for Float80 which on top of those usually slow subnormal arithmetic operations also seems to be subject to a lot of missed optimization opportunities.
However, note that the stdlib implementation is around 15x (Float and Double) and 72x (Float80) slower for 0 ..< leastNormalMagnitude
than it is for eg 0 ..< 1 or 0 ..< 1024.
The implementation in my previous post is about 14x faster than the stdlib for the subnormal range for Float and Double and 120x faster for Float80. But it doesn't handle the sign bit etc of course.
EDIT:
BTW, I have a very simple sketch for a PseudorandomNumberGenerator protocol, something which I think should be added to the Random API of the stdlib, since it's needed for eg initializing or restarting a PRNG with some specfic state or seed in a generic context. I've needed it for writing tests for this stuff.
Here it is together with the WyRand generator
protocol PseudorandomNumberGenerator : RandomNumberGenerator {
associatedtype State
init(seed: UInt64)
init?(state: State)
}
extension PseudorandomNumberGenerator {
init(seed: UInt64? = nil) {
self.init(seed: seed ?? UInt64.random(in: .min ... .max))
}
}
/// The wyrand generator, translated from:
/// https://github.com/wangyi-fudan/wyhash
struct WyRand : PseudorandomNumberGenerator {
typealias State = UInt64
private (set) var state: State
init(seed: UInt64) {
self.state = seed
}
init(state: State) {
// Every UInt64 value is a valid WeRand state, so no need for this
// initializer to be failable, it will satisfy the requirement anyway.
self.state = state
}
mutating func next() -> UInt64 {
state &+= 0xa0761d6478bd642f
let mul = state.multipliedFullWidth(by: state ^ 0xe7037ed1a0b428db)
return mul.high ^ mul.low
}
}
This depends a lot on the platform. On Apple's recent ARM cores, there's zero penalty for operations with subnormal numbers. On x86, it can easily be 100x.
On platforms with slow subnormal operations, everything is going to be slow, so it's not really a concern if random number generation is also slow. This is an extremely niche corner case, that is broadly not worth worrying about (and certainly not worth designing around--if fast subnormals fall out naturally from other optimizations, great, but it doesn't make sense to spend software engineering effort on them).
Without any constraints on State, init(state:) is almost useless as a protocol requirement (it remains useful on concrete types, of course). What's the reason to put it on the protocol?