Finding a decent 48-bit LCG

148 views
Skip to first unread message

ratm...@gmail.com

unread,
Apr 6, 2022, 7:24:45 AM4/6/22
to prng
Hello.
I hope to find some general PRNG expertise here...

I have just implemented two PRNG:s for Erlang/OTP's `rand` module.  The user request was for performance (low generation time) over quality.  So I looked in Pierre L'Ecuyer's classical (1999) paper "Tables of Linear Congruential Generators of Different Sizes and Good Lattice Structure" and selected X1 = (185852*X0) % (2^35 - 31) as  `mcg35`, and X1 = (15319397*X0 + 1) &  (2^35 - 1) as `lcg35`.

These generators produce a number in Erlang in 5 ns or 3.5 ns, respectively, compared to 17 ns for our fastest in the Xorshift family - Xorshift116+; so 3 and 5 times faster, which is appreciated for some applications.

Now I realize that there might be room for longer period generators from the same concept, and hope for some help / pointers on how to search for parameters.

48 bit generators could be possible. Such a generator would have 8000 times longer period and generate 6 bytes per iteration instead of 4.  But the multiplication constants would have to be small since, in Erlang, there is a limit where numbers go into bignum representation and becomes very slow to handle.

An intermediate result can have 59 bits.  60 bits might be possible but then I will have to use negative numbers and shift the modulus range from [0, 2^48) to [-2^47, 2^47), separate the code path in one for negative state and one for positive, etc...  So 59 bits would work but  60 might be possible.

So I envision a generator X1 = (a * X0) % (2^48 + 21) as `mcg48`, and another X1 = (b * X0 + 1) & (2^48 - 1) as `lcg48`.   I think using a prime number just above 2^48 for the MCG would make rejection sampling to 48 bits efficient since it is simple to detect the skewed high range and a low probability to reject and retry.

The problem is now to decide on parameters `a` and `b` - they have to be small to not break the bignum limit; =< 2047 to stay within 59 bits or =< 4095 to stay within 60 bits.

So, I would like help / pointers on how to find out such parameters in this narrow range, or to realize that there are no such creatures.  The next step would then be to investigate 40 bit generators, but that is not a big improvement from the current 35, so it would maybe not be worth the effort.

In another paper by Pierre L'Ecuyer and Raymod Couture (An Implementation of the Lattice and Spectral Tests for Multiple Reursive Linear Random Number Generators)(1996) there is mention of a software package LatMRG written in Modula-2.  Is that the way to go, and if so how do I aquire it?

Hoping for advice!
/ Raimo Niskanen

Sebastiano Vigna

unread,
Apr 6, 2022, 7:42:28 AM4/6/22
to pr...@googlegroups.com, ratm...@gmail.com


> On 6 Apr 2022, at 13:24, ratm...@gmail.com <ratm...@gmail.com> wrote:
>
> Now I realize that there might be room for longer period generators from the same concept, and hope for some help / pointers on how to search for parameters.
>
> 48 bit generators could be possible. Such a generator would have 8000 times longer period and generate 6 bytes per iteration instead of 4. But the multiplication constants would have to be small since, in Erlang, there is a limit where numbers go into bignum representation and becomes very slow to handle.

The problem is that if you use a multiplier much smaller than the square root of the modulus, havoc will happen. You will miserably fail collision tests in 2 dimensions. That comes from Proposition 1 in https://doi.org/10.1002/spe.3030.

This severely restrict what you can do. Essentially, if you have 60 bits you can use 40 for the modulus and 20 for the multipliers. Or 40 for the modulus and 19 for the multiplier to stay within that.

Did you consider the possibility of an MWC? That's the standard way to obtaining a reasonable generator with prime modulus with restricted resources. You could get period 2^96 probably with the same timings. 3-dimensional spectral scores are not that good, but you can fix that with a xorshift as in https://www.agner.org/random/theory/randomvector.pdf.

> In another paper by Pierre L'Ecuyer and Raymod Couture (An Implementation of the Lattice and Spectral Tests for Multiple Reursive Linear Random Number Generators)(1996) there is mention of a software package LatMRG written in Modula-2. Is that the way to go, and if so how do I aquire it?

No, but you can use LatMRG or even our much simpler code at https://github.com/vigna/CPRNG that we wrote exactly for the same purpose. It works with any full-period generator (power-of-two modulus with odd constant, or prime modulus).

Ciao,

seba

ratm...@gmail.com

unread,
Apr 6, 2022, 8:57:31 AM4/6/22
to prng
Thank you - very informative.

I suspected there had to be a reason why there were no so small multipliers in L'Ecuyer's paper.

Now I will definitely investigate MWC:s, and CPRNG will be useful!
I am afraid that the computations of an xorshift will be too expensive in Erlang.  I need to beat our Xorshift116+ with more than a factor 2, I'd say.

Cheers
/ Raimo

Sebastiano Vigna

unread,
Apr 6, 2022, 9:08:44 AM4/6/22
to pr...@googlegroups.com


> On 6 Apr 2022, at 14:57, ratm...@gmail.com <ratm...@gmail.com> wrote:
>
> Thank you - very informative.
>
> I suspected there had to be a reason why there were no so small multipliers in L'Ecuyer's paper.
>
> Now I will definitely investigate MWC:s, and CPRNG will be useful!
> I am afraid that the computations of an xorshift will be too expensive in Erlang. I need to beat our Xorshift116+ with more than a factor 2, I'd say.
>

If you wanna play with a two-29-bit-words MWC, these are a few reasonable multipliers:

0.038423 0.672866 0x3ffad057fffffff 0x1ffd682c 0.930458 0.038423 0.903231 0.664068 0.627773 0.677559 0.625802
0.038448 0.673705 0x3fe73d49fffffff 0x1ff39ea5 0.929901 0.038448 0.867438 0.759072 0.668899 0.709593 0.467147
0.038962 0.676747 0x3ff9e3c3fffffff 0x1ffcf1e2 0.930431 0.038962 0.901819 0.776230 0.621205 0.537397 0.674278
0.038932 0.677717 0x3fea7abdfffffff 0x1ff53d5f 0.929993 0.038932 0.930956 0.604254 0.752233 0.612216 0.657305
0.038905 0.679613 0x3ff60e51fffffff 0x1ffb0729 0.930322 0.038905 0.904867 0.732026 0.685286 0.602840 0.631438
0.038141 0.687616 0x3ff23d2ffffffff 0x1ff91e98 0.930214 0.038141 0.866165 0.843685 0.661405 0.674267 0.625119

The relevant parameter is the fourth one, A1. You can use it as in here:

https://prng.di.unimi.it/MWC128.c

The larger number is the prime modulus. You are simulating in all cases a multiplier 2^29.

Of course the >> 64 should become an >> 29. And you would output t in its entirety (58 bits), not just the lower 29 bits. And x should be updated to just the lower 29 bits of t. So there's some masking to do that could severly slow down the computation.

Ciao,

seba

ratm...@gmail.com

unread,
Apr 8, 2022, 5:59:52 PM4/8/22
to prng
Neat!  Did you use a program in CPRNG to do that search?

I only found only the `spect` program to be described as handling non-power-of-2 generators, and it produces (almost, for lag 1, maxdim 8) the output above, but it doesn't search.

Since I have headroom for 59 bits (60 bits signed), I was inclined to look for 30 bit multipliers, if that could be beneficial for the generator (and if it works ;-). Such as:
0.038012    0.688028    0x2008a34c    0x40114697fffffff    0.930115    0.038012    0.873878    0.775406    0.644750    0.721614    0.703320
0.038682    0.688558    0x20157a59    0x402af4b1fffffff    0.929387    0.038682    0.910351    0.706369    0.697220    0.675080    0.732223
0.038846    0.690399    0x2012cf46    0x40259e8bfffffff    0.929538    0.038846    0.912017    0.784270    0.668854    0.739141    0.588764
0.039786    0.691205    0x202559ca    0x404ab393fffffff    0.928491    0.039786    0.941210    0.744050    0.762871    0.589899    0.652213
0.038838    0.694740    0x20075dc0    0x400ebb7ffffffff    0.930187    0.038838    0.929042    0.797123    0.721577    0.647758    0.633639
0.038691    0.696980    0x202b7320    0x4056e63ffffffff    0.928147    0.038691    0.926260    0.766536    0.643122    0.792836    0.689686

I found these by doing a linear scan of the interval 0x20000000..0x205fffff for all safe primes and these are the multipliers with the highest harmonic score and a lowest score of at least 0.038, according to `spect`.

When I benchmark the MWC generator  `T1 = 16#1ffb0729 * (T0 band ((1 bsl 29)-1)) + (T0 bsr 29)` it is as fast as the LCG `X1 = (15319397 * X0 + 1) band ((1 bsl 35)-1)`, even when masking T1 to 29 bits to get an output value.  It seems in Erlang the masking is not what slows the generator down, but it is to create a composite state or term such as returning a tuple of value and state.  The term composition creates garbage that has to be collected.  This MWC generator executes in 4 ns, but when modified to return a composite term such as {T1 band ((1 bsl 29)-1), T1}, it needs 10 ns, in my benchmark.

One should not regard all bits of T1 as output bits; only the lowest 29, right...?  Or dare one use more bits?

But, since the use case at hand was to generate small numbers fast,  I think a 29 bit generator with a period of about 2^58 should often be preferred over a 35 bit generator with a period of 2^35.

Cheers
/ Raimo

Sebastiano Vigna

unread,
Apr 9, 2022, 4:11:29 AM4/9/22
to pr...@googlegroups.com

> On 8 Apr 2022, at 23:59, ratm...@gmail.com <ratm...@gmail.com> wrote:
>
> Neat! Did you use a program in CPRNG to do that search?

Argh, no, you're right, I got confused--the MWC search program is not there. I have a program looking for safe primes that I can send you, and then I pass the output to LatticeTester. I can send you the program but looks like you already developed your own :(.

> Since I have headroom for 59 bits (60 bits signed), I was inclined to look for 30 bit multipliers, if that could be beneficial for the generator (and if it works ;-). Such as:
> 0.038012 0.688028 0x2008a34c 0x40114697fffffff 0.930115 0.038012 0.873878 0.775406 0.644750 0.721614 0.703320
> 0.038682 0.688558 0x20157a59 0x402af4b1fffffff 0.929387 0.038682 0.910351 0.706369 0.697220 0.675080 0.732223
> 0.038846 0.690399 0x2012cf46 0x40259e8bfffffff 0.929538 0.038846 0.912017 0.784270 0.668854 0.739141 0.588764
> 0.039786 0.691205 0x202559ca 0x404ab393fffffff 0.928491 0.039786 0.941210 0.744050 0.762871 0.589899 0.652213
> 0.038838 0.694740 0x20075dc0 0x400ebb7ffffffff 0.930187 0.038838 0.929042 0.797123 0.721577 0.647758 0.633639
> 0.038691 0.696980 0x202b7320 0x4056e63ffffffff 0.928147 0.038691 0.926260 0.766536 0.643122 0.792836 0.689686

But there is only one multiplier--2^k. What you can do is to play with the size of the multiplier and the coefficient of the recursion A1. That is, say, 2^30 as multiplier and 2^29 as coefficient of the recursion. I've never seen that done but I guess one can work out the math (shifts etc.). You can always try the idea in low dimension (say, multiplier 2^8 and coefficient of recursion <= 2^7) and see whether it works by exhaustive testing.

> I found these by doing a linear scan of the interval 0x20000000..0x205fffff for all safe primes and these are the multipliers with the highest harmonic score and a lowest score of at least 0.038, according to `spect`.

Again, I think you're talking about A1 with multiplier 2^29 but please correct me.

> One should not regard all bits of T1 as output bits; only the lowest 29, right...? Or dare one use more bits?

No no, T1 _is_ the state. It is exactly the value of the associated LCG (multiplicative, I mean).

> But, since the use case at hand was to generate small numbers fast, I think a 29 bit generator with a period of about 2^58 should often be preferred over a 35 bit generator with a period of 2^35.

I'm trying to imagine how much you can unbalance the thing. I don't think there are any specific limits, but of course if you make A1 very small you will get a bad recursion because you'll get a kinda degenerate prime with a long trail of ones in the representation.

But I'm always interested in PRNGs with performance constraints. :)

Ciao,

seba

Sebastiano Vigna

unread,
Apr 9, 2022, 6:03:44 AM4/9/22
to pr...@googlegroups.com


> On 9 Apr 2022, at 10:11, Sebastiano Vigna <sebastia...@unimi.it> wrote:
>
> But I'm always interested in PRNGs with performance constraints. :)
>

Getting back on that idea. Suppose you want to generate 32-bit integers. You could use a multiplier 2^32 and an A1 in range 0 < A1 < 2^27. All computations would fit 59 bits.

0.073333 0.381030 0x7f17554ffffffff 0x7f17555 0.163924 0.073333 0.740324 0.804943 0.607501 0.662911 0.751590

This is an example that works but of course besides a shitty f₃ we have also a shitty f₂. Which brings back the idea: can't you really xorshift the output? All you have to do is to take the 32 outputs bits from the MWC and return (x ^ x << 16) & (2^32-1). You can do this even with the whole state t because you'll mask after.

If this is fast you can correct the problem with spectral scores. I did a quick check with PractRand and there are no obvious problems up to 16GB. I'll let it run for a while and let you know.

PractRand doesn't check for collisions, but I have my own tests for large-scale sampling and even with 10^10 samples collision tests are fine.

That would be a fine generator.

Ciao,

seba

ratm...@gmail.com

unread,
Apr 13, 2022, 11:46:22 AM4/13/22
to prng
First, yes; I did get the terminology wrong. A1 is used in a multiplication in the algorithm but it is not the multiplier; it is as you pointed out the "digit shift" that is the multiplier.

So I meant a multiplier 2^29 and an A1 in range <2^30  instead of <2^29, if that could achieve better quality.

I had an amateur look at a generator as you suggest (multiplier 2^32 and A1 < 2^27), but without xorshift, and it appears to be impossible to achieve decent scores from `spect`.
It is amazing to see how well a "shitty" generator behaves with xorshift scrambling...

Regarding scrambling and how it affects an Erlang implementation:
The fastest generator I have in the bouquet is a 35 bit LCG that generates 35 "LCG quality" bits in 3.8 ns (in my benchmark).
The MWC you suggest (or any lag-1 that stays within 59 bits) generating a new state and letting the caller mask out the low bits as the value is all done in 3.3 ns.
Generating a state and forcing the caller to inline the xorshift: X = T band ((1 bsl 32) - 1), V = X bxor ((X band ((1 bsl 16) - 1)) bsl 16), ends up in 4.1 ns.                    
Generating a state and calling a new function that does that xorshift ends up in 6.7 ns.
Letting the MWC return both the xorshifted value and the new state ends up in 16 ns.
Our Xorshift116+ generator doing the same (returning value and state) executes in about the same time: 17 ns.

So, letting the user mask the low bits from an MWC (or an LCG, for that matter) is the fastest, and reasonably easy to use.
Forcing the user to do the xorshift is too awkward.
Offering two functions, one for updating the state and one that gets the value from the state by xorshift is as convenient or more than letting the user mask the state, but takes about twice the time.
Returning both an xorshifted value and a new state takes another twice the time, i.e as much time as for Xorshift116+, so it would not be worth implementing.

The 3.8 ns LCG at hand is this one:
./spect 1 8 15319397 0x800000000
0.674717    0.775734    15319397    0xe9c165    1    0.849147    0.736953    0.779583    0.705774    0.739156    0.679319    0.674717

And I also have a 5.4 ns MCG:
./spect 1 8 185852 0x7FFFFFFE1
0.650639    0.805885    185852    0x2d5fc    1    0.933056    0.650639    0.801776    0.815666    0.681698    0.772522    0.664310

I have learned that an LCG has e.g got the statistical peculiarity that the lowest bit alternates and then the period doubles for every higher bit, which sounds like an annoying flaw, and that the MCG does not have that flaw, but on the other hand has just less than 35 bits so it is a bit inconvenient to use, and slower.

Therefore I am looking at adding an MWC  generator to get longer period, and better quality (if possible).  (I have also been pondering on a just-above-35 bit MCG...)
The MWC generators can do a state update and value mask in 3.3 ns, but since they emulate an LCG I guess they have the same bad low bits peculiarity as any LCG, so all I get with that is a longer period but smaller value range.

Suggested ones are a 29-bit:
./spect 1 8 0x202b7320 0x4056e63ffffffff
0.038691    0.696980    539718432    0x202b7320    1    0.928147    0.038691    0.926260    0.766536    0.643122    0.792836    0.689686

and a 32-bit:
./spect 1 8 0x7f17555 0x7f17554ffffffff
0.073333    0.381030    133264725    0x7f17555    1    0.163924    0.073333    0.740324    0.804943    0.607501    0.662911    0.751590

Can the 29-bit MWC be said to have good enough quality to be used with just masking the value bits, compared to the 35-bit LCG?
This option is actually a bit faster than the 35-bit LCG.

Or is the better option the 32-bit MWC to be used with a separate xorshift output function, which when used together takes 75% longer than the 35-bit LCG and 25% longer than the almost-35-bit MCG , and I assume has got much better quality and period...

BTW. xorshift up would preserve the peculiar properties of the low bits, right?  If so, is xorshift down a tried option?

Cheers
/ Raimo

Sebastiano Vigna

unread,
Apr 13, 2022, 5:08:07 PM4/13/22
to pr...@googlegroups.com


> On 13 Apr 2022, at 17:46, ratm...@gmail.com <ratm...@gmail.com> wrote:
>
> I had an amateur look at a generator as you suggest (multiplier 2^32 and A1 < 2^27), but without xorshift, and it appears to be impossible to achieve decent scores from `spect`.
> It is amazing to see how well a "shitty" generator behaves with xorshift scrambling...

No, wait--I said shitty f₂: I never said "shitty generator". That score identifies a very specific problem of the generator--for the rest, the generator might pass all possible tests. In fact, without the xorshift the generator fails PractRand at 256GB--worse than without xorshift, but definitely _not_ shitty. It would fail with much less data if PractRand had collision tests, but it doesn't.

> So, letting the user mask the low bits from an MWC (or an LCG, for that matter) is the fastest, and reasonably easy to use.
> Forcing the user to do the xorshift is too awkward.

> The 3.8 ns LCG at hand is this one:
> ./spect 1 8 15319397 0x800000000
> 0.674717 0.775734 15319397 0xe9c165 1 0.849147 0.736953 0.779583 0.705774 0.739156 0.679319 0.674717
>
> And I also have a 5.4 ns MCG:
> ./spect 1 8 185852 0x7FFFFFFE1
> 0.650639 0.805885 185852 0x2d5fc 1 0.933056 0.650639 0.801776 0.815666 0.681698 0.772522 0.664310
>
> I have learned that an LCG has e.g got the statistical peculiarity that the lowest bit alternates and then the period doubles for every higher bit, which sounds like an annoying flaw, and that the MCG does not have that flaw, but on the other hand has just less than 35 bits so it is a bit inconvenient to use, and slower.

OK, let's fix the terminology. I might have been taken in previous messages by the fact that x <- mx is a linear map, whereas x <- mx + a is an affine map. What people call LCG are really ACG. What people call MCG (sometimes) are really LCG (but again, some people use LCG for x <- mx).

So let's say that LCG is x <- mx and ACG is x <- mx + a :).

An MWC just simulates an LCG. So mathematically there is no difference between those two generators, and the lower bit of the first will NOT alternate.

> BTW. xorshift up would preserve the peculiar properties of the low bits, right? If so, is xorshift down a tried option?

The shitty spectral score mean that collision tests fail, and collision tests depend on the high bits. xorshifting down doesn't help.

Ciao,

seba

ratm...@gmail.com

unread,
Apr 14, 2022, 3:18:09 AM4/14/22
to prng
onsdag 13 april 2022 kl. 23:08:07 UTC+2 skrev Sebastiano Vigna:


> On 13 Apr 2022, at 17:46, Raimo Niskanen wrote:
>
> I had an amateur look at a generator as you suggest (multiplier 2^32 and A1 < 2^27), but without xorshift, and it appears to be impossible to achieve decent scores from `spect`.
> It is amazing to see how well a "shitty" generator behaves with xorshift scrambling...

No, wait--I said shitty f₂: I never said "shitty generator". That score identifies a very specific problem of the generator--for the rest, the generator might pass all possible tests. In fact, without the xorshift the generator fails PractRand at 256GB--worse than without xorshift, but definitely _not_ shitty. It would fail with much less data if PractRand had collision tests, but it doesn't.

Sorry about that misunderstanding.  Still learning...
 
> So, letting the user mask the low bits from an MWC (or an LCG, for that matter) is the fastest, and reasonably easy to use.
> Forcing the user to do the xorshift is too awkward.

> The 3.8 ns LCG at hand is this one:
 
35 bit ACG. X1 = (15319397 * X0 + 1) mod 2^35:
> ./spect 1 8 15319397 0x800000000
> 0.674717 0.775734 15319397 0xe9c165 1 0.849147 0.736953 0.779583 0.705774 0.739156 0.679319 0.674717
>
> And I also have a 5.4 ns MCG:
 
 Just below 35 bit LCG. X1 = (185852 * X0) mod (2^35 - 31)
> ./spect 1 8 185852 0x7FFFFFFE1
> 0.650639 0.805885 185852 0x2d5fc 1 0.933056 0.650639 0.801776 0.815666 0.681698 0.772522 0.664310
>
> I have learned that an LCG has e.g got the statistical peculiarity that the lowest bit alternates and then the period doubles for every higher bit, which sounds like an annoying flaw, and that the MCG does not have that flaw, but on the other hand has just less than 35 bits so it is a bit inconvenient to use, and slower.

OK, let's fix the terminology. I might have been taken in previous messages by the fact that x <- mx is a linear map, whereas x <- mx + a is an affine map. What people call LCG are really ACG. What people call MCG (sometimes) are really LCG (but again, some people use LCG for x <- mx).

Yes.  That is precisely my misunderstanding.  I learn from Wikipedia :-( ...
I called x <- mx an MCG and x <- mx + a an LCG.  I will avoid that from now.


So let's say that LCG is x <- mx and ACG is x <- mx + a :).

An MWC just simulates an LCG. So mathematically there is no difference between those two generators, and the lower bit of the first will NOT alternate.

Great!  That came with my misunderstanding the terminology.

I think the message is slowly sinking in...

The MWC generator T1 = 0x202b7320 * T0 + T0 >> 29 simulates the LCG T1 = (2^29 * T0) mod 0x4056e63ffffffff.
The MWC generator T1 = 0x7f17555 * T0 + T0 >> 32 simulates the LCG T1 = (2^32 * T0) mod 0x7f17554ffffffff.

Comparing these two; the first has a modulus a bit above 58 bits, and the second one a bit below 59 bits.
The first has a more "balanced" A1 compared to the shift. The second is more "unbalanced" (which makes its f2 score much worse).
Is there anything that actually says that the second is more 32-bit than the first?  Both have a non-power-of-2 modulus above 58 bits so grabbing 32 bits from either would give not noticeable bias (buried in 26..27 bits margin).
Would doing a 32 bit xorshift to get the output value work just as well for the first as for the second?

A low f3 score; does that mean that if you plot 3-tuples from the state i.e {Xn, Xn+1, Xn+2}for many n, as vectors, they are not evenly distributed in a 3 dimensions, but tend to gather along planes?

How come these MWC generators (all ?) has got bad f3 scores, while the short period 35-bit ACG and LCG from L'Ecuyer does not?  Is it because of the power-of-two multiplier?


> BTW. xorshift up would preserve the peculiar properties of the low bits, right? If so, is xorshift down a tried option?

The shitty spectral score mean that collision tests fail, and collision tests depend on the high bits. xorshifting down doesn't help.

Ok. Great!

Cheers
/ Raimo
 

Sebastiano Vigna

unread,
Apr 14, 2022, 3:47:28 AM4/14/22
to pr...@googlegroups.com

> On 14 Apr 2022, at 09:18, ratm...@gmail.com <ratm...@gmail.com> wrote:
>
> onsdag 13 april 2022 kl. 23:08:07 UTC+2 skrev Sebastiano Vigna:
>
> Sorry about that misunderstanding. Still learning...

I've been doing this for 10 years, read hundreds of papers, and I'm still learning, too, so welcome to the club :).

Just to give you an idea: that generator, without xorshift, will pass without a warning PractRand up to 32GB, but it will die with a horrible p-value in a birthday-spacings test after 20MB--more than a *thousand* times less data. PractRand is very biased in its choice of tests compared to TestU01.

> Yes. That is precisely my misunderstanding. I learn from Wikipedia :-( ...
> I called x <- mx an MCG and x <- mx + a an LCG. I will avoid that from now.

No but that's totally standard, too. The terminology is just a mess. See the footnote in the first page of https://onlinelibrary.wiley.com/doi/10.1002/spe.3030 (just before the definition of an ACG).

> The MWC generator T1 = 0x202b7320 * T0 + T0 >> 29 simulates the LCG T1 = (2^29 * T0) mod 0x4056e63ffffffff.
> The MWC generator T1 = 0x7f17555 * T0 + T0 >> 32 simulates the LCG T1 = (2^32 * T0) mod 0x7f17554ffffffff.

Yes, exactly.

> Comparing these two; the first has a modulus a bit above 58 bits, and the second one a bit below 59 bits.
> The first has a more "balanced" A1 compared to the shift. The second is more "unbalanced" (which makes its f2 score much worse).

Exactly. Note that we can look for something that will make the modulus of the first closer to 2^59-1, which will give you a ≈50% longer period.

> Is there anything that actually says that the second is more 32-bit than the first? Both have a non-power-of-2 modulus above 58 bits so grabbing 32 bits from either would give not noticeable bias (buried in 26..27 bits margin).

No. The difference is that in the second case we know something about the spectral scores because the "digits" are 32 bits. You could use the first and grab the lower 32 bits. I'd run some testing. We're going into somewhat uncharted territory. These things are subtle because there's a theorem that tells you that the "digits" of an MWC form a lattice that's very close to the lattice of the associated LCG, but I have no idea what happens if you take more bits. Pierre L'Ecuyer is the person that could have an answer (he proved that theorem).

> Would doing a 32 bit xorshift to get the output value work just as well for the first as for the second?

That's a good question. I don't know, because I don't know whether there's something to fix about collisions in the first one if you take the lower 32 bits. I can do some testing if you want.

> A low f3 score; does that mean that if you plot 3-tuples from the state i.e {Xn, Xn+1, Xn+2}for many n, as vectors, they are not evenly distributed in a 3 dimensions, but tend to gather along planes?

Yes, collisions and birthday-spacing tests will fail in dimension 3 (they actually do).

> How come these MWC generators (all ?) has got bad f3 scores, while the short period 35-bit ACG and LCG from L'Ecuyer does not? Is it because of the power-of-two multiplier?

MWC has a very specific multiplier--a power of 2. It can be proved that in general that will give you bad f₃. How bad this will be in practice depends on the state size and output size. For example, the 128-bit-state, 64-bit-output MWC on my PRNG page fails 3-dimensional birthday-spacing tests but you need a terabyte of data for that to happen. It's very unlikely to affect applications, but it happens.

> The shitty spectral score mean that collision tests fail, and collision tests depend on the high bits. xorshifting down doesn't help.

More precisely: collision tests and birthday-spacing tests see the output as points in the unit cube. So their position in the cube depends mostly on the upper bits. If you ran the same test on the output reversed, you'd get a test on the lower bits. But usually lower bits in these generators (with prime modulus) are very good, so nobody does that.

Ciao,

seba

ratm...@gmail.com

unread,
Apr 14, 2022, 6:26:01 AM4/14/22
to prng
When saying that a MWC generator "simulates" an LCG, that means that the spectral scores of the "digits" from the MWC generator  are very close to the spectral scores of the state of the LCG, but we do not know much about the rest of the state of the MWC generator. Right?  It also has the same period and state range.  But one can not say that the MWC generator is the same as the associated LCG.

Therefore, if one wants 32 bits it would be kosher to choose a 32-bit digit MWC generator and xorshift up the 16 low bits.
If it is ok with 29 bits then it would be kosher to use a 29 bit MWC generator and xorshift up e.g the 14 low bits. (t & (1<<29 - 1) xor ((t & (1<<14 - 1)) << 15).

It feels good to have mathematical guarantees...

I was under the delusion that an MWC generator was "the same" as an LCG and therefore searched for a prime modulus above and close to 2^58 so it would be cheap to rejection sample down to 58 bits.  But since it is only the digit size that has got mathematical guarantees that idea went down like a lead balloon.  And I think I have searched, for a 2^29 multiplier, for a prime modulus near below 2^59 but failed to find any with good spectral scores (disclaimer - my search tool is slow).  The best I could find were just above 2^58 that were maybe only slightly better than your suggested just below 2^58.  Maybe it gets unbalanced at 30 bits A1 with multiplier 2^29.  I think I will try 29 bits A1 with multiplier 2^30...

Cheers
/ Raimo

ratm...@gmail.com

unread,
Apr 26, 2022, 3:36:25 AM4/26/22
to prng
I have been tinkering with these MWC generators, and have realized a few things that I hope are correct:

The lag-1 MWC generator t1 = (a * (t0 & (2^b -1)) + (t0 >> b) corresponds to a prime modulus LCG (Lehmer generator) t0 = (t1 << b) % p, where p = (a << b) - 1.

In fact, Wikipedia claims, and I have tried to do the math and written test programs; as far as I can tell the corresponding LCG produces the *same* sequence as the MWC generator but in reverse order.

Since a and 2^b are multiplicative inverses mod p; p = (a << b) - 1 = a * 2^b - 1 => a * 2^b = p + 1 = 1 mod p, then the LCG t1 = (t0 * a) % p is equivalent to the MWC generator - it produces the same sequence.

So there is not really any "digit size" in an MWC generator - it is simply an alternative implementation of an LCG generator with particular parameters. (Prime modulus and one of the a constants is a power of 2) Its low spectral score in 3 dimensions stems from the power of 2 multiplier and is not particular to any "MWC digit" size.

I have tried (a = 0x7f17555, b = 32), (a = 0x20075dc0, b = 29), and (a = 0x1ffb0729, b = 29), and it is the last one, one you early suggested as a good candidate that seems to perform the best.  Trying to get better quality from the low 32 bits by using a "MWC digit" of 32 bits seems to be a dead end since there is no "MWC digit" because it is simply an LCG.

The MWC generator a = 0x1ffb0729, b = 29 with an xorshift 16 scrambler passes TestU01 Crush, but fails consistently Rabbit - Matrix rank 320 x 320 and the problem is in the low bits - the high 32 bits (of 58) passes both test batteries.  Therefore I tried using xor with rotl58 16 to improve the low bits.  I also played with the shift constant and it seems 18 is slightly better.  That generator passes TestU01 Crush and Rabbit 2^32 for the low, the middle (>> 16) and the high bits.

So here is the generator:
    t1 = 0x1ffb0729 * (t0 & ((1<<29)-1)) + (t0 >> 29);
    v = t1 ^ ((t1 << 18) | (t1 >> (58-18)));
    v &= (1<<58) - 1;
I am now trying the low 32 bits in TestU01 BigCrush, just to see what happens.

Cheers
/ Raimo

Sebastiano Vigna

unread,
Apr 26, 2022, 1:30:02 PM4/26/22
to pr...@googlegroups.com


> On 26 Apr 2022, at 09:36, ratm...@gmail.com <ratm...@gmail.com> wrote:
>
> The lag-1 MWC generator t1 = (a * (t0 & (2^b -1)) + (t0 >> b) corresponds to a prime modulus LCG (Lehmer generator) t0 = (t1 << b) % p, where p = (a << b) - 1.
>
> In fact, Wikipedia claims, and I have tried to do the math and written test programs; as far as I can tell the corresponding LCG produces the *same* sequence as the MWC generator but in reverse order.

Yes. I've been repeating "multiplier 2^29" but that was a shortcut for "multiplier 2^-29 mod p". The multiplier is actually the modular inverse of 2^29 modulo p.

> So there is not really any "digit size" in an MWC generator - it is simply an alternative implementation of an LCG generator with particular parameters. (Prime modulus and one of the a constants is a power of 2) Its low spectral score in 3 dimensions stems from the power of 2 multiplier and is not particular to any "MWC digit" size.

That I don't know. The theorem about spectral scores is for the digits (that is, the theorem saying that you can use the LCG spectral score for the digits).

> I have tried (a = 0x7f17555, b = 32), (a = 0x20075dc0, b = 29), and (a = 0x1ffb0729, b = 29), and it is the last one, one you early suggested as a good candidate that seems to perform the best. Trying to get better quality from the low 32 bits by using a "MWC digit" of 32 bits seems to be a dead end since there is no "MWC digit" because it is simply an LCG.

Yes, but, again, the spectral scores are for the LCG, not for the lower bits in general. It works for the lower bits if they are the same bits of the multiplier. Above or below that, no idea.

> The MWC generator a = 0x1ffb0729, b = 29 with an xorshift 16 scrambler passes TestU01 Crush, but fails consistently Rabbit - Matrix rank 320 x 320 and the problem is in the low bits - the high 32 bits (of 58) passes both test batteries. Therefore I tried using xor with rotl58 16 to improve the low bits. I also played with the shift constant and it seems 18 is slightly better. That generator passes TestU01 Crush and Rabbit 2^32 for the low, the middle (>> 16) and the high bits.

Check the high bits with PractRand because my experience is that the high bits of MWC have the spectral score you expect, but fail everything else. They are a real disaster.

>
> So here is the generator:
> t1 = 0x1ffb0729 * (t0 & ((1<<29)-1)) + (t0 >> 29);
> v = t1 ^ ((t1 << 18) | (t1 >> (58-18)));
> v &= (1<<58) - 1;
> I am now trying the low 32 bits in TestU01 BigCrush, just to see what happens.

I guess there's a t0 = t1 at the end. Or you can do everything with t0.

That seems reasonable, as you should reduce the defects of the generator. In particular, the bad high bits.

Ciao,

seba


PS: the lower 32 bits of the generator above do not go well with PractRand:

rng=MWC-29-58-0x1ffb0729-ROTL18, seed=0x506866ff
length= 2 gigabytes (2^31 bytes), time= 136 seconds
Test Name Raw Processed Evaluation
FPF-14+6/16:(2,14-0) R= +13.1 p = 9.9e-12 VERY SUSPICIOUS
FPF-14+6/16:(3,14-0) R= +19.0 p = 3.7e-17 FAIL
FPF-14+6/16:(4,14-0) R= +10.6 p = 1.8e-9 very suspicious
FPF-14+6/16:all R= +19.4 p = 1.0e-17 FAIL !
FPF-14+6/4:(0,14-0) R= +29.0 p = 1.9e-26 FAIL !!
FPF-14+6/4:(1,14-0) R= +14.0 p = 1.3e-12 VERY SUSPICIOUS
FPF-14+6/4:(2,14-0) R= +8.3 p = 2.6e-7 mildly suspicious
FPF-14+6/4:(3,14-0) R= +9.7 p = 1.3e-8 suspicious
FPF-14+6/4:all R= +23.9 p = 7.2e-22 FAIL !!


ratm...@gmail.com

unread,
Apr 26, 2022, 3:39:59 PM4/26/22
to prng
So I will have to figure out how to run PractRand, it seems...  Is it a better tool than TestU01?

Btw.  The low 32 bits passed TestU01 BigCrush.  Now running high and mid bits.

And, yes, my confusion of t0 and t1 is thanks to Erlang being functional, so there the variables must be different ones, and the mathematical notation also uses t0 and t1 for before and after.  In C, of course, only t is needed.

Regarding MWC vs. the corresponding LCG: my hypothesis is that since they produce the same sequence, they should have the same spectral score for the generators as such,  not only for the low MWC "digit".  Not important, I guess, since they can be bad even with a good spectral score.

Cheers
/ Raimo

Sebastiano Vigna

unread,
Apr 26, 2022, 3:47:05 PM4/26/22
to pr...@googlegroups.com


> On 26 Apr 2022, at 21:39, ratm...@gmail.com <ratm...@gmail.com> wrote:
>
> So I will have to figure out how to run PractRand, it seems... Is it a better tool than TestU01?

It is complementary. It has a very limited set of sets, but it kills TestU01 in searching for bit patterns. Things that die easily on PractRand pass with flying grades on TestU01.

> Btw. The low 32 bits passed TestU01 BigCrush. Now running high and mid bits.

Yep, but, as I told you, they die horribly on PractRand.

> Regarding MWC vs. the corresponding LCG: my hypothesis is that since they produce the same sequence, they should have the same spectral score for the generators as such, not only for the low MWC "digit". Not important, I guess, since they can be bad even with a good spectral score.

OK, but we're talking about two different things: t, the global state of the generator, is the state of the LCG. We compute spectral scores *on that LCG*. So it's exactly those.

The surprising thing (that happens just because the multiplier is 2^k) is that if the multiplier is 2^-k the k lowest bits of t describe a lattice that is very close to that of the LCG. This can be proved only for the lowest k bits when the multiplier is 2^k. It is not true of any other bit suffix.

I'm running PractRand on the upper 32 bits of your generator and on the "natural" 32 bits of a generator with 58 bits of state and 32-bit digits, xorshifted by << 16. I'm curious to see which method extract more randomness (as measured by PractRand; and I'm already taking as a fact that the lower bit of yours are bad).

Ciao,

seba

ratm...@gmail.com

unread,
Apr 26, 2022, 5:58:05 PM4/26/22
to prng
tisdag 26 april 2022 kl. 21:47:05 UTC+2 skrev Sebastiano Vigna:


> On 26 Apr 2022, at 21:39, Raimo Niskanen wrote:
>
> So I will have to figure out how to run PractRand, it seems... Is it a better tool than TestU01?

It is complementary. It has a very limited set of sets, but it kills TestU01 in searching for bit patterns. Things that die easily on PractRand pass with flying grades on TestU01.


Ok. That figures...

 
> Btw. The low 32 bits passed TestU01 BigCrush. Now running high and mid bits.

Yep, but, as I told you, they die horribly on PractRand.

Yes.  They passed too.


> Regarding MWC vs. the corresponding LCG: my hypothesis is that since they produce the same sequence, they should have the same spectral score for the generators as such, not only for the low MWC "digit". Not important, I guess, since they can be bad even with a good spectral score.

OK, but we're talking about two different things: t, the global state of the generator, is the state of the LCG. We compute spectral scores *on that LCG*. So it's exactly those.

The surprising thing (that happens just because the multiplier is 2^k) is that if the multiplier is 2^-k the k lowest bits of t describe a lattice that is very close to that of the LCG. This can be proved only for the lowest k bits when the multiplier is 2^k. It is not true of any other bit suffix.

I am missing a truckload of math, it seems...
 
I'm running PractRand on the upper 32 bits of your generator and on the "natural" 32 bits of a generator with 58 bits of state and 32-bit digits, xorshifted by << 16. I'm curious to see which method extract more randomness (as measured by PractRand; and I'm already taking as a fact that the lower bit of yours are bad).

Ah.  Your proposed generator with a1 = 0x7f17555, I presume...

Interesting. I was running PractRand's "RNG_test stdin32" on the low 32 bits of my generator, and it just passed the 16 GB mark with only 4 "unusual"s so far.  Then I realized that I had gotten accidentally creative since there was no description on byte order for the stdin32 generator, so I used network byte order (htonl).  That must have introduced additional very successful scrambling.  When I use host byte order it fails almost immediately on 2^31 bytes just as you said.

By changing the rot58  back to 16 it works much better in PractRand.  The high 32 bits (of 58) failed on 256 GB for DC6-9x1Bytes-1.  I'll try other bits and  again later how rot58 16 works in Test U01...

Cheers.
/ Raimo 

Sebastiano Vigna

unread,
Apr 26, 2022, 6:03:00 PM4/26/22
to pr...@googlegroups.com


> On 26 Apr 2022, at 23:58, ratm...@gmail.com <ratm...@gmail.com> wrote:
>
> I'm running PractRand on the upper 32 bits of your generator and on the "natural" 32 bits of a generator with 58 bits of state and 32-bit digits, xorshifted by << 16. I'm curious to see which method extract more randomness (as measured by PractRand; and I'm already taking as a fact that the lower bit of yours are bad).
>
> Ah. Your proposed generator with a1 = 0x7f17555, I presume...

Nope, this is 58 bits as you were doing the same. 0x3f35301.

> Interesting. I was running PractRand's "RNG_test stdin32" on the low 32 bits of my generator, and it just passed the 16 GB mark with only 4 "unusual"s so far. Then I realized that I had gotten accidentally creative since there was no description on byte order for the stdin32 generator, so I used network byte order (htonl). That must have introduced additional very successful scrambling. When I use host byte order it fails almost immediately on 2^31 bytes just as you said.

Wow.

> By changing the rot58 back to 16 it works much better in PractRand. The high 32 bits (of 58) failed on 256 GB for DC6-9x1Bytes-1. I'll try other bits and again later how rot58 16 works in Test U01...

Interesting!

Ciao,

seba

ratm...@gmail.com

unread,
Apr 27, 2022, 12:36:23 AM4/27/22
to prng


onsdag 27 april 2022 kl. 00:03:00 UTC+2 skrev Sebastiano Vigna:


> On 26 Apr 2022, at 23:58, Raimo Niskanen wrote:

> By changing the rot58 back to 16 it works much better in PractRand. The high 32 bits (of 58) failed on 256 GB for DC6-9x1Bytes-1. I'll try other bits and again later how rot58 16 works in Test U01...

Interesting!

The low 32 bits failed on 512 GB for  FPF-14+6/16:all and friends.
The mid 32 bits (>>16) are still running 2 TB.

Cheers.
/ Raimo

Sebastiano Vigna

unread,
Apr 27, 2022, 2:22:24 AM4/27/22
to pr...@googlegroups.com


> On 27 Apr 2022, at 06:36, ratm...@gmail.com <ratm...@gmail.com> wrote:
>
>
> The low 32 bits failed on 512 GB for FPF-14+6/16:all and friends.
> The mid 32 bits (>>16) are still running 2 TB.

I'm seeing the differences I'm expecting:

rng=MWC-32-58-0x3f35301-XS16, seed=0x63ecfce4
length= 512 gigabytes (2^39 bytes), time= 30062 seconds
no anomalies in 2058 test result(s)


rng=MWC-29-58-0x1ffb0729-XR18-UPPER32, seed=0x9780d5bf
length= 512 gigabytes (2^39 bytes), time= 30361 seconds
Test Name Raw Processed Evaluation
BCFN(0+0,13-0,T) R= +16.4 p = 2.5e-8 very suspicious
BCFN(0+1,13-0,T) R= +11.9 p = 6.6e-6 mildly suspicious
BCFN(0+2,13-0,T) R= +13.2 p = 1.2e-6 suspicious
...and 2055 test result(s) without anomalies

We'll see how this develops.

Ciao,

seba

ratm...@gmail.com

unread,
Apr 27, 2022, 6:00:27 AM4/27/22
to prng
The MWC with 29-bit digit and rotl58 16 has now passed 4 TB in PractRand with the mid bits (16..47).
It has also passed TestU01 BigCrush with the low bits (0..31) and the high bits (26..57).

Cheers!
/ Raimo

Sebastiano Vigna

unread,
Apr 27, 2022, 7:35:02 AM4/27/22
to pr...@googlegroups.com


> On 27 Apr 2022, at 12:00, ratm...@gmail.com <ratm...@gmail.com> wrote:
>
> The MWC with 29-bit digit and rotl58 16 has now passed 4 TB in PractRand with the mid bits (16..47).
> It has also passed TestU01 BigCrush with the low bits (0..31) and the high bits (26..57).
>

rng=MWC-32-58-0x3f35301-XS16, seed=0x63ecfce4
length= 1 terabyte (2^40 bytes), time= 60001 seconds
no anomalies in 2132 test result(s)

rng=MWC-29-58-0x1ffb0729-ROTL18, seed=0x9780d5bf
length= 1 terabyte (2^40 bytes), time= 60604 seconds
Test Name Raw Processed Evaluation
BCFN(0+0,13-0,T) R= +29.2 p = 3.8e-15 FAIL
BCFN(0+1,13-0,T) R= +25.2 p = 5.1e-13 FAIL
BCFN(0+2,13-0,T) R= +21.8 p = 3.5e-11 VERY SUSPICIOUS
DC6-9x1Bytes-1 R= +10.1 p = 2.3e-4 unusual
...and 2128 test result(s) without anomalies

The gap is widening. I'll try the ROTL16 variant, too.

Ciao,

seba

Sebastiano Vigna

unread,
Apr 27, 2022, 12:43:40 PM4/27/22
to pr...@googlegroups.com


> On 27 Apr 2022, at 12:00, ratm...@gmail.com <ratm...@gmail.com> wrote:
>
> The MWC with 29-bit digit and rotl58 16 has now passed 4 TB in PractRand with the mid bits (16..47).
> It has also passed TestU01 BigCrush with the low bits (0..31) and the high bits (26..57).

The low bits die on PractRand tho:

rng=MWC-29-58-0x1ffb0729-ROTL16-LOWER32, seed=0x8041d99d
length= 256 gigabytes (2^38 bytes), time= 15824 seconds
Test Name Raw Processed Evaluation
FPF-14+6/16:(2,14-0) R= +9.3 p = 3.4e-8 mildly suspicious
FPF-14+6/16:(3,14-0) R= +9.8 p = 1.1e-8 suspicious
FPF-14+6/16:(4,14-0) R= +10.3 p = 4.0e-9 suspicious
FPF-14+6/16:all R= +11.7 p = 1.8e-10 VERY SUSPICIOUS
...and 1979 test result(s) without anomalies

Ciao,

seba

Sebastiano Vigna

unread,
Apr 27, 2022, 12:44:42 PM4/27/22
to pr...@googlegroups.com


> On 27 Apr 2022, at 12:00, ratm...@gmail.com <ratm...@gmail.com> wrote:
>
> The MWC with 29-bit digit and rotl58 16 has now passed 4 TB in PractRand with the mid bits (16..47).
> It has also passed TestU01 BigCrush with the low bits (0..31) and the high bits (26..57).
>

The high, too:

rng=MWC-29-58-0x1ffb0729-ROTL16, seed=0xc24043b0
length= 256 gigabytes (2^38 bytes), time= 15858 seconds
Test Name Raw Processed Evaluation
BCFN(0+1,13-0,T) R= +14.5 p = 2.8e-7 suspicious
BCFN(0+2,13-0,T) R= +12.8 p = 2.3e-6 mildly suspicious
DC6-9x1Bytes-1 R= +39.6 p = 1.3e-14 FAIL
DC6-6x2Bytes-1 R= +24.8 p = 1.5e-11 FAIL
DC6-5x4Bytes-1 R= +9.7 p = 3.8e-6 mildly suspicious
...and 1978 test result(s) without anomalies

Are we using the same code?

#define stringify(x) #x
#define MWC_A1 0x1ffb0729
#define NAME(A1) "MWC-29-58-" stringify(A1) "-ROTL16-UPPER32" REVERSE_INFIX

uint32_t x;
uint32_t c = 1;

Uint32 inline raw32() {
const uint64_t t = MWC_A1 * (uint64_t)x + c;
x = t & (1 << 29) - 1;
c = t >> 29;
return REV((t ^ ((t << 16) | (t >> (58 - 16)))) >> 26);
}

Ciao,

seba

ratm...@gmail.com

unread,
Apr 27, 2022, 4:24:45 PM4/27/22
to prng
The code looks equivalent.  I have not figured out how to write a linked in generator in PractRand so I pipe the data on stdin32.

As I wrote earlier: The high 32 bits (of 58) failed on 256 GB for DC6-9x1Bytes-1. The low 32 bits failed on 512 GB for  FPF-14+6/16:all and friends.

I did not know how good/bad that was.  Compared to ROTL-18 it was a success, but I note now that it should be seen as only less of a failure...

The mid bits (>> 16) have passed 4 TB.  I guess that is a very good selection of bits.

This seems to be a dead end, then...

I will start to test your early suggestion MWC-32-59-0x7f17555UL-XS16 or ROTL16 depending on how the low 32 bits behave in BigCrush.
It could be less unbalanced before scrambling than an MWC-32-58.

Cheers,
/ Raimo

Sebastiano Vigna

unread,
Apr 27, 2022, 5:00:55 PM4/27/22
to pr...@googlegroups.com


> On 27 Apr 2022, at 22:24, ratm...@gmail.com <ratm...@gmail.com> wrote:
>
> The code looks equivalent. I have not figured out how to write a linked in generator in PractRand so I pipe the data on stdin32.
>
> As I wrote earlier: The high 32 bits (of 58) failed on 256 GB for DC6-9x1Bytes-1. The low 32 bits failed on 512 GB for FPF-14+6/16:all and friends.

Oops, I'm sorry, I got lost in our long exchange.

> I will start to test your early suggestion MWC-32-59-0x7f17555UL-XS16 or ROTL16 depending on how the low 32 bits behave in BigCrush.
> It could be less unbalanced before scrambling than an MWC-32-58.

Well my (possibly wrong) idea is that the lower bits are the good ones in these LCGs (i.e., the "lower digit").

Would you be OK with a 32-bit generator with a 2^58 state space if it had very good statistical properties?

Ciao,

seba

PS: For completeness, this is what I'm testing:

#define stringify(x) #x
#define MWC_A1 0x3f35301
#define NAME(A1) "MWC-32-58-" stringify(A1) "-XS16" REVERSE_INFIX

uint32_t x;
uint32_t c = 1;

Uint32 inline raw32() {
const uint64_t t = MWC_A1 * (uint64_t)x + c;
x = t;
c = t >> 32;
return REV(x ^ (x << 16));
}


ratm...@gmail.com

unread,
Apr 28, 2022, 2:56:32 AM4/28/22
to prng
MWC-32-59-0x7f17555UL-XS16

For bits 0..31:
========= Summary results of BigCrush =========
 Version:          TestU01 1.2.3
 Generator:        mwc58
 Number of statistics:  160
 Total CPU time:   04:34:11.99
       Test                          p-value
 ----------------------------------------------
 15  BirthdaySpacings, t = 4        2.4e-15
 19  BirthdaySpacings, t = 8        6.0e-31
 ----------------------------------------------
 All other tests were passed

PractRand RNG_test:
=================
rng=RNG_stdin32, seed=unknown
length= 2 terabytes (2^41 bytes), time= 30958 seconds

  Test Name                         Raw       Processed     Evaluation
  DC6-9x1Bytes-1                    R= +50.0  p =  3.4e-18    FAIL !        
  ...and 312 test result(s) without anomalies


Bits 16..47
=========
Passed BigCrush and PractRand 2 TB, still running...


For bits 26..57:
========= Summary results of BigCrush =========
 Version:          TestU01 1.2.3
 Generator:        mwc58
 Number of statistics:  160
 Total CPU time:   04:33:25.81
       Test                          p-value
 ----------------------------------------------
 44  CollisionPermut, r = 0          5.6e-4
 53  SampleMean, r = 0               1.5e-4
 ----------------------------------------------
 All other tests were passed

These bits have passed PractRand 2 TB, still running...


I am now trying ROTL16 to see if the low bits improve, and trying a different seed for the high bits in BigCrush to see if that was bad luck...

> Well my (possibly wrong) idea is that the lower bits are the good ones in these LCGs (i.e., the "lower digit").

To me it seems that only the low 16 are really good ones since bits 16..47 passes both BigCrush and PractRand.

> Would you be OK with a 32-bit generator with a 2^58 state space if it had very good statistical properties?

Yes I would.
I am trying a 2^59 state space in the hope that it would give a slightly less imbalanced base generator.
The initial goal was to be better than a 35-bit ACG and an LCG that I have implemented just from Pierre L'Ecuyer's paper on lattice structure, and when I test them with TestU01 they are much worse than any of these MWC:s we have been experimenting with, so that goal is passed with margin.
Now it is more about releasing a generator that could not be obviously better.
Also, should I say that the generator is 32-bit and only return 32 bits, or should I return all 58 bits and say what, exactly about the high 32 bits?  Now it seems that they are almost as good or better than the low 32 bits, so it would be nice return 58 bits with some confidence.
Since the mid bits appears to be the best I would get the best 32 bits by xorshift down 16 bits and then return the 32 low, if I would call this a 32-bit generator.

Cheers
/ Raimo

ratm...@gmail.com

unread,
Apr 28, 2022, 6:05:30 AM4/28/22
to prng
MWC-32-59-0x7f17555UL-ROTL16 seems to be slightly better than MWC-32-59-0x7f17555UL-XS16 for the low 32 bits, since it passed BigCrush without warning for BirthdaySpacings, t = 4, t = 8.
Still waiting to see about PractRand 2 TB. On my laptop that level takes 8.5 hours + the time leading up to that (4..8 hours).
Cheers
/ Raimo

Sebastiano Vigna

unread,
Apr 28, 2022, 6:41:02 AM4/28/22
to pr...@googlegroups.com


> On 28 Apr 2022, at 12:05, ratm...@gmail.com <ratm...@gmail.com> wrote:
>
> MWC-32-59-0x7f17555UL-ROTL16 seems to be slightly better than MWC-32-59-0x7f17555UL-XS16 for the low 32 bits, since it passed BigCrush without warning for BirthdaySpacings, t = 4, t = 8.
> Still waiting to see about PractRand 2 TB. On my laptop that level takes 8.5 hours + the time leading up to that (4..8 hours).

Well, yes, twice the state space will help with collision-based tests.

Ciao,

seba

Sebastiano Vigna

unread,
Apr 28, 2022, 6:51:16 AM4/28/22
to pr...@googlegroups.com


> On 28 Apr 2022, at 12:40, Sebastiano Vigna <sebastia...@unimi.it> wrote:
>
>
>
> Well, yes, twice the state space will help with collision-based tests.
>

BTW, can you send me the complete output? There are several tests with t=4 or 8. Some have more cells than the orbit size. That's weird.

Ciao,

seba

Sebastiano Vigna

unread,
Apr 28, 2022, 8:43:50 AM4/28/22
to pr...@googlegroups.com

> On 28 Apr 2022, at 12:05, ratm...@gmail.com <ratm...@gmail.com> wrote:
>
> MWC-32-59-0x7f17555UL-ROTL16 seems to be slightly better than MWC-32-59-0x7f17555UL-XS16 for the low 32 bits, since it passed BigCrush without warning for BirthdaySpacings, t = 4, t = 8.
> Still waiting to see about PractRand 2 TB. On my laptop that level takes 8.5 hours + the time leading up to that (4..8 hours).

OK, now I have exact data and things are clearer.

Both generator die horribly on birthday spacings, d=2^16, t=4.

58:
Running a birthday-spacings test on the full 64-bit output using 84546572 points (0.63 GiB)
d: 65536 t: 4 cells: 2^64 expected birthday-spacings collisions: 8190.46
58513 p=0

59:
Running a birthday-spacings test on the full 64-bit output using 84546572 points (0.63 GiB)
d: 65536 t: 4 cells: 2^64 expected birthday-spacings collisions: 8190.46
31063 p=0

The second generator dies "a little bit less horribly" because it has a bit more space.

In fact, BigCrush runs this test with half the points, with a right shift of 14 bits. But also in that case I get similar results:

58:
Running a birthday-spacings test on the lowest 50 bit(s) using 30000000 points (0.22 GiB)
d: 65536 t: 4 cells: 2^64 expected birthday-spacings collisions: 365.918
2679 p=0

59:
Running a birthday-spacings test on the lowest 50 bit(s) using 30000000 points (0.22 GiB)
d: 65536 t: 4 cells: 2^64 expected birthday-spacings collisions: 365.918
1325 p=0

The collisions are almost exactly twice (the orbits is two times smaller).

These tests will chronometrically start to fail when you have more cells than points: these are the results for 58 bits, having 2^56 or 2^60 sells:

Running a birthday-spacings test on the full 64-bit output using 8388082 points (0.06 GiB)
d: 16384 t: 4 cells: 72057594037927936 expected birthday-spacings collisions: 2047.61
2047 p=0.5

Running a birthday-spacings test on the full 64-bit output using 26630501 points (0.20 GiB)
d: 32768 t: 4 cells: 1152921504606846976 expected birthday-spacings collisions: 4095.23
4312 p=0.000399687

If you add a bit of state, you resist a little bit more (this is with 59 bits):

Running a birthday-spacings test on the full 64-bit output using 26630501 points (0.20 GiB)
d: 32768 t: 4 cells: 1152921504606846976 expected birthday-spacings collisions: 4095.23
4141 p=0.239251

Running a birthday-spacings test on the full 64-bit output using 84546572 points (0.63 GiB)
d: 65536 t: 4 cells: 2^64 expected birthday-spacings collisions: 8190.46
31063 p=0

So with such a short period these tests don't say much. You'll just die depending on your state size. But this has no relationship with the quality of the generator.

Ciao,

seba

Sebastiano Vigna

unread,
Apr 28, 2022, 8:53:44 AM4/28/22
to pr...@googlegroups.com


> On 28 Apr 2022, at 14:43, Sebastiano Vigna <sebastia...@unimi.it> wrote:
> OK, now I have exact data and things are clearer.

OK. My previous post was technically correct, but I was not applying a xorshift. I realized that that's why my results were different from TestU01.

So, getting back to our case: the test that fail/do not fail for 58/59 are tests with a left shift of 14, d=^16 t=4, n=30000000. And indeed:

59:
Running a birthday-spacings test on the lowest 50 bit(s) using 30000000 points (0.22 GiB)
d: 65536 t: 4 cells: 2^64 expected birthday-spacings collisions: 365.918
410 p=0.0124245

58:
Running a birthday-spacings test on the lowest 50 bit(s) using 30000000 points (0.22 GiB)
460 p=1.22943e-06

But this is just an artifact. For these sizes, you can run the birthday-spacing test on more points:

59:
Running a birthday-spacings test on the lowest 50 bit(s) using 84546572 points (0.63 GiB)
d: 65536 t: 4 cells: 2^64 expected birthday-spacings collisions: 8190.46
9002 p=5.65431e-19

58:
Running a birthday-spacings test on the lowest 50 bit(s) using 84546572 points (0.63 GiB)
d: 65536 t: 4 cells: 2^64 expected birthday-spacings collisions: 8190.46
10185 p=3.04596e-100

So, even with xorshift, it's just an artifact. Both fail. The state is just too small. Well, 58 "fails more", but it's obvious with half the state.

The problem is that BigCrush is just a bunch of arbitrary and sometimes really bizarre parameter choices (left shift of 14? why 14?). So it often happens that it hits or miss something depending on the parameters. A bit like it happens for linearity in lower bits (https://xoshiro.di.unimi.it/lowcomp.php).

Ciao,

seba

ratm...@gmail.com

unread,
Apr 28, 2022, 9:38:12 AM4/28/22
to prng
That was a lot of intricate information and conclusions...

So, which one would you recommend?
32-bit digit, I guess:  0x3f3530 or 0x7f17555?
Xorshift 16 or rotl 16?
Dare one say that after scrambling 58 bits are useful or only 32?

Cheers
/ Raimo

Sebastiano Vigna

unread,
Apr 28, 2022, 9:42:06 AM4/28/22
to pr...@googlegroups.com


> On 28 Apr 2022, at 15:38, ratm...@gmail.com <ratm...@gmail.com> wrote:
>
> That was a lot of intricate information and conclusions...

I'm sorry, we're really doing super in tuning.
>
> So, which one would you recommend?
> 32-bit digit, I guess: 0x3f3530 or 0x7f17555?

The second is better, but slower: only you can decide.

> Xorshift 16 or rotl 16?

I would xorshift because rotl16 is not bijective. Any word with the same upper and lower half will give you zero.

If you have enough time, you might consider double XORROT, like

t ^ (t << 16 | t >> 16) ^ (t << 17 | t >> 13)

which is bijective.

> Dare one say that after scrambling 58 bits are useful or only 32?

I think you should use only the lower 32, using a multiplier 2^-32.

Ciao,

seba

Sebastiano Vigna

unread,
Apr 28, 2022, 9:45:36 AM4/28/22
to pr...@googlegroups.com


On April 28, 2022 3:41:55 PM GMT+02:00, Sebastiano Vigna <sebastia...@unimi.it> wrote:
.
>
>If you have enough time, you might consider double XORROT, like
>
>t ^ (t << 16 | t >> 16) ^ (t << 17 | t >> 13)
>
>which is bijective.

Oops, the second rotation is wrong!

BTW, it might be the case that with a double XORROT all bits are fine...

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

ratm...@gmail.com

unread,
Apr 29, 2022, 5:25:01 AM4/29/22
to prng
So, XORROT is not bijective, unintuitive, for me that is...  Devils in every corner!

Regarding 0x3f3530 vs. 0x7f17555; the larger and better is not slower in my case so I will choose that one.

I guess you were describing XORROT over 32 bits, i.e t ^ (t rotl32 16) ^ (t rotl32 17)...
That all bits might be fine, I guess that would be when rotating over 58 bits as in t ^ (t rotl58 16) (t rotl58 17)?
Could XORROT 16, 32 over 58 bits as in t ^ (t rotl58 16) ^ (t rotl58 32) spread the bits better and also be ok or better?
I measured something similar (16, 27) and that passed TestU01 and PractRand 2 TB for the bits I tested before (0..31, 16..47, 26..57)

I am planning to offer two value functions; one XORSHIFT 16 that returns 32 bits, and one 2*XORROT that returns 58 bits.
My benchmark measures that my Erlang implementation updates the state in 3.6 ns from which one can use the 16 lowest bits.  State update + value32 takes 7.0 ns (95% longer).  State update + value58 takes 9.1 ns (150% longer).

I have been dabbling with 2*XORSHIFT as in t ^= t << 16; t ^= t << 32 which I suppose also would be bijective, and a bit faster...

Cheers
/ Raimo

Sebastiano Vigna

unread,
Apr 29, 2022, 7:52:31 AM4/29/22
to pr...@googlegroups.com


> On 29 Apr 2022, at 11:25, ratm...@gmail.com <ratm...@gmail.com> wrote:
>
> So, XORROT is not bijective, unintuitive, for me that is... Devils in every corner!

Ha ha don't tell me, I discover new ones by the day!

But in some way it's fun--it's like an Easter hunt for eggs.

> Regarding 0x3f3530 vs. 0x7f17555; the larger and better is not slower in my case so I will choose that one.

Perfect. So let's fix our attention on 59 bits of state.

> I guess you were describing XORROT over 32 bits, i.e t ^ (t rotl32 16) ^ (t rotl32 17)...

Yes.

> That all bits might be fine, I guess that would be when rotating over 58 bits as in t ^ (t rotl58 16) (t rotl58 17)?

There's a missing ^ there--is it intensional?

If that's not too slow, it could improve the 59 (or 58 bits) to the point that they're good.

> Could XORROT 16, 32 over 58 bits as in t ^ (t rotl58 16) ^ (t rotl58 32) spread the bits better and also be ok or better?

Yes, that would be very good. As for the amount of shift, I'd take two primes like 11 and 23. But that's just a heuristics, might be uninfluential. The only way to know is to test.

> I measured something similar (16, 27) and that passed TestU01 and PractRand 2 TB for the bits I tested before (0..31, 16..47, 26..57)
>
> I am planning to offer two value functions; one XORSHIFT 16 that returns 32 bits, and one 2*XORROT that returns 58 bits.

That's a good plan. I'm not 100% sure it's the best to use the same underlying MWC in both cases, but let's strive for simplificity.

> I have been dabbling with 2*XORSHIFT as in t ^= t << 16; t ^= t << 32 which I suppose also would be bijective, and a bit faster...

If that's faster, let's consider it. Yes, each iteration is bijective and so is the composition. I'd just use lower values (e.g., 8 and 16) so you correct more the lower bits.

Ciao,

seba

ratm...@gmail.com

unread,
Apr 29, 2022, 10:30:33 AM4/29/22
to prng


fredag 29 april 2022 kl. 13:52:31 UTC+2 skrev Sebastiano Vigna:


> On 29 Apr 2022, at 11:25, Raimo Niskanen wrote:
>
> So, XORROT is not bijective, unintuitive, for me that is... Devils in every corner!

Ha ha don't tell me, I discover new ones by the day!

But in some way it's fun--it's like an Easter hunt for eggs.

Yes, it is!
 

> Regarding 0x3f3530 vs. 0x7f17555; the larger and better is not slower in my case so I will choose that one.

Perfect. So let's fix our attention on 59 bits of state.

 
> That all bits might be fine, I guess that would be when rotating over 58 bits as in t ^ (t rotl58 16) (t rotl58 17)?

There's a missing ^ there--is it intensional?

Nope. That was a mistake.
 

> Could XORROT 16, 32 over 58 bits as in t ^ (t rotl58 16) ^ (t rotl58 32) spread the bits better and also be ok or better?

Yes, that would be very good. As for the amount of shift, I'd take two primes like 11 and 23. But that's just a heuristics, might be uninfluential. The only way to know is to test.

Ok.   16,32 seems to work fine, as did 16,27.  Does not seem to be very critical in the tests I run...

Prime numbers sounds nice!
 

> I have been dabbling with 2*XORSHIFT as in t ^= t << 16; t ^= t << 32 which I suppose also would be bijective, and a bit faster...

If that's faster, let's consider it. Yes, each iteration is bijective and so is the composition. I'd just use lower values (e.g., 8 and 16) so you correct more the lower bits.

I will measure how much faster it is.  It should be faster since two down shifts with masking are not done.

I am aiming for all 58 bits decent, that is why I suggested <<16 <<<32.  That would expand to t ^ (t<<16) ^ (t<<32) ^ (t<<(16+32)), right?  It is probably a waste to shift up the low 10 bits to the top...
8,16 would expand to 8,16,24 which puts the highest of the low 16 at 39 leaving the top 18 untouched by the low 16.
Maybe 14,28, or 12,24...

Cheers
/ Raimo

ratm...@gmail.com

unread,
May 2, 2022, 4:36:52 AM5/2/22
to prng
I have some, for me, surprising results...

First, using larger xorshifts did not improve the quality of the high bits.  Even 10,20 performs worse than 8,16.  In fact 6,12 worked better than 10,20 and about as good as 8,16.  So I have no better candidate than 8,16, which you suggested from the start...

The base generator, in my implementation, takes 4 ns.  Adding a second call for single xorshift sums up to 6.5 ns.  With double xorshift 8 ns.  Double xorrot 10 ns.
So I'd say that the extra 20..25% execution time for double xorrot is not worth it especially since double xorshift passes the tests although double xorrot maybe feels more solid.

With single xorshift 16 the low 32 bits do not pass PractRand.  So I tried a smaller xorshift.  PractRand seems to prefer 12, while TestU01 prefers 8.  Now trying 10.

The low 16 bits passes PractRand stdin16 without scrambling! :-)  This I can recommend for applications that needs fast numbers on a small range (the reason this kind of generator has been requested).

Cheers!
/ Raimo

Yann Droneaud

unread,
May 2, 2022, 6:07:01 AM5/2/22
to pr...@googlegroups.com, wel...@nullprogram.com
Hi,

Le 02/05/2022 à 10:36, ratm...@gmail.com a écrit :
I have some, for me, surprising results...

First, using larger xorshifts did not improve the quality of the high bits.  Even 10,20 performs worse than 8,16.  In fact 6,12 worked better than 10,20 and about as good as 8,16.  So I have no better candidate than 8,16, which you suggested from the start...


I'm late in the thread, and probably not understanding it, but this echoes with this article

"Prospecting for Hash Functions" https://nullprogram.com/blog/2018/07/31/

where Chris created a tool to brute force a better hash function

https://github.com/skeeto/hash-prospector

The following best parameters were found recently

https://github.com/skeeto/hash-prospector#discovered-hash-functions

May be the same process can be used for you to find the best xorshifts parameters for your needs.


--
You received this message because you are subscribed to the Google Groups "prng" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prng+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prng/00b9aaa9-e3b5-4d0e-807c-f5253ac550c2n%40googlegroups.com.

--

Yann Droneaud

OPTEYA


ratm...@gmail.com

unread,
May 2, 2022, 5:20:48 PM5/2/22
to prng
Hello Yann!

There are some similarities, and an interesting article, indeed.  But I do not think it is useful in this case.

I cannot use multiplication to diffuse upwards because in Erlang a multiplication does not wrap to the word size but instead creates a big integer instead of overflowing, which is the main problem that I use an MWC generator to avoid.

The integer is already "random" but needs just a little more diffusion into the high bits, so there is no well defined optimum to search for, except to try if it passes an 8 hour test in PractRand and TestU01.  How to diffuse depends on how "random" it was and how TestU01 and PractRand tests for "randomness"

Cheers!
/ Raimo

Sebastiano Vigna

unread,
May 2, 2022, 5:21:57 PM5/2/22
to pr...@googlegroups.com


> On 2 May 2022, at 10:36, ratm...@gmail.com <ratm...@gmail.com> wrote:
>
> I have some, for me, surprising results...
>
> First, using larger xorshifts did not improve the quality of the high bits. Even 10,20 performs worse than 8,16. In fact 6,12 worked better than 10,20 and about as good as 8,16. So I have no better candidate than 8,16, which you suggested from the start...

That's why I suggested it--there's a lot of overlap and you cover a lot of bits.

> With single xorshift 16 the low 32 bits do not pass PractRand. So I tried a smaller xorshift. PractRand seems to prefer 12, while TestU01 prefers 8. Now trying 10.
>

OK. You mean PractRand at some particular data length? The other multiplier will die at 8TB.

Ciao,

seba

ratm...@gmail.com

unread,
May 3, 2022, 1:52:56 AM5/3/22
to prng
Sry, I ment to pass PractRand 2 TB, which is what I have time to wait for on my laptop...
/ Raimo

Sebastiano Vigna

unread,
May 3, 2022, 1:58:17 AM5/3/22
to pr...@googlegroups.com


> On 3 May 2022, at 07:52, ratm...@gmail.com <ratm...@gmail.com> wrote:
>
> Sry, I ment to pass PractRand 2 TB, which is what I have time to wait for on my laptop...
> / Raimo

Well, with the other multiplier they do (see my last message).

Ciao,

seba

ratm...@gmail.com

unread,
May 11, 2022, 11:18:15 AM5/11/22
to prng
For the archives.

The project landed on an MWC generator with a 59-bit state, in Erlang:
    C = T0 bsr 32,
    X = T0 band ((1 bsl 32)-1),
    T1 = 16#7fa6502 * X + C.

A 32-bit scrambler (Xorshift 8):
    V0 = T band ((1 bsl 32)-1),
    V1 = V0 band ((1 bsl (32-8))-1),
    V = V0 bxor (V1 bsl 8).

And a 59-bit scrambler (Xorshift 4, Xorshift 27):
    V0 = T band ((1 bsl (59-4))),
    V1 = T bxor (V0 bsl 4),
    V2 = V1 band ((1 bsl (59-27))),
    V = V1 bxor (V2 bsl 27).

These will be in Erlang/OTP 25.0 soon to be released.
See on https://www.erlang.org/  /news and /blog in a few days...

Cheers,
/ Raimo Niskanen

ratm...@gmail.com

unread,
May 13, 2022, 5:56:14 AM5/13/22
to prng

ratm...@gmail.com

unread,
May 18, 2022, 10:34:22 AM5/18/22
to prng

Sebastiano Vigna

unread,
May 18, 2022, 10:57:52 AM5/18/22
to pr...@googlegroups.com


> On 18 May 2022, at 16:34, ratm...@gmail.com <ratm...@gmail.com> wrote:
>
> Now released: https://www.erlang.org/news/157

🥳

Ciao,

seba

Aleksey Vaneev

unread,
Oct 7, 2022, 6:00:55 AM10/7/22
to prng
Check the `komirand` PRNG here: it's the simplest high-quality PRNG possible I think, and very fast on modern x64 platforms. https://github.com/avaneev/komihash
Reply all
Reply to author
Forward
0 new messages