xoroshiro+ improvement using parity

364 views
Skip to first unread message

TonyB

unread,
Oct 28, 2017, 6:25:50 PM10/28/17
to prng
Hello everyone,

A couple of days ago I found the following post from 2016 by David Roberts
https://forum.powerbasic.com/forum/user-to-user-discussions/source-code/749403-xorshift-prng?p=751037#post751037

In it he says that binary rank tests failures can be avoided by replacing bit 0 of the xoroshiro+ sum with the parity of the sum. During the past day or so, we have been testing this idea on the Parallax P2 forum, the discussion starting at
http://forums.parallax.com/discussion/comment/1423789/#Comment_1423789

The P2 is a new microcontroller with one xoroshiro128+ and eight xoroshiro32+ algorithms implemented in hardware. The full-period xoroshiro32+ triplets were found by brute force and we used these for our tests, however the results should apply to arbitrary bit widths.

For xoroshiro2N+ (where N = 16 for xoroshiro32+), replacing bit 0 with the parity of the N bit sum makes a significant difference, enough to avoid b-rank failures when using PractRand. The results on the whole word are not as good as those without bit 0 but are certainly better than the before.

The very best results are achieved by using all of the sum bits including the carry from bit N-1 for the parity. In other words, replace bit 0 of the sum with the parity of the full N+1 bit sum.

Using this latter "full parity" bit 0 is changed from the least to probably the most random bit. It seems to be as random as the msb for the best triplets and more so for some of the less good triplets, according to our PractRand tests.

The above parity method is easy to implement in hardware and assembler, but perhaps less so in higher-level languages.

Sebastiano Vigna

unread,
Oct 29, 2017, 6:04:48 AM10/29/17
to prng

> On 29 Oct 2017, at 00:25, TonyB <tny...@gmail.com> wrote:
>
> Hello everyone,
>
> In it he says that binary rank tests failures can be avoided by replacing bit 0 of the xoroshiro+ sum with the parity of the sum. During the past day or so, we have been testing this idea on the Parallax P2 forum, the discussion starting at
> http://forums.parallax.com/discussion/comment/1423789/#Comment_1423789

No, that's not going to happen--also bit 1 fails binary rank tests. Probably he didn't test long enough.

> The P2 is a new microcontroller with one xoroshiro128+ and eight xoroshiro32+ algorithms implemented in hardware. The full-period xoroshiro32+ triplets were found by brute force and we used these for our tests, however the results should apply to arbitrary bit widths.

What do you mean by xoroshiro32+? Generators with 16-bit shifts and 32 bits of state? When shift width needs to be explicit I put it in brackets, as in xoroshiro[16]32+.

> The very best results are achieved by using all of the sum bits including the carry from bit N-1 for the parity. In other words, replace bit 0 of the sum with the parity of the full N+1 bit sum.
>
> The above parity method is easy to implement in hardware and assembler, but perhaps less so in higher-level languages.

Yes, I guess so. But frankly binary-rank failures are interesting from a technical viewpoint, but quite irrelevant in practice (although it depends--if your application consists in filling matrices on Z/2Z with the bits output by the generator and computing their rank than you might have a problem). People has been using binary-rank biased bits for decades in scientific computation, and both the gcc standard C library and python will give you bits failing binary-rank tests.

From a practical viewpoint seems to be a resonable solution if you can compute parity quickly. It kills almost all theoretical properties of the PRNG though (no more equidistribution, guaranteed period in every bit, etc.), if, as I understand, you are *replacing* the lower bit with the parity. If you *xor* the lower bit with the parity of the upper 63 bits, instead, you'll keep everything fine, as it is a bijective transformation of 64-bit words (in fact, an involution, i.e., a self-inverting transform).

Also, the author of the post seems concerned about creating floating-point values in "double precision". Nowhere in the discussion is specified what that means, but if he means "a double", you need just 53 bits, so just pick the highest bits and you're done--you won't even see the two lowest bits.

Ciao,

seba

TonyB

unread,
Oct 29, 2017, 2:50:47 PM10/29/17
to prng
Hello Sebastiano,

Firstly thank you for developing the xoroshiro+ algorithm.

What we call xoroshiro32+ works in the same way as your xoroshiro128+, except s0 and s1 are both 16-bit, not 64-bit. The complete state in one 32-bit register is more important than a longer period.

We tested only xoroshiro32+ but to avoid confusion I will refer to xoroshiro128+ from now on. As you correctly predict, replacing bit 0 with the parity of the upper 63 bits gives extremely bad results, far worse than the original algorithm.

Xor-ing bit 0 with the upper 63 bits is exactly the same as changing bit 0 to the parity of the 64 bits. This "64-bit parity" method produces better scores in PractRand than the original.

Including the carry output from the addition in the parity, a "65-bit parity" method, produces very much better scores than the original. In our tests bit 0 on its own scored as well or better than the msb.

Improving bit 0 by using the parity probably works with other algorithms, as there is nothing really specific to xoroshiro+. It is simple and "free" so why not try it?

Sebastiano Vigna

unread,
Oct 29, 2017, 2:56:48 PM10/29/17
to prng

> On 29 Oct 2017, at 19:50, TonyB <tny...@gmail.com> wrote:
>
> Hello Sebastiano,
>
> Xor-ing bit 0 with the upper 63 bits is exactly the same as changing bit 0 to the parity of the 64 bits. This "64-bit parity" method produces better scores in PractRand than the original.

Oh sorry, I was writing in a hurry and didn't realize that. Remember that PractRand has few, very specific tests. You must run it through BigCrush at least.

> Including the carry output from the addition in the parity, a "65-bit parity" method, produces very much better scores than the original. In our tests bit 0 on its own scored as well or better than the msb.

I see. But at that point I think you're losing equidistribution again.

> Improving bit 0 by using the parity probably works with other algorithms, as there is nothing really specific to xoroshiro+. It is simple and "free" so why not try it?

Well, it's slower. Also, if you really care about the linear artifacts you have to fix the lowest two bits.

Ciao,

seba

tommy.e...@gmail.com

unread,
Nov 4, 2017, 2:28:00 AM11/4/17
to prng
The need to keep state in one 32-bit value got me thinking, and I tested SplitMix32, which is the 32-bit variant on SplitMix64, with gjrand's testunif tools. It did not do well, at all... It had 4 failures, two minor at what gjrand calls rank 1, but one at rank 20, and another at rank 21; generally any failure at rank 3 or worse means the generator is considered unusable by gjrand. None of these failures were on binary rank; these were all tests that should be relevant in general. So I got to thinking about ways I could do better, and I came up with this little PRNG: https://gist.github.com/tommyettinger/46a874533244883189143505d203312c . I do not believe Mulberry32 is especially fast, because to ensure quality, it needs to do several more operations (including two multiplications, neither by constants) than what 64-bit-math-based PRNGs can get by with. However, it does have fairly good quality, even when tested across the full 2-to-the-32 values it can generate as its period. It hasn't yet failed on one of gjrand's binary rank tests (it minorly fails diff12 and rda when tested on 16GB of results, and fails nothing on 4GB). I can't ensure it is equidistributed, but if I'm not mistaken, a xoroshiro[16]32+ generator adds the two 16-bit halves of state to get its result, which can result in at most a 17-bit integer if the same algorithm is used as for the xoroshiro128+ generator. If you split up the results from Mulberry32 into 16-bit halves (assuming that's all you need at a time), its speed may improve enough to be useful. If you're willing to post the C++ code you're using to test your variant of xoroshiro with PractRand, I'd be happy to try it out over here. I still need to test Mulberry32 against PractRand, since sometimes gjrand fails to detect an issue that PractRand does, and sometimes the opposite happens. Like Sebastiano is saying, and I think like George Marsaglia said when he devised the binary rank test, I don't think the binary rank tests are at all critical to pass, unless you know your specific use case has issues related to that exact property.

blarg blarg

unread,
Nov 12, 2017, 1:57:42 AM11/12/17
to prng
On Saturday, 4 November 2017 19:28:00 UTC+13, tommy.e...@gmail.com wrote:
If you're willing to post the C++ code you're using to test your variant of xoroshiro with PractRand, I'd be happy to try it out over here.

Sebastiano Vigna

unread,
Nov 12, 2017, 3:47:00 AM11/12/17
to blarg blarg, prng
Frankly, why are you people doing all this guesswork? Just ask! :)

The lowest *two* bits of xoroshiro128+ will fail binary rank tests. Fixing the lower bit only will work only as long as you don't test much with PractRand. If you go long enough, the second bit it will fail. It'll take a while because PractRand tests more heavily lowest bits in bytes, words, etc., but if, say, you rotate the output of your parity-fixed xoroshiro by one on the right it will fail very quickly.

Linear dependencies are kind of a poison: even if one single bit has linear dependencies, in time all the output will show linear dependencies. If you take just bits without linear dependencies, you'll never fail binary-rank tests.

For example, taking just the upper 32 bits of the sum will work, as there are no detectable linear dependencies there. The carry reduces the linear dependencies of the second lowest bit, and makes them practically undetectable from the third bit on (you could catch the third lowest bit with a linear compression test though).

If you want for some reason to delete absolutely linear dependencies, there are a number of simple nonlinear backends different from just + that we can suggest.

BTW, if you're interested in generators with 16-bit shifts I can provide you a 128-bit xoroshiro map using 16-bits shifts.

Ciao,

seba

Message has been deleted

blarg blarg

unread,
Nov 12, 2017, 5:34:06 AM11/12/17
to prng
Why thank you Seba, the offer is appreciated.  Yes, we've been slowly learning much of what you've listed there - even if we don't understand the why's.

We have a couple of requirements that limit us:
 #1 - We are trying to fit the PRNG in an instruction that uses a single general processor register for it's state store.  These registers are 32-bit in size.
 #2 - Is why so much guesswork.  We're also attempting to achieve a nice round 16 bits of output. :)  For now, with bit 0 sorted, we're content to use bit 1 as is.

Here's what we could additionally use from you:
Latest thought is it's possible to step up to 64 bits for the state store using a second instruction to stash the second 32 bits of the state in an orderly manner.  So, it would be handy to know some good triplets for a Xoroshiro64+.

On that note, is it possible to easily find all the Xoroshiro64 triplets that produce full-periods?  We quickly brute-forced this for Xoroshiro32 but a 64-bit state is a much bigger deal.  Excuse the pun.


BTW:  Xoroshiro is really good for doing in hardware.  The logic size is compact and it fits in a single clock cycle real easy.  We were provided with a somewhat better 32-bit state PRNG that had two consecutive 32-bit adders, by the author of PractRand no less, but that was too much to fit in one clock cycle.

PS:  Oh, man, I'm not liking this Google groups comment entry at all.  Javascript in webpages sucks!  It's trying to be spiffy but ends up glitching all over the place.  I can't see what I'm typing most of the time!  I've resorted to using an ordinary text editor and copy'n'pasted.  On the other hand, I've not proof read my typing this much on any forums before.


Sebastiano Vigna

unread,
Nov 12, 2017, 5:58:06 AM11/12/17
to blarg blarg, prng


> On 12 Nov 2017, at 11:34, blarg blarg <blarg...@fastmail.fm> wrote:
>
> We have a couple of requirements that limit us:
> #1 - We are trying to fit the PRNG in an instruction that uses a single general processor register for it's state store. These registers are 32-bit in size.
> #2 - Is why so much guesswork. We're also attempting to achieve a nice round 16 bits of output. :) For now, with bit 0 sorted, we're content to use bit 1 as is.

OK, so a 64-bit xoroshiro with 32-bit shifts would be fine?

>
> Here's what we could additionally use from you:
> Latest thought is it's possible to step up to 64 bits for the state store using a second instruction to stash the second 32 bits of the state in an orderly manner. So, it would be handy to know some good triplets for a Xoroshiro64+.

I'm attaching the list of full-period xoroshiro[32]64 generators. First column is the three parameters (rotation, shift, rotation--see the comments in the code on my site). Second column is the degree of the associated LFSR. Then the LFSR (no need to look at that).

So, state is 64 bits, in two registers. Shifts and rotations are 32 bits. The algorithm is *exactly* like the xoroshiro128+ algorithm you find on my website, but with 32-bit registers and shifts.

I'd suggest to do a quick round of testing of the triplets with degree 31 or 33, and a relatively small middle value. These are just heuristics--at this size, the more to you do test to do the selection, the better. Try to stay close to 32 anyway (the LFSR are better).

If you need 16 bits of outputs, you can add the two 32-bit status words and take the upper 16 bits. Or we can discuss more complex backends, but you seem to have very hard constraints.

> BTW: Xoroshiro is really good for doing in hardware. The logic size is compact and it fits in a single clock cycle real easy. We were provided with a somewhat better 32-bit state PRNG that had two consecutive 32-bit adders, by the author of PractRand no less, but that was too much to fit in one clock cycle.

Full-period? Watch out ;).

Ciao,

seba
prim.txt

blarg blarg

unread,
Nov 12, 2017, 7:51:49 AM11/12/17
to prng


On Sunday, 12 November 2017 23:58:06 UTC+13, Sebastiano Vigna wrote:

I'm attaching the list of full-period xoroshiro[32]64 generators. First column is the three parameters (rotation, shift, rotation--see the comments in the code on my site). Second column is the degree of the associated LFSR. Then the LFSR (no need to look at that).

So, state is 64 bits, in two registers. Shifts and rotations are 32 bits. The algorithm is *exactly* like the xoroshiro128+ algorithm you find on my website, but with 32-bit registers and shifts.

 Perfect!  It blows my mind that the whole set can be magic'd up that quickly.  Feeling all bouncy ... No early night now.

Full-period? Watch out ;).

Lol, I'd got quite used to expecting some full-period scores in PractRand.  Won't be an more of those, for sure!

Chris Rutz

unread,
May 31, 2018, 10:19:52 PM5/31/18
to prng
TonyB,

I've recently been studying one of your versions of xoroshiro[16]32+. I've modified it slightly to pass Small Crush both Fwd/Rev and fail Practrand at 512MB both Fwd/Rev (using all 32 state bits... though Small Crush only tests the 30 MSBs, not 31 as I had originally thought).

In any case, I read your comment in the Parallax forum:
"The hardware xoroshiro generator will create the next value in its sequence every system clock and the tests you have done show that such sequences are close to random but it is unlikely and perhaps impossible for a cog to read every consecutive value. A cog that needs a regular flow of random numbers without using its own code will be sampling the xoroshiro output at interval n, where n > 1 probably. How do the tests perform for n > 1, for example 8, 16 or 32? Or 7, 15 or 31? Will some of the 63 bits be less random as a result and if so which ones?"

Is that the final implementation, free-running with a high probability of sampling (n > 1)?

If so, that is roughly equivalent to periodically reseeding the generator when short-periods are involved, which is proven to have consequences which negate the quality of the base xoroshiro (or any other short-period) PRNG algorithm.

Chris

Sebastiano Vigna

unread,
May 31, 2018, 10:51:00 PM5/31/18
to pr...@googlegroups.com


On June 1, 2018 4:19:52 AM GMT+02:00, Chris Rutz <rut...@gmail.com> wrote:
>TonyB,
>negate the quality of the base xoroshiro (or any other short-period)

Define 'short'.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Chris Rutz

unread,
Jun 1, 2018, 12:06:22 AM6/1/18
to prng
On Thursday, May 31, 2018 at 10:51:00 PM UTC-4, Sebastiano Vigna wrote:

Define 'short'.

In this specific usage-case, 'short' might equal less than 60 seconds for full period (if I've calculated properly). This might not be much of a problem for small values of sampling interval that are close to 1, but randomness measurably deteriorates as n increases, especially when the sampling interval might be a near constant (as is likely in any clocked design).

TonyB

unread,
Jun 2, 2018, 9:12:47 PM6/2/18
to prng
Chris,

That comment on the Parallax forum was the first one I made in April 2017 and a lot has happened since then. The most important thing was that my post here about parity led to Sebastiano contacting me with details of a new scrambler (++), which we (the Parallax P2 PRNG development team) kept secret until the recent paper was published.

The parity trick was very short-lived, it was easy to do in hardware but less so in software and it improved xoroshiro+ so that it didn't fail PractRand almost immediately. We dropped parity as soon as we knew about and tested xoroshiro[16]32++.

Back in April 2017, the free-running generator in the P2 was xoroshiro128+ using the original constants but in April 2018 we switched to xoroshiro128** as described in the paper. A xoroshiro128++ would have used more logic than xoroshiro128** and was not considered worthwhile.

Each processor core or "cog" in the P2 has a xoroshiro[16]32++ and there will be eight cogs in the first version with sample chips due in late September. There is a special cog instruction XORO32 that does a double-iteration of xoroshiro[16]32++ to produce a 32-bit output using any 32-bit register as the state.

There is only one free-running generator and that is xoroshiro128**, not xoroshiro[16]32+. Each cog receives a different scrambled 32-bit subset of the 64-bit xoroshiro12** output. Most instructions takes two clock cycles (the minimum possible) therefore if the sampling interval is n then n >= 2. It's hard to say how random the 32-bit subsets will be and some users might want just a single random bit, not 32.

TonyB

unread,
Jun 2, 2018, 9:18:18 PM6/2/18
to prng
I meant to write "64-bit xoroshiro128** output."

TonyB

unread,
Jun 2, 2018, 9:38:30 PM6/2/18
to prng
One other correction: we made the ++ scrambler public on the Parallax forum in March, with David's & Sebastiano's permission and before their paper was published, as they had decided that ** would be their preferred scrambler.

Chris Rutz

unread,
Jun 2, 2018, 10:40:49 PM6/2/18
to prng
On Saturday, June 2, 2018 at 9:38:30 PM UTC-4, TonyB wrote:
One other correction: we made the ++ scrambler public on the Parallax forum in March, with David's & Sebastiano's permission and before their paper was published, as they had decided that ** would be their preferred scrambler.
TonyB,

Thanks for the detailed replies. My original thought (pertaining to 32-bit state xoroshiro) was to have a way to 'stop/start/set interval' the free-running xoroshiro, and if it is stopped, then an option (in case multiple cogs need current value) for the state to advance by one automatically immediately after the current output value is read.

I think I have solved the original problem, but differently than your parity method, in a way that handles all LFSR LSBs simultaneously, and uses perhaps 3 additional ASM instructions on top of a basic + generator. Not the same as the new ++ method (which is likely slightly faster), so I'll have to see which is actually better. If the results are promising, I will have to test on larger than 32-bit state. I am hoping to pass Big Crush in well under 40-bits, but need to brute force shift constants for 36/38/40 bit xoroshiro.

Chris

TonyB

unread,
Jun 3, 2018, 10:23:34 AM6/3/18
to prng
Please see
http://forums.parallax.com/discussion/comment/1439843/#Comment_1439843
for a zip file with full-period constants for xoroshiro34/36/38/40.

Numbers in the filenames are output widths, always half the state for our tests. xoroshiro[20]40 is the largest generator we tried and I'm not sure that only 66 full-period candidates is correct.

TonyB

unread,
Jun 3, 2018, 10:26:23 AM6/3/18
to prng
xoroshiro[20]40 is the largest generator we tried by brute-forcing the constants.

Chris Rutz

unread,
Jun 3, 2018, 10:31:38 PM6/3/18
to prng
Thanks for digging those shift constants up!

I already finished comparing ++ with my method and ended up merging both methods together and found that:
1. The hybrid method uses 3 ASM instructions more than ++ (added 4, removed 1, tmp and prn cannot be the same register).
2. I had to add S1 at the end instead of S0 for best statistical performance, but doing so is of questionable value for ++.
2. Neither Practrand nor Small Crush are of much help distinguishing between the two methods, so I used gjrand as a tie-breaker (but seldom use it otherwise).
3. My method easily passes gjrand small, and ++ fails at ~1E-10 (fwd and rev average).
4. My method fails gjrand standard at ~1E-10, and ++ fails at ~1E-40.
5. I can see why d=6 was temping for ++, but rejected, as it nearly allows a fwd pass on gjrand small, but falls apart on gjrand standard at ~1E-70,

I wonder what the best 32-bit state score on record for gjrand standard is...

Chris

Chris Rutz

unread,
Jun 4, 2018, 7:16:02 AM6/4/18
to prng
On a whim, I ran Fwd/Rev Crush on both ++ and my hybrid:
++ Fwd:
  1  SerialOver, t = 2               0.9992
  2  SerialOver, t = 4              9.8e-04
 41  MaxOft, t = 5                  1 -  4.7e-07
 42  MaxOft, t = 10                 1 -  5.4e-06
 43  MaxOft, t = 20                 1 -  6.4e-07
 70  RandomWalk1 R (L = 10000)       0.9993
++ Rev:
  1  SerialOver, t = 2              1 -  2.0e-10
  9  CollisionOver, t = 20          3.1e-04
 42  MaxOft, t = 10                 1 -  8.0e-06
Hybrid Fwd:
  1  SerialOver, t = 2              1 -  8.2e-09
 41  MaxOft, t = 5                   0.9999
Hybrid Rev:
  1  SerialOver, t = 2              1 -  8.3e-14
 42  MaxOft, t = 10                  0.9994

Chris Rutz

unread,
Jun 24, 2018, 5:45:21 PM6/24/18
to prng
I've noticed odd things afoot, but in any case, I'm moving forward on this (and appreciate all the help so far):
1. Demonstrated that 1-dimensional equidistribution is maintained in my variant (with the expected missing zero).
2. Horribly under-estimated additional ASM instructions... 6 additional, not 3. Sorry about that. I've been programming in ASM for almost 40 years, but only C++ recently and just discovered Compiler Explorer (https://gcc.godbolt.org).
3. Demonstrated a deficiency in GCC emitting an odd instruction sequence which seems to cause a pipeline stall on my Xeon processor. This does not happen with MSVC, so I used primitives to nudge the compilers so the emitted code works equally well (though less readable) on both GCC and MSVC.
4. Randomness at least equal to, and likely superior to, PCG 'setseq_xsh_rr_32_16'... without any full-state operations or multiplication.
5. Demonstrated the importance of running both Practrand and gjrand on both forward and reverse streams, though less important than doing so with TestU01.
6. Excellent escape-from and minimize-falling-to (near) zero bits behavior.

My naive implementation of an xoroshiro128++ variant has these additional characteristics:
1. ~60% of the speed of xoroshiro128+ (not ++), using Vigna's benchmark.
2. Same speed as xororoshiro128+, using Lemire's benchmark. This is implausible, but does demonstrate the insensitivity to PRNG speed of many end-user (perhaps flawed) implementations.
3. Passing Vigna's new Hamming Weight test so far at 1.5E+14, but showing definite signs of weakness... wishing for a MT version to expedite the testing process (though I could probably load all threads with other naive candidate implementations to move this along).

Currently recommended testing methodologies (using at least a 16-core/32-thread processor):
1. TestU01 BigCrush, forward + reverse on low-32, high-32, and all bits (if > 32); 100+ runs each, preferably with full statistical meta-analysis using p-val corrections I provided elsewhere here.
2. Practrand forward (+ reverse, though seldom critical), preferably following-up using parameters 'stdin -tf 2 -te 1 -tlmin 1KB -multithreaded'.
3. Gjrand forward (+ reverse, because sometimes critical), at each step size.
4. New Hamming Weight test to 1E+15, as Vigna recommended.
5. Optional: RaBiGeTe, but only if you want to explore (usually small-scale) structure in depth.
6. Hopefully a new purported statistical analyzer will obsolete 1-5.

I will eventually release code if/when all my issues get sorted

Chris Rutz

unread,
Aug 14, 2018, 9:47:30 PM8/14/18
to prng
Mostly done... faster, smaller and better than stated above:
Reply all
Reply to author
Forward
0 new messages