A couple of days ago I found the following post from 2016 by David Roberts
https://forum.powerbasic.com/forum/user-to-user-discussions/source-code/749403-xorshift-prng?p=751037#post751037
In it he says that binary rank tests failures can be avoided by replacing bit 0 of the xoroshiro+ sum with the parity of the sum. During the past day or so, we have been testing this idea on the Parallax P2 forum, the discussion starting at
http://forums.parallax.com/discussion/comment/1423789/#Comment_1423789
The P2 is a new microcontroller with one xoroshiro128+ and eight xoroshiro32+ algorithms implemented in hardware. The full-period xoroshiro32+ triplets were found by brute force and we used these for our tests, however the results should apply to arbitrary bit widths.
For xoroshiro2N+ (where N = 16 for xoroshiro32+), replacing bit 0 with the parity of the N bit sum makes a significant difference, enough to avoid b-rank failures when using PractRand. The results on the whole word are not as good as those without bit 0 but are certainly better than the before.
The very best results are achieved by using all of the sum bits including the carry from bit N-1 for the parity. In other words, replace bit 0 of the sum with the parity of the full N+1 bit sum.
Using this latter "full parity" bit 0 is changed from the least to probably the most random bit. It seems to be as random as the msb for the best triplets and more so for some of the less good triplets, according to our PractRand tests.
The above parity method is easy to implement in hardware and assembler, but perhaps less so in higher-level languages.
Firstly thank you for developing the xoroshiro+ algorithm.
What we call xoroshiro32+ works in the same way as your xoroshiro128+, except s0 and s1 are both 16-bit, not 64-bit. The complete state in one 32-bit register is more important than a longer period.
We tested only xoroshiro32+ but to avoid confusion I will refer to xoroshiro128+ from now on. As you correctly predict, replacing bit 0 with the parity of the upper 63 bits gives extremely bad results, far worse than the original algorithm.
Xor-ing bit 0 with the upper 63 bits is exactly the same as changing bit 0 to the parity of the 64 bits. This "64-bit parity" method produces better scores in PractRand than the original.
Including the carry output from the addition in the parity, a "65-bit parity" method, produces very much better scores than the original. In our tests bit 0 on its own scored as well or better than the msb.
Improving bit 0 by using the parity probably works with other algorithms, as there is nothing really specific to xoroshiro+. It is simple and "free" so why not try it?
If you're willing to post the C++ code you're using to test your variant of xoroshiro with PractRand, I'd be happy to try it out over here.
I'm attaching the list of full-period xoroshiro[32]64 generators. First column is the three parameters (rotation, shift, rotation--see the comments in the code on my site). Second column is the degree of the associated LFSR. Then the LFSR (no need to look at that).
So, state is 64 bits, in two registers. Shifts and rotations are 32 bits. The algorithm is *exactly* like the xoroshiro128+ algorithm you find on my website, but with 32-bit registers and shifts.
Full-period? Watch out ;).Lol, I'd got quite used to expecting some full-period scores in PractRand. Won't be an more of those, for sure!
Define 'short'.
That comment on the Parallax forum was the first one I made in April 2017 and a lot has happened since then. The most important thing was that my post here about parity led to Sebastiano contacting me with details of a new scrambler (++), which we (the Parallax P2 PRNG development team) kept secret until the recent paper was published.
The parity trick was very short-lived, it was easy to do in hardware but less so in software and it improved xoroshiro+ so that it didn't fail PractRand almost immediately. We dropped parity as soon as we knew about and tested xoroshiro[16]32++.
Back in April 2017, the free-running generator in the P2 was xoroshiro128+ using the original constants but in April 2018 we switched to xoroshiro128** as described in the paper. A xoroshiro128++ would have used more logic than xoroshiro128** and was not considered worthwhile.
Each processor core or "cog" in the P2 has a xoroshiro[16]32++ and there will be eight cogs in the first version with sample chips due in late September. There is a special cog instruction XORO32 that does a double-iteration of xoroshiro[16]32++ to produce a 32-bit output using any 32-bit register as the state.
There is only one free-running generator and that is xoroshiro128**, not xoroshiro[16]32+. Each cog receives a different scrambled 32-bit subset of the 64-bit xoroshiro12** output. Most instructions takes two clock cycles (the minimum possible) therefore if the sampling interval is n then n >= 2. It's hard to say how random the 32-bit subsets will be and some users might want just a single random bit, not 32.
One other correction: we made the ++ scrambler public on the Parallax forum in March, with David's & Sebastiano's permission and before their paper was published, as they had decided that ** would be their preferred scrambler.
Numbers in the filenames are output widths, always half the state for our tests. xoroshiro[20]40 is the largest generator we tried and I'm not sure that only 66 full-period candidates is correct.