Fast and small Multiplicative Random number generator

hopcode

unread,

Dec 19, 2011, 8:08:24 AM12/19/11

to

Hi everybody,

some days ago i started thinking upon a
fast and simple random number generator of few
lines of codes, wondering about the use of
MUL and RDTSC. after some "deviations" from the theory :-)
i stumbled upon this following idea of mine:

.randomB:
rdtsc
mov rcx,866AA7'3F0B'7F'0B'03h
movd mm0,eax
movq mm1,rcx
pshufw mm0,mm0,00'10'01'00b
pmaddwd mm1,mm0
pshufb mm1,mm0
pmullw mm1,mm0
movq rax,mm0
movq rcx,mm1
mul rdx
sub rax,rcx
ret

then using ENT to test some of its features
here what i have found:

ent result.bin
Entropy = 7.841899 bits per byte.

Optimum compression would reduce the size
of this 130800 byte file by 1 percent.

Chi square distribution for 130800 samples is 29972.95, and randomly
would exceed this value less than 0.01 percent of the times.

Arithmetic mean value of data bytes is 129.7706 (127.5 = random).
Monte Carlo value for Pi is 3.139266055 (error 0.07 percent).
Serial correlation coefficient is -0.001012
(totally uncorrelated =0.0).

i dont know how to read it to evaluate my algo.
could you please help me ?

ent link is at
http://www.fourmilab.ch/random/

and the following 2 screenshots using gnuplot

total distribution of 65536 items ANDED to 65535
http://sites.google.com/site/x64lab/randomB_ganz.gif?attredirects=0

and a detail of it
http://sites.google.com/site/x64lab/randomB_detail.gif?attredirects=0

i have drawn some blu lines to denote visible
constant "holes" inside it.

Hints, general guidelines about the subject,
improvements, your opinion on the results well appreciated

Thanx in advance,
Cheers,

--
.:mrk[hopcode]
.:x64lab:.
group http://groups.google.com/group/x64lab
site http://sites.google.com/site/x64lab

Markus Wichmann

unread,

Dec 19, 2011, 3:10:58 PM12/19/11

to

This means a pretty good score (8 would be truly random, but that is a
theoretical goal).

> Optimum compression would reduce the size
> of this 130800 byte file by 1 percent.
>

Also a good sign and related to the above: The file is near
uncompressible. If it was, it wouldn't have much entropy, thus your
algorithm would be bad.

> Chi square distribution for 130800 samples is 29972.95, and randomly
> would exceed this value less than 0.01 percent of the times.
>

I have no idea. Chi square is a stochastic distribution, but I don't
know what they are doing with it in this context.

> Arithmetic mean value of data bytes is 129.7706 (127.5 = random).

Well, should be self-explaining.

> Monte Carlo value for Pi is 3.139266055 (error 0.07 percent).

The Monte Carlo algorithm for calculating Pi is:

1. Initialize counters n, m with 0
2. Randomly draw a point in the unit square (take two consecutive bytes
as coordinates and divide them both by 255)
3. Increase n
4. If the point is within the unit circle, increase m
5. If we have not yet run out of data, go to 2.
6. Pi should be m/n * 4

Using you file as a random number source, this algorithm yields an error
as little as 0.07 per cent. Pretty good. (With a true random source,
this would _probably_ generate a perfect Pi)

> Serial correlation coefficient is -0.001012
> (totally uncorrelated =0.0).
>

Correlation is another stochastic topic. Yet this comment is pretty self
explaining.

> i dont know how to read it to evaluate my algo.
> could you please help me ?
>

Hope that helped.

> ent link is at
> http://www.fourmilab.ch/random/
>
> and the following 2 screenshots using gnuplot
>
> total distribution of 65536 items ANDED to 65535
> http://sites.google.com/site/x64lab/randomB_ganz.gif?attredirects=0
>
> and a detail of it
> http://sites.google.com/site/x64lab/randomB_detail.gif?attredirects=0
>
> i have drawn some blu lines to denote visible
> constant "holes" inside it.
>
> Hints, general guidelines about the subject,
> improvements, your opinion on the results well appreciated
>

One nice test: Have gnuplot or something similar plot points with the
coordinates being consecutive values from your generator. You shouldn't
see any lines now! (Linear congruency generators _will_ fail this test.)

If you don't know how, use this perl script (untested, but should work):

#!/usr/bin/perl -w

use strict;

my ($x, $y);

binmode(STDIN);
die "read error" if (read(STDIN, $x, 1));
while (read(STDIN, $y, 1)) {
print $x . " " . $y . "\n";
$x = $y;
}

Usage: perl tmp.pl <result.bin >tmp.gnuplot
gnuplot
plot "tmp.gnuplot"

> Thanx in advance,
> Cheers,
>

HTH,
Markus

hopcode

unread,

Dec 20, 2011, 1:33:10 AM12/20/11

to

Hi Markus,
Il 19.12.2011 21:10, Markus Wichmann ha scritto:
> Hope that helped.

thank You, that helped a lot to find a type-error by packing data
using PHP and the complex and weird function pack();

then after running it against the DIEHARD test, i have found
encouraging and good results in most tests, ~90%.
here, a fragment of an apparently negative result:

Chi-square with 5^5-5^4=2500 d.of f. for sample size: 256000
chisquare equiv normal p value
Results for COUNT-THE-1's in specified bytes:
bits 1 to 8 2712.06 2.999 .998645
bits 2 to 9 2525.89 .366 .642844
bits 3 to 10 2701.12 2.844 .997774
bits 4 to 11 4739.64 31.673 1.000000
bits 5 to 12 18639.68 228.250 1.000000
bits 6 to 13 87082.48 1196.177 1.000000
bits 7 to 14506249.30 7124.091 1.000000
bits 8 to 15********* 42758.170 1.000000
bits 9 to 16********* 213986.600 1.000000
bits 10 to 17********* 137730.000 1.000000
bits 11 to 18********* 170499.900 1.000000
bits 12 to 19********* 80776.410 1.000000
bits 13 to 20********* 16785.790 1.000000
bits 14 to 21224561.30 3140.421 1.000000
bits 15 to 22 51165.06 688.228 1.000000
bits 16 to 23 11874.50 132.575 1.000000
bits 17 to 24 2598.76 1.397 .918749
bits 18 to 25 2745.13 3.467 .999736
bits 19 to 26 2876.84 5.329 1.000000
bits 20 to 27 4787.94 32.356 1.000000
bits 21 to 28 19647.96 242.509 1.000000
bits 22 to 29 92230.13 1268.976 1.000000
bits 23 to 30540332.70 7606.103 1.000000
bits 24 to 31********* 43608.360 1.000000
bits 25 to 32********* 218090.700 1.000000

i say "apparently", because it has a possible explanation in the
algorithm.
i think it is caused from the wrap-around of the pmaddwd. but i need
it/like it, because it divides the period in several parts. an SHR/L
could be applied, but i dont know where at the moment.

hopcode

unread,

Dec 20, 2011, 8:59:35 AM12/20/11

to

some little improvements,

.randomB:
rdtsc
rol dx,4
rol ax,3

mov rcx,866AA7'3F0B'7F'0B'03h
movd mm0,eax
movq mm1,rcx

pshufw mm0,mm0,01'00'01'00b
pmaddwd mm1,mm0
pshufb mm1,mm0

movq rax,mm0
movq rcx,mm1
mul rdx

ret 0

with following unstable results yet on the x2.
they are yet around 30% and 70%

Entropy = 7.999309 bits per byte.

Optimum compression would reduce the size

of this 262144 byte file by 0 percent.

Chi square distribution for 262144 samples is 250.54, and randomly
would exceed this value 56.72 percent of the times.

Arithmetic mean value of data bytes is 127.7175 (127.5 = random).
Monte Carlo value for Pi is 3.140306706 (error 0.04 percent).
Serial correlation coefficient is -0.000857 (totally uncorrelated = 0.0).

cut away the pmullw, because too much aggressive
the pshufb is aggressive too, but i dont know how to resolve it at the
moment.

i know it's quite mpossible to have a good rng without using a
seed+state. but that is the challenge, or ?

Rod Pemberton

unread,

Dec 20, 2011, 11:06:45 AM12/20/11

to

"hopcode" <hop...@nospicedham.nulnulul.de> wrote in message
news:jcpa78$aoq$1...@online.de...

> then after running it against the DIEHARD test,

Bob Jenkins also has some tests for randomness, and other related stuff like
cryptography and hashing:
http://burtleburtle.net/bob/

Sorry, you'll have to ask Markus to explain them too ... ;-)

Rod Pemberton

Bernhard Schornak

unread,

Dec 20, 2011, 6:28:47 PM12/20/11

to

hopcode wrote:

> some little improvements,
>
> .randomB:
> rdtsc
> rol dx,4
> rol ax,3
> mov rcx,866AA7'3F0B'7F'0B'03h
> movd mm0,eax
> movq mm1,rcx
> pshufw mm0,mm0,01'00'01'00b
> pmaddwd mm1,mm0
> pshufb mm1,mm0
> movq rax,mm0
> movq rcx,mm1
> mul rdx
> ret 0

PSHUFB is a SSSE 3 extension, not available on AMD machines
prior to FX-8xxx (don't know which iNTEL processors support
it). A similar (64 bit) RNG from my libraries

http://code.google.com/p/st-open/source/browse/LIB/SOURCES/core/pmf.S

rdtsc
movl %eax, %ecx
movl %eax, %r8d
movl %edx, %r9d
andl $0x3F, %ecx # remove for non-Athlon machines!
shlq $0x20, %r8
shlq $0x20, %r9
shlq $0x20, %rdx
addq %rdx, %r8
addq %rax, %r9
addq %rdx, %rax
rorq %cl, %r8
rolq %cl, %r9
rorq %cl, %rax
addq %r8, %r9
addq %r9, %rax
ret

using GPRs (only). The lowest 7 bits returned in EAX always
are different. RDTSC needs about 69...90 clocks, processing
adds another 5 clocks (AMD Athlon). Using the lowest 6 bits
as count for the rotates should throw sufficiently 'random'
numbers. Quality greatly depends on the machine's run time,
output should become more 'random' with growing EDX. EAX is
filled once per second, EDX in 2e32 seconds @ 4.2 GHz (more
than 136 years!).

Greetings from Augsburg

Bernhard Schornak

Hugh Aguilar

unread,

Dec 20, 2011, 10:35:39 PM12/20/11

to

On Dec 19, 6:08 am, hopcode <hopc...@nospicedham.nulnulul.de> wrote:
> Hi everybody,
>
> some days ago i started thinking upon a
> fast and simple random number generator of few
> lines of codes, wondering about the use of
> MUL and RDTSC.

I had to look up RDTSC as I had never heard of it. When looking it up,
I came across RDRAND that provides a prng in hardware. The
documentation didn't describe what the algorithm was, or tell if it
could be seeded so as to provide a repeatable stream. Is it any good?
Does anybody know?

Your algorithm isn't repeatable because it uses RDTSC, and I think
that would eliminate it from consideration in most cases. During
testing, people want to get the same stream every time. Also, your use
of RDTSC might result in a not-very-random stream if your prng is
called at a constant frequency, which is very likely to happen if it
is called within a loop. All in all, I would prefer to stick with a
traditional prng that gets seeded initially and then provides a stream
determined solely by that seed.

In my novice package I provided a prng that I call LC53. Here it is in
Forth:

4294967291 constant rnd-unity \ 2^32 - 5
3961633963 constant rnd-mult \ 2^32 - 333333333

macro: <prng> ( rnd -- new-rnd )
rnd-mult um* rnd-unity um/mod drop ;

It would be trivial to port to assembly language; I'll do that if
anybody is interested. The only test that I applied was to make sure
that it provides a full cycle through all of the 32-bit numbers other
than zero, which it does. I don't know how to test it for randomness,
although I should do that considering that I'm providing it as my prng
in the novice package.

Mine is definitely "a few lines of code" too; I just do a
multiplication and a division. Most prngs don't do a division. This is
because the old processors such as the 65c02 were horribly slow at
division. Is this an issue nowadays? On a modern x86, wouldn't my
algorithm with its multiplication and division be faster than an
algorithm that uses simpler instructions (XOR etc.) but has more of
them?

In my novice package I also provide a Forth version of the Java prng.
This is 48-bit rather than 32-bit, which is better. It has only
multiplication and addition, but no division. I would assume that it
is slower than mine, but I haven't tested them for speed.

In my novice package I also provided a program that cracks an
encryption system based on LC53 or any other linear-congruential prng.
It didn't work very well because I had a bad evaluation function. The
program (actually "programs" as I had several versions) were bug free,
but I just didn't know how to design a good evaluation function. This
was the only code in the novice package that I was unsatisfied with
--- everything else was good.

hopcode

unread,

Dec 20, 2011, 11:49:42 PM12/20/11

to

Hi Bernhard,

Il 21.12.2011 00:28, Bernhard Schornak ha scritto:
> PSHUFB is a SSSE 3 extension, not available on AMD machines
> prior to FX-8xxx (don't know which iNTEL processors support
> it). A similar (64 bit) RNG from my libraries
>
> http://code.google.com/p/st-open/source/browse/LIB/SOURCES/core/pmf.S

> <snip>

> using GPRs (only). The lowest 7 bits returned in EAX always
> are different. RDTSC needs about 69...90 clocks, processing
> adds another 5 clocks (AMD Athlon). Using the lowest 6 bits
> as count for the rotates should throw sufficiently 'random'
> numbers. Quality greatly depends on the machine's run time,
> output should become more 'random' with growing EDX. EAX is
> filled once per second, EDX in 2e32 seconds @ 4.2 GHz (more
> than 136 years!).

thanks for outlining it.i am using PSHUFB at the moment because
it simplify a lot the code, but i will remove it later favouring
a manual hardcoding of the shuffle.
btw,testing your algo to 1-byte-ANDing give not so bad results,
even if chi-square is 0.01 very low and distant from the acceptable of
randomness. the distribution on the contrary looks fine

http://sites.google.com/site/x64lab/bernhard_random.gif?attredirects=0

then curious of a 2-byte-ANDing i did a test on it, n shau ma hiahea des

http://sites.google.com/site/x64lab/bernhard_random_2byte.gif?attredirects=0

Harold Aptroot

unread,

Dec 21, 2011, 6:21:38 AM12/21/11

to

"Hugh Aguilar" <hughag...@nospicedham.yahoo.com> wrote in message
news:7c72de8e-a320-40dd...@32g2000yqp.googlegroups.com...

> I had to look up RDTSC as I had never heard of it. When looking it up,
> I came across RDRAND that provides a prng in hardware. The
> documentation didn't describe what the algorithm was, or tell if it
> could be seeded so as to provide a repeatable stream. Is it any good?
> Does anybody know?

Yes, it's good, it uses heat noise, fancy statistics to filter it, and AES
in cypher block chaining mode to make more bits. It's fine for cyptographic
uses.
But it's not supported by any actual processors yet, so that immediately
disqualifies it.

hopcode

unread,

Dec 21, 2011, 6:29:41 AM12/21/11

to

Hi Hugh,

Il 21.12.2011 04:35, Hugh Aguilar ha scritto:
> On Dec 19, 6:08 am, hopcode<hopc...@nospicedham.nulnulul.de> wrote:
>> Hi everybody,
>>
>> some days ago i started thinking upon a
>> fast and simple random number generator of few
>> lines of codes, wondering about the use of
>> MUL and RDTSC.
>

> Your algorithm isn't repeatable because it uses RDTSC, and I think
> that would eliminate it from consideration in most cases. During
> testing, people want to get the same stream every time. Also, your use
> of RDTSC might result in a not-very-random stream if your prng is
> called at a constant frequency, which is very likely to happen if it
> is called within a loop. All in all, I would prefer to stick with a
> traditional prng that gets seeded initially and then provides a stream
> determined solely by that seed.

ehm... good point, but difficoult for me to give an answer.
i'll try: assuming _T_ ime is cyclic, and _S_ omething in it changes
every seconds, we want correlate the two things.
permutations are not enough because even if,
for the fundamental rule, T x S are the number of resulting ways,
one can find in it _again_ the S or T, call it _fixed_ seed or
cyclical seed as from after issuing an RDTSC.

> Forth:
>
> 4294967291 constant rnd-unity \ 2^32 - 5
> 3961633963 constant rnd-mult \ 2^32 - 333333333
>
> macro:<prng> ( rnd -- new-rnd )
> rnd-mult um* rnd-unity um/mod drop ;
>

someone told me that approaches like this have very bad distribution.
i cannot say it until i test it. i know from this last days that
operations only (correct me if i am wrong) like above quoted,
or operation+masking are not enough. we need to scramble all (bytes)
values, because they all should be/act as semantically equals.
this is what i am trying to do just now, to let the algo behave more
_stable_ . at the moment, i can say after some few modifications

- distribution is ok aftern ANDing results to 8/16/32 bits
- chi-square average between 40% and 70%

and the challenge is to have both stable, using T as seed. whenever
T is cyclic, the result should be random (so called "pseudo-random")
in that instant.

> It would be trivial to port to assembly language; I'll do that if
> anybody is interested. The only test that I applied was to make sure
> that it provides a full cycle through all of the 32-bit numbers other
> than zero, which it does. I don't know how to test it for randomness,
> although I should do that considering that I'm providing it as my prng
> in the novice package.

if You port it to assembly, i can execute ent and, eventually if the
algo satisfies both conditions after ent, i.e
- chi square
- distribution

execute diehard on it and share results. coding conditions are simple:
preserve eventually RBP/RBX/RDI/RSI, result in RAX/EAX/AX/AL.

Hugh Aguilar

unread,

Dec 21, 2011, 8:45:11 AM12/21/11

to

On Dec 21, 4:29 am, hopcode <hopc...@nospicedham.nulnulul.de> wrote:
> Hi Hugh,
>
> Il 21.12.2011 04:35, Hugh Aguilar ha scritto:
> > Forth:
>
> > 4294967291 constant rnd-unity \ 2^32 - 5
> > 3961633963 constant rnd-mult \ 2^32 - 333333333
>
> > macro:<prng> ( rnd -- new-rnd )
> > rnd-mult um* rnd-unity um/mod drop ;
>
> someone told me that approaches like this have very bad distribution.
> i cannot say it until i test it. i know from this last days that
> operations only (correct me if i am wrong) like above quoted,
> or operation+masking are not enough. we need to scramble all (bytes)
> values, because they all should be/act as semantically equals.

It is true that in the linear-congruential, the high bits are more
random than the low bits. This is why everybody multiplies a 32-bit
number by the limit and then returns the high word of the product as a
number in the range [0,limit) --- your result is composed of the high
bits. Alternatively, you could divide by the limit and return the
remainder as a number in the range [0,limit) --- but then you would be
using the low bits.

> if You port it to assembly, i can execute ent and, eventually if the
> algo satisfies both conditions after ent, i.e
> - chi square
> - distribution
>
> execute diehard on it and share results. coding conditions are simple:
> preserve eventually RBP/RBX/RDI/RSI, result in RAX/EAX/AX/AL.

Here it is in HLA:

program lc53;
#include( "stdlib.hhf" )

const

rnd_unity := 4294967291; // 2^32 - 5
rnd_mult := 3961633963; // 2^32 - 333333333

proc

prng : procedure { @noframe }; // needs EAX, sets EAX
begin prng;
mul( mov( rnd_mult, ebx ) );
div( mov( rnd_unity, ebx ) );
mov( edx, eax ); // return the remainder
ret( );
end prng;

begin lc53;

stdout.put( "Here is a pseudo-random number:", nl );
mov( 1, eax );
prng( );
stdout.putu32( eax );

end lc53;

hopcode

unread,

Dec 21, 2011, 12:40:53 PM12/21/11

to

Hi Hugh,
i feeded the procedure with the RDTSC,i.e
seed is low timestamp (EAX only). then i tested it
again using the EDX:EAX too, but it results QED in both cases,
i.e a bad distribution.
in other cases it is obvious that working on the seed requires
a theory, and the algo should be extended.

here the proc

.lc53:
rdtsc
rnd_unity equ 4294967291 ; // 2^32 - 5
rnd_mult equ 3961633963 ; // 2^32 - 333333333
;--- in EAX seed
mov ecx,rnd_mult
xor edx,edx
mul rcx
mov ecx,rnd_unity
div ecx
xchg eax,edx
ret 0

ANDing the result in EAX to 0FFFFh and reading it,
the screenshot,
http://sites.google.com/site/x64lab/hugh_lc53_store16_read16.gif?attredirects=0

then i tested it again ANDing EAX to -1 (no AND)
but reading 16bits, and it show good performances, as follows from ent

Entropy = 7.998624 bits per byte.

Optimum compression would reduce the size

of this 131072 byte file by 0 percent.

Chi square distribution for 131072 samples is 250.29, and randomly
would exceed this value 57.16 percent of the times.

Arithmetic mean value of data bytes is 127.5518 (127.5 = random).
Monte Carlo value for Pi is 3.136461433 (error 0.16 percent).
Serial correlation coefficient is 0.003030 (totally uncorrelated = 0.0).

a detail of the distribution

http://sites.google.com/site/x64lab/hugh_lc53_details_store32_read16.gif?attredirects=0

even if some points still overlap.

i hope to learn plotting the uniform and normal distribution as soon as
i can, because i find gnuplot somewhat complex for my raw basic habit.
i am working on my algo providing some scrambling of the values, i hope
tomorrow late in the night i can post a draft of it.

Robert Redelmeier

unread,

Dec 21, 2011, 2:34:30 PM12/21/11

to

In alt.lang.asm hopcode <hop...@nospicedham.nulnulul.de> wrote in part [Fup set]:

> i feeded the procedure with the RDTSC,i.e seed is low
> timestamp (EAX only). then i tested it again using the
> EDX:EAX too, but it results QED in both cases, i.e a bad
> distribution. in other cases it is obvious that working on
> the seed requires a theory, and the algo should be extended.

RDTSC is obviously unsuitable for seeding a PRNG during boot.

At a later, user determined time (like deciding to run a pgm),
RDTSC may have useful amount of entropy, _BUT_ only in the
right places.

Highword EDX increments every second or two on modern CPUs
-- not many useful bits there. Lowword EAX runs faster,
but will be "clumpy", at least around CPU multipliers
and probably even slower PCI bus speed. You will lose at
least 4 bits off the bottom, and most likely more. Test!

My guess, there are at least 10 bits and maybe as many as
24 bits of entropy in EAX.

-- Robert

Rod Pemberton

unread,

Dec 21, 2011, 2:48:17 PM12/21/11

to

"Hugh Aguilar" <hughag...@nospicedham.yahoo.com> wrote in message
news:7c72de8e-a320-40dd...@32g2000yqp.googlegroups.com...

> On Dec 19, 6:08 am, hopcode <hopc...@nospicedham.nulnulul.de> wrote:

> > [...]

> > MUL and RDTSC.
>
> I had to look up RDTSC as I had never heard of it.

You may want to look up RDTSCP too.

Wikipedia's page provides info on both.
http://en.wikipedia.org/wiki/Rdtsc

But, you should also read the Intel or AMD
manuals.

Sandpile.org's x86 technical reference
http://www.sandpile.org/

Rod Pemberton

Bernhard Schornak

unread,

Dec 22, 2011, 7:19:14 AM12/22/11

to

hopcode wrote:

> Hi Bernhard,

Hi!

> Il 21.12.2011 00:28, Bernhard Schornak ha scritto:
>> PSHUFB is a SSSE 3 extension, not available on AMD machines
>> prior to FX-8xxx (don't know which iNTEL processors support
>> it). A similar (64 bit) RNG from my libraries
>>
>> http://code.google.com/p/st-open/source/browse/LIB/SOURCES/core/pmf.S
>> <snip>
>> using GPRs (only). The lowest 7 bits returned in EAX always
>> are different. RDTSC needs about 69...90 clocks, processing
>> adds another 5 clocks (AMD Athlon). Using the lowest 6 bits
>> as count for the rotates should throw sufficiently 'random'
>> numbers. Quality greatly depends on the machine's run time,
>> output should become more 'random' with growing EDX. EAX is
>> filled once per second, EDX in 2e32 seconds @ 4.2 GHz (more
>> than 136 years!).
>
> thanks for outlining it.i am using PSHUFB at the moment because
> it simplify a lot the code, but i will remove it later favouring
> a manual hardcoding of the shuffle.

My Phenom II T1100 crashed if it stumbled upon PSHUFB. Today, I
exchanged it with the FX-8150 I purchased last month, so I will
give it a try ASAP.

> btw,testing your algo to 1-byte-ANDing give not so bad results,
> even if chi-square is 0.01 very low and distant from the acceptable of randomness. the
> distribution on the contrary looks fine
>
> http://sites.google.com/site/x64lab/bernhard_random.gif?attredirects=0
>
> then curious of a 2-byte-ANDing i did a test on it, n shau ma hiahea des
>
> http://sites.google.com/site/x64lab/bernhard_random_2byte.gif?attredirects=0

Wia in Hungergaria - do hot si oana durch d Midd'n gfuadat...

This RNG throws 64 bit numbers. Looking at the second diagram,
it seems it has two preferred states, bouncing between minimum
and maximum. Did you feed the testing tool with qwords emitted
by the RNG? I never tested the RNG for real, even if the basic
function exists since 2006 (the good old OS/2 days). Cries for
improvement... ;)

If you're in testing mood, this one might throw better results

movq %rcx, %r8 # RCX = min
movq %rdx, %r9 # RDX = max
movq %rcx, %r10
movq %rdx, %r11
rdtsc
cmpq %r10, %r11
cmova %r9, %r10
cmova %r8, %r11
movl %eax, %ecx
movl %edx, %r9d
movl %eax, %r8d
xorl %edx, %edx
incq %r11
shlq $0x20, %r9
andl $0x3F, %ecx
addq %r9, %rax
shrl $0x1A, %r8d
rorq %cl, %rax
movl %r8d, %ecx
subq %r10, %r11
rolq %cl, %rax
divq %r11
addq %rdx, %r10
movq %r10, %rax
ret

This function returns a number between a minimum passed in RCX
and a maximum passed in RDX. The division is time consuming, I
know, but it is quite handy, because it returns the difference
between min and max, so we just have to add the minimum to get
a proper number between min and max.

hopcode

unread,

Dec 22, 2011, 4:30:35 PM12/22/11

to

Hi Bernhard,

Il 22.12.2011 13:19, Bernhard Schornak ha scritto:
> Wia in Hungergaria - do hot si oana durch d Midd'n gfuadat...

un guad hods geschmeckd :-)

> This RNG throws 64 bit numbers. Looking at the second diagram,
> it seems it has two preferred states, bouncing between minimum
> and maximum. Did you feed the testing tool with qwords emitted
> by the RNG?

no, because i wanted to observe the uniformity of it against the
chi-square, and that is probably the reason of such feldstreifen.
i can test it egain in two days, giving the PRNG the seed once and
let it run on its results.

yes, i am in testing mood. :-)
two days, i test both the algos and share results.

Hugh Aguilar

unread,

Dec 22, 2011, 6:42:12 PM12/22/11

to

On Dec 21, 10:40 am, hopcode <hopc...@nospicedham.nulnulul.de> wrote:
> Hi Hugh,
> i feeded the procedure with the RDTSC,i.e
> seed is low timestamp (EAX only). then i tested it
> again using the EDX:EAX too, but it results QED in both cases,
> i.e a bad distribution.
> in other cases it is obvious that working on the seed requires
> a theory, and the algo should be extended.
>
> here the proc
>
> .lc53:
> rdtsc
> rnd_unity equ 4294967291 ; // 2^32 - 5
> rnd_mult equ 3961633963 ; // 2^32 - 333333333
> ;--- in EAX seed
> mov ecx,rnd_mult
> xor edx,edx
> mul rcx
> mov ecx,rnd_unity
> div ecx
> xchg eax,edx
> ret 0

That looks like the algorithm. I have a couple questions though:
1.) Why did you zero out EDX prior to the multiply?
2.) What is RCX? Why didn't you use ECX?

> ANDing the result in EAX to 0FFFFh and reading it,

> the screenshot,http://sites.google.com/site/x64lab/hugh_lc53_store16_read16.gif?attr...

>
> then i tested it again ANDing EAX to -1 (no AND)
> but reading 16bits, and it show good performances, as follows from ent

I told you that linear-congruential prngs are more random in the high
bits than the low bits. You used the 16 low bits --- that is not going
to be very random.

Also, your two tests both use the 16 low bits --- they seem to be the
same test --- why are you describing them as different?

The LC53 cycles through all 2^32-1 possible values (everything except
0). It doesn't really matter what it is seeded with (so long as the
seed isn't 0), as it is eventually going to hit every possible seed
value if it is run long enough. Statistics such as arithmetic mean,
which cover an entire block of numbers, aren't meaningful. If you make
the block big enough (all of the possible numbers) these kinds of
statistics would be known a-priori.

Statistics that test the randomness between *adjacent* numbers in the
stream would be more meaningful. These would (hopefully) show the LC53
to be more "random" than a simple sequence (such as: 1,2,3,4...). I
don't really know how to test this though. I know almost nothing about
statistics. :-(

hopcode

unread,

Dec 22, 2011, 8:24:42 PM12/22/11

to

yes, right, xor edx,edx not needed. that's a residual. because i
converted them to 64 bit first, and the xor edx,edx was under the mul
rcx. then i realized in my counsciounsness :-) the unit and the
multiplier were 32 bit, but the instruction is still there.

> I told you that linear-congruential prngs are more random in the high
> bits than the low bits. You used the 16 low bits --- that is not going
> to be very random.
>
> Also, your two tests both use the 16 low bits --- they seem to be the
> same test --- why are you describing them as different?

because when outputting the binary file,
in the store32_read16 case, i write all 32bit of EAX but read 16 two
times: 1 call -> 2 reads;

in the 16-16 case, i issue 1 call, AND EAX to 0FFFFh (low bits),
then 1 read. essentially 1 call -> 1 read.

also, in the case store32_read16 high bits of EAX are included.
i can repeat the tests in a couple of days and share results for the
high bits of EAX only.

> The LC53 cycles through all 2^32-1 possible values (everything except
> 0). It doesn't really matter what it is seeded with (so long as the
> seed isn't 0), as it is eventually going to hit every possible seed
> value if it is run long enough

yes, in fact i observed that in the case of the timestamp as seed,
such method looks like scrambling/shuffling it, mantaining the entropy
of the seed constant and in the same range of the result.
i am working on this feature, because i like the fact that the timestamp
might be seen as a "natural" entropy-vector.

> Statistics such as arithmetic mean,
> which cover an entire block of numbers, aren't meaningful. If you make
> the block big enough (all of the possible numbers) these kinds of
> statistics would be known a-priori.

i am new to such branch of mathe too. but if i am not wrong
mean and "meaning" depend on the distributive function (i am just
learning it, after the chi-square)

>
> Statistics that test the randomness between *adjacent* numbers in the
> stream would be more meaningful. These would (hopefully) show the LC53
> to be more "random" than a simple sequence (such as: 1,2,3,4...). I
> don't really know how to test this though. I know almost nothing about
> statistics. :-(

how do You mean "ajacent" ? adjacent seeds ? say from 23 to 56 range,
from 126 to 251 range ... or perhaps randomness of the resulting stream
by ranges ? because in this last case case, iirc, diehard does it
already to a certain extent.

Bernhard Schornak

unread,

Dec 23, 2011, 4:59:08 PM12/23/11

to

hopcode wrote:

> Hi Bernhard,
>
> Il 22.12.2011 13:19, Bernhard Schornak ha scritto:
>> Wia in Hungergaria - do hot si oana durch d Midd'n gfuadat...
>
> un guad hods geschmeckd :-)

Derwei's koa Maeck-Doener-Woid war... ;)

>> This RNG throws 64 bit numbers. Looking at the second diagram,
>> it seems it has two preferred states, bouncing between minimum
>> and maximum. Did you feed the testing tool with qwords emitted
>> by the RNG?
>
> no, because i wanted to observe the uniformity of it against the
> chi-square, and that is probably the reason of such feldstreifen.
> i can test it egain in two days, giving the PRNG the seed once and
> let it run on its results.

I see. If you use smaller parts of the entire number, evaluation
might not reveal the whole truth.

>> If you're in testing mood, this one might throw better results
>>

<snip>

>>
> yes, i am in testing mood. :-)
> two days, i test both the algos and share results.

I'm busy with a benchmark, but it should be finished in a couple
of days. Battling WinAPI since Monday, I finally won, and menues
now almost work as they did on OS/2. Some translation work still
is left. My conversion functions need a major upgrade, so I'm in
testing mood, as well. Now that I can test all SSE instructions,
it might be worth a thought to provide a second library for most
recent machines. Even if they are quite sparse at the moment, it
is foreseeable they will dominate the market in a few years.

hopcode

unread,

Dec 27, 2011, 8:19:38 AM12/27/11

to

Hi,
my Park-Miller-Carta LCG, avoiding the final if-correction.
checked OK, against the Park's "minimum-standard" requirements.
i post it here in case R.Whittle needs to discuss it publicly.

.pmc:
mov ecx,eax
and eax,0FFFFh
imul eax,16807 ;--- lo
shr ecx,16
imul ecx,ecx,16807 ;--- hi
mov edx,ecx
and ecx,7FFFh
shl ecx,16
add eax,ecx
shr edx,15
add eax,edx
add eax,eax
sbb ecx,ecx
shr eax,1
sub eax,ecx
ret 0

links and further descriptions at
http://sites.google.com/site/x64lab/home/reloaded-algorithms/park-miller-carta-random-lcg

hopcode

unread,

Dec 27, 2011, 8:25:04 PM12/27/11

to

Hi Bernhard,

Il 22.12.2011 13:19, Bernhard Schornak ha scritto:
>

chi-square originally too low (0.001) for min = 2, max = 0FFFFh
in all cases. then applying the simple ROR before
updating min, chi-square will become very fine and stable
on 1-byte and 2-byte too. also, at a first look,
it seems eligible for diehard. surely better than ANSI/Unix rand().
a little flaw on min: it cannot be 1 or 0 (issues div by 0)

i called the function this way

mov rcx,[min]
mov rdx,[max]
call .nebrezn
add rax,[min]
ror rax,5 ;<-- here, to improve chi-sqare to +/-20% the expected
mov [min],rax
@store 2 ;--- i.e uses only the lower 2 bytes of eax

Fritz Wuehler

unread,

Dec 28, 2011, 4:59:35 AM12/28/11

to

multiplicative is not a word at least not in English...

hopcode

unread,

Dec 28, 2011, 5:38:12 AM12/28/11

to

Il 28.12.2011 10:59, Fritz Wuehler ha scritto:
> multiplicative is not a word at least not in English...
>

well, what would You claim now from Your own english ?
http://en.wikipedia.org/wiki/Multiplicative_function#Properties

Nomen Nescio

unread,

Dec 28, 2011, 8:15:24 AM12/28/11

to

hopcode <hop...@nulnulul.de> wrote:

> Il 28.12.2011 10:59, Fritz Wuehler ha scritto:
> > multiplicative is not a word at least not in English...
> >
> well, what would You claim now from Your own english ?
> http://en.wikipedia.org/wiki/Multiplicative_function#Properties

wikipedia is no proof of anything except it is a good source for those who
know nothing

James Harris

unread,

Dec 28, 2011, 11:00:36 AM12/28/11

to

On Dec 28, 9:59 am, Fritz Wuehler

<fr...@spamexpire-201112.rodent.frell.theremailer.net> wrote:
> multiplicative is not a word at least not in English...

Of course it is

http://www.onelook.com/?w=multiplicative

James

Phil Carmody

unread,

Dec 28, 2011, 4:59:57 PM12/28/11

to

Thus spake someone so ashamed of his own ignorance he posts
via an anonymiser.

http://www.cambridge.org/gb/knowledge/isbn/item6432943/?site_locale=en_GB

Phil
--
Unix is simple. It just takes a genius to understand its simplicity
-- Dennis Ritchie (1941-2011), Unix Co-Creator

sfuerst

unread,

Dec 28, 2011, 5:04:46 PM12/28/11

to

On Dec 27, 5:19 am, hopcode <hopc...@nospicedham.nulnulul.de> wrote:
> Hi,
> my Park-Miller-Carta LCG, avoiding the final if-correction.
> checked OK, against the Park's "minimum-standard" requirements.
> i post it here in case R.Whittle needs to discuss it publicly.
>
> .pmc:
> mov ecx,eax
> and eax,0FFFFh
> imul eax,16807 ;--- lo
> shr ecx,16
> imul ecx,ecx,16807 ;--- hi
> mov edx,ecx
> and ecx,7FFFh
> shl ecx,16
> add eax,ecx
> shr edx,15
> add eax,edx
> add eax,eax
> sbb ecx,ecx
> shr eax,1
> sub eax,ecx
> ret 0
>

> links and further descriptions athttp://sites.google.com/site/x64lab/home/reloaded-algorithms/park-mil...

>
> Cheers,
>
> --
> .:mrk[hopcode]
> .:x64lab:.
> grouphttp://groups.google.com/group/x64lab
> sitehttp://sites.google.com/site/x64lab

How about this for 64 bit:

lcng_rnd_64:
imul $16807, %rdi, %rdi
lea (%edi, %edi), %eax
shr $30, %rdi
add %edi, %eax
sbb %edi, %edi
shr $1, %eax
sub %edi, %eax
retq

And this for 32 bit:

lcng_rnd_32:
mov $16807, %eax
mul %ecx
add %eax, %eax
adc %edx, %edx
add %edx, %eax
sbb %edx, %edx
shr $1, %eax
sub %edx, %eax
retq

Of course... that random number generator is pretty poor. It is
possible to do much better.

This one passes TestU01. It takes a pointer to a 128 bit seed in
%rdi, and returns 64 bits of randomness in %rax. It has a period of
2^128. However, it uses the address within the pointer as part of the
calculation. This means that different threads, using different
seeds, will effectively get different slices of a 2^160+ period.

mov $0x6595a395a1ec531b, %rsi
mov 0xc(%rdi), %eax
add %rsi, (%rdi)
adc %rsi, 0x8(%rdi)
xor 0x8(%rdi), %rax
xor %rdi, %rax
imul %rsi, %rax
mov %rax, %rdx
shr $0x20, %rax
xor %rdx, %rax
imul %rsi, %rax
add (%rdi), %rax

(This was the fastest routine I found that would pass the TestU01
battery of tests. As such it just barely passes...)

Steven

Bernhard Schornak

unread,

Dec 28, 2011, 6:34:30 PM12/28/11

to

hopcode wrote:

> Hi Bernhard,

Hi!

The function uses this formula:

/ random number \
RAX = MIN + REMAINDER | ----------------- |
\ ((MAX + 1) - MIN) /

Assuming MIN = 0 and MAX = 255, the 64 bit random number
is divided by 256, leaving a remainder of 0...255, which
is added to MIN. Hence, RND -should- be a number between
zero and 255. Keep in mind that RDX (remainder), not RAX
(integer result) is added to MIN!

With 256 as divisor, the division's result is equal to

and rax, $0xFF

Surely not the best way to reduce a 64 bit number to one
byte used in a pseudo-RNG test... ;)

"Divide by zero" only is possible if MIN is zero and MAX
is FFFFFFFFFFFFFFFF. I will add a few lines to make this
function foolproof.

> i called the function this way
>
> mov rcx,[min]
> mov rdx,[max]
> call .nebrezn

Brez'n san imma guad! ;)

> add rax,[min]
> ror rax,5 ;<-- here, to improve chi-sqare to +/-20% the expected
> mov [min],rax
> @store 2 ;--- i.e uses only the lower 2 bytes of eax

With MIN = 0 and MAX = 255, RAX can't hold anything else
than 0...255.

hopcode

unread,

Dec 28, 2011, 8:13:37 PM12/28/11

to

Hi Steven,

Il 28.12.2011 23:04, sfuerst ha scritto:
>
> And this for 32 bit:
>
> lcng_rnd_32:
> mov $16807, %eax
> mul %ecx
> add %eax, %eax
> adc %edx, %edx
> add %edx, %eax
> sbb %edx, %edx
> shr $1, %eax
> sub %edx, %eax
> retq
>

sorry, tested but it doesnt pass the Park's minimum requirements:
starting from seed=1, after 10000 loops it should not return
28E97B81h, and it doesnt wrap after 7FFFFFFEh loops to 1.
see my website for links/details.

>
> This one passes TestU01. It takes a pointer to a 128 bit seed in
> %rdi, and returns 64 bits of randomness in %rax. It has a period of
> 2^128. However, it uses the address within the pointer as part of the
> calculation. This means that different threads, using different
> seeds, will effectively get different slices of a 2^160+ period.
>
>
> mov $0x6595a395a1ec531b, %rsi
> mov 0xc(%rdi), %eax
> add %rsi, (%rdi)
> adc %rsi, 0x8(%rdi)
> xor 0x8(%rdi), %rax
> xor %rdi, %rax
> imul %rsi, %rax
> mov %rax, %rdx
> shr $0x20, %rax
> xor %rdx, %rax
> imul %rsi, %rax
> add (%rdi), %rax
>
> (This was the fastest routine I found that would pass the TestU01
> battery of tests. As such it just barely passes...)
>

very nice, thanks for sharing.
i have tested it using a vector generated from a radioactive decay
source. this is one of the best uniform distributions i have seen till
now, among ~30 different LCG RNG.
i will share some result and itel-ligible ;-) code as i have
some more time because i am busy with the theory at the moment.

btw: what is the basic randomness-concept in it ?

Cheers,

--
.:mrk[hopcode]
.:x64lab:.

sfuerst

unread,

Dec 28, 2011, 9:43:10 PM12/28/11

to

On Dec 28, 5:13 pm, hopcode <hopc...@nospicedham.nulnulul.de> wrote:
> Hi Steven,
>
> Il 28.12.2011 23:04, sfuerst ha scritto:
>
>
>
> > And this for 32 bit:
>
> > lcng_rnd_32:
> > mov $16807, %eax
> > mul %ecx
> > add %eax, %eax
> > adc %edx, %edx
> > add %edx, %eax
> > sbb %edx, %edx
> > shr $1, %eax
> > sub %edx, %eax
> > retq
>
> sorry, tested but it doesnt pass the Park's minimum requirements:
> starting from seed=1, after 10000 loops it should not return
> 28E97B81h, and it doesnt wrap after 7FFFFFFEh loops to 1.
> see my website for links/details.

Oops, you are right. I missed a shift in the conversion from 64 bit
to 32 bit code. :-/ The result needs an extra instruction:

mov $16807, %eax
mul %ecx
add %eax, %eax
adc %edx, %edx

add %edx, %edx

add %edx, %eax
sbb %edx, %edx
shr $1, %eax
sub %edx, %eax
retq
>
>

It is based on the fact that the group under xor+shifts, and the group
under multiplications by odd numbers are "incompatible". The
interference between the two group operations gives good randomness.
The large period is easy to derive from the result of repeatedly
adding an odd 128 bit number to itself.

It is not a LCG. It uses two multiplications, so is non-linear. This
means that it will automatically avoid many of the issues that
statistical tests for randomness look for.

Steven

orz

unread,

Dec 29, 2011, 8:33:46 PM12/29/11

to

BTW, to anyone testing RNGs: Do not use Diehard as anything other than
a very preliminary test - it is badly out of date.

I thought I'd run your last RNG (the one that passes TestU01) through
my tests and benchmarks (which are in C/C++). When I attempted to
translate that in to C I got this:
Uint64 rand(Uint64 *state) {
const Uint64 K = 0x6595a395a1ec531b;
Uint64 tmp;
tmp = (tmp & 0xFFFFffff00000000) | (state[1] >> 32);
state[0] += K;
state[1] += K + (([state[0] < K) ? 1 : 0);
tmp ^= state[1] ^ (Uint64)state;
tmp *= K;
tmp ^= tmp >> 32;
return tmp + state[0];
}
In particular, the upper 32 bits of rax appear to be used while
uninitialized. Am I missing something there? For testing I used tmp
= state[1] >> 32; instead of that line, as that was my best guess of
what it should be. It's doing well so far, though it needs a few more
hours for me to be confident. Though I'll probably want to rerun the
tests without xoring by the address... since the address, while
nominally constant in a single run, is basically a nondeterministic
function of the compiler, OS, and system libraries, I'd rather treat
it pessimistically as possibly very low entropy.

The fastest RNG I have on hand at the moment that passes my testing
criteria is:
Uint64 rand(Uint64 *state) {
Uint64 old = state[0] + state[1];
state[0] = state[1] ^ (state[1] >> 7);
state[1] = state[2] + (state[2] << 3);
state[2] = old + ((state[2] << 29) | (state[2] >> 35));
return old;
}
On a good day that can produce as much as 8 gigabytes per second on a
single core of my Core i5 2500. On a bad day it can drop to as low as
5 gigabytes per second. Not sure exactly what determines whether it's
a good day or a bad day. I haven't tried an asm implementation... I
stopped using asm for these things half a decade or so ago when the
compilers finally started figuring out how to optimize (or at least
avoid pessimizing) multi-word integer math, but I think an asm
implementation would look something like:
mov rax, [rsi]
mov rbx, [rsi+8]
mov rcx, [rsi+16]
lea rax, [rax+rbx]
mov rdx, rbx
shr rdx, 7
xor rbx, dbx
lea rdx, [rcx + rcx*8]
rol rcx, 29
mov [rsi], rbx
mov [rsi+8], rdx
add rcx, rax
mov [rsi+16], rcx
ret

Also note that while TestU01 Crush and BigCrush are some of the best
batteries of empirical RNG tests around, they will pass some low
quality RNGs - a few small simple RNGs slip through the cracks
somehow, and a wide variety of large lagged-fibonacci style RNGs slip
through as well.

sfuerst

unread,

Dec 29, 2011, 11:52:12 PM12/29/11

to

<snip>

Nope. Remember that in 64 bit mode, instructions that act on a 32 bit
quantity will clear the upper 32 bits of a 64 bit register.

mov 0xc(%rdi), %eax <--- This instruction loads the lower half,
and clears the upper half of %rax.

The above C code also seems to be missing the second multiplication by
'K'.

Steven

hopcode

unread,

Dec 29, 2011, 11:50:38 PM12/29/11

to

Il 29.12.2011 03:43, sfuerst ha scritto:
> Oops, you are right. I missed a shift in the conversion from 64 bit
> to 32 bit code. :-/ The result needs an extra instruction:
>
> mov $16807, %eax
> mul %ecx
> add %eax, %eax
> adc %edx, %edx
> add %edx, %edx
> add %edx, %eax
> sbb %edx, %edx
> shr $1, %eax
> sub %edx, %eax
> retq

perfekt! ;-) it deserves an entry in the gems' list
of coding. here my intel-ligible translation

.pm_sfuerst:
mov ecx,16807
mul ecx
add eax,eax
adc edx,edx
add edx,edx
add eax,edx
sbb edx,edx
shr eax,1
sub eax,edx
ret 0

>>> > > mov $0x6595a395a1ec531b, %rsi
>>> > > mov 0xc(%rdi), %eax
>>> > > add %rsi, (%rdi)
>>> > > adc %rsi, 0x8(%rdi)
>>> > > xor 0x8(%rdi), %rax
>>> > > xor %rdi, %rax
>>> > > imul %rsi, %rax
>>> > > mov %rax, %rdx
>>> > > shr $0x20, %rax
>>> > > xor %rdx, %rax
>>> > > imul %rsi, %rax
>>> > > add (%rdi), %rax
>> >

>> > btw: what is the basic randomness-concept in it ?
> It is based on the fact that the group under xor+shifts, and the group
> under multiplications by odd numbers are "incompatible". The
> interference between the two group operations gives good randomness.
> The large period is easy to derive from the result of repeatedly
> adding an odd 128 bit number to itself.

interesting. inspirative,perhaps working with PI as vector.

sfuerst

unread,

Dec 30, 2011, 12:31:00 AM12/30/11

to

On Dec 29, 8:50 pm, hopcode <hopc...@nospicedham.nulnulul.de> wrote:
> Il 29.12.2011 03:43, sfuerst ha scritto:
>
> > Oops, you are right. I missed a shift in the conversion from 64 bit
> > to 32 bit code. :-/ The result needs an extra instruction:
>
> > mov $16807, %eax
> > mul %ecx
> > add %eax, %eax
> > adc %edx, %edx
> > add %edx, %edx
> > add %edx, %eax
> > sbb %edx, %edx
> > shr $1, %eax
> > sub %edx, %eax
> > retq
>
> perfekt! ;-) it deserves an entry in the gems' list
> of coding. here my intel-ligible translation
>
> .pm_sfuerst:
> mov ecx,16807

^ This probably should be a "mov eax", right?

> >>> > > mov $0x6595a395a1ec531b, %rsi
> >>> > > mov 0xc(%rdi), %eax
> >>> > > add %rsi, (%rdi)
> >>> > > adc %rsi, 0x8(%rdi)
> >>> > > xor 0x8(%rdi), %rax
> >>> > > xor %rdi, %rax
> >>> > > imul %rsi, %rax
> >>> > > mov %rax, %rdx
> >>> > > shr $0x20, %rax
> >>> > > xor %rdx, %rax
> >>> > > imul %rsi, %rax
> >>> > > add (%rdi), %rax
>
> >> > btw: what is the basic randomness-concept in it ?
> > It is based on the fact that the group under xor+shifts, and the group
> > under multiplications by odd numbers are "incompatible". The
> > interference between the two group operations gives good randomness.
> > The large period is easy to derive from the result of repeatedly
> > adding an odd 128 bit number to itself.
>
> interesting. inspirative,perhaps working with PI as vector.

Right... the choice of odd number doesn't matter all that much as long
as roughly half the bits are set. The theory is that three xorshifts
+ multiplies should be enough for a good random hash. The multiply
mixes entropy upwards. The xorshift mixes it downwards. However, the
above only does two though.

It manages to get away with doing less work by making the input (the
upper 64 bits of the 128 bit number) vary relatively rapidly. The
final addition also adds enough entropy to the lower bits to make the
tests barely pass. The xor by the seed pointer value to get a rng
that works well in parallel situations is just gravy. That feature
can be removed if speed is more important. (Or you could easily
change it to xor with the MPI process rank for multi-machine safety,
for example...)

Steven

Robert Wessel

unread,

Dec 30, 2011, 2:18:54 AM12/30/11

to

Which is not much, if any, faster than Mersenne Twister.

hopcode

unread,

Dec 30, 2011, 2:20:34 AM12/30/11

to

Il 30.12.2011 06:31, sfuerst ha scritto:
>> .pm_sfuerst:
>> > mov ecx,16807
> ^ This probably should be a "mov eax", right?

yes, merely for a clearness/explanation purpouse.
but for _fastcall and used as generator, is ok Your way
(i have added a note and the originyl code on my website).

>>>>>> > >>> > > mov $0x6595a395a1ec531b, %rsi
>>>>>> > >>> > > mov 0xc(%rdi), %eax
>>>>>> > >>> > > add %rsi, (%rdi)
>>>>>> > >>> > > adc %rsi, 0x8(%rdi)
>>>>>> > >>> > > xor 0x8(%rdi), %rax
>>>>>> > >>> > > xor %rdi, %rax
>>>>>> > >>> > > imul %rsi, %rax
>>>>>> > >>> > > mov %rax, %rdx
>>>>>> > >>> > > shr $0x20, %rax
>>>>>> > >>> > > xor %rdx, %rax
>>>>>> > >>> > > imul %rsi, %rax
>>>>>> > >>> > > add (%rdi), %rax
>> >
>>>>> > >> > btw: what is the basic randomness-concept in it ?
>>> > > It is based on the fact that the group under xor+shifts, and the group
>>> > > under multiplications by odd numbers are "incompatible". The
>>> > > interference between the two group operations gives good randomness.
>>> > > The large period is easy to derive from the result of repeatedly
>>> > > adding an odd 128 bit number to itself.
>> >
>> > interesting. inspirative,perhaps working with PI as vector.
> Right... the choice of odd number doesn't matter all that much as long
> as roughly half the bits are set. The theory is that three xorshifts
> + multiplies should be enough for a good random hash. The multiply
> mixes entropy upwards. The xorshift mixes it downwards. However, the
> above only does two though.

and imagine that when i say "good distribution", i mean a very
restrictive requirement to obtain it. because i tested the algo in two
ways: the way You coded it, using

mov $0x6595a395a1ec531b, %rsi

and my way :-) using RDTSC !! shifted all on RSI, at only 46-bits
information !! also, for my little experience of few days, i can
"scent" something in it, and distinguish from now 2 types of
multiplicative effekts (whenever as from the standard terminology
"multiplicative" is that LCG in which c, the constant, is = 0):

1) multiply to obtain an entropy "expansion"
this has a side-effect on the distribution and builds
local ranges, that should correspond to hyperplanes)

2) multiply to scramble the entropy
as for example, contained in the value read from the RDTSC

and this is the case i like to focus, because i believe
entropy is _already_ in the timestamp.
and for this last case too, i am finding a way to pack
a convenient decimal part of PI for use as 128 bit vector in order
to improve the distribution.

orz

unread,

Dec 30, 2011, 3:19:39 AM12/30/11

to

On Dec 29, 8:52 pm, sfuerst <svfue...@nospicedham.gmail.com> wrote:
> <snip>

> Nope. Remember that in 64 bit mode, instructions that act on a 32 bit
> quantity will clear the upper 32 bits of a 64 bit register.
>
> mov 0xc(%rdi), %eax <--- This instruction loads the lower half,
> and clears the upper half of %rax.
>
> The above C code also seems to be missing the second multiplication by
> 'K'.
>
> Steven

Ah, I had forgotten that. And missed the second multiply when posting
- it's in the C code I'm testing atm, just missing from the post (I
tried to edit out extraneous details relating to how I stuck it in to
the testing framework, accidentally removed an extra line).

Anyway, if you take the bottom 16 bits of the output of that RNG and
apply a 16 bit gap test to them, it fails after 128 to 256 GB. If you
do only that test, it takes about 5 minutes. If you do my latest
standard mix of tests it takes half an hour to a full hour for it to
find that failure.

On Dec 29, 11:18 pm, Robert Wessel

<robertwess...@nospicedham.yahoo.com> wrote:
> Which is not much, if any, faster than Mersenne Twister.

It's about 8 times faster than my implementation of mt19937 (0.97 GB/s
on a good day, 0.70 GB/s on a bad day), about 12 times faster than the
implementations of mt19937 in TestU01 or GSL. Of course, if you
wanted speed you could use SFMT instead of mt19937, I'm not sure how
fast SFMT is but it's supposed to be faster than mt19937. But mt19937
fails TestU01 Crush and BigCrush and occupies a lot more memory, so it
would not be a good comparison even if it were comparable in speed.

orz

unread,

Dec 30, 2011, 3:53:02 AM12/30/11

to

Just to clarify, "that RNG" in the above paragraph refers to the RNG
posted by sfuerst with the comment "This one passes TestU01."

> On Dec 29, 11:18 pm, Robert Wessel
>
> <robertwess...@nospicedham.yahoo.com> wrote:
> > Which is not much, if any, faster than Mersenne Twister.
>
> It's about 8 times faster than my implementation of mt19937 (0.97 GB/s
> on a good day, 0.70 GB/s on a bad day), about 12 times faster than the
> implementations of mt19937 in TestU01 or GSL. Of course, if you
> wanted speed you could use SFMT instead of mt19937, I'm not sure how
> fast SFMT is but it's supposed to be faster than mt19937. But mt19937
> fails TestU01 Crush and BigCrush and occupies a lot more memory, so it
> would not be a good comparison even if it were comparable in speed.

...whereas "It" refers to the RNG I posted with the comment "The

fastest RNG I have on hand at the moment that passes my testing

criteria is".

It just occurred to me that I was referring to two different
algorithms in the same brief post so I thought I'd clarify just in
case.

sfuerst

unread,

Dec 30, 2011, 11:30:24 AM12/30/11

to

<snip>

> Ah, I had forgotten that. And missed the second multiply when posting
> - it's in the C code I'm testing atm, just missing from the post (I
> tried to edit out extraneous details relating to how I stuck it in to
> the testing framework, accidentally removed an extra line).
>
> Anyway, if you take the bottom 16 bits of the output of that RNG and
> apply a 16 bit gap test to them, it fails after 128 to 256 GB. If you
> do only that test, it takes about 5 minutes. If you do my latest
> standard mix of tests it takes half an hour to a full hour for it to
> find that failure.

I'm confused. The Bigcrush test (and the other lesser tests) don't
work like that. It does Knuth's gap test with 5*10^8 numbers for gaps
of 1/16, 3*10^8 numbers for gaps of 1/32, 10^8 numbers for gaps of
1/128, and 10^7 numbers for gaps of 1/1024. The failure you are
describing occurs with 3*10^10 output numbers... well above the level
tested.

Basically, I started with a "bulletproof" rng, and then pared it down,
removing as much as possible. It now only _just_ passes the tests
tried in Bigcrush. If you go beyond them, it will fail. This is
deliberate. The idea isn't to make the perfect rng, it is to try to
make something as fast as possible for the given testing regime.
Requiring a stricter set of tests will mean requiring something not
quite as fast...

Also note that the TestU01 tests work on 32bit quantities. The
standard way of converting a 64 bit generator to a 32 bit one is to
interleave the upper and lower halves live so:

static unsigned rng_interleaved(void)
{
static unsigned ctr = 0;
static unsigned long long dat;

if (++ctr & 1)
{
dat = rng(rseed);
return dat;
}

return dat >> 32;
}

This interleaved form is what was tested.

Steven

sfuerst

unread,

Dec 30, 2011, 11:34:19 AM12/30/11

to

> > Right... the choice of odd number doesn't matter all that much as long
> > as roughly half the bits are set. The theory is that three xorshifts
> > + multiplies should be enough for a good random hash. The multiply
> > mixes entropy upwards. The xorshift mixes it downwards. However, the
> > above only does two though.
>
> and imagine that when i say "good distribution", i mean a very
> restrictive requirement to obtain it. because i tested the algo in two
> ways: the way You coded it, using
>
> mov $0x6595a395a1ec531b, %rsi
>
> and my way :-) using RDTSC !! shifted all on RSI, at only 46-bits
> information !!

Be careful. The output from RDTSC doesn't have all that much entropy
in it. What you want to do is collect TSC output until you know you
have at least 128 bits of entropy, before reseeding. (This isn't
exactly trivial to do.) Reseeding early will be bad.

Steven

orz

unread,

Dec 30, 2011, 8:16:06 PM12/30/11

to

On Dec 30, 8:30 am, sfuerst <svfue...@nospicedham.gmail.com> wrote:
> <snip>
>
> > Ah, I had forgotten that. And missed the second multiply when posting
> > - it's in the C code I'm testing atm, just missing from the post (I
> > tried to edit out extraneous details relating to how I stuck it in to
> > the testing framework, accidentally removed an extra line).
>
> > Anyway, if you take the bottom 16 bits of the output of that RNG and
> > apply a 16 bit gap test to them, it fails after 128 to 256 GB. If you
> > do only that test, it takes about 5 minutes. If you do my latest
> > standard mix of tests it takes half an hour to a full hour for it to
> > find that failure.
>
> I'm confused. The Bigcrush test (and the other lesser tests) don't
> work like that. It does Knuth's gap test with 5*10^8 numbers for gaps
> of 1/16, 3*10^8 numbers for gaps of 1/32, 10^8 numbers for gaps of
> 1/128, and 10^7 numbers for gaps of 1/1024. The failure you are
> describing occurs with 3*10^10 output numbers... well above the level
> tested.

Yes. If you define SmallCrush/Crush/BigCrush as your sole testing
regime then this failure is irrelevant. You said that it had passed
BigCrush so I applied my testing regime, which is a bit different than
TestU01. The lower 16 bits out of each 64 were used because that is
one of the things that my testing regime tries on 64 bit RNGs. The
fact that it is a different testing regime does not strictly mean that
it is more stringent than BigCrush - note that it used 30 minutes of
CPU time to find bias in the RNG the first time I ran it, 60 minutes
the 2nd time I ran it, whereas BigCrush uses 4 hours of CPU time on
this computer, a bit more if you run Crush and SmallCrush first.

> Basically, I started with a "bulletproof" rng, and then pared it down,
> removing as much as possible. It now only _just_ passes the tests
> tried in Bigcrush. If you go beyond them, it will fail. This is
> deliberate. The idea isn't to make the perfect rng, it is to try to
> make something as fast as possible for the given testing regime.
> Requiring a stricter set of tests will mean requiring something not
> quite as fast...

It may be the farthest that that particular algorithm can be scaled
down, but if the idea is to make the fastest / smallest / simplest RNG
that passes BigCrush I'm pretty sure it's possible to go lower than
that. Also, that RNG has all the heavy lifting done in the output
function, with almost no work done in the state transition function -
generally I expect an RNG that is tuned for maximum performance at a
given level of empirical quality to be the opposite in that regard, to
allow pseudo-entropy (for lack of a better term) to be recycled from
one call to the next. With a bit of work I expect you could get
something down to 2 or 3 opcodes per call (assuming many calls in
rapid succession, all inlined by a smart optimizer... otherwise I'd
think it might be more like 5 or 10 opcodes when you have to handle
loading registers, saving registers, and returning from function
calls) and still pass BigCrush. For instance, this passes BigCrush
just fine:
struct State {Uint64 a, b;};
Uint64 random( State *state ) {
Uint64 tmp = state->a ^ state->b;
state->a = state->b * 0x6595a395a1ec531b;
state->b = (tmp << 27) | (tmp >> 37);
return state->a;
}
I think that if that was called repeatedly by a smart optimizer it
might be as little as an xor an imul and a rol per call, plus like 5
ops total across all calls for loading everything in to registers and
saving everything back afterwards.

sfuerst

unread,

Dec 31, 2011, 1:52:13 AM12/31/11

to

<snip>

> For instance, this passes BigCrush
> just fine:
> struct State {Uint64 a, b;};
> Uint64 random( State *state ) {
> Uint64 tmp = state->a ^ state->b;
> state->a = state->b * 0x6595a395a1ec531b;
> state->b = (tmp << 27) | (tmp >> 37);
> return state->a;}
>
> I think that if that was called repeatedly by a smart optimizer it
> might be as little as an xor an imul and a rol per call, plus like 5
> ops total across all calls for loading everything in to registers and
> saving everything back afterwards.

Be careful. Even though that passes Bigcrush, I probably wouldn't
trust it. The problem is that a rng of that form has no guaranteed
period. Statistically, the period is about the square root of the
state-space. (i.e. 2^64, for that function.) However, there is
nothing preventing you from accidentally choosing a small cycle with
your initial seed... For example, let the initial seed be state->a =
0, and state->b = 0. (Not a particularly uncommon choice.)
Unfortunately, that is the absolutely worst possible case, with a
cycle length of 1. :-/

The generator I described is guaranteed to have a period of 2^128, and
it doesn't care which seed you pick. It is also guaranteed to be
uniform, with every 2^64 possible output occurring 2^64 times within
that period. Uniformity can be a good or a bad thing... usually it is
good for most applications though.

However, you are right. The low-order bits aren't quite as good as
they could be. To make them better, just replace the final addition
by another xorshift operation. You are also right in that having a
more complex seed mutation function might be better. It should also
be possible to get OOO parallelism via overlapping its calculation
with that of the hash function. However, due to contention with the
carry flag, I couldn't find a particularly nice way to do it.

Steven

orz

unread,

Dec 31, 2011, 6:23:40 AM12/31/11

to

On Dec 30, 10:52 pm, sfuerst <svfue...@nospicedham.gmail.com> wrote:
> <snip>
>
> > For instance, this passes BigCrush
> > just fine:
> > struct State {Uint64 a, b;};
> > Uint64 random( State *state ) {
> > Uint64 tmp = state->a ^ state->b;
> > state->a = state->b * 0x6595a395a1ec531b;
> > state->b = (tmp << 27) | (tmp >> 37);
> > return state->a;}
>
> > I think that if that was called repeatedly by a smart optimizer it
> > might be as little as an xor an imul and a rol per call, plus like 5
> > ops total across all calls for loading everything in to registers and
> > saving everything back afterwards.
>
> Be careful. Even though that passes Bigcrush, I probably wouldn't
> trust it. The problem is that a rng of that form has no guaranteed
> period. Statistically, the period is about the square root of the
> state-space. (i.e. 2^64, for that function.) However, there is
> nothing preventing you from accidentally choosing a small cycle with
> your initial seed... For example, let the initial seed be state->a =
> 0, and state->b = 0. (Not a particularly uncommon choice.)
> Unfortunately, that is the absolutely worst possible case, with a
> cycle length of 1. :-/

The square-root-of-statespace rule is for irreversible RNGs - the
example function I gave is reversible, so it has an expected average
period closer to half the statespace size than to the square root of
the statespace size. The bad cycles issue is quite real, but my
analysis finds it to not be very significant on most RNGs of that
general character with statespaces of 2**128 (ie the chance of hitting
a cycle of length less then 2**50 on a random seeding appears to be
substantially less than one in a trillion). Though of course the all
zeroes state does need to be precluded by the seeding function. That
said, I do add a little non-chaotic structure to some of my otherwise
chaotic reversible RNGs to force them to have a provable minimum
period. My suspicion is that the combination of 4 ops (when
repeatedly inlined by a smart optimizer), a provable minimum period,
and passing BigCrush ought to be perfectly practical. Though I can't
find one looking through my RNGs at the moment... some at 5 ops
though.

Anyway, that function wasn't intended to be good RNG, just a
demonstration that extremely simple and fast RNGs of just a few
operations can pass BigCrush with little effort. I guess what I'm
saying is, I'm not quite sure what your design goals are. Since this
is in the asm group I'm assuming speed is important. From what you
said earlier I ended up with the impression that the primary goals
were passing TestU01 and being fast. Probably I was confused. If you
want a good general RNG, including statespace and cycle length...
well, in that case my biggest concern on your RNG would be inter-state
correlation. I think that the low word of your state has too little
impact upon the output to really count towards the statespace and
cycle length. If I wanted to improve that issue while making minimal
changes to the rest of the algorithm, I would probably convert that
128 bit counter in to a 128 bit LCG of modulus 2**128 and multiplier
of 2**64+1. Something like:
Uint64 random(State *state) {

const Uint64 K = 0x6595a395a1ec531b;

state->low += K;
Uint64 return_value = (state->high ^ (Uint64)state) * K;
state->high += ((state->low < K) ? 1 : 0) + state->low;
return_value ^= return_value >> 32;
return return_value * K + state->low;

}

> The generator I described is guaranteed to have a period of 2^128, and
> it doesn't care which seed you pick. It is also guaranteed to be
> uniform, with every 2^64 possible output occurring 2^64 times within
> that period. Uniformity can be a good or a bad thing... usually it is
> good for most applications though.

It's true that, being single-cycle, it has no short cycles. That sort
of uniformity is nice but nothing to write home about since it
generally requires active effort to *not* be uniform in that sense.

I would consider it to care which seed you pick. In fact, I would
consider it to be far worse on the seeding front than the example
short RNG I posted, which only needs to avoid 0 (easily handled by a
seeding function) and the in 1 in 2**60 or so fraction of seeds that
happen to lead to short cycles - it has no long term correlation
between similar initial states, and very little short term
correlation. Many common entropy sources will have much higher
quality in their low bits than their high bits. The low-word-
inadequately-reflected-in-the-output issue I mentioned above is liable
to interact with that in very bad ways in real world applications.
Consider, for instance, a program that does some basic monte carlo
type stuff using that algorithm. Suppose it decides to seed the RNG
using a 128 bit number of cycles since the computer was booted, maybe
tossing in few bits of CPU core id and boot # or somesuch to guarantee
uniqueness across cores and reboots. Now suppose this program gets
called repeatedly, and needs uncorrelated runs to produce useful
results. And suppose that your RNG often ends up instantiated at the
same address on sequential runs. All of this sounds like a pretty
normal scenario to me. And if you don't do complex hashing during
your seeding process, which you kinda implied with that zero comment,
then it will result in severe correlation between runs, as a large
fraction of the RNG output will differ by only a constant factor
between consecutive runs. Even with complex hashing in your seeding
process your effective statespace for purposes of numbers of seeds
with that can produce relatively uncorrelated output ends up being
closer to 2**64, not 2**128, which will make a noticable difference if
a lot of runs are done.

sfuerst

unread,

Dec 31, 2011, 11:45:39 AM12/31/11

to

<snip>

> The square-root-of-statespace rule is for irreversible RNGs - the
> example function I gave is reversible, so it has an expected average
> period closer to half the statespace size than to the square root of
> the statespace size.

Oops, you are right. The shift+or line is really a rotate. For some
reason, the constants 27 and 37 didn't immediately spring to mind as
adding to 64. :-/ Note: try using a bswap instruction. It moves
entropy downwards much better, and is a much nicer complement to a
multiply.

> The bad cycles issue is quite real, but my
> analysis finds it to not be very significant on most RNGs of that
> general character with statespaces of 2**128 (ie the chance of hitting
> a cycle of length less then 2**50 on a random seeding appears to be
> substantially less than one in a trillion). Though of course the all
> zeroes state does need to be precluded by the seeding function.

Unfortunately, one in a trillion isn't quite good enough. I've had to
debug issues that happen on that frequency. It isn't particularly
fun. You really need some kind of mathematical guarantee of a certain
period length. That proof is the major distinguisher between "hobby"
and "professional" examples of random number generators.

> That
> said, I do add a little non-chaotic structure to some of my otherwise
> chaotic reversible RNGs to force them to have a provable minimum
> period. My suspicion is that the combination of 4 ops (when
> repeatedly inlined by a smart optimizer), a provable minimum period,
> and passing BigCrush ought to be perfectly practical. Though I can't
> find one looking through my RNGs at the moment... some at 5 ops
> though.

Stop counting ops. Count (or rather, measure), cycles instead.
Modern computers do not do ops in the order you write them. They
merge and split ops... so the resulting instruction sequence can look
quite different. For example, I can almost swear that this machine
does something special with an add followed by an adc. If I shift a
(non-flag affecting) instruction in between, the result is nearly
always slower. The combining of compares with branches also seemed to
be implemented quite a bit before it was documented...

> Anyway, that function wasn't intended to be good RNG, just a
> demonstration that extremely simple and fast RNGs of just a few
> operations can pass BigCrush with little effort. I guess what I'm
> saying is, I'm not quite sure what your design goals are.

The design goals were to create a fast yet conceptually simple rng
that passed bigcrush, with a suitably large period. That was it. It
was an optimization exercise. Originally, I was using gcc. However,
the code output was absolutely horrible when using 128 bit types.
That lead to experimental assembly usage. Of course, the resulting
rng now doesn't use 128 bit operations other than an add... but half
the fun is the journey.

> Since this
> is in the asm group I'm assuming speed is important. From what you
> said earlier I ended up with the impression that the primary goals
> were passing TestU01 and being fast. Probably I was confused. If you
> want a good general RNG, including statespace and cycle length...
> well, in that case my biggest concern on your RNG would be inter-state
> correlation. I think that the low word of your state has too little
> impact upon the output to really count towards the statespace and
> cycle length.

Perhaps. However, the idea is that the hash function has enough
upward avalanche that that doesn't matter. The occasional carry into
the upper 64bit word is enough. The real problem seems to be the
reverse. The upper bits in the state don't affect the lower bits in
the output as much as they could.

> If I wanted to improve that issue while making minimal
> changes to the rest of the algorithm, I would probably convert that
> 128 bit counter in to a 128 bit LCG of modulus 2**128 and multiplier
> of 2**64+1.

I tried something like that, but with a generic multiplier instead.
Unfortunately, if you use a better multiplier, the result was slower
on this machine due to the 64x64->128 operation. I didn't try a
special-case multiplier though.

> Something like:
> Uint64 random(State *state) {
> const Uint64 K = 0x6595a395a1ec531b;
> state->low += K;
> Uint64 return_value = (state->high ^ (Uint64)state) * K;
> state->high += ((state->low < K) ? 1 : 0) + state->low;
> return_value ^= return_value >> 32;
> return return_value * K + state->low;
> }

That looks fairly good, except that the final addition may not be
needed. There are other options as well, like multiplying by 2^64-1,
or perhaps other values via the magic of the lea instruction. If the
effective multiplier is nice enough, a 256bit state may look
attractive. The fun part is finding the best option out of all of
them.

Steven

orz

unread,

Jan 2, 2012, 6:21:17 PM1/2/12

to

Apologies in advance if this shows up as a double or triple post, I'm
having a hard time posting at the moment.
On Dec 31 2011, 8:45 am, sfuerst <svfue...@nospicedham.gmail.com>
wrote:

> <snip>
>
> > The square-root-of-statespace rule is for irreversible RNGs - the
> > example function I gave is reversible, so it has an expected average
> > period closer to half the statespace size than to the square root of
> > the statespace size.
>
> Oops, you are right. The shift+or line is really a rotate. For some
> reason, the constants 27 and 37 didn't immediately spring to mind as
> adding to 64. :-/ Note: try using a bswap instruction. It moves
> entropy downwards much better, and is a much nicer complement to a
> multiply.

Eeenteresting... I had not considered an endianness reversal as a
possibility for downward entropy movement. On first glance it looks
good, slightly better for rightward entropy movement than the barrel
shifts I've been using in my RNGs for that purpose, at least on 32 and
64 bit words. How are they in terms of portability? Most of my RNGs
are intended for anything that does general purpose 16, 32, or 64 bit
integer math. Barrel shifts are supported on a lot of ISAs and
somewhat efficient to emulate using regular shifts when they aren't
supported... I have no clue how widely supported endianness reversal
is or isn't on non-x86 hardware, but it looks harder to emulate on 32
and 64 bit CPUs.

> > The bad cycles issue is quite real, but my
> > analysis finds it to not be very significant on most RNGs of that
> > general character with statespaces of 2**128 (ie the chance of hitting
> > a cycle of length less then 2**50 on a random seeding appears to be
> > substantially less than one in a trillion). Though of course the all
> > zeroes state does need to be precluded by the seeding function.
>
> Unfortunately, one in a trillion isn't quite good enough. I've had to
> debug issues that happen on that frequency. It isn't particularly
> fun. You really need some kind of mathematical guarantee of a certain
> period length. That proof is the major distinguisher between "hobby"
> and "professional" examples of random number generators.

The chance is believed to be far far below one in a trillion,
unfortunately it's very hard to tell how well the cycle length
distribution conforms to the ideal distribution on a purely chaotic
RNG so I try to be conservative. With no way to prove that some
people would hesitate to use it, which is one of the reasons that I
structure some of my chaotic RNGs to have provable minimum cycle
lengths or at least a high likelyhood of closely approximating the
cycle length distribution of an ideal chaotic reversible RNG.

> > That
> > said, I do add a little non-chaotic structure to some of my otherwise
> > chaotic reversible RNGs to force them to have a provable minimum
> > period. My suspicion is that the combination of 4 ops (when
> > repeatedly inlined by a smart optimizer), a provable minimum period,
> > and passing BigCrush ought to be perfectly practical. Though I can't
> > find one looking through my RNGs at the moment... some at 5 ops
> > though.
>
> Stop counting ops. Count (or rather, measure), cycles instead.
> Modern computers do not do ops in the order you write them. They
> merge and split ops... so the resulting instruction sequence can look
> quite different. For example, I can almost swear that this machine
> does something special with an add followed by an adc. If I shift a
> (non-flag affecting) instruction in between, the result is nearly
> always slower. The combining of compares with branches also seemed to
> be implemented quite a bit before it was documented...

I agree that counting ops is not a good idea. I was trying to speak
simultaneously of speed and complexity, but ops is not really a good
measure of either. The first RNG I listed I described as 5 to 8
gigabytes per second, on a single core of my Core i5 2500. The higher
end of which would imply between 3 and 4 cycles per call (this thing
supposedly runs a little over 3 GHrz, but it reclocks itself based
upon temp and workload or something). Unfortunately the results my
current main benchmark produces aren't prefectly reproducible, have
been falling steadily for the last week or two without any code
changes (maybe I should turn down the heater? dunno what else effects
it), and show no difference at all between some of the really fast
RNGs. Once they get fast enough to be hard to measure I usually worry
more about other issues (quality, seeding quality, seeding speed,
provable properties, random access, crypto security, exotic seeding,
etc).

> > Anyway, that function wasn't intended to be good RNG, just a
> > demonstration that extremely simple and fast RNGs of just a few
> > operations can pass BigCrush with little effort. I guess what I'm
> > saying is, I'm not quite sure what your design goals are.
>
> The design goals were to create a fast yet conceptually simple rng
> that passed bigcrush, with a suitably large period. That was it. It
> was an optimization exercise. Originally, I was using gcc. However,
> the code output was absolutely horrible when using 128 bit types.
> That lead to experimental assembly usage. Of course, the resulting
> rng now doesn't use 128 bit operations other than an add... but half
> the fun is the journey.

Well, you're objectively passing BigCrush, and even my pessimistic
interpretations of your cycle length exceed the maximum a modern
program could possibly. When you say a suitably large period, are you
also talking about statespace? Of course it's impossible for the
statespace to be smaller than the cycle length, but the reverse is not
true, and while the maximum desirable period for modern applications
is around 2**60, the maximum desirable statespace is much higher,
maybe 2**180 or so, higher if you assume that the effective statespace
is slightly smaller than the nominal statespace.

Speedwise... I'm a bit confused at the moment. Currently it's running
around 1.5 to 2.5 GB/s in my benchmark depending upon circumstances,
but that's in C++ not asm. Reaching 2.5 in C required substituting a
regular add for the add with carry, which might be the compilers fault
or the hardwares, I have not looked at the dissassembly. It's also
behaving somewhat quirkily, some versions running significantly faster
when called through a vtable than when called directly in an inlinable
manner. Somehow I have the vague impression that when I benchmarked
it a few days ago I saw faster performance, but I can't reproduce that
or remember what kind of performance it was.

The LCG-based variant from my last post only with the last addition
removed as you suggested is running around 4.1 GB/s. A bswap on the
input to the first multiply might help quality, haven't tried that
yet.

Note that I'd expect chaotic RNGs to outperform on both quality and
speed, even if a guaranteed minimum cycle length is required. A
variant of one of the ones I posted earlier achieves 5.6 GB/s on my
computer at the moment, has a minimum cycle length of 2**64, and does
very well on statistical tests even when "Word" is 16 bits instead of
32 or 64.
template<typename Word, int SHIFT1, int SHIFT2, int SHIFT3>
class _sfc_alternative {
public:
Word a, b, c, counter;
static Word rotate(Word value, int bits) {return (value << bits) |
(value >> (8*sizeof(value)-bits));}
Word _raw_output() {
Word old = a + b + counter++;
a = b ^ (b >> SHIFT2);
b = c + (c << SHIFT3);
c = old + rotate(c,SHIFT1);
return old;
}
};
The shifts are set to 29, 9, and 3 (I'd prefer higher, but LEA is too
nice to resist) at 64 bit at the moment. If the counter is removed
performance rises to 7.0 GB/s at the moment, but of course the minimum
cycle length guarantee is lost.

> > Since this
> > is in the asm group I'm assuming speed is important. From what you
> > said earlier I ended up with the impression that the primary goals
> > were passing TestU01 and being fast. Probably I was confused. If you
> > want a good general RNG, including statespace and cycle length...
> > well, in that case my biggest concern on your RNG would be inter-state
> > correlation. I think that the low word of your state has too little
> > impact upon the output to really count towards the statespace and
> > cycle length.
>
> Perhaps. However, the idea is that the hash function has enough
> upward avalanche that that doesn't matter. The occasional carry into
> the upper 64bit word is enough. The real problem seems to be the
> reverse. The upper bits in the state don't affect the lower bits in
> the output as much as they could.

It doesn't matter in that a 64 bit counter is actually good enough for
many purposes. It does matter though in that your 128 bit statespace
is effectively a 64 bit statespace, you might as well replace "a" with
a constant and get more speed. The occasional carry is not enough
because it does not accumulate, and it does get cancelled out. It
will hide the correlation... for a single output. Maybe several
outputs. But then the correlation will come right back full strength
when the other state caries too. It's just a matter of what percent
of the time they mismatch on caries, and it will probably average 50%
in that kind of scenario depending upon the exact values of "a".

> > If I wanted to improve that issue while making minimal
> > changes to the rest of the algorithm, I would probably convert that
> > 128 bit counter in to a 128 bit LCG of modulus 2**128 and multiplier
> > of 2**64+1.
>
> I tried something like that, but with a generic multiplier instead.
> Unfortunately, if you use a better multiplier, the result was slower
> on this machine due to the 64x64->128 operation. I didn't try a
> special-case multiplier though.
>
> > Something like:
> > Uint64 random(State *state) {
> > const Uint64 K = 0x6595a395a1ec531b;
> > state->low += K;
> > Uint64 return_value = (state->high ^ (Uint64)state) * K;
> > state->high += ((state->low < K) ? 1 : 0) + state->low;
> > return_value ^= return_value >> 32;
> > return return_value * K + state->low;
> > }
>
> That looks fairly good, except that the final addition may not be
> needed. There are other options as well, like multiplying by 2^64-1,
> or perhaps other values via the magic of the lea instruction. If the
> effective multiplier is nice enough, a 256bit state may look
> attractive. The fun part is finding the best option out of all of
> them.
>
> Steven

Yeah, I agree the last addition isn't helping there. I don't really
see other fast multipliers as being any better, but it might be worth
trying. Maybe a bswap on the input to the first multiply? Haven't
tried that yet, but the higher bits there are better mixed so that
might improve quality enough to allow the the second multiply to be
removed.

Is there a recommended way to express bswap / endianness reversals in
C?

pe...@nospam.demon.co.uk

unread,

Jan 3, 2012, 4:01:11 AM1/3/12

to

In article <e5a11bb8-146c-480a...@t14g2000prh.googlegroups.com>
cdh...@nospicedham.gmail.com "orz" writes:
[..]

> Is there a recommended way to express bswap / endianness reversals in
> C?

Probably, but here's one way (off the top of my head...) for 32 bit,
easily expanded to 64 bit (and excluding the obvious _asm{} way!):

#include <stdio.h>

union Reg32
{
unsigned Uint32 r;
unsigned char b[4];
};

void bswap (union Reg32 *pr)
{
pr->b[0] ^= pr->b[3];
pr->b[3] ^= pr->b[0];
pr->b[0] ^= pr->b[3];
pr->b[1] ^= pr->b[2];
pr->b[2] ^= pr->b[1];
pr->b[1] ^= pr->b[2];
}

int main()
{
union Reg32 r32;

r32.r = 0x11223344;
printf ("Before: 0x%08x\n", r32.r);
bswap (&r32);
printf ("After : 0x%08x\n", r32.r);

return 0;
}

Alternatively, and possibly tidier as it hides the detail:

Uint32 bswap (Uint32 ui)
{
union Reg32 r32;

r32.r = ui;
r32.b[0] ^= r32.b[3];
r32.b[3] ^= r32.b[0];
r32.b[0] ^= r32.b[3];
r32.b[1] ^= r32.b[2];
r32.b[2] ^= r32.b[1];
r32.b[1] ^= r32.b[2];

return r32.r;
}

int main()
{
Uint32 n = 0x11223344;

printf ("Before: 0x%08x\n", n);
printf ("After : 0x%08x\n", bswap(n));

return 0;
}

I have a feeling that this could be improved upon!
Pete
--
Believe those who are seeking the truth.
Doubt those who find it. - André Gide

Terje Mathisen

unread,

Jan 3, 2012, 6:57:32 AM1/3/12

to

Ouch!

It will be much faster to use transpose-style code:

First swap high/low word:

n = (n >> 16) | (n << 16);

Next, swap high/low bytes in each word:

n = ((n & 0x00ff00ff) << 8) | ((n >> 8) & 0x00ff00ff);

This can of course be easily extended to 64-bit as well.

n = (n >> 32) | (n << 32);
n = ((n & 0x0000ffff0000ffff) << 16) |
((n >> 16) & 0x0000ffff0000ffff);
n = ((n & 0x00ff00ff00ff00ff) << 8) |
((n >> 8) & 0x00ff00ff00ff00ff);

The code above could compile into something like this:

mov rbx, rax
shr rax,32

shl rbx,32

or rax,rbx
mov rcx,0x0000ffff0000ffff

mov rbx,rcx
and rcx,rax
shr rax,16

shl rcx,16
and rax,rbx

or rax,rcx
mov rcx,0x00ff00ff00ff00ff

mov rbx,rcx
and rcx,rax
shr rax,8

shl rcx,8
and rax,rbx

or rax,rcx

The C code should have a latency of ~3 cycles/stage, so about 9 cycles
for the 64-bit version, vs a single cycle for BSWAP or PSHUFB.

>
> int main()
> {
> Uint32 n = 0x11223344;
>
> printf ("Before: 0x%08x\n", n);
> printf ("After : 0x%08x\n", bswap(n));
>
> return 0;
> }
>
> I have a feeling that this could be improved upon!

Yes indeed!
:-)

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

orz

unread,

Jan 3, 2012, 1:29:10 PM1/3/12

to

On Jan 3, 1:01 am, p...@nospam.demon.co.uk wrote:
> In article <e5a11bb8-146c-480a-b0d0-bcd796c84...@t14g2000prh.googlegroups.com>

> int main()
> {
> Uint32 n = 0x11223344;
>
> printf ("Before: 0x%08x\n", n);
> printf ("After : 0x%08x\n", bswap(n));
>
> return 0;
>
> }
>
> I have a feeling that this could be improved upon!

> Pete
> --
> Believe those who are seeking the truth.
> Doubt those who find it. - André Gide

On Jan 3, 3:57 am, Terje Mathisen <"terje.mathisen at
tmsw.no"@giganews.com> wrote:
> p...@nospam.demon.co.uk wrote:
> > In article<e5a11bb8-146c-480a-b0d0-bcd796c84...@t14g2000prh.googlegroups.com>

If I write ((a << 11) | (a >> 21)) both MSVC and gcc reliably figure
out that that is an barrel shift, even if I obfuscate it slightly, and
the barrel shift will be fast. But when I take the xor-multiply-
barrel-shift RNG and replace the barrel shift with a bswap... all
possibilities end up slower than the barrel shift and MSVC won't
perform decently with anything other than an intrinsic.

MSVC, 32 bit code, 32 bit algorithm:
3.2 GB/s with barrel shift (no endianness reversal)
2.2 GB/s with assign (no endianness reversal)
2.1 GB/s with intrinsic (_byteswap_ulong)
1.3 GB/s with transpose
0.9 GB/s with inline asm
0.5 GB/s with union swap
0.4 GB/s with union xorswap
For some reason a barrel shift there is even faster than a simple
assignment, but all forms of 32 bit endianness reversal I tried were
slow.

gcc, 64 bit code, 64 bit algorithm:
6.7 GB/s with assign (no endianness reversal)
5.4 GB/s with rotate (no endianness reversal)
4.6 GB/s with smart-inline-asm
4.3 GB/s with intrinsic (__builtin_bswap64)
4.3 GB/s with transpose
0.8 GB/s with union swap
0.6 GB/s with union xorswap
On gcc the simple assign was properly faster than the barrel shift.
But the inline asm for some reason was faster than the intrinsic? And
the intrinsic was the same speed as the transpose? That might mean
that gcc recognized the transpose for what it intended.
Note that there is inherently a factor of 2 between the 64 bit and 32
bit versions since I'm listing them in bytes per second and 64 bit
produces twice as many bytes at a time.

Anyway, barring additional information it looks like I won't be
switching my barrel shifts for endianness reversals. Which is too bad
as I think endianness reversal is likely to be better for random
number generation purposes than barrel shifting if it could be done at
the same speed. A full bit order reversal would probably be better
still, but so far as I know can't be done efficiently.

orz

unread,

Jan 3, 2012, 2:03:24 PM1/3/12

to

On Jan 3, 10:29 am, orz <cdh...@nospicedham.gmail.com> wrote:
(snip)
> 4.6 GB/s with smart-inline-asm
oops, I was accidentally truncating the result to 32 bit on that code
path
With that corrected the performance changes from 4.6 to 4.3 GB/s.

Meaning that smart-inline-asm performed identically to the intrinsic
and the transpose code.

Pedro Pereira

unread,

Jan 3, 2012, 2:34:42 PM1/3/12

to

I haven't read all the thread but what about xorshift?
http://en.wikipedia.org/wiki/Xorshift

get_pseudo_random:
mov eax, DWORD PTR state
mov edx, DWORD PTR state+12
mov ecx, eax
sal ecx, 11
xor ecx, eax
mov eax, DWORD PTR state+4
mov DWORD PTR state, eax
mov eax, DWORD PTR state+8
mov DWORD PTR state+8, edx
mov DWORD PTR state+4, eax
mov eax, edx
shr eax, 19
xor eax, edx
xor eax, ecx
shr ecx, 8
xor eax, ecx
mov DWORD PTR state+12, eax
ret

state:
.long 123456789
.long 362436069
.long 521288629
.long 88675123

Pedro Pereira

orz

unread,

Jan 3, 2012, 3:59:02 PM1/3/12

to

On Jan 3, 11:34 am, Pedro Pereira <ppere...@nospicedham.grupopie.com>
wrote:
> I haven't read all the thread but what about xorshift?http://en.wikipedia.org/wiki/Xorshift

Flunks Crush and BigCrush. And most other tests. Well, I tested the
very similar algorithm at the top of Marsaglias paper on the subject,
not that specific algorithm, but I believe they produce virtually
identical results on tests. I do use a variant of that with output
hashing when I want a random access RNG though, or a single cycle
RNG.

Speedwise, I get 2 GB/s off of the the C version of that, 4.1 GB/s off
of 64 bit versions of the C version of that.

Terje Mathisen

unread,

Jan 3, 2012, 4:18:08 PM1/3/12

to

orz wrote:
> Anyway, barring additional information it looks like I won't be
> switching my barrel shifts for endianness reversals. Which is too bad
> as I think endianness reversal is likely to be better for random
> number generation purposes than barrel shifting if it could be done at
> the same speed. A full bit order reversal would probably be better
> still, but so far as I know can't be done efficiently.

A full bit order reversal is easily handled with the same algorithm I
showed for byte swapping, just add three more stages.

I.e. we're talking close to 20 cycles for a 64-bit number, comparable to
looking each byte up in a byte reversal table. :-(

orz

unread,

Jan 3, 2012, 5:22:18 PM1/3/12

to

Also note On Jan 3, 12:59 pm, orz <cdh...@nospicedham.gmail.com>
wrote:

Also, speedwise, note that the 2 GB/s / 4.1 GB/s numbers were for C
code that was a straight translation of the asm. Applying minor
optimizations to make a more pipelinable algorithm that produces the
same output improves performance to 2.5 GB/s / 5.0 GB/s.

On Jan 3, 1:18 pm, Terje Mathisen <"terje.mathisen at

Ouch! It would be nice, and pretty simple in hardware, but I guess
there's not enough demand for performance such operations.

Rod Pemberton

unread,

Jan 3, 2012, 9:36:34 PM1/3/12

to

<pe...@nospam.demon.co.uk> wrote in message
news:132558...@nospam.demon.co.uk...
...

> pr->b[1] ^= pr->b[2];
>

FYI, the C compilers I've seen will use at least one more instruction for
C's [] subscript operator than if pointers were used and dereferenced, i.e.,
for adding the index to the pointer. I went through twelve different
methods of breaking down a 32-bit word into bytes for a simple bytecode
interpreter and that's my take on it. Also, some compilers, e.g., GCC, are
horrible at generating x86 byte instructions, i.e., a char, while others,
e.g., OpenWatcom, are superb. You can get GCC to generate x86 byte
instructions by rewriting your C code, but it's difficult.

Rod Pemberton

Rod Pemberton

unread,

Jan 3, 2012, 9:40:13 PM1/3/12

to

"orz" <cdh...@nospicedham.gmail.com> wrote in message
news:e5a11bb8-146c-480a...@t14g2000prh.googlegroups.com...

>
> I had not considered an endianness reversal as a
> possibility for downward entropy movement.

...

Since you are discussing "endianness reversal" to modify entropy, the
following technique for increasing the randomness of hash functions might
possibly be put to work here too.

I'm not sure where I ran across the technique, perhaps it was by Bob
Jenkins. Sometimes the input data set for a hash function is
insufficiently random. I.e., no matter what you do to increase the
randomness of the hash function to reduce hash collisions, the limited range
of the input set will always generate a large number of collisions. E.g.,
ASCII text strings have a narrow range of input values per character versus
bytes of binary values, e.g., in an executable. In order to increase
randomness for text strings, a table of bytes is created with random values.
The input, as bytes, is passed through the table, i.e., byte B0=table[byte
B0] etc. Since data from the byte table is far more random than the byte
char's in the original strings, the hash collisions decrease. I'd guess
that "entropy" could be increased or decreased also depending on whether the
table was non-random or random. You could break down your 32-bit word into
bytes and then lookup a random (or non-random) value for that byte to
attempt to change entropy.

Rod Pemberton

Pedro Pereira

unread,

Jan 4, 2012, 1:57:01 PM1/4/12

to

>> I haven't read all the thread but what about xorshift?

[...]

> Flunks Crush and BigCrush. And most other tests. Well, I tested the
> very similar algorithm at the top of Marsaglias paper on the subject,
> not that specific algorithm, but I believe they produce virtually
> identical results on tests. I do use a variant of that with output
> hashing when I want a random access RNG though, or a single cycle
> RNG.

The xorwow present in the xorshift paper does much better with BigCrush:

The following tests gave p-values outside [0.001, 0.9990]:
(eps means a value < 1.0e-300):
(eps1 means a value < 1.0e-15):

Test p-value
----------------------------------------------
7 CollisionOver, t = 7 1.5e-4
8 CollisionOver, t = 7 7.8e-4
21 BirthdaySpacings, t = 16 4.6e-4
27 SimpPoker, r = 27 eps
81 LinearComp, r = 29 1 - eps1
----------------------------------------------
All other tests were passed

Corresponding assembly:

get_pseudo_random:
mov eax, DWORD PTR state

mov ecx, eax
shr ecx, 2

xor ecx, eax
mov eax, DWORD PTR state+4
mov DWORD PTR state, eax
mov eax, DWORD PTR state+8

mov DWORD PTR state+4, eax
mov eax, DWORD PTR state+12
mov DWORD PTR state+8, eax
mov eax, DWORD PTR state+16
mov edx, eax
sal edx, 4
xor edx, eax

mov DWORD PTR state+12, eax

mov eax, DWORD PTR state+20
xor edx, ecx
add ecx, ecx
xor edx, ecx
mov DWORD PTR state+16, edx
add eax, 362437
mov DWORD PTR state+20, eax
add eax, edx

ret

state:
.long 123456789
.long 362436069
.long 521288629
.long 88675123

.long 5783321
.long 6615241

The state should be aligned to a cache line boundary.

Pedro Pereira

Phil Carmody

unread,

Jan 4, 2012, 2:23:20 PM1/4/12

to

sfuerst <svfu...@nospicedham.gmail.com> writes:
<snip>

> Also note that the TestU01 tests work on 32bit quantities. The
> standard way of converting a 64 bit generator to a 32 bit one is to
> interleave the upper and lower halves live so:
>
> static unsigned rng_interleaved(void)
> {
> static unsigned ctr = 0;
> static unsigned long long dat;
>
> if (++ctr & 1)
> {
> dat = rng(rseed);
> return dat;
> }
>
> return dat >> 32;
> }
>
> This interleaved form is what was tested.

There's no "standard" behind such a technique, and for many
of the simplest types of PRNG, that would be a positively
dangerous technique to use. Throwing the bottom bits away
is far more commonly done in code bases that I've seen.

Phil
--
Unix is simple. It just takes a genius to understand its simplicity
-- Dennis Ritchie (1941-2011), Unix Co-Creator

orz

unread,

Jan 4, 2012, 4:15:28 PM1/4/12

to

On Jan 4, 10:57 am, Pedro Pereira <ppere...@nospicedham.grupopie.com>
wrote:

> >> I haven't read all the thread but what about xorshift?
>
> [...]
>
> > Flunks Crush and BigCrush. And most other tests. Well, I tested the
> > very similar algorithm at the top of Marsaglias paper on the subject,
> > not that specific algorithm, but I believe they produce virtually
> > identical results on tests. I do use a variant of that with output
> > hashing when I want a random access RNG though, or a single cycle
> > RNG.
>
> The xorwow present in the xorshift paper does much better with BigCrush:
>
> The following tests gave p-values outside [0.001, 0.9990]:
> (eps means a value < 1.0e-300):
> (eps1 means a value < 1.0e-15):
>
> Test p-value
> ----------------------------------------------
> 7 CollisionOver, t = 7 1.5e-4
> 8 CollisionOver, t = 7 7.8e-4
> 21 BirthdaySpacings, t = 16 4.6e-4
> 27 SimpPoker, r = 27 eps
> 81 LinearComp, r = 29 1 - eps1
> ----------------------------------------------
> All other tests were passed

That may be better, but it's still pretty far from good. And while
TestU01 make take a while to flunk it, my normal test regime flunks it
in a single minute. Also, compound RNGs with small sub-components
like that have interseed/interstate correlation problems which do not
show up in conventional testing but still makes them problematic for
some uses.

orz

unread,

Jan 4, 2012, 5:35:40 PM1/4/12

to

On Jan 4, 11:23 am, Phil Carmody
<thefatphil_demun...@nospicedham.yahoo.co.uk> wrote:

He was talking about RNGs that specifically are intended to output 64
random bits, not RNGs that happen to produce 64 or more bits
internally of which only a fraction are considered usably random.
What he describes is the standard method as it's what naturally
happens when people dump their random bits to a file and then run the
file through some test.

hopcode

unread,

Jan 4, 2012, 9:53:07 PM1/4/12

to

Hi Pedro,

Il 04.01.2012 19:57, Pedro Pereira ha scritto:
>>> I haven't read all the thread but what about xorshift?
>

>> Flunks Crush and BigCrush. And most other tests. Well, I tested the
>> very similar algorithm at the top of Marsaglias paper on the subject,
>> not that specific algorithm, but I believe they produce virtually
>> identical results on tests. I do use a variant of that with output
>> hashing when I want a random access RNG though, or a single cycle
>> RNG.
>
> The xorwow present in the xorshift paper does much better with BigCrush:
>
> The following tests gave p-values outside [0.001, 0.9990]:
> (eps means a value< 1.0e-300):
> (eps1 means a value< 1.0e-15):
>
> Test p-value
> ----------------------------------------------
> 7 CollisionOver, t = 7 1.5e-4
> 8 CollisionOver, t = 7 7.8e-4
> 21 BirthdaySpacings, t = 16 4.6e-4
> 27 SimpPoker, r = 27 eps
> 81 LinearComp, r = 29 1 - eps1
> ----------------------------------------------
> All other tests were passed
>

perhaps it maybe almost ok with Crush and BigCrush, as you say.
i have not yet rewritten some procs i have found for collisions
and correlations.but i set strict requirements, one being that
the algo must feed itself at evey call, issuing a RDTSC.
then, the fact that algos should hold a vector for their states
it sounds to me a luxury; i have not so much bytes for the state.
(whenever in these last days i am changing opinion in some way,
but not radically.) now, look at the picture here,

http://sites.google.com/site/x64lab/ppereira2_1C1S_32K.gif

a limited capability; distribution is not that bad afterall.
the problem is it cannot scramble the entropy it expands.
look at the black lines outlining the concept.

i repeat: that doesent mean necessairly that the algo is bad.
in fact i tweaked the original with a RDTSC,
because i am convinced :-) there must be a way to
scramble the entropy in the timestamp.

iirc, the previous You posted was better than this one.
i tested it, but i dont remember exactly at the moment. i will
post results as i have some more times, because i am busy with the
theory... and then 4 day new.online.de down.

hopcode

unread,

Jan 4, 2012, 10:13:25 PM1/4/12

to

Hi Rod,

Il 04.01.2012 03:40, Rod Pemberton ha scritto:
> I'm not sure where I ran across the technique, perhaps it was by Bob
> Jenkins. Sometimes the input data set for a hash function is
> insufficiently random. I.e., no matter what you do to increase the
> randomness of the hash function to reduce hash collisions, the limited range
> of the input set will always generate a large number of collisions.

exactly. that is the way i am moving is by using a 128bit vector subset
of decimals from PI or atmospheric noise. results using PI are
encouraging at the moment. i have a theory about choosing the opportune
range.

> The input, as bytes, is passed through the table, i.e., byte B0=table[byte
> B0] etc. Since data from the byte table is far more random than the byte
> char's in the original strings, the hash collisions decrease. I'd guess

exactly, or something like that, but in a predetermined way and bytes
opportunely taken.

> that "entropy" could be increased or decreased also depending on whether the
> table was non-random or random. You could break down your 32-bit word into
> bytes and then lookup a random (or non-random) value for that byte to
> attempt to change entropy.

i wouldnt say "increase" or "decrease". i would say "expand" or
"shrink" in the same way 484/49 is an expansion of 22/7 to a power of
2. multipications,division, subtraction,adding, are all "linear"
operation, i.e the _method_ we use to do those operations could be
plotted in the same way of a linear function like those for the LCG.
ergo we have in the pre-calculated PI something that is already linear.
understanding this will avoid (in my hope) the need for
holding a vector for an internal state of the algo.

orz

unread,

Jan 5, 2012, 4:28:36 PM1/5/12

to

I think I mentioned earlier in this thread that I was using a modified
xorshift with ouput hashing when I wanted an RNG that supported random
access - the ability to skip around arbitrary distances within its
cycle backwards or forwards efficiently. Unfortunately, it would have
been more accurate to say that I was in the process of moving from a
combined LCG (mediocre speed, not flexible, severe interseed
correlation, etc) to the modified xorshift. The modified xorshift I
was going for went like this:
template<typename Word, int shift1, int shift2, int shift3, int
shift4, int shift5, int shift6>
class RandomAccessRNG {
Word xs1, xs2, xs3, history;
public:
Word _raw_output() {
Word tmp, old;
tmp = xs1 + xs2 + xs3;
old = xs1;
xs1 = xs2 ^ (xs2 >> shift1);
xs2 = xs3 ^ (xs3 << shift2);
xs3 = xs3 ^ (xs3 >> shift3) ^ old;
tmp ^= tmp >> shift4;
tmp += tmp << shift5;
old = history;
history = tmp;
return (old ^ tmp) + ((old << shift6) | (old >> (8*sizeof(Word)-
shift6)));
}
...
};
On unsigned 16 bit integers with appropriate shift values it had a
cycle length of 2**48-1, on 32 bit it had a cycle length of 2**96-1,
and on 64 bit it had a cycle length of 2**192-1. Even on 16 bit it
passed a ton of tests, and it had very low interseed correlation.
Unfortunately, I just wrote the code to let it skip forward /
backwards and found that it's too slow at seeking - skipping forward /
backwards in its cycle takes too long, and while I might be able to
reduce that by a factor of 1000 or more with a lot of optimization it
might still be too slow. The 64 bit version literally took 2 seconds
to skip forward an arbitrary amount in its cycle with unoptimized
code.

So I'm looking for new a random access RNG again. Ideally one that is
at least semi-fast, has no sustained interseed correlations, easily
passes all statistical tests, does not use multiplication (or any
other complex operations), outputs a number of bits equal to the size
of integers it does math on internally, can be scaled up or down for
16 bit CPUs to 64 bit integers readily, can efficiently skip around
within its cycle or cycles, has a long guaranteed minimum cycle
length, uses little or no flow control, and can be expressed in
portable C/C++ easily and efficiently.

If there was an efficient way of using add-with-carry from C++ then
I'd be interested in a 3 integer LCG with very simple multiplier with
some output hashing. Unfortunately, the standard ways to do that in C
that I know of are not all that efficient, and mostly only work
reasonably on 2 integers not 3. I may give up and go to something
that is only viable at 64 bit and/or uses multiplication... it only
strictly has be across the board better than the CLCG it's replacing I
guess.

sfuerst

unread,

Jan 5, 2012, 10:42:19 PM1/5/12

to

This is the rng I'm investigating now. (Still running BigCrush tests
on various 32bit functions of the 64 bit output.) It doesn't fit all
of your criteria, but it fits most, and is very very fast.

It is a hash of the output of a 128 bit LCG. The LCG uses a
multiplier of 2^64 + 1, so is easy to step. Fast forwarding, and
reversing are trivial. The hash is the most complex thing I've
managed to stuff into the cycles spent accessing L1 in the LCG update.

My constraints are a little different than yours, so more is effort
has been "spent" making sure parallel users can be assured of
different sequences. (Hence the imul as the fourth instruction.)

.align 16
.globl rng_new
.type rng_new,@function
rng_new:
movabs $0x6595a395a1ec531b, %rcx

mov 8(%rdi), %rax

# Or you could have xor %rsi, %rax if process number is the second
parm
xor %rdi, %rax
imul %rcx, %rax

mov (%rdi), %rdx # The LCG update
add %rcx, (%rdi)
adc %rdx, 8(%rdi)

xor %rax, %rdx
ror $29, %rdx # Can use a rotate here due to the extra entropy
from %rdx
xor %rdx, %rax
mov %rax, %rdx
shl $31, %rdx
xor %rdx, %rax
mov %rax, %rdx
shr $23, %rdx
xor %rdx, %rax
retq
.size rng_new, .-rng_new

I've also looked at 192 and 256 bit variants, but getting a good fast
hash, or good fast LCG with that many bits is seemingly futile. I'm
willing to be proved wrong though. :-)

Steven

orz

unread,

Jan 6, 2012, 6:28:50 AM1/6/12

to

It looks, by eye, like a reasonable quality RNG. The way you use that
rotate-xor pair would be very dangerous on 16 bit math but only mildly
on 32 bit and not at all really on 64 bit I think. I get 2.5 GB/s on
that in C, 3.4 GB/s if I switch the adc out for an add (my guess is
that that's the compilers fault). Random access is readily available,
much easier than an xorshift based RNG.

Hm... on my C translation output quality varies enormously depending
upon the address it is located at. At the NULL address, it does
spectacularly badly, at power of 2 addresses I've tried it does badly,
but random values picked off of the heap it tends to do quite well.
I'll leave some tests running overnight on it using an address picked
off of the heap, see what they say tomorrow.

The interstate correlation seems bad at all addresses, including
sustained interstate correlation. With a statesize of only 2**128
that will likely cause problems in a number of parallel usage
scenarios. Though in some scenarios that can be made up for by the
use of the address. Which is a practice that I disapprove of - it
makes results non-reproducible, breaks some uses (including some of my
RNG tests...), and makes testing harder (I could, for instance, have
missed the problems that occur at some addresses). In short, it makes
everything messier by violating the abstraction layer I expect RNGs to
obey. The benefits it provides, while potentially comforting for some
multithreaded users, are of very little utility when paired with high
quality RNGs that are seeded correctly - I attempt to provide similar
but more flexible benefits to my (non-existant) users through a
combination of good RNG general properties (for that kind of purpose
statespace size and interseed correlations in particular) and
intelligent seeding interfaces.

In case the issues I'm seeing are due to a mis-translation, here is my
C code... a 128 bit LCG with weak constants, with both the high and
low 64 bit words of state put through a chaotic (and slightly lossy on
the high word) hashing step:
static Uint64 barrel_shift(Uint64 v, int shift) {return (v << shift) |
(v >> (64-shift));}
...

const Uint64 K = 0x6595a395a1ec531b;

Uint64 tmp = high ^ (Uint64)this;
tmp *= K;
tmp ^= barrel_shift(tmp^low, 64-29);
tmp ^= tmp << 31;
tmp ^= tmp >> 23;
Uint64 old = low;
low += K;
high += old + ((low < K) ? 1 : 0);
return tmp;

---
BTW, side note, is that the correct usage of movabs? In your code it
is used to move a constant in to a register, but my quick googling of
it suggested that it was only for reading from memory at absolute
addresses? Really I don't know much about x86-64 asm so I could
easily have just misunderstood what I read.

orz

unread,

Jan 6, 2012, 1:15:33 PM1/6/12

to

On Jan 6, 3:28 am, orz <cdh...@nospicedham.gmail.com> wrote:
> Hm... on my C translation output quality varies enormously depending
> upon the address it is located at. At the NULL address, it does
> spectacularly badly, at power of 2 addresses I've tried it does badly,
> but random values picked off of the heap it tends to do quite well.
> I'll leave some tests running overnight on it using an address picked
> off of the heap, see what they say tomorrow.

The overnight run reported it as very good output quality when located
at some random address off of the heap. Weird that it's so dependent
upon that xor.

sfuerst

unread,

Jan 6, 2012, 2:32:48 PM1/6/12

to

That's what I'm seeing as well. I'll see if I can sneak another
instruction in somewhere to fix this, yet without altering the speed.

Steven

sfuerst

unread,

Jan 6, 2012, 2:31:28 PM1/6/12

to

<snip>

That's a problem. I'll see if I can fix the hash to be a bit more
random with regard to that seed. I had hoped it wasn't necessary. :-(

> The interstate correlation seems bad at all addresses, including
> sustained interstate correlation. With a statesize of only 2**128
> that will likely cause problems in a number of parallel usage
> scenarios. Though in some scenarios that can be made up for by the
> use of the address. Which is a practice that I disapprove of - it
> makes results non-reproducible, breaks some uses (including some of my
> RNG tests...), and makes testing harder (I could, for instance, have
> missed the problems that occur at some addresses). In short, it makes
> everything messier by violating the abstraction layer I expect RNGs to
> obey. The benefits it provides, while potentially comforting for some
> multithreaded users, are of very little utility when paired with high
> quality RNGs that are seeded correctly - I attempt to provide similar
> but more flexible benefits to my (non-existant) users through a
> combination of good RNG general properties (for that kind of purpose
> statespace size and interseed correlations in particular) and
> intelligent seeding interfaces.

The advantage of using the address is that no communication is
required at all between threads to guarantee that they get different
sequences. Sequence-dividing schemes run into problems when you have
threads that fork off other threads, and the resulting number of users
is initially unknown. (Partitioning the seed-space becomes
impossible.) The other issue is multi-machine codes using something
like MPI. There, you have a number for each "rank", and need it to
make the rng give a different sequence. (That is the commented out
rsi as second parameter option.)

Of course, since it seems that the quality of the resulting stream is
strongly affected by the initial xor value, something needs to be
fixed first.

> In case the issues I'm seeing are due to a mis-translation, here is my
> C code... a 128 bit LCG with weak constants, with both the high and
> low 64 bit words of state put through a chaotic (and slightly lossy on
> the high word) hashing step:
> static Uint64 barrel_shift(Uint64 v, int shift) {return (v << shift) |
> (v >> (64-shift));}
> ...
> const Uint64 K = 0x6595a395a1ec531b;
> Uint64 tmp = high ^ (Uint64)this;
> tmp *= K;
> tmp ^= barrel_shift(tmp^low, 64-29);
> tmp ^= tmp << 31;
> tmp ^= tmp >> 23;
> Uint64 old = low;
> low += K;
> high += old + ((low < K) ? 1 : 0);
> return tmp;

Here is the result of compiling that with gcc-4.7 with -O3 :

Dump of assembler code for function rng2:
0x0000000000401640 <+0>: mov 0x8(%rdi),%r8
0x0000000000401644 <+4>: mov %rdi,%rdx
0x0000000000401647 <+7>: movabs $0x6595a395a1ec531b,%rsi
0x0000000000401651 <+17>: mov (%rdi),%rcx
0x0000000000401654 <+20>: xor %r8,%rdx
0x0000000000401657 <+23>: imul %rsi,%rdx
0x000000000040165b <+27>: add %rcx,%rsi
0x000000000040165e <+30>: mov %rsi,(%rdi)
0x0000000000401661 <+33>: mov %rdx,%rax
0x0000000000401664 <+36>: xor %rcx,%rax
0x0000000000401667 <+39>: add %r8,%rcx
0x000000000040166a <+42>: movabs $0x6595a395a1ec531a,%r8
0x0000000000401674 <+52>: ror $0x1d,%rax
0x0000000000401678 <+56>: xor %rdx,%rax
0x000000000040167b <+59>: mov %rax,%rdx
0x000000000040167e <+62>: shl $0x1f,%rdx
0x0000000000401682 <+66>: xor %rax,%rdx
0x0000000000401685 <+69>: mov %rdx,%rax
0x0000000000401688 <+72>: shr $0x17,%rax
0x000000000040168c <+76>: cmp %r8,%rsi
0x000000000040168f <+79>: setbe %sil
0x0000000000401693 <+83>: xor %rdx,%rax
0x0000000000401696 <+86>: movzbl %sil,%esi
0x000000000040169a <+90>: add %rsi,%rcx
0x000000000040169d <+93>: mov %rcx,0x8(%rdi)
0x00000000004016a1 <+97>: retq

Notice how it isn't using the adc instruction at all. It also seems
to be rather poor code. :-( The above benchmarks as taking 5.8s to do
a billion random numbers. The asm version I gave takes 4.2s Note
that the fastest of the other rng's you gave takes 4.6s on this
machine. I don't begrudge the compile much though... even simple
rearrangements of the asm instructions cause the time taken to vary by
up to 30%. It is very hard to come up with the right ordering to
overlap memory accesses with computation.

Once I find a version with slightly better properties, I'll think
about finding something the C compiler can convert into good asm.
Previously, the trick was to use a union with an intrinsic 128 bit
unsigned integer type.

> ---
> BTW, side note, is that the correct usage of movabs? In your code it
> is used to move a constant in to a register, but my quick googling of
> it suggested that it was only for reading from memory at absolute
> addresses? Really I don't know much about x86-64 asm so I could
> easily have just misunderstood what I read.

"movabs "is the instruction gcc uses to move a 64 bit immediate into a
64 bit register, as you can see from the above. Don't ask me why they
call it that rather than just plain old "mov". :-/

Steven

orz

unread,

Jan 6, 2012, 6:31:46 PM1/6/12

to

I'm aware of the benefits. Fundamentally, the issue of trying to get
each thread an uncorrelated sequence is identical to the issue of
trying to get each run an uncorrelated sequence. You can solve it in
an automated fashion for the specific case of per-thread RNGs in
shared address space TLS by violating the abstraction layer, but that
is no help at all on sequential runs, distributed runs, parallel runs,
sequential threads, etc. Users who need decent random number
generators may be doing monte carlo simulations or whatever that want
a lack of correlation over any of those domains. Users who really
know what they're doing will be hurt more than helped by your method
(they might appreciate simplified thread seeding, but they don't want
to have to remember the details of how the abstraction layer is
violated), and those who don't will only be helped in a limited set of
scenarios.

You effectively get an extra 4-8 bytes of static state, without having
to store it in memory or even read it from memory, and those extra
bits of state are guaranteed to never share the same value as any
other RNG that is instantiated at the same time in the same address
space. My approach shares some similarities but is pretty different:
1. I have a large enough statespace so that initializing them to a
random state is probabilistically good enough to eliminate
correlations over a reasonable domain, regardless of the nature of the
domain. That generally means at least 3 variables worth of state for
16 to 64 bit RNGs, preferably 4 or 5. And definitely more for 8 bit
RNGs.
2a. I supply an automatic seeding mechanism to users. They don't have
to use it, but I encourage its use for cases where neither crypto
security nor reproducible behavior is required. They can still
reproduce the RNG output by copying or serializing the RNG state, just
not by saving the seed since there is no visible seed in the
autoseeding case. The automatic seeding mechanism initializes an
internal RNG with a function of the program start time, the current
time, the current value of the stack pointer (which is unique per-
thread for the life of the thread), the address of the RNG it was
called on (the same basic thing you were using!, but without the
problems), the number of RNGs initialized within the current thread
(if TLS variables are supported on the platform), a few other things,
and on windows some platform specific stuff (I ought to add platform
specific stuff for unix sometime...). The internal RNG then
initializes the users RNG to a random state, in a fashion that the
users RNG algorithm can customize if it needs to (to preclude some
states, for instance). Some of those are precomputed at application
launch time, others are done at autoseeding time. The internal RNG
algorithm is chosen to be semi-fast (on 64 bit anyway... slower on
other CPUs), very high quality, guaranteed to have no short cycles,
have high quality seeding, be quick to seed, and be capable of seeding
from oddly structured data.
2b. I also supply some more complex tools for high quality seeding for
cryptographic users and power users, who hopefully know what they're
doing. And I also supply a very simple seed-from-a-64-bit-integer
interface, which sometimes maps to an RNGs reference seeding algorithm
(if it has one that uses integers in the range supplied) and other
times maps to something similar to the autoseeder except
reproducible. All RNGs get all of those seeding mechanisms on an
identical interface. Additionally many RNGs also offer other custom
seeding mechanisms, and a few have default states that they go to
deterministically if not seeded. Those without default states require
a seed in their constructor to prevent accidents unless the user
explicitly requests otherwise (by passing a magic token to the
constructor) or is using the "raw" version of the RNG without the
normal interface. If an RNG that does not have a default state is
used without being seeded then its behavior is not defined.

No interthread or interprocess or intermachine communication is
required. It mostly works equally well for sequential RNG
instantiations in the same thread, in different threads, on sequential
runs, on simultaneous runs, on distributed machines, etc. The weakest
case at the moment is if a thread is created, then destroyed, then
very quickly recreated with the same stack address and TLS address.
Simultaneous launches on non-windows platforms are also weak. Every
other case that I can think of is strong, though not up to
cryptographic standards. The platform-specific stuff is done just
once per application launch. The requirements are:
1. Either the library initialization function or the first automatic
seeding is performed before additional threads are launched. That
will tend to naturally be true for programs that don't go around doing
crazy things with the launch and linking processes.
2. the degree of lack of correlation desired across the domain not
exceed the square root of the statespace size. That is almost
identical to a fundamental requirement for high quality RNGs that
would apply regardless, so it's no big deal.

The typical multithreaded user writes something like:

typedef PractRand::RNGs::Polymorphic::jsf64 RNG_TYPE;
__thread RNG_TYPE rng(PractRand::SEED_AUTO);

...and is done, with the same basic benefits as your approach. The
more sophisticated user can do more sophisticated things though, no
one will be fooled in to thinking that a nondeterministic case should
behave deterministically, and serializing the RNG state for later
reproduction will work properly (which will let some of my more exotic
RNG tests behave correctly on it).

/me barfs
I suppose it could be worse. It could branch to implement the adc.

> Notice how it isn't using the adc instruction at all. It also seems
> to be rather poor code. :-( The above benchmarks as taking 5.8s to do
> a billion random numbers. The asm version I gave takes 4.2s Note
> that the fastest of the other rng's you gave takes 4.6s on this
> machine. I don't begrudge the compile much though... even simple
> rearrangements of the asm instructions cause the time taken to vary by
> up to 30%. It is very hard to come up with the right ordering to
> overlap memory accesses with computation.

4.6 seconds, ow! I know performance at the high end gets weird, but
still I had hoped that at least one of the RNGs I posted would do
better. Several of them reliably take less then 2 seconds for that on
both my Core i5 2500 and my old Core2 3 GHrz using both 64 bit gcc and
less than twice that using 32 bit MSVC. All w/o asm.

> Once I find a version with slightly better properties, I'll think
> about finding something the C compiler can convert into good asm.
> Previously, the trick was to use a union with an intrinsic 128 bit
> unsigned integer type.
>
> > ---
> > BTW, side note, is that the correct usage of movabs? In your code it
> > is used to move a constant in to a register, but my quick googling of
> > it suggested that it was only for reading from memory at absolute
> > addresses? Really I don't know much about x86-64 asm so I could
> > easily have just misunderstood what I read.
>
> "movabs "is the instruction gcc uses to move a 64 bit immediate into a
> 64 bit register, as you can see from the above. Don't ask me why they
> call it that rather than just plain old "mov". :-/
>
> Steven

The following variant of that:
Word tmp = high;
tmp ^= tmp >> (OUTPUT_BITS / 2);
tmp *= K;
tmp ^= rotate(tmp ^ low, OUTPUT_BITS/3);
Word old = low;

low += K;
high += old + ((low < K) ? 1 : 0);
return tmp;

...does much better on statistical tests when used on 32 bit integers
instead of 64 bit integers. For the full 64 bit versions there is no
way to distinguish quality as both that and the original pass all
tests I've tried (when the xored value wasn't low-entropy), but
probably the 32 bit version is representative. It still has the
interstate correlation issues, but those are very hard to get rid of
on fast random access RNGs, and no worse than the ones in the CLCG I'm
looking to replace. Adding a second multiply and another xorshift
gets the interstate correlation issue under control without hurting
performance intolerably - users who want performance faster than that
can damn well use a non-random-access RNG. I might also try combining
successive elements like my hashed LFSR did to see if I can eliminate
the need for a 2nd multiply that way. I'm thinking of adding a third
word of state doing something dumb to get the statespace a little
bigger, but at this point really it's looking good enough to be worth
replacing the CLCG.

Hey, do you have any problems with people using any of this stuff,
modifying it, etc, potentially in closed source or commercial apps? I
try to keep all of my recommended RNGs public domain or very close to
public domain.

Frank Kotler

unread,

Jan 8, 2012, 5:51:24 PM1/8/12

to

orz wrote:

> Hey, do you have any problems with people using any of this stuff,
> modifying it, etc, potentially in closed source or commercial apps? I
> try to keep all of my recommended RNGs public domain or very close to
> public domain.

Okay...

I guess you guys have quit using rdtsc... something's been nagging at my
mind... on my machine (P4), rdtsc always returns an even number...

Okay, that's just an attempt to stay on-topic. This is really just a
test to see if there's a possible glitch in our moderation software. If
you should have trouble posting, try dropping clax and posting just to
ala for a couple days. I hope it's not really a problem...

Best,
Frank

Reposting just to ala: Okay, it *is* a problem. We'll get it fixed ASAP.
Meanwhile, post just to ala. Sorry 'bout that.

Best,
Frank
[clax moderation team]

Nathan

unread,

Jan 8, 2012, 7:46:08 PM1/8/12

to

On Jan 8, 5:51 pm, Frank Kotler <fbkot...@myfairpoint.net> wrote:
>
> Reposting just to ala: Okay, it *is* a problem. We'll get it fixed ASAP.
> Meanwhile, post just to ala. Sorry 'bout that.
>

Ooops! Sorry. My bad. I believe that I've got things fixed... it
should be working normally soon....

Nathan.

Frank Kotler

unread,

Jan 10, 2012, 4:13:37 AM1/10/12

to

Nathan wrote:
> Ooops! Sorry. My bad. I believe that I've got things fixed... it
> should be working normally soon....

All fixed. Resume cross-posting... Thanks, Nathan!

Best,
Frank

Jim Leonard

unread,

Jan 10, 2012, 12:54:11 PM1/10/12

to

On Jan 10, 3:13 am, Frank Kotler

Unfortunately, whatever occurred seems to have eaten my post from last
night around 4:45pm CST.

sfuerst

unread,

Jan 10, 2012, 7:38:54 PM1/10/12

to

I don't think things are that bad. Often the simple case will "just
work". Leave the complex interface for those who really need it, and
have sensible defaults that work for most users.

> You effectively get an extra 4-8 bytes of static state, without having
> to store it in memory or even read it from memory, and those extra
> bits of state are guaranteed to never share the same value as any
> other RNG that is instantiated at the same time in the same address
> space.

I know. It feels like cheating... :-) It is the only way I know to
get a random number generator to have a larger state-space than that
bounded by the size of its seed. Breaking fundamental limits is
always entertaining.

This is the "standard" technique for parallel generators... so there
is nothing much wrong with it. That doesn't mean that other (perhaps
simpler) methods don't exist. Sometimes the new methods do end up
being better... That's how progress happens. :-)

> No interthread or interprocess or intermachine communication is
> required. It mostly works equally well for sequential RNG
> instantiations in the same thread, in different threads, on sequential
> runs, on simultaneous runs, on distributed machines, etc. The weakest
> case at the moment is if a thread is created, then destroyed, then
> very quickly recreated with the same stack address and TLS address.
> Simultaneous launches on non-windows platforms are also weak. Every
> other case that I can think of is strong, though not up to
> cryptographic standards. The platform-specific stuff is done just
> once per application launch. The requirements are:
> 1. Either the library initialization function or the first automatic
> seeding is performed before additional threads are launched. That
> will tend to naturally be true for programs that don't go around doing
> crazy things with the launch and linking processes.
> 2. the degree of lack of correlation desired across the domain not
> exceed the square root of the statespace size. That is almost
> identical to a fundamental requirement for high quality RNGs that
> would apply regardless, so it's no big deal.

You can fix the rapid recreate problem by using a guid. Provided that
the guid updates rapidly enough, there is no way other users can have
the same initial seed state. The trick is to make the creation stall
long enough to fulfill this speed limitation. Since thread creation
is (meant to be) slow and rare, this isn't much of a constraint in
practice.

> The typical multithreaded user writes something like:
>
> typedef PractRand::RNGs::Polymorphic::jsf64 RNG_TYPE;
> __thread RNG_TYPE rng(PractRand::SEED_AUTO);
>
> ...and is done, with the same basic benefits as your approach. The
> more sophisticated user can do more sophisticated things though, no
> one will be fooled in to thinking that a nondeterministic case should
> behave deterministically, and serializing the RNG state for later
> reproduction will work properly (which will let some of my more exotic
> RNG tests behave correctly on it).

I'll admit that the interface (if you can call it that) for the form
of rng I've proposed isn't well-suited for C++. However, since I
don't like C++ very much, I consider that a feature rather than a
bug. :-P

> > Notice how it isn't using the adc instruction at all. It also seems
> > to be rather poor code. :-( The above benchmarks as taking 5.8s to do
> > a billion random numbers. The asm version I gave takes 4.2s Note
> > that the fastest of the other rng's you gave takes 4.6s on this
> > machine. I don't begrudge the compile much though... even simple
> > rearrangements of the asm instructions cause the time taken to vary by
> > up to 30%. It is very hard to come up with the right ordering to
> > overlap memory accesses with computation.
>
> 4.6 seconds, ow! I know performance at the high end gets weird, but
> still I had hoped that at least one of the RNGs I posted would do
> better. Several of them reliably take less then 2 seconds for that on
> both my Core i5 2500 and my old Core2 3 GHrz using both 64 bit gcc and
> less than twice that using 32 bit MSVC. All w/o asm.

My computer is old and slow. Performance is relative, remember.

I tried something much like the above, and it wasn't very good. It
either had poor randomness, or failed to be fast enough to bother
with.

>
> Hey, do you have any problems with people using any of this stuff,
> modifying it, etc, potentially in closed source or commercial apps? I
> try to keep all of my recommended RNGs public domain or very close to
> public domain.

No problems. It is just math, after all.

Steven

sfuerst

unread,

Jan 10, 2012, 8:21:31 PM1/10/12

to

On Jan 8, 2:51 pm, Frank Kotler <fbkot...@myfairpoint.net> wrote:
> orz wrote:
> > Hey, do you have any problems with people using any of this stuff,
> > modifying it, etc, potentially in closed source or commercial apps? I
> > try to keep all of my recommended RNGs public domain or very close to
> > public domain.
>
> Okay...
>
> I guess you guys have quit using rdtsc...

rdtsc doesn't have enough entropy to be usable in a rapidly called
rng. Instead, what you should do is gather the few bits of entropy
that rdtsc has, and reseed the entire state of the rng once you have
enough.

Unfortunately, rdtsc is a rather slow instruction, so calling it every
rng invocation is painful.

> something's been nagging at my
> mind... on my machine (P4), rdtsc always returns an even number...

That seems really annoying. The most obvious assumption you could
make is that the lowest bit of the cycle count is the most random.

Steven

sfuerst

unread,

Jan 10, 2012, 8:19:55 PM1/10/12

to

On Jan 5, 7:42 pm, sfuerst <svfue...@nospicedham.gmail.com> wrote:

Okay... after quite a bit of effort, I've found something much, much
better.

This is three different rng's in one:

.align 16
.globl rng_hash_128
.type rng_hash_128,@function
rng_hash_128:
mov 8(%rdi), %rax

movabs $0x6595a395a1ec531b, %rcx

# Option 1)
# Non-threaded, fastest. No xor instruction used.

# Option 2)
# Threaded, use the address of the seed as a nonce.
# xor %rdi, %rax

# Option 3)
# Threaded, pass a nonce as a second parameter.
# xor %rsi, %rax
# or have it stored in a larger seed:
# xor 16(%rdi), %rax

mov (%rdi), %rsi
mul %rcx

add %rcx, (%rdi)
adc %rsi, 8(%rdi)

xor %rsi, %rax
xor %rdx, %rax
mul %rcx
add %rsi, %rax
add %rdx, %rax

retq
.size rng_hash_128, .-rng_hash_128

For non-threaded use, pick the one without the extra xor instruction.
This is fastest, taking 3.77s to output 10^9 64bit random numbers on
this machine.

If you have multiple threads, then pick option two. The rng will use
the different addresses of the seeds to make sure that each thread
gets a different output sequence. This takes 4.05s to output 10^9
64bit random numbers on this machine.

If you are using multiple computers, or require multithreaded
repeatability, then pick option three. Pass a variable describing the
thread/machine-number to the rng. (Another option is to store this
extra state into a slightly expanded seed, and use "xor 16(%rdi),
%rax" instead.)

In effect, options two and three choose a 2^128 subsequence from a
2^192 element sequence.

This rng passes TestU01 BigCrush with the top 32 bits, bottom 32 bits,
middle 32 bits, and alternating bottom and top 32 bit bitstream, with
a total of zero failures.

It works by hashing the output of a 128 bit LCG, with a simple
multiplier of 2^64 + 1 and addend of a "random" 64 bit odd constant.
Feel free to try some other constant. As long as it has roughly half
the bits set, and is odd, it should work well. Since the base is a
LCG, fast-forwarding or reversing the rng is trivial.

The hash is based on merging the upper and lower 64 bit halves of two
64x64->128 bit multiplies. This loses entropy... however since we
also merge in another part of the seed, the extra entropy from that
compensates. The result is an irreversible hash that seems to have
very good statistical properties.

Why is this rng so fast?

Most normal random number generators use a mapping x_new = f(x), where
the function f defines how the seed state x is altered.

This rng instead of outputting (part of) the seed, outputs a hash g(x)
of it instead. Thus f(x), and g(x) need to be calculated in every
call. This looks obviously slower than the other method. However,
two things save us. The first is that modern processors are out-of-
order, and can execute many instructions in parallel. Thus the
calculation of f() and g() can overlap. The second is that the
constraints of f() and g() are different.

f() needs to map to a large cycle of states, and have good statistical
randomness. g() just needs to have good statistical randomness.

So by picking a very fast f(), with poor randomness, and a good-enough
g() we can obtain a very fast rng that still passes stringent
statistical tests. The problem is finding a good fast hash.
Fortunately, the extended-precision multiply used above works for g().
(A fast yet poor quality LCG easily works for f() )

Note: when re-implementing the above, be careful with the instruction
ordering. Even small rearrangements can alter the speed by tens of
percent.

Steven

orz

unread,

Jan 11, 2012, 6:39:18 AM1/11/12

to

On Jan 10, 5:19 pm, sfuerst <svfue...@nospicedham.gmail.com> wrote:
> Note: when re-implementing the above, be careful with the instruction
> ordering. Even small rearrangements can alter the speed by tens of
> percent.

forgot to ask: what CPU are you using?

orz

unread,

Jan 11, 2012, 6:37:52 AM1/11/12

to

On Jan 10, 4:38 pm, sfuerst <svfue...@nospicedham.gmail.com> wrote:

Actually, on second thought I want to qualify that slightly - you have
no control over the value of that state, so you have to
pessimistically assume it will be near-zero entropy, effectively far
less than than a pointer size worth of entropy. Really, the only
thing you can count on is uniqueness (for the life the RNG instance).
Which is, admittedly, pretty sweet. But I shouldn't have described it
in terms of the pointer size, size the pointer size is only a measure
of the maximum entropy involved, and has very little to do, in
practice, with the actual entropy involved.

Is it? I've seen several of the individual elements of my scheme in
other packages (TestU01, GSL, Boost, etc), but never more than one or
two of the elements in a single package, and a few elements I've never
seen elsewhere. I've seen your xor-with-address scheme several times
on the web, always combined with xorshift+multiply pairs, but never in
a library that I can recall.

I think that interface-wise, for the simple case, my scheme is as
simple as anything that allows multiple RNG states (ie simpler than
the libraries aimed at research, but not as simple as libc rand/
srand). Implementation-wise, my scheme is not simple... there ends up
being a call to the system RNG at program start, and some additional
hashing on a per-autoseeding basis and a tiny bit on a per-word-of-RNG-
state-autoseeded basis. For most uses it's overkill, but since many
users have a hard time figuring out just how broad a scope they need
lack of correlation over, and since the resources used are cheap, I
shoot for a little overkill.

>
>
>
>
>
>
>
> > No interthread or interprocess or intermachine communication is
> > required. It mostly works equally well for sequential RNG
> > instantiations in the same thread, in different threads, on sequential
> > runs, on simultaneous runs, on distributed machines, etc. The weakest
> > case at the moment is if a thread is created, then destroyed, then
> > very quickly recreated with the same stack address and TLS address.
> > Simultaneous launches on non-windows platforms are also weak. Every
> > other case that I can think of is strong, though not up to
> > cryptographic standards. The platform-specific stuff is done just
> > once per application launch. The requirements are:
> > 1. Either the library initialization function or the first automatic
> > seeding is performed before additional threads are launched. That
> > will tend to naturally be true for programs that don't go around doing
> > crazy things with the launch and linking processes.
> > 2. the degree of lack of correlation desired across the domain not
> > exceed the square root of the statespace size. That is almost
> > identical to a fundamental requirement for high quality RNGs that
> > would apply regardless, so it's no big deal.
>
> You can fix the rapid recreate problem by using a guid. Provided that
> the guid updates rapidly enough, there is no way other users can have
> the same initial seed state. The trick is to make the creation stall
> long enough to fulfill this speed limitation. Since thread creation
> is (meant to be) slow and rare, this isn't much of a constraint in
> practice.

If fixed the non-windows issues since my last post. I'm still not
sure what to do about the thread recreation issue. One solution would
be to use some sort of locking mechanism or atomic math, maybe
EnterCriticalSection or InterlockedIncrement on Win32/MSVC. But I'd
like to stay very portable... I currently use no libraries aside from
the C++ standard library (and, on windows, kernel32.lib), and nothing
compiler or OS or ISA specific without ifdefs and fallback paths on
unrecognized systems (admittedly the fallbacks are not very good for
OSes that are neither windows nor *nix...). Another solution would
be a malloc(1) on a per-thread basis, with no corresponding free. I'm
not familiar with "guid", but it appears to refer to a variety of
different algorithms for generating 128 bit numbers intended to be
unique across process boundaries and often network boundaries.
Apparently I would call UuidCreateSequential on win32, I'm not sure
what I would do on *nix? But that isn't really looking any better to
me in terms of portability or speed than the other options. Though it
could replace the RtlGenRandom and/or /dev/random usage on program
start.

> > The typical multithreaded user writes something like:
>
> > typedef PractRand::RNGs::Polymorphic::jsf64 RNG_TYPE;
> > __thread RNG_TYPE rng(PractRand::SEED_AUTO);
>
> > ...and is done, with the same basic benefits as your approach. The
> > more sophisticated user can do more sophisticated things though, no
> > one will be fooled in to thinking that a nondeterministic case should
> > behave deterministically, and serializing the RNG state for later
> > reproduction will work properly (which will let some of my more exotic
> > RNG tests behave correctly on it).
>
> I'll admit that the interface (if you can call it that) for the form
> of rng I've proposed isn't well-suited for C++. However, since I
> don't like C++ very much, I consider that a feature rather than a
> bug. :-P
>

I don't think it has issues with C++ really. Well, I suppose arguably
with object orientation in general since its state can't (or
shouldn't) really be treated like a normal object. But C++ would
handle it fine a simple function instead of a class. What I would say
is that it lacks reproducibility (via either manual seeding or state
serialization). RNG users sometimes want reproducibility for
debugging purposes, network games and apps, hashing purposes, probably
a few other uses I can't think of atm.

>
>
>
>
>
> > > Notice how it isn't using the adc instruction at all. It also seems
> > > to be rather poor code. :-( The above benchmarks as taking 5.8s to do
> > > a billion random numbers. The asm version I gave takes 4.2s Note
> > > that the fastest of the other rng's you gave takes 4.6s on this
> > > machine. I don't begrudge the compile much though... even simple
> > > rearrangements of the asm instructions cause the time taken to vary by
> > > up to 30%. It is very hard to come up with the right ordering to
> > > overlap memory accesses with computation.
>
> > 4.6 seconds, ow! I know performance at the high end gets weird, but
> > still I had hoped that at least one of the RNGs I posted would do
> > better. Several of them reliably take less then 2 seconds for that on
> > both my Core i5 2500 and my old Core2 3 GHrz using both 64 bit gcc and
> > less than twice that using 32 bit MSVC. All w/o asm.
>
> My computer is old and slow. Performance is relative, remember.

Yeah. Sigh. Even just looking at it as relative performance... Is
your optimizer allowed to inline your stuff and/or remap register
names, or does it have to stick to calling conventions and/or the code
as you wrote it? On my benchmarks, if I force it to not inline the
calls then they hit a performance maximum around 300 million calls per
second (though it varies quite a bit, from a low around 100 to a high
around 500, it's usually between 250 and 325), regardless of how fast
the RNG itself is. Even return 0 will get capped at that rate. If I
let it inline, even then it will still hit a cap around 1.1 billion
calls per second even for very simple RNGs like "int random() {int tmp
= a; a += b; return tmp;}". A simple "return 0;" is faster still, but
by so much that I suspect it optimizes away almost the entire calling
loop in that instance. And a few of the RNGs I've posted in this
thread get 0.7 to 1.0 billion calls per second here - just under what
looks like the maximum possible performance on this CPU / compiler
pair. MSVC results are very similar, though it will inline even
across compilation unit boundaries, unlike gcc (4.5.4) - I only see
the lower cap on MSVC if the call goes through a vtable.
In my attempts to achieve more meaningful performance measures I've
tried to look for aggregate test results across multiple compilers,
multiple CPU generations, multiple structures of calling loops, etc.
I had thought it helped a little, but it's hard to say really, and
often that gets tiresome and I'll skip most of it for extended periods
of time.

That change had no performance impact on my computer (it's becoming
increasing apparent that we see very different performance impact from
changes), and did well on statistical tests when used on 32 bit
integers where the original (when changed to use 32 bit integers) did
poorly (IIRC failing after 5 minutes or so). A few other tweaks got
it to the point of passing an 8 hour test on 32 bit, I didn't try any
longer tests.

>
> > Hey, do you have any problems with people using any of this stuff,
> > modifying it, etc, potentially in closed source or commercial apps? I
> > try to keep all of my recommended RNGs public domain or very close to
> > public domain.
>

> No problems. It is just math, after all.
>
> Steven

Excellent, I'll probably be using something in the general ballpark of
this stuff for random access RNGs for the time being, until I can
figure out a way to get good random access without multiplication.

On Jan 10, 5:19 pm, sfuerst <svfue...@nospicedham.gmail.com> wrote:

I get about 2.5 GB/s when I stick that in as inline asm on gcc. About
300 million calls per second, about 10 cycles per call. I used this
code:
Uint64 rv;
asm volatile (
"mov 8(%%rdi), %%rax \n"
"movabs $0x6595a395a1ec531b, %%rcx \n"
"mov (%%rdi), %%rsi \n"
"mul %%rcx \n"
"add %%rcx, (%%rdi) \n"
"adc %%rsi, 8(%%rdi) \n"
"xor %%rsi, %%rax \n"
"xor %%rdx, %%rax \n"
"mul %%rcx \n"
"add %%rsi, %%rax \n"
"add %%rdx, %%rax \n"
:"=a"(rv)//outputs
:"D"(this)//inputs
:"memory", "%rcx", "%rsi", "%rdx"//clobbers
);
return rv;
If I left off the "volatile" keyword then the optimizer seemed to go
berserk on non-vtable calls to it for some reason, though vtable calls
remained the same. Probably I screwed up the constraints somehow.

I also tried implementing it in C, but neither gcc nor MSVC will let
me use a 128 bit integer. I did implement it in C by scaling it down
to 32 bit instead of 64 bit and compiling on a 32 bit compiler. It
wasn't particularly fast that way relative to other 32 bit RNGs.

Count it as one more bit of speed-for-you != speed-for-me, not even
relative speed. Either that or my inline asm sucked, and my compiler
sucked too.

> If you have multiple threads, then pick option two. The rng will use
> the different addresses of the seeds to make sure that each thread
> gets a different output sequence. This takes 4.05s to output 10^9
> 64bit random numbers on this machine.
>
> If you are using multiple computers, or require multithreaded
> repeatability, then pick option three. Pass a variable describing the
> thread/machine-number to the rng. (Another option is to store this
> extra state into a slightly expanded seed, and use "xor 16(%rdi),
> %rax" instead.)
>
> In effect, options two and three choose a 2^128 subsequence from a
> 2^192 element sequence.
>
> This rng passes TestU01 BigCrush with the top 32 bits, bottom 32 bits,
> middle 32 bits, and alternating bottom and top 32 bit bitstream, with
> a total of zero failures.

My testing is reporting good quality for it as well. I'm testing both
that algorithm in asm, plus that algorithm in C scaled down to operate
on 32 bit integers instead of 64.

On the complex-output-functions = better-parallelism argument I'm
pretty meh. Yeah, the output function can be done in parallel with
the state transition function. Yeah, the output function can be done
in parallel with the next output function if the state transition is
fast enough and the calling code is nice enough and the scheduler is
psychic enough. But there are severe limits to how much can be done
at once, both due to limits of execution resources (large integer
multiplication in particular takes a huge transistor & power budget)
and due to scheduling difficulty. My understanding is that most CPUs
can't maintain a IPC more than 4 on any remotely real world code no
matter how parallel it is, far lower if lots of complex operations
like large integer multiplication are involved. Notice how recent CPU
generations haven't really been getting significantly more
superscalar? These days heat is the fundamental limiting factor, and
executing all those operations will produce the same amount of heat
regardless of how how smart the scheduler is... in fact a smarter
scheduler will just add overhead to the thermal budget, so the next
few CPU generations are not likely to emphasize OOO any more than the
last few have (in the desktop world anyway...).

Also, most fast RNGs with simple output functions have highly parallel
state transition functions, typically 2 to 4 simple operations. Your
code there is a critical path of 2 simple operations on the state
transition function, and 6 operations on the output function, two of
which are high latency instructions. mul 64x64->128 is reported to be
5 cycle latency on recent AMD, Intel's most recent optimization guide
seems to claim a latency of 14-18 cycles on that but probably I'm
misreading that somehow as it seems improbably high... a quick
googling suggests it might be closer to 3 cycles. Since you have two
of them in your critical path I end up with a best-case performance on
the order of 6-10 cycles). If the output never actually gets used,
that's a different story... the simple transition function might
dominate then.

orz

unread,

Jan 11, 2012, 7:59:37 PM1/11/12

to

On Jan 11, 3:37 am, orz <cdh...@nospicedham.gmail.com> wrote:
> Also, most fast RNGs with simple output functions have highly parallel
> state transition functions, typically 2 to 4 simple operations. Your
> code there is a critical path of 2 simple operations on the state
> transition function, and 6 operations on the output function, two of
> which are high latency instructions. mul 64x64->128 is reported to be
> 5 cycle latency on recent AMD, Intel's most recent optimization guide
> seems to claim a latency of 14-18 cycles on that but probably I'm
> misreading that somehow as it seems improbably high... a quick
> googling suggests it might be closer to 3 cycles. Since you have two
> of them in your critical path I end up with a best-case performance on
> the order of 6-10 cycles). If the output never actually gets used,
> that's a different story... the simple transition function might
> dominate then.

I was a bit tired when I wrote that, lots of typos and mistakes though
I stand by the general concept. The talk of latencies should have
mentioned throughputs and ratios of throughputs to latencies. Intel
claims that mul 64x64->128 should have a latency of 4 to 10 (usually
10) cycles on core 7 or 7 cycles on core 2, and a throughput of 1
cycle on core 7 or 3 cycles on core 2. The ratio of those latencies
to throughputs ranges from 2.3 to 10.0. Which implies that executing
sequential output functions in parallel with each other is a
significant possibility for that RNG. BTW, add-with-carry is listed
as having significantly worse latency and throughput than regular
add... add at 1 latency and 0.33 throughput on all core7s and core2s,
adc at 2 latency and 1 to 1.5 throughput, cmp+cmov+add at 4 latency,
1.16 to 1.66 throughput (adding the throughputs of the individual
instructions, though technically that's not correct).

If we assume that the scheduling is perfect and there's no extra
complications from the calling code or anything, then what matters is
the larger of the overall functions throughput and the state
transition functions critical path latency. For that RNG, that looks
like 3 cycles latency for the state transition function on most of
Intel's desktop CPUs, and an overall throughput in the neighborhood of
~5 cycles on core7 or ~8 cycles on core2. Compare that to, say, the
RNG on Robert Jenkins small fast RNG design page, with a critical path
on the state transition function of 1 rotate and 2 additions, and a
grand total of 4 additions, 1 xor, and 2-3 rotates - a critical path
latency of 3 cycles and an overall throughput of a little less than 3
cycles.

Of course, that's all based upon massively simplified models of the
CPUs performance - reality rarely conforms to it. But since our main
alternative seems to be benchmarks that yield very different results
on different computers...

sfuerst

unread,

Jan 11, 2012, 9:28:07 PM1/11/12

to

An Opteron 280. It obviously has quite a different behaviour than the
Intel machine you are testing on.

Steven

sfuerst

unread,

Jan 11, 2012, 9:56:02 PM1/11/12

to

On Jan 11, 3:37 am, orz <cdh...@nospicedham.gmail.com> wrote:

> > > You effectively get an extra 4-8 bytes of static state, without having
> > > to store it in memory or even read it from memory, and those extra
> > > bits of state are guaranteed to never share the same value as any
> > > other RNG that is instantiated at the same time in the same address
> > > space.
>
> > I know. It feels like cheating... :-) It is the only way I know to
> > get a random number generator to have a larger state-space than that
> > bounded by the size of its seed. Breaking fundamental limits is
> > always entertaining.
>
> Actually, on second thought I want to qualify that slightly - you have
> no control over the value of that state, so you have to
> pessimistically assume it will be near-zero entropy, effectively far
> less than than a pointer size worth of entropy. Really, the only
> thing you can count on is uniqueness (for the life the RNG instance).
> Which is, admittedly, pretty sweet. But I shouldn't have described it
> in terms of the pointer size, size the pointer size is only a measure
> of the maximum entropy involved, and has very little to do, in
> practice, with the actual entropy involved.

Yep. All the testing I have done has assumed that number was zero.
If it is anything else, that's entropic gravy. Unfortunately, I
started discussing the other rather poor rng before testing had
completed, which confused things. The preliminary results did look
good at the time... (which is a rather bad excuse, but anyway...)

The "standard" probably is the SPRNG library, which seems to use the
state-space splitting technique. The other thing that pops up when
searching for parallel rng is my code, which uses the xor followed by
xorshift + multiply.

A "guid" is a globally unique ID. Basically, it is some algorithm or
another that allows you to create a number (128 bits or higher,
usually) that is guaranteed to be unique. They use things like cycle
count and network MAC address to do it. If you use the 128 bits to
initialize part of the seed, that leaves many other bits for state-
space for the rng to work.

i.e. With the 128 bits, store into the upper 64 bit quantity in the
LCG, and into the 64 bits of "nonce", in rng_hash_128. Set the lower
64 bits of the LCG to be zero. This initialization gives you 2^64
random numbers before the rng can get into a state that another
instance could possibly be in.

>
> > > > Notice how it isn't using the adc instruction at all. It also seems
> > > > to be rather poor code. :-( The above benchmarks as taking 5.8s to do
> > > > a billion random numbers. The asm version I gave takes 4.2s Note
> > > > that the fastest of the other rng's you gave takes 4.6s on this
> > > > machine. I don't begrudge the compile much though... even simple
> > > > rearrangements of the asm instructions cause the time taken to vary by
> > > > up to 30%. It is very hard to come up with the right ordering to
> > > > overlap memory accesses with computation.
>
> > > 4.6 seconds, ow! I know performance at the high end gets weird, but
> > > still I had hoped that at least one of the RNGs I posted would do
> > > better. Several of them reliably take less then 2 seconds for that on
> > > both my Core i5 2500 and my old Core2 3 GHrz using both 64 bit gcc and
> > > less than twice that using 32 bit MSVC. All w/o asm.
>
> > My computer is old and slow. Performance is relative, remember.
>
> Yeah. Sigh. Even just looking at it as relative performance... Is
> your optimizer allowed to inline your stuff and/or remap register
> names, or does it have to stick to calling conventions and/or the code
> as you wrote it?

I deliberately made it so it couldn't. The asm is in a separate file
in my tests. The C equivalent is to use the noinline and noclone gcc
attributes. If you allow inlining, then benchmarking becomes
meaningless. The optimizer can do all sorts of things, including
eliding parts of the rng entirely. :-/

Since "typical" use will be via a function call into a library, this
isn't unrealistic.

> On my benchmarks, if I force it to not inline the
> calls then they hit a performance maximum around 300 million calls per
> second (though it varies quite a bit, from a low around 100 to a high
> around 500, it's usually between 250 and 325), regardless of how fast
> the RNG itself is.

I don't see this. The difference between the slow and fast rng is
quite pronounced here. There also is very little variability in the
timings from one run to another. What does vary greatly is the effect
of instruction ordering. If I rearrange a couple of instructions, the
results can change by 20% or more. This makes optimizing a chore, as
many combinations need to be tested in order to find the best.

> Even return 0 will get capped at that rate. If I
> let it inline, even then it will still hit a cap around 1.1 billion
> calls per second even for very simple RNGs like "int random() {int tmp
> = a; a += b; return tmp;}". A simple "return 0;" is faster still, but
> by so much that I suspect it optimizes away almost the entire calling
> loop in that instance.

That's the problem. You need to avoid letting the optimizer see the
rng code when it makes the code for the benchmark loop. Otherwise the
results are contaminated.

As I've said... even small changes to the orderings of instructions
can give large relative changes in timing here. It is really
annoying.

Why on earth is a vtable required? Can't you just use a normal
function call? That's what I do. The asm is compiled in a
separate .s file, and then linked in later.

> I also tried implementing it in C, but neither gcc nor MSVC will let
> me use a 128 bit integer. I did implement it in C by scaling it down
> to 32 bit instead of 64 bit and compiling on a 32 bit compiler. It
> wasn't particularly fast that way relative to other 32 bit RNGs.

Are you in 32 bit mode? You can use 128 bit integers in 64 bit mode
with gcc. If you are using 32 bits, you are out of luck. MSVC has
intrinsics which allow the use of 64x64->128 multiplies in 64 bit
mode.

Be careful... I'm stating what I'm measuring, not what is
theoretically happening. For example, if I remove the final multiply
+ xor sequence, the time taken doesn't change. It appears that the
time taken to access L1 in the add, adc chain is much longer than
naive cycle-counting would predict.

Steven

sfuerst

unread,

Jan 11, 2012, 10:22:01 PM1/11/12

to

Measurement is easier:

This machine is 2.4GHz Opteron. It takes 3.773s to complete 10^9
calls to the single-threaded version of the rng.

2.4 * 3.773 = 9.0552

Thus it takes 9 cycles per loop iteration. If you remove the loop
overhead it'll be faster...

Don't forget the L1 access latency in your cycle count prediction.
Registers are much faster than L1 on modern machines. The L1 latency
on this machine is 3 cycles. On I7, I'm fairly sure it is 4 cycles.

Steven

orz

unread,

Jan 11, 2012, 11:58:19 PM1/11/12

to

Another area where we've come from similar goals to rather different
approaches.

1. My view is that inlining is in fact representative. MSVC will
inline static library calls (if the information needed is present...
which bloats the size of the .lib files by quite a bit), having them
as separate files is insufficient. I believe Intel C can do that as
well. gcc will not so far as I know, at the moment anyway, but that
could change after a few more gcc releases, and even now it will still
inline functions defined in the headers of libraries. I usually don't
put the algorithms in the headers, but when the algorithm is
implemented in a buffered fashion I do put the buffer reading portion
in to the header with only the buffer refill logic hidden in the .cpp
file. And for the simplest of algorithms, those that have very small
code size, I've been considering moving their implementation in to the
header.

2. To prevent the calls from being optimized away, I actually perform
math and IO dependent upon the result of the calls. Admittedly very
very simple math to prevent it from reducing the performance of the
loop, and IO that is absurdly unlikely to happen to prevent it from
cluttering up the output. Unfortunately, if the RNG is really REALLY
simple (ie "return 0;" or somesuch - anything with a state transition
function that doesn't do any transitioning) then the compiler can
actually figure out the results of the math without calling the RNG
repeatedly because all calls have the same result anyway. This is
intended to be representative of an RNG user that demands maximum
performance from an RNG - they need to do something with the result or
they wouldn't be calling the RNG in the first place, but if they want
the fastest possible RNG then they must be using the results for
something with low performance impact. Unfortunately, gcc seems to
have trouble with my loop when combined with inline asm, apparently
doing the math incorrectly... probably I defined the constraints on
the inline asm incorrectly somehow.

3. Like many RNG packages aimed at researchers, I support run-time RNG
selection via polymorphic interfaces. Unlike them I also offer the
same RNG algorithms by a non-polymorphic but otherwise identical
interface for maximum performance and minimum size. I benchmark all
RNG algorithms through both interfaces.

If I limit myself to non-inline RNG calls then my best performance on
this CPU drops down to about 3 seconds per billion calls. And your
latest RNG comes in first out of the set of RNGs I currently have
active in my benchmark program, which includes many of my fastest RNGs
(though I've been optimizing more for inline performance than non-
line, since the differences are pretty small in non-inline). Which
ones are fastest in non-inline contexts is pretty weird in my testing,
even weirder than inline RNG performance I think, and has very small
performance differences.

So basically, it looks like much of our performance difference is
simply that you are measuring and optimizing for non-inline
performance (representative for library calls on gcc), while I am
mostly optimizing for inline performance (representative on some
compilers, or if the core RNG algorithm is defined in a header) though
I measure both.

> > I also tried implementing it in C, but neither gcc nor MSVC will let
> > me use a 128 bit integer. I did implement it in C by scaling it down
> > to 32 bit instead of 64 bit and compiling on a 32 bit compiler. It
> > wasn't particularly fast that way relative to other 32 bit RNGs.
>
> Are you in 32 bit mode? You can use 128 bit integers in 64 bit mode
> with gcc. If you are using 32 bits, you are out of luck. MSVC has
> intrinsics which allow the use of 64x64->128 multiplies in 64 bit
> mode.

I tried both 64 and 32 bit gcc. And I tried consulting the docs. If
I'm reading the docs correctly, the issue is that gcc does not support
128 bit integers in C++, only in C. WTF gcc? I suppose I could try
creating a separate .c file just for sticking in RNGs with 128 bit
integers and adding some glue logic to allow it to interface with my C+
+ testing framework but... really, that doesn't sound that
entertaining. Probably if I just wait a while someone will add 128
bit integer support to gcc.

Yeah, reality trumps theory. But, real world RNG using code might use
the output in some way that made the next call to the RNG dependent
upon the previous ones results, from the schedulers perspective
anyway. Or do something else to make latency more important relative
to throughput. And, are you actually using the output for anything?
The CPUs scheduler acts a little bit like the compilers optimizer... I
don't *think* it can completely discard scheduled instructions that
become meaningless due to later instructions, but maybe it's smarter
than I think? And, as mentioned above, I consider inline performance
at least as representative for small code size RNGs.

hopcode

unread,

Jan 11, 2012, 9:46:17 PM1/11/12

to

something gone wrong by posting with google...
ok, i post it again

On 12.01.2012 01:59, orz wrote:
> Of course, that's all based upon massively simplified models of the
> CPUs performance - reality rarely conforms to it. But since our main
> alternative seems to be benchmarks that yield very different results
> on different computers...

i think struggling on single CPU instructions is not
so worth the while nowadays. SIMD SSE may give
some advantages, when using only the CPU.
btw, an exaustive paper benchmarking the Park-Miller using CUDA.

http://www.cs.ucl.ac.uk/staff/W.Langdon/ftp/papers/langdon_2009_CIGPU.pdf

and B.Langdon's publications, interesting imho.

https://iris.ucl.ac.uk/research/publication?upi=WBLAN93&page=1

Cheers,

Pedro Pereira

unread,

Jan 12, 2012, 1:25:50 PM1/12/12

to

I've modified the xorshift with a period of 2^128-1,
adding two LFSR's with period 2^32-1 and 2^31-1 and
combined them with a J-K style flip-flop.

It passed the crush/bigcrush tests and is not too slow.

Feedback is welcome!

Pedro Pereira

static uint32_t state_0 = 0xEA2CBD78;
static uint32_t state_1 = 0x07BF10D5;
static uint32_t state_2 = 0x5FA0B00A;
static uint32_t state_3 = 0x06D98C3F;
static uint32_t lfsr_0 = 0x29EB68FE;
static uint32_t lfsr_1 = 0x4CD98AA6;
static uint32_t lfsr_01 = 0x22C3407B;
static uint32_t cycle = 0x5D8291F3;

static inline uint32_t lfsr(uint32_t n, uint32_t m)
{
return (n >> 1) ^ (-(n & 1) & m);
}

uint32_t get_pseudo_random(void)
{
uint32_t save;

save = state_0 ^ (state_0 << 13);
state_0 = state_1;
state_1 = state_2;
state_2 = state_3;
state_3 = state_3 ^ (state_3 >> 17) ^ (save ^ (save >> 7));

lfsr_0 = lfsr(lfsr_0, 0xE28F75ED);
lfsr_1 = lfsr(lfsr_1, 0x400005B6);
lfsr_01 = (lfsr_0 & ~lfsr_01) | (lfsr_1 & lfsr_01);

cycle += 0x5D31995A;
return (state_3 + lfsr_01) ^ cycle;
}

orz

unread,

Jan 12, 2012, 3:04:29 PM1/12/12

to

On Jan 12, 10:25 am, Pedro Pereira <ppere...@nospicedham.grupopie.com>
wrote:

The first LFSR is pretty good for an LFSR. Very good for an LFSR of
its state size. Not great in any absolute sense though - it's still
an LFSR.

The two small LFSRs are extremely low quality. The way they are
combined is rather unusual, but doesn't help on my tests.

The overall function is pretty slow. I could run a standard
cryptographic RNG instead and have faster output.

Broken down as return values vs data and/or CPU time to fail my
testing regime
state_3: 32 GB (8 minutes)
lfsr_0: 4 KB (< 1 second)
lfsr_1: 1 KB (the smallest unit my tests will even think of operating
on)
lfsr_01: 8 KB (< 1 second)
cycle: 1 KB (the smallest unit my tests will even think of operating
on)
lfsr_0 ^ cycle: 16 MB (< 1 second)
lfsr_1 ^ cycle: 2 MB (< 1 second)
lfsr_01 ^ cycle: 8 MB (< 1 second)
state_3 ^ cycle: passed 128 GB so far (half an hour)
state_3 + lfsr0: not yet tested
state_3 + lfsr1: not yet tested
state_3 + lfsr01: not yet tested
(state_3 + lfsr01) ^ cycle: passed 256 GB so far (an hour)

I've got 2 cores dedicated to testing variants atm, I'll post again
when they've filled out the rest of that table.

orz

unread,

Jan 14, 2012, 5:33:01 AM1/14/12

to

It's looking like state_3 combined with anything else passes all my
tests.

I am not fan of combining multiple RNGs in that way. I would rather
have every bit of state either effect or be effected by every other
bit of state, or both. When the RNG can be divided in to independent
component RNGs that tends to mean both decreased overall quality and
increased inter-state/seed correlation relative to the quality.

I want to back away from my earlier statement "I could run a standard
cryptographic RNG instead and have faster output". Really, the common
cryptographic RNGs are pretty slow in software - to find one faster
than this I have to basically cheat, picking ones that use larger
integer sizes and are not well accepted as secure. And while your RNG
is kinda slow in my book, it's still faster than mt19937 on gcc
(though slower on MSVC), so there's plenty of people who wouldn't
consider it slow - I have pretty high standards in that regard. Also,
leaving out lfsr_1 and lfsr_01 has, I think, fairly little cost in
output quality or statespace, and yields a significant improvement in
speed. Though it would substantially worsen interstate correlation
issues I think.

BTW, how did you choose the constants for the state_* LFSR? I'm
familiar with methods of finding maximal period constants and have a
few rules of thumb / common sense to avoid bad constants, but after
that I mostly stick to either empirical testing or avalanche-style
testing to distinguish good constants from bad constants. If I had a
better feel for where the constants came from I might scale it down to
working on 16 or 8 bit integers for testing purposes, as it's a lot
easier to find flaws when corner cases are more frequent and the state
size is closer to the log of the number of bins in my chi squared
tests.

Pedro Pereira

unread,

Jan 23, 2012, 2:23:25 PM1/23/12

to

orz wrote:

[...]

> It's looking like state_3 combined with anything else passes all my
> tests.
>
> I am not fan of combining multiple RNGs in that way. I would rather
> have every bit of state either effect or be effected by every other
> bit of state, or both. When the RNG can be divided in to independent
> component RNGs that tends to mean both decreased overall quality and
> increased inter-state/seed correlation relative to the quality.
>
> I want to back away from my earlier statement "I could run a standard
> cryptographic RNG instead and have faster output". Really, the common
> cryptographic RNGs are pretty slow in software - to find one faster
> than this I have to basically cheat, picking ones that use larger
> integer sizes and are not well accepted as secure. And while your RNG
> is kinda slow in my book, it's still faster than mt19937 on gcc
> (though slower on MSVC), so there's plenty of people who wouldn't
> consider it slow - I have pretty high standards in that regard. Also,
> leaving out lfsr_1 and lfsr_01 has, I think, fairly little cost in
> output quality or statespace, and yields a significant improvement in
> speed. Though it would substantially worsen interstate correlation
> issues I think.

Using a block cipher was the advantage of being easy to create
a random access pseudo random number generator (just use it in counter mode).
I tried to pick one, reduce the number of rounds and overall operations,
but I wasn't able to make it fast enough.

>
> BTW, how did you choose the constants for the state_* LFSR? I'm
> familiar with methods of finding maximal period constants and have a
> few rules of thumb / common sense to avoid bad constants, but after
> that I mostly stick to either empirical testing or avalanche-style
> testing to distinguish good constants from bad constants. If I had a
> better feel for where the constants came from I might scale it down to
> working on 16 or 8 bit integers for testing purposes, as it's a lot
> easier to find flaws when corner cases are more frequent and the state
> size is closer to the log of the number of bins in my chi squared
> tests.

I'm using the same method.

Pedro Pereira

Rivka Miller

unread,

Jan 23, 2012, 8:37:31 PM1/23/12

to Rivka Miller

Hello Everyone !

Does any kind soul know the answer to a few simple questions or a way
to discover them from the program ?

Who is the author of Kodak or Wang Imaging [Image Editor] and which
language was it written in ?

I mean, the name of the software engineer, and not the company.

I heard that it was originally subcontracted out to UK company and not
a United States company or a group of programmers.

Rivka

Rivka Miller

unread,

Jan 23, 2012, 11:17:27 PM1/23/12

to Rivka Miller

A little history from some forum for reminder to anyone reading this
post and interested.

Mozzy
Guest
Posts: n/a

02-11-2005
A recent discussion here mentioned XP and Kodak Imaging. I have XP
so I went and followed some instructions on where to get Kodak
Imaging.

Unfortunately the help text didn't come with it, so I am only able to
assess it slowly.

No one mentioned if Kodak Imaging is considered good, bad or
indifferent. Can someone say if it is really worth exploring.

Reply With Quote

Mozzy

Wayne Fulton
Guest
Posts: n/a

02-11-2005
In article <95FA86D6...@130.133.1.4>, says...
>
>
>A recent discussion here mentioned XP and Kodak Imaging. I have XP
>so I went and followed some instructions on where to get Kodak
>Imaging.
>
>Unfortunately the help text didn't come with it, so I am only able to
>assess it slowly.
>
>No one mentioned if Kodak Imaging is considered good, bad or
>indifferent. Can someone say if it is really worth exploring.

It probably depends on what you got, and what you will use it for.

Kodak Imaging was included as part of Win98 and WinME, at menu Start -
Programs - Accessories - Imaging. It was in all Windows until XP, but
it is NOT in WinXP - XP has its own Picture and Fax Viewer instead,
which only shows and prints. Imaging was a document program, for
documents, more so than for photos... multipage TIF, etc (even reads
multipage XIF files). It would read and write photos in TIF or JPG
format, but it had no photo editing powers. In Windows, this free
version was a minimal version, not the full version, which sells for
about $170.

It was Wang Imaging in Win95, before Kodak bought it from Wang. Then
it
changed to Eastman Imaging called eiStream, but now its home is at
http://www.global360.com

The full version was considered pretty strong for documents. It is
more
for business document applications. I have to think if you want a
photo
editor, then something like Elements or Paint Shop Pro would be a much
better buy. In turn however, these dont do documents well.

--
Wayne
http://www.scantips.com "A few scanning tips"

Reply With Quote

Wayne Fulton

Mozzy
Guest
Posts: n/a

02-11-2005
On 11 Feb 2005, Wayne Fulton wrote:

> It was Wang Imaging in Win95, before Kodak bought it from Wang.
> Then it changed to Eastman Imaging called eiStream, but now its
> home is at http://www.global360.com

I see that Kodak Imaging gets installed into this folder:
C:\Program Files\Windows NT\Accessories\ImageVue

Also I notice that INF file to install Kodak Imaging (in Win2000) is
called IMAGEVUE.INF.

I guess this means that ImageVue is yet another name for "Kodak
Imaging for Windows".

So these seem to be equivalent:

"Wang Imaging"
"Eastman Software Imaging"
"Kodak Imaging"
"ImageVue"
"eiStream"
"eiStream Global 360"

Phew!

Reply With Quote

Mozzy

Wayne Fulton
Guest
Posts: n/a

02-11-2005
In article <95FA85B66...@130.133.1.4>, says...

>So these seem to be equivalent:
>
> "Wang Imaging"
> "Eastman Software Imaging"
> "Kodak Imaging"
> "ImageVue"
> "eiStream"
> "eiStream Global 360"

Except I think not the same version. The free Windows versions were a
very light version of the purchased full version.

--
Wayne
http://www.scantips.com "A few scanning tips"

Reply With Quote

Wayne Fulton

Newron
Guest
Posts: n/a

02-11-2005
Greetings Mozzy,

Actually, the Kodak Imaging program, which is included with the MS
Operating
Systems from 95 to Me and I believe Win2000, was created by Kodak and
given
to MS for use in their OS. Wang Labs was the originator and Kodak
bought
part of the company that created that software. It has been in use for
a
long time now.

XP does not include Kodak IMG but uses MS own viewer etc.

If you are talking about some other feature or software program, let
me know
and I will track it down for you.

Talk to you soon,

Ron Baird
Eastman Kodak Company

"Mozzy" <> wrote in message
news:95FA86D6...@130.133.1.4...
>A recent discussion here mentioned XP and Kodak Imaging. I have XP
> so I went and followed some instructions on where to get Kodak
> Imaging.
>
> Unfortunately the help text didn't come with it, so I am only able to
> assess it slowly.
>
> No one mentioned if Kodak Imaging is considered good, bad or
> indifferent. Can someone say if it is really worth exploring.
>

hopcode

unread,

Apr 5, 2012, 8:24:18 PM4/5/12

to

Hi,
playing with a custom MWC algo

http://en.wikipedia.org/wiki/Multiply-with-carry

i have found something better (most times) than that
from Steven, using a reduced vector
of 32x8 bytes. the main difference
with that of Steven is that here you can
optimize or customize it without variating
to much from the main performances it offers.
it is called /pi_rand/ because i use a sample
from pi as vector

usage:

mov rcx,362436
xor edx,edx
call .pi_randA

.pi_randA:
;--- RSI point to a vector minimum EDX*8
;--- in ECX c
;--- in EDX i
;--- out EAX,ECX,EDX
;--- preserve ECX,EDX
;--------------------------

;--- i = (i + 1) & 4095;
;--- customize: RSI points to a vector
;--- large at least EDX*8
inc edx
and edx,01Fh ;--- 4095 original

mov rax,[rsi+rdx] ;--- t = a * Q[i] + c;
imul rax,18782

ror rax,11 ;<--- customize here or comment
add rax,rcx

add rcx,rax ;--- c = (t >> 32);
shr rcx,32

add eax,ecx ;--- x = t + c;

cmp eax,ecx ;--- if (x < c) {
sbb r8,r8
neg r8

adc ecx,0 ;--- c++;
add eax,r8d ;--- x++;

;--- (Q[i] = r - x);
sub eax,-2 ;--- sub x,r
neg eax ;--- neg x

mov [rsi+rdx],eax ;--- return (Q[i] = r - x);
ret

i cannot attach a plot because gnuplot runs out of memory
after 8 MiB binary data (sic!), when i need at least 16.

gnuplot> load "E:/gnuplot/demo/mydemoB.dem"
"E:/gnuplot/demo/mydemoB.dem", line 6: out of memory f
or expanding curve points

and,as far as it seems, the supporters too refuse categorically
to face that serious problem again; they discussed about it
in the remote 2009 A.D.

http://groups.google.com/group/comp.graphics.apps.gnuplot/browse_frm/thread/7db1b5ba5dd47136/c6c0388d96bddbb5?tvc=1#c6c0388d96bddbb5

and their bugged philosopy seems to be:
"gnuplot is not a data analysis tool, it is a plotting program"
oh,...
...really ? :-)

i wonder how could they start/continue coding a plotting
program without considering data,and how huge the bulk of datas may be.

Cheers,

.:mrk[hopcode]
.:x64lab:.
group http://groups.google.com/group/x64lab
site http://sites.google.com/site/x64lab

orz

unread,

Apr 6, 2012, 4:58:06 AM4/6/12

to

> http://groups.google.com/group/comp.graphics.apps.gnuplot/browse_frm/...

>
> and their bugged philosopy seems to be:
> "gnuplot is not a data analysis tool, it is a plotting program"
> oh,...
> ...really ? :-)
>
> i wonder how could they start/continue coding a plotting
> program without considering data,and how huge the bulk of datas may be.
>
> Cheers,
>
> .:mrk[hopcode]
> .:x64lab:.
> grouphttp://groups.google.com/group/x64lab
> sitehttp://sites.google.com/site/x64lab

Some of the comments in your asm code appear to refer to the example
cmwc implementation near the bottom of the wikipedia page you linked.
The code on the wiki appears to be a cmwc or something similar with
non-power-of-2 modulus 2**32-1, lag 4096, and multiplier 18782, but
there is no description or comments there to make that explicit.

Unless I'm misinterpreting the asm syntax used, you're treating Q (the
array pointed at by RSI) as an array of 8 bit values. The code on
wikipedia treats it as an array of 32 bit values. Some of the
comments in your code suggest that it might be an array of 64 bit
values. Could you clarify that?

Judging from the comments in your asm you are experimenting with
adding a barrel shift to a cmwc? Adding barrel shifts to RNGs that
lack rightward / downward flow of information between the bits can
help a lot, but the cmwc algorithm listed uses a non-power-of-2
modulus so it may not need that very much, and it inherently has some
rightward flow of information anyway due to the rightshift used in
figuring out the carry.

I tried plugging the cmwc code off of wikipedia in to my tests...
simply plugging it in directly passed the tests, but skipping 4095 out
of every 4096 outputs caused it to fail the tests. Variants of the
algorithm with barrel shifts added produced very similar results when
tested.

hopcode

unread,

Apr 6, 2012, 7:55:17 AM4/6/12

to

Hi orz,
i need some time, because i am new to Scipy language and
i am right now writing an explanation on the benefits
of studying this algo for the newbie to "randomness" (like i am too)

Il 06.04.2012 10:58, orz ha scritto:
> Some of the comments in your asm code appear to refer to the example
> cmwc implementation near the bottom of the wikipedia page you linked.
> The code on the wiki appears to be a cmwc or something similar with
> non-power-of-2 modulus 2**32-1, lag 4096, and multiplier 18782, but
> there is no description or comments there to make that explicit.

it is essentially it. i found it indirectly from a py-recipe here

http://code.activestate.com/recipes/576707-long-period-random-number-generator/

additionally, at the bottom of the wikipedia page there is a link to a
2005 Marsaglia paper about pi where the algo/explanation can be found
there too.

> Unless I'm misinterpreting the asm syntax used, you're treating Q (the
> array pointed at by RSI) as an array of 8 bit values. The code on

yes, right, indexed by 7bit. but i am sorry,because i published
an other test from the dozens versions, instead of the right one :-)
EDX should be shifted << by 3 too,

correct would be
AND EDX,1Fh
MOV RAX,[RSI+RDX*8]

without that typo anyway, the array should be reduced
to a 32*8 = 256byte vector. RSI points to the vector.
in RAX it loads the state, a QWORD (8bytes) from RSI+RDX

> wikipedia treats it as an array of 32 bit values. Some of the
> comments in your code suggest that it might be an array of 64 bit
> values. Could you clarify that?

yes as i said, 64-bit only by loading the state in t
then i do exceptionally a ROR 11 on the whole of it,
and finally i store only the lower 32 bit as from the original

> Judging from the comments in your asm you are experimenting with
> adding a barrel shift to a cmwc? Adding barrel shifts to RNGs that
> lack rightward / downward flow of information between the bits can
> help a lot, but the cmwc algorithm listed uses a non-power-of-2
> modulus so it may not need that very much, and it inherently has some
> rightward flow of information anyway due to the rightshift used in
> figuring out the carry.

just investigating the matter, because my tests are pi-related.
i dont use that init_rand function in the code nor the PHI constant

> I tried plugging the cmwc code off of wikipedia in to my tests...
> simply plugging it in directly passed the tests, but skipping 4095 out
> of every 4096 outputs caused it to fail the tests. Variants of the
> algorithm with barrel shifts added produced very similar results when
> tested.

You should not let the algo skip outputs, because the concept is yet
of a sequencer and of consecutive load/store the state into a vector of
16kb. my Marsaglia tests are sufficient/discrete
even on the 32-bytes-typoed-vector i posted above.
but i need some time to extract/test the PDF carefully again.
anyway i think those basic conditions are very promising.

i let You imagine the cases i tried in these last hours:
the /input/ vector as overlapped and offseted into the /output/ vector.
i.e.: both the input+output vectors re-generated on the fly and changed
everytime during each production...then acquiring the initial seed from
HW noise/temerature etc...
expert would like to call it from now
the rng-bingo!
:-) ...funny

Cheers,

--

.:mrk[hopcode]
.:x64lab:.