PractRand 0.86 is ready for release now, just packaging things up
now. It includes fixes for the trivium RNG and the jsf16 RNG, and a
tweak to the seeding of the efiix RNGs. It also adds a multithreaded
test program to improve performance on multicore CPUs. The
multithreaded test program works only on pthreads and WIN32, platforms
that offer neither can't build that one.
On Jan 28, 3:34 pm, Karl-Uwe Frank <
karl.fr...@freecx.co.uk> wrote:
> On 28.01.12 06:40, orz wrote:
>
> [...snip...]
>
> > PractRand doesn't need to get recompiled all the time. test.cpp (or
> > anything else in the test/ folder) can be recompiled independently of
> > everything else. I just don't have a makefile to do that properly
> > with gcc, as I have difficulty dealing with gnu makefiles.
>
> Don't worry, I use a small shell script now for calling the compile process
>
> ---------------------------------------
> #! /bin/bash
> #
>
> HOME=$(echo ~)
> THIS=$(pwd)
>
> cp tt32_practrand.cpp $HOME/usr/PractRand_0.84/test/
>
> cd $HOME/usr/PractRand_0.84/
>
> g++ -Iinclude -O3 -o $THIS/tt32_practrand test/tt32_practrand.cpp
> src/*.cpp src/RNGs/*.cpp src/RNGs/entropy_pools/*.cpp src/RNGs/other/*.cpp
>
> cd $THIS
>
> exit 0
> ---------------------------------------
I think that script will recompile everything all the time, while a
proper makefile would only recompile stuff that needed to be
recompiled. Of course, a modified script could be made to emulate the
behavior of a makefile.
> > want, you can make Raw_DummyRng::raw32() just call any function you
> > want (though you'll have to figure out seeding on your own if you do
> > that, as PractRand won't be able to find any state that walk_state
> > doesn't tell it about). For instance, you could have it say:
> > Uint32 raw32() {
> > Uint32 return_value;
> > if (1 == std::fread(&return_value, sizeof(return_value), 1, stdin))
> > return return_value;
> > std::fprintf(stderr, "error reading input\n");
> > std::exit(1);
> > }
> > And it would take piped in input. Well, on unix... when I try that on
> > windows I get strange results.
>
> Thanks so far for that clever code snippet. I will give feedback soon
> how it behave on Mac OSX.
>
> > About what I'd expect. Except the speed... that looks like
> > optimization is disabled (ie the "-O3" part of the command line was
> > omitted). Note that for most of the tests in the expanded test set it
> > won't even guestimate p-values, just classify results as one of
> > "suspicious?", "very suspicious?", or "FAIL?".
>
> Now I never forget paramater "-O3" because it's in the compile script :-)
On my computer, the expanded standard test set takes about 70 seconds
per GB with optimization enabled (compared to about 13 seconds per GB
on the standard test set), using a single core. Your numbers there
were on the order of 400 seconds per gigabyte for the same set of
tests. Two basic possibilities come to mind:
1. Your build was somehow much lower performance, perhaps due to
optimization not being enabled somehow or some other difference in
compilers or compiler settings. But my performance varies very little
across the compilers I can easily test on, so disabled optimization
seems more likely than any other compiler differences.
2. Your hardware is much worse for the purposes of these tests than
mine. On the surface this seems unlikely, since I get similar (maybe
10% slower) results on my old Core2, and not that much worse on my
older still Athlon 64 x2 (which I can no longer test on as it
literally exploded two years ago). However... these tests speeds can
be extremely sensitive to cache size in some cases, it's possible that
if your CPU has a smaller cache than the CPUs I've tested it on it
could drastically reduce performance. I've tried to tune the
parameters to the stuff in the standard test set to minimize
sensitivity to cache sizes where possible, but I have made much less
effort on the expanded test set, and I haven't touched any of that up
lately so it could have been detuned a bit somehow.
> >>> If you want a not-exactly-barrel-shift
> >>> that accomplishes the same basic thing, instead try replacing the
> >>> bitwise-or with a bitwise-xor: (q>> 16) ^ (q<< 11). That's not as
> >>> fast as a barrel shift on many architectures, but it's slightly better
> >>> than a barrel shift in terms of what it accomplishes for a chaotic
> >>> RNG. In that form of not-barrel-shift the two shift amounts need to
> >>> add up to less than the size of the integer.
>
> >> I will give that a try, let's see how it inflects the results.
>
> I leave it to tz = (q >> 16) | (q << 16);
>
> >>> or my current candidate for version 4 of that:
> >>> Uint16 a, b, c, counter;
> >>> Uint16 raw16() {
> >>> Uint16 old = a + b + counter++;
> >>> a = b ^ (b>> 3);
> >>> b = c + (c<< 2);
> >>> c = ((c<< 7) | (c>> (16-7))) + tmp;
> >>> return old;
> >>> }
>
> By the way, shouldn't it read
>
> c = ((c << 7) | (c >> (16-7))) + old;
Yes. My mishmash of retyped code and cut-and-pasted code sometimes
ends up with silly things like that in it.
> instead of
>
> c = ((c << 7) | (c >> (16-7))) + tmp;
>
> But nonetheless it is quite compact and elegant.
>
> >>> Mostly the same stuff, but with 3 state variables besides the counter
> >>> now. Note that "c + (c<< 2)" is equivalent to "c * 5", but is faster
> >>> than multiplication on most architectures.
>
> I see, for that I have learned now trying to avoid "normal"
> multiplication on variables for speed, in fact I'm don't use it since
> pqRNGr2.
Note that smart compilers will usually automatically pick the
implementation best for the hardware they are targeting. Which may or
may not be the hardware that you actually run on. But for things
intended to potentially be used on platforms without fast
multiplication choosing a constant with a simple implementation can be
very important. Though do realize that constants with simple
implementations do a lot less mixing than those that require fast
multiplication. In this case it doesn't matter much as even a bad
constant is good enough.
> >> Still unbelievable how a 16-bit PRNG can pass BigCrush? Do you
> >> concatenate two 16-bit outputs into one 32-bit value before testing?
>
> > Yes. I concatenate 2 16 bit values or 4 8 bit values to make each 32
> > bit value for TestU01. For 64 bit values I've changed back and fort a
> > bit, sometimes returning first the lower 32 bits then the upper 32
> > bits, sometimes returning only the lower 32 bits and discarding the
> > rest, but really, I often don't bother testing the 64 bit version on
> > TestU01 if the 16 and 32 bit versions both pass. RNGs that produce
> > anything other than 8, 16, 32, or 64 bits at once I generally refuse
> > to look at, but sometimes I'll scale something down to 4 or 2 or 1 bit
> > math and of course pack them together to produce 32 bits for TestU01
> > purposes or 8 bits for PractRand purposes.
>
> Hopefully the piping of values into PractRand work, because as I
> mentioned earlier it find problems much faster than TestU01.
>
> > Note that there is no fundamental need for large integer sizes to
> > produce good quality (though there is a need for at least a minimal
> > state size), though it usually helps. For instance, my efiix RNG
> > algorithm intended to be cryptographically secure (though I'm not 100%
> > sure on that, and the crypto community certainly wouldn't accept it),
>
> Now I remember that reading that efiix as one of the PRNG's which are
> listed first for speed comparison. Some of them are your work, didn't
> really realised that before.
HC-256 and Trivium are standard crypto RNGs, mt19937 is a standard
RNG, jsf and ISAAC are from Bob Jenkins (though he never named jsf, so
I named it that, for Jenkin's Small Fast RNG). I wrote sfc, efiix,
xsm, sha2_based_pool (though that's mostly just SHA2-512, a standard
crypto hash function), and arbee (though that's mostly just jsf).
Efiix is not particularly fast in any absolute sense (ie it's
significantly slower than jsf or sfc), though it is faster than
similar algorithms like ISAAC, HC-256, or RC4 (which are all dynamic s-
box crypto RNG algorithms... which generally tends to be the fastest
category of crypto RNG algorithms that I've seen in software) and
common algorithms like mt19937. It's the fastest not-obviously-broken
intended-for-crypto algorithm that I know of.
> Any reason why the crypto community refuse to take a closer look at it?
1. I don't actually know how to find or talk to the crypto
community. This usenet group is about all I know about. Though I did
recently find this forum, haven't tried posting there yet as I think
it's specifically for ecrypt algorithms:
http://www.ecrypt.eu.org/stream/phorum/list.php?1
2. The cypto community has very high standards. Which I like, for
the most part - I wish the non-crypto PRNG community had standards as
high. But in the context of their desire for security and
provability, this ends up meaning that modern crypto RNGs much be
designed concurrently with proofs about them, a practice I am not
particularly fond of and would have some trouble following myself.
3. My impression is that the crypto community is not terribly fond of
algorithms by others. Possibly for good reason (see the thread in
this group titled "BulletProof Random Number Generator"), but still.
I look at the response to Jenkins ISAAC for instance: two papers that
I can find by people who appear to be members of the crypto community,
one of treats ISAAC seriously but doesn't manage to accomplish much,
the other of which is rather rude to ISAAC and seems to make some
significant conceptual flaws in its analysis rending the whole paper
meaningless. And a third paper in some asian thing that I haven't
managed to find, apparently based upon a mis-scan or typo, attacking
an RNG that isn't ISAAC but that they mistook for ISAAC. That's over
like a decade and a half. I doubt I could get any better response
than he got without a ton of effort, plus I'm still revising efiix a
bit.
> > should easily pass any and all statistical tests even when scaled down
> > to operate on 4 bit integers with a total state size of just 112 bits,
> > despite having a normal state size of 24832 bits when operating on 64
> > bit integers.
>
> Sounds quite astonishing. Did you ever published it for review?
Um... sort of? I've posted earlier versions of it in this usenet
group, and its source code is included in PractRand. Note that the
technical merits I described there don't actually mean much on their
own, though in context of the other details on the algorithm it comes
out looking pretty good for its speed.
> >> Regarding the counter, I have changed one line from
>
> >> s = (s + tt + (s>> 14));
>
> >> into
>
> >> s = (s + tt + ((s<< 13) ^ (s>> 19)));
>
> >> because when I visualise s (which is the "s"tepper) I become clear that
> >> it was increasing to slow but than falling back to fast.
>
> > Note that that is equivalent to:
> > s += (s<< 13) ^ (s>> 19);
> > s += tt;
>
> Yes, that's true but does it really matter, for speed for example?
That wasn't supposed to be an independent point, just a poorly worded
lead in to the part about reversibility.
Note:
1. FLEA is actually very similar to earlier versions of my efiix
algorithm, though it predates mine IIRC. Note the Jenkins does not
think well of its security. It does not meet the current standards of
the crypto community.
2. RC4 is not secure, though it functionally is if you skip the first
few hundred outputs. Even then though, it does not meet (or even come
close to) the current standards of the crypto community.
3. The current standards of the crypto community are very high,
particularly with regard to seeding functions. The only RNG I have
examined that meets those standards is HC-256. Assuming I'm
interpreting what they say correctly.