Actually, on second thought I want to qualify that slightly - you have
no control over the value of that state, so you have to
pessimistically assume it will be near-zero entropy, effectively far
less than than a pointer size worth of entropy. Really, the only
thing you can count on is uniqueness (for the life the RNG instance).
Which is, admittedly, pretty sweet. But I shouldn't have described it
in terms of the pointer size, size the pointer size is only a measure
of the maximum entropy involved, and has very little to do, in
practice, with the actual entropy involved.
Is it? I've seen several of the individual elements of my scheme in
other packages (TestU01, GSL, Boost, etc), but never more than one or
two of the elements in a single package, and a few elements I've never
seen elsewhere. I've seen your xor-with-address scheme several times
on the web, always combined with xorshift+multiply pairs, but never in
a library that I can recall.
I think that interface-wise, for the simple case, my scheme is as
simple as anything that allows multiple RNG states (ie simpler than
the libraries aimed at research, but not as simple as libc rand/
srand). Implementation-wise, my scheme is not simple... there ends up
being a call to the system RNG at program start, and some additional
hashing on a per-autoseeding basis and a tiny bit on a per-word-of-RNG-
state-autoseeded basis. For most uses it's overkill, but since many
users have a hard time figuring out just how broad a scope they need
lack of correlation over, and since the resources used are cheap, I
shoot for a little overkill.
>
>
>
>
>
>
>
> > No interthread or interprocess or intermachine communication is
> > required. It mostly works equally well for sequential RNG
> > instantiations in the same thread, in different threads, on sequential
> > runs, on simultaneous runs, on distributed machines, etc. The weakest
> > case at the moment is if a thread is created, then destroyed, then
> > very quickly recreated with the same stack address and TLS address.
> > Simultaneous launches on non-windows platforms are also weak. Every
> > other case that I can think of is strong, though not up to
> > cryptographic standards. The platform-specific stuff is done just
> > once per application launch. The requirements are:
> > 1. Either the library initialization function or the first automatic
> > seeding is performed before additional threads are launched. That
> > will tend to naturally be true for programs that don't go around doing
> > crazy things with the launch and linking processes.
> > 2. the degree of lack of correlation desired across the domain not
> > exceed the square root of the statespace size. That is almost
> > identical to a fundamental requirement for high quality RNGs that
> > would apply regardless, so it's no big deal.
>
> You can fix the rapid recreate problem by using a guid. Provided that
> the guid updates rapidly enough, there is no way other users can have
> the same initial seed state. The trick is to make the creation stall
> long enough to fulfill this speed limitation. Since thread creation
> is (meant to be) slow and rare, this isn't much of a constraint in
> practice.
If fixed the non-windows issues since my last post. I'm still not
sure what to do about the thread recreation issue. One solution would
be to use some sort of locking mechanism or atomic math, maybe
EnterCriticalSection or InterlockedIncrement on Win32/MSVC. But I'd
like to stay very portable... I currently use no libraries aside from
the C++ standard library (and, on windows, kernel32.lib), and nothing
compiler or OS or ISA specific without ifdefs and fallback paths on
unrecognized systems (admittedly the fallbacks are not very good for
OSes that are neither windows nor *nix...). Another solution would
be a malloc(1) on a per-thread basis, with no corresponding free. I'm
not familiar with "guid", but it appears to refer to a variety of
different algorithms for generating 128 bit numbers intended to be
unique across process boundaries and often network boundaries.
Apparently I would call UuidCreateSequential on win32, I'm not sure
what I would do on *nix? But that isn't really looking any better to
me in terms of portability or speed than the other options. Though it
could replace the RtlGenRandom and/or /dev/random usage on program
start.
> > The typical multithreaded user writes something like:
>
> > typedef PractRand::RNGs::Polymorphic::jsf64 RNG_TYPE;
> > __thread RNG_TYPE rng(PractRand::SEED_AUTO);
>
> > ...and is done, with the same basic benefits as your approach. The
> > more sophisticated user can do more sophisticated things though, no
> > one will be fooled in to thinking that a nondeterministic case should
> > behave deterministically, and serializing the RNG state for later
> > reproduction will work properly (which will let some of my more exotic
> > RNG tests behave correctly on it).
>
> I'll admit that the interface (if you can call it that) for the form
> of rng I've proposed isn't well-suited for C++. However, since I
> don't like C++ very much, I consider that a feature rather than a
> bug. :-P
>
I don't think it has issues with C++ really. Well, I suppose arguably
with object orientation in general since its state can't (or
shouldn't) really be treated like a normal object. But C++ would
handle it fine a simple function instead of a class. What I would say
is that it lacks reproducibility (via either manual seeding or state
serialization). RNG users sometimes want reproducibility for
debugging purposes, network games and apps, hashing purposes, probably
a few other uses I can't think of atm.
>
>
>
>
>
> > > Notice how it isn't using the adc instruction at all. It also seems
> > > to be rather poor code. :-( The above benchmarks as taking 5.8s to do
> > > a billion random numbers. The asm version I gave takes 4.2s Note
> > > that the fastest of the other rng's you gave takes 4.6s on this
> > > machine. I don't begrudge the compile much though... even simple
> > > rearrangements of the asm instructions cause the time taken to vary by
> > > up to 30%. It is very hard to come up with the right ordering to
> > > overlap memory accesses with computation.
>
> > 4.6 seconds, ow! I know performance at the high end gets weird, but
> > still I had hoped that at least one of the RNGs I posted would do
> > better. Several of them reliably take less then 2 seconds for that on
> > both my Core i5 2500 and my old Core2 3 GHrz using both 64 bit gcc and
> > less than twice that using 32 bit MSVC. All w/o asm.
>
> My computer is old and slow. Performance is relative, remember.
Yeah. Sigh. Even just looking at it as relative performance... Is
your optimizer allowed to inline your stuff and/or remap register
names, or does it have to stick to calling conventions and/or the code
as you wrote it? On my benchmarks, if I force it to not inline the
calls then they hit a performance maximum around 300 million calls per
second (though it varies quite a bit, from a low around 100 to a high
around 500, it's usually between 250 and 325), regardless of how fast
the RNG itself is. Even return 0 will get capped at that rate. If I
let it inline, even then it will still hit a cap around 1.1 billion
calls per second even for very simple RNGs like "int random() {int tmp
= a; a += b; return tmp;}". A simple "return 0;" is faster still, but
by so much that I suspect it optimizes away almost the entire calling
loop in that instance. And a few of the RNGs I've posted in this
thread get 0.7 to 1.0 billion calls per second here - just under what
looks like the maximum possible performance on this CPU / compiler
pair. MSVC results are very similar, though it will inline even
across compilation unit boundaries, unlike gcc (4.5.4) - I only see
the lower cap on MSVC if the call goes through a vtable.
In my attempts to achieve more meaningful performance measures I've
tried to look for aggregate test results across multiple compilers,
multiple CPU generations, multiple structures of calling loops, etc.
I had thought it helped a little, but it's hard to say really, and
often that gets tiresome and I'll skip most of it for extended periods
of time.
That change had no performance impact on my computer (it's becoming
increasing apparent that we see very different performance impact from
changes), and did well on statistical tests when used on 32 bit
integers where the original (when changed to use 32 bit integers) did
poorly (IIRC failing after 5 minutes or so). A few other tweaks got
it to the point of passing an 8 hour test on 32 bit, I didn't try any
longer tests.
>
> > Hey, do you have any problems with people using any of this stuff,
> > modifying it, etc, potentially in closed source or commercial apps? I
> > try to keep all of my recommended RNGs public domain or very close to
> > public domain.
>
> No problems. It is just math, after all.
>
> Steven
Excellent, I'll probably be using something in the general ballpark of
this stuff for random access RNGs for the time being, until I can
figure out a way to get good random access without multiplication.
I get about 2.5 GB/s when I stick that in as inline asm on gcc. About
300 million calls per second, about 10 cycles per call. I used this
code:
Uint64 rv;
asm volatile (
"mov 8(%%rdi), %%rax \n"
"movabs $0x6595a395a1ec531b, %%rcx \n"
"mov (%%rdi), %%rsi \n"
"mul %%rcx \n"
"add %%rcx, (%%rdi) \n"
"adc %%rsi, 8(%%rdi) \n"
"xor %%rsi, %%rax \n"
"xor %%rdx, %%rax \n"
"mul %%rcx \n"
"add %%rsi, %%rax \n"
"add %%rdx, %%rax \n"
:"=a"(rv)//outputs
:"D"(this)//inputs
:"memory", "%rcx", "%rsi", "%rdx"//clobbers
);
return rv;
If I left off the "volatile" keyword then the optimizer seemed to go
berserk on non-vtable calls to it for some reason, though vtable calls
remained the same. Probably I screwed up the constraints somehow.
I also tried implementing it in C, but neither gcc nor MSVC will let
me use a 128 bit integer. I did implement it in C by scaling it down
to 32 bit instead of 64 bit and compiling on a 32 bit compiler. It
wasn't particularly fast that way relative to other 32 bit RNGs.
Count it as one more bit of speed-for-you != speed-for-me, not even
relative speed. Either that or my inline asm sucked, and my compiler
sucked too.
> If you have multiple threads, then pick option two. The rng will use
> the different addresses of the seeds to make sure that each thread
> gets a different output sequence. This takes 4.05s to output 10^9
> 64bit random numbers on this machine.
>
> If you are using multiple computers, or require multithreaded
> repeatability, then pick option three. Pass a variable describing the
> thread/machine-number to the rng. (Another option is to store this
> extra state into a slightly expanded seed, and use "xor 16(%rdi),
> %rax" instead.)
>
> In effect, options two and three choose a 2^128 subsequence from a
> 2^192 element sequence.
>
> This rng passes TestU01 BigCrush with the top 32 bits, bottom 32 bits,
> middle 32 bits, and alternating bottom and top 32 bit bitstream, with
> a total of zero failures.
My testing is reporting good quality for it as well. I'm testing both
that algorithm in asm, plus that algorithm in C scaled down to operate
on 32 bit integers instead of 64.
On the complex-output-functions = better-parallelism argument I'm
pretty meh. Yeah, the output function can be done in parallel with
the state transition function. Yeah, the output function can be done
in parallel with the next output function if the state transition is
fast enough and the calling code is nice enough and the scheduler is
psychic enough. But there are severe limits to how much can be done
at once, both due to limits of execution resources (large integer
multiplication in particular takes a huge transistor & power budget)
and due to scheduling difficulty. My understanding is that most CPUs
can't maintain a IPC more than 4 on any remotely real world code no
matter how parallel it is, far lower if lots of complex operations
like large integer multiplication are involved. Notice how recent CPU
generations haven't really been getting significantly more
superscalar? These days heat is the fundamental limiting factor, and
executing all those operations will produce the same amount of heat
regardless of how how smart the scheduler is... in fact a smarter
scheduler will just add overhead to the thermal budget, so the next
few CPU generations are not likely to emphasize OOO any more than the
last few have (in the desktop world anyway...).
Also, most fast RNGs with simple output functions have highly parallel
state transition functions, typically 2 to 4 simple operations. Your
code there is a critical path of 2 simple operations on the state
transition function, and 6 operations on the output function, two of
which are high latency instructions. mul 64x64->128 is reported to be
5 cycle latency on recent AMD, Intel's most recent optimization guide
seems to claim a latency of 14-18 cycles on that but probably I'm
misreading that somehow as it seems improbably high... a quick
googling suggests it might be closer to 3 cycles. Since you have two
of them in your critical path I end up with a best-case performance on
the order of 6-10 cycles). If the output never actually gets used,
that's a different story... the simple transition function might
dominate then.