I accidentally discovered that of all the quality PRNGs I have looked at, Splitmix is very special...
It is the only one that I have seen which will fully use AVX2 instructions automatically, without specialized coding, under gcc (8+ for Intel and 9+ for AMD) on a modern x86/64 CPU when passing '-march=Native' to the compiler.
No specialized coding is required. This is oddly even more effective for floating point conversion.
Here are some benchmarks of a naive Pi estimator*:
gcc-8 (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0
Intel(R) Core(TM) i5-8400H CPU
Generic Benchmarks:
./splitmix-speed: INT PI= 3.14157 in 7.27 s FP PI= 3.14158 in 6.50 s TOT= 13.77 s
./xoroshiro128plus-speed: INT PI= 3.14156 in 5.88 s FP PI= 3.14159 in 5.55 s TOT= 11.42 s
Native Benchmarks:skylake
./splitmix-speed: INT PI= 3.14157 in 5.03 s FP PI= 3.14158 in 3.14 s TOT= 8.17 s
./xoroshiro128plus-speed: INT PI= 3.14156 in 5.88 s FP PI= 3.14159 in 5.53 s TOT= 11.41 s
gcc (Ubuntu 9.2.1-17ubuntu1~18.04.1) 9.2.1 20191102
AMD Ryzen 7 3800X 8-Core Processor
Generic Benchmarks:
./splitmix-speed: INT PI= 3.14156 in 6.58 s FP PI= 3.14155 in 6.00 s TOT= 12.58 s
./xoroshiro128plus-speed: INT PI= 3.14156 in 5.44 s FP PI= 3.14159 in 4.61 s TOT= 10.05 s
Native Benchmarks:znver2
./splitmix-speed: INT PI= 3.14156 in 5.12 s FP PI= 3.14155 in 3.16 s TOT= 8.28 s
./xoroshiro128plus-speed: INT PI= 3.14156 in 5.55 s FP PI= 3.14159 in 4.52 s TOT= 10.06 s
* Twice as many calls to the Integer estimator were made during the benchmark, since a single call to the PRNG was split into two 31-bit parts.
Here is the code I looped (all compiled using -Wall -Wno-unused-variable -O3 -fno-move-loop-invariants -fno-unroll-loops -std=c99):
#include <stdint.h>
uint64_t x; /* The state can be seeded with any value. */
uint64_t next() {
uint64_t z = (x += 0x9e3779b97f4a7c15);
z = (z ^ (z >> 30)) * 0xbf58476d1ce4e5b9;
z = (z ^ (z >> 27)) * 0x94d049bb133111eb;
return z ^ (z >> 31);
}
uint64_t picalcint(uint64_t n) {
uint64_t rand_x, rand_y;
uint64_t circle_points = 0;
for (uint64_t i = n; i-- != 0;) {
uint64_t tmp = next();
rand_x = tmp >> 33; rand_x *= rand_x;
rand_y = (uint32_t)tmp >> 1; rand_y *= rand_y;
if (rand_x + rand_y <= 0x4000000000000000) circle_points++;
}
return circle_points;
}
// Fastest FP converter mod I have tested
uint64_t to_double_mask = (0x3FFULL << 52);
union { double d; uint64_t i; } to_double_u;
double to_double(uint64_t x) {
to_double_u.i = (x >> 12) | to_double_mask;
return to_double_u.d - 1.0;
}
uint64_t picalcfp(uint64_t n) {
double rand_x, rand_y;
uint64_t circle_points = 0;
for (uint64_t i = n; i-- != 0;) {
rand_x = to_double(next()); rand_x *= rand_x;
rand_y = to_double(next()); rand_y *= rand_y;
if (rand_x + rand_y <= 1.0) { circle_points++; }
}
return circle_points;
}
In summary, using Splitmix for floating point, where its exceptional speed and 52-bit equidistribution properties rule, should breathe new life into a PRNG which was historically only used for seeding. If you can tolerate the relatively short 2^64 period, and figure out how to seed it properly, it will provide enough randomness for basic research, etc.