Splitmix Vectorization

51 views

Skip to first unread message

Chris Rutz

unread,

Apr 9, 2020, 2:15:20 AM4/9/20

to prng

I accidentally discovered that of all the quality PRNGs I have looked at, Splitmix is very special...

It is the only one that I have seen which will fully use AVX2 instructions automatically, without specialized coding, under gcc (8+ for Intel and 9+ for AMD) on a modern x86/64 CPU when passing '-march=Native' to the compiler.

No specialized coding is required. This is oddly even more effective for floating point conversion.

Here are some benchmarks of a naive Pi estimator*:

gcc-8 (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0
Intel(R) Core(TM) i5-8400H CPU
Generic Benchmarks:
              ./splitmix-speed: INT PI= 3.14157 in 7.27 s FP PI= 3.14158 in 6.50 s TOT= 13.77 s
      ./xoroshiro128plus-speed: INT PI= 3.14156 in 5.88 s FP PI= 3.14159 in 5.55 s TOT= 11.42 s
Native Benchmarks:skylake
              ./splitmix-speed: INT PI= 3.14157 in 5.03 s FP PI= 3.14158 in 3.14 s TOT= 8.17 s
      ./xoroshiro128plus-speed: INT PI= 3.14156 in 5.88 s FP PI= 3.14159 in 5.53 s TOT= 11.41 s

gcc (Ubuntu 9.2.1-17ubuntu1~18.04.1) 9.2.1 20191102

AMD Ryzen 7 3800X 8-Core Processor

Generic Benchmarks:
              ./splitmix-speed: INT PI= 3.14156 in 6.58 s FP PI= 3.14155 in 6.00 s TOT= 12.58 s
      ./xoroshiro128plus-speed: INT PI= 3.14156 in 5.44 s FP PI= 3.14159 in 4.61 s TOT= 10.05 s
Native Benchmarks:znver2
              ./splitmix-speed: INT PI= 3.14156 in 5.12 s FP PI= 3.14155 in 3.16 s TOT= 8.28 s

./xoroshiro128plus-speed: INT PI= 3.14156 in 5.55 s FP PI= 3.14159 in 4.52 s TOT= 10.06 s

* Twice as many calls to the Integer estimator were made during the benchmark, since a single call to the PRNG was split into two 31-bit parts.

Here is the code I looped (all compiled using -Wall -Wno-unused-variable -O3 -fno-move-loop-invariants -fno-unroll-loops -std=c99):

#include <stdint.h>

uint64_t x; /* The state can be seeded with any value. */

uint64_t next() {

uint64_t z = (x += 0x9e3779b97f4a7c15);

z = (z ^ (z >> 30)) * 0xbf58476d1ce4e5b9;

z = (z ^ (z >> 27)) * 0x94d049bb133111eb;

return z ^ (z >> 31);

}

uint64_t picalcint(uint64_t n) {

uint64_t rand_x, rand_y;

uint64_t circle_points = 0;

for (uint64_t i = n; i-- != 0;) {

uint64_t tmp = next();

rand_x = tmp >> 33; rand_x *= rand_x;

rand_y = (uint32_t)tmp >> 1; rand_y *= rand_y;

if (rand_x + rand_y <= 0x4000000000000000) circle_points++;

}

return circle_points;

}

// Fastest FP converter mod I have tested

uint64_t to_double_mask = (0x3FFULL << 52);

union { double d; uint64_t i; } to_double_u;

double to_double(uint64_t x) {

to_double_u.i = (x >> 12) | to_double_mask;

return to_double_u.d - 1.0;

}

uint64_t picalcfp(uint64_t n) {

double rand_x, rand_y;

uint64_t circle_points = 0;

for (uint64_t i = n; i-- != 0;) {

rand_x = to_double(next()); rand_x *= rand_x;

rand_y = to_double(next()); rand_y *= rand_y;

if (rand_x + rand_y <= 1.0) { circle_points++; }

}

return circle_points;

}

In summary, using Splitmix for floating point, where its exceptional speed and 52-bit equidistribution properties rule, should breathe new life into a PRNG which was historically only used for seeding.

If you can tolerate the relatively short 2^64 period, and figure out how to seed it properly, it will provide enough randomness for basic research, etc.

Reply all

Reply to author

Forward

0 new messages