Splitmix Vectorization

51 views
Skip to first unread message

Chris Rutz

unread,
Apr 9, 2020, 2:15:20 AM4/9/20
to prng
I accidentally discovered that of all the quality PRNGs I have looked at, Splitmix is very special...
It is the only one that I have seen which will fully use AVX2 instructions automatically, without specialized coding, under gcc (8+ for Intel and 9+ for AMD) on a modern x86/64 CPU when passing '-march=Native' to the compiler.
No specialized coding is required. This is oddly even more effective for floating point conversion.

Here are some benchmarks of a naive Pi estimator*:
gcc-8 (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0
Intel(R) Core(TM) i5-8400H CPU
Generic Benchmarks:
              ./splitmix-speed: INT PI= 3.14157 in  7.27 s  FP PI= 3.14158 in  6.50 s  TOT= 13.77 s
      ./xoroshiro128plus-speed: INT PI= 3.14156 in  5.88 s  FP PI= 3.14159 in  5.55 s  TOT= 11.42 s
Native Benchmarks:skylake
              ./splitmix-speed: INT PI= 3.14157 in  5.03 s  FP PI= 3.14158 in  3.14 s  TOT=  8.17 s
      ./xoroshiro128plus-speed: INT PI= 3.14156 in  5.88 s  FP PI= 3.14159 in  5.53 s  TOT= 11.41 s

gcc (Ubuntu 9.2.1-17ubuntu1~18.04.1) 9.2.1 20191102
AMD Ryzen 7 3800X 8-Core Processor
Generic Benchmarks:
              ./splitmix-speed: INT PI= 3.14156 in  6.58 s  FP PI= 3.14155 in  6.00 s  TOT= 12.58 s
      ./xoroshiro128plus-speed: INT PI= 3.14156 in  5.44 s  FP PI= 3.14159 in  4.61 s  TOT= 10.05 s
Native Benchmarks:znver2
              ./splitmix-speed: INT PI= 3.14156 in  5.12 s  FP PI= 3.14155 in  3.16 s  TOT=  8.28 s
      ./xoroshiro128plus-speed: INT PI= 3.14156 in  5.55 s  FP PI= 3.14159 in  4.52 s  TOT= 10.06 s
* Twice as many calls to the Integer estimator were made during the benchmark, since a single call to the PRNG was split into two 31-bit parts.

Here is the code I looped (all compiled using -Wall -Wno-unused-variable -O3 -fno-move-loop-invariants -fno-unroll-loops -std=c99):

#include <stdint.h>

uint64_t x; /* The state can be seeded with any value. */

uint64_t next() {
    uint64_t z = (x += 0x9e3779b97f4a7c15);
    z = (z ^ (z >> 30)) * 0xbf58476d1ce4e5b9;
    z = (z ^ (z >> 27)) * 0x94d049bb133111eb;
    return z ^ (z >> 31);
}

uint64_t picalcint(uint64_t n) {
   uint64_t rand_x, rand_y;
   uint64_t circle_points = 0;
   for (uint64_t i = n; i-- != 0;) { 
        uint64_t tmp = next();
        rand_x = tmp >> 33; rand_x *= rand_x;
        rand_y = (uint32_t)tmp >> 1; rand_y *= rand_y;     
        if (rand_x + rand_y <= 0x4000000000000000) circle_points++;
    } 
    return circle_points;
}

// Fastest FP converter mod I have tested
uint64_t to_double_mask = (0x3FFULL << 52);
union { double d; uint64_t i; } to_double_u;
double to_double(uint64_t x) {
    to_double_u.i = (x >> 12) | to_double_mask;
    return to_double_u.d - 1.0;
}

uint64_t picalcfp(uint64_t n) {
    double rand_x, rand_y; 
    uint64_t circle_points = 0
    for (uint64_t i = n; i-- != 0;) { 
        rand_x = to_double(next()); rand_x *= rand_x;
        rand_y = to_double(next()); rand_y *= rand_y;
        if (rand_x + rand_y <= 1.0) { circle_points++; }
    } 
    return circle_points;
}

In summary, using Splitmix for floating point, where its exceptional speed and 52-bit equidistribution properties rule, should breathe new life into a PRNG which was historically only used for seeding.
If you can tolerate the relatively short 2^64 period, and figure out how to seed it properly, it will provide enough randomness for basic research, etc.
Reply all
Reply to author
Forward
0 new messages