RNG Benchmarks and rework of RDRAND/RDSEED

34 views
Skip to first unread message

Jeffrey Walton

unread,
Mar 7, 2017, 1:26:36 AM3/7/17
to Crypto++ Users
Hi Everyone,

We are getting ready to add benchmarks for a selection of RNGs provided by the library. The results are interesting.

Running RDRAND and RDSEED benchmarks did not pass the sniff test. They were under-performing though it went unnoticed. RDRAND and RDSEED are being reworked, and it should increase throughput by about 50%. Here are some of the details:

* RDRAND and RDSEED throughput varies wildly depending on processor family, cpu sub-architecture and processor manufacturer. RDSEED runs at anywhere from 1/2 to 1/5 of the rate of RDRAND on Intel hardware.

* The reworked generators always fulfill a request if the function parameters are correct and the hardware is present. On failure, they immediately retry automatically without the need for user intervention.

* The retry parameters were removed. RDRAND never fails, and it was hard to predict when there were "enough" retries for RDSEED. Now that the generators immediately retry, there was no need for an external retry count.

* RDRAND and RDSEED's GenerateBlock was crippled by C++ exceptions. Setting up the C++ exceptions dominated the function. I estimated we needed about 5 to 7 instructions for generation and book-keeping in a tight loop: generate, test failure, write value, increment pointer, decrement size. We were getting 50 to 70 additional instructions as the exception frame was setup and torn down. Keep in mind the library often "chunks" a request, so 10K-bytes might be broken into multiple 512-byte or 1024-byte requests.

* RDRAND and RDSEED's GenerateBlock had two sources of C++ exceptions. First was hardware and parameter validation, and second was retry counts. Now that the exceptions are removed, GenerateBlock is an x86_64 leaf function without a red zone.

* Internal representations, like NASM_RDRAND_GenerateBlock and MASM_RDRAND_GenerateBlock, now return void (i.e., no return value). The return values are no longer needed since GenerateBlock will fulfill the request or crash if the hardware is not present. It no longer throws an exception.

* We tested a parallelized implementation using OpenMP with a parallel for-loop. The generator lost 7 MiB/s for each OMP thread on a particular test machine using GCC 4 and 5. That is, we could baseline at 70 MiB/s; and then achieve 63 MiB/s total using 2 OMP threads; 56 MiB/s using 3 OMP threads; 47 MiB/s using 4 OMP threads.

Jeff

Jeffrey Walton

unread,
Mar 9, 2017, 4:56:33 AM3/9/17
to Crypto++ Users


On Tuesday, March 7, 2017 at 1:26:36 AM UTC-5, Jeffrey Walton wrote:
Hi Everyone,

We are getting ready to add benchmarks for a selection of RNGs provided by the library. The results are interesting.

These changes are in. There were a few commits for RDRAND and RDSEED. They can be found at https://github.com/weidai11/cryptopp/issues/387.

The changes for the benchmark code can be found at https://github.com/weidai11/cryptopp/issues/386.

The benchmark program no longer depends on the makefile echo'ing elements like HTML, HEAD and BODY. Everything is self contained. If you run:

    ./cryptest.exe b <duration> <cpu freq>

then the program produces a well formed HTML5 page. You can save it to a file with:

    ./cryptest.exe b <duration> <cpu freq>   > benchmarks.html

 or something like:

    CRYPTOPP_CPU_FREQ=2.4 make bench | tee benchmarks.html

The <duration> is a time in seconds, and <cpu freq> is GHz. You can also run subsets of the benchmark code.

"b1" is the unkeyed algorithms. Each test is run for 3 seconds:

    ./cryptest.exe b1 3 2.4

"b2" is the shared key algorithms. Each test is run for 5 seconds:

    ./cryptest.exe b2 5 2.4

"b3" is the public key algorithms. Each test is run for 2.5 seconds:

    ./cryptest.exe b3 2.5 2.4

Jeff
Reply all
Reply to author
Forward
0 new messages