CLZERO

24 views
Skip to first unread message

Bonita Montero

unread,
May 16, 2022, 8:00:33 AM5/16/22
to
x86 on AMD-CPUs since Zen1 has an instruction called CLZERO.
According to Wikichip this is to recover from some memory-errors,
but this is pure nonsense. There was a posting in the LKML that
reveals the correct purpose: it's to fast zero memory without
polluting the cache, i.e. clzero is non-temporal.
I thought it would be nice to have a comparison betwen a looped
clzero and a plain memset, which itself is usually optimized
very good with today's compiler. So I wrote a little benchmark
in C++20 to compare both:

#include <iostream>
#include <chrono>
#include <vector>
#include <memory>
#include <chrono>
#include <cstring>
#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__) || defined(__clang__)
#include <x86intrin.h>
#endif

using namespace std;
using namespace chrono;

template<bool MemSet = false>
size_t clZeroRange( void *p, size_t n );

int main()
{
constexpr size_t
N = 0x4000000,
ROUNDS = 1'000;
vector<char> vc( N, 0 );
auto bench = [&]<bool MemSet>( bool_constant<MemSet> )
{
auto start = high_resolution_clock::now();
size_t n = 0;
for( size_t r = ROUNDS; r--; )
n += clZeroRange<MemSet>( to_address( vc.begin() ), N );
double GBS = (double)(ptrdiff_t)n / 0x1.0p30;
cout << GBS / ((double)(int64_t)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / 1.0e9) << endl;
};
bench( false_type() );
bench( true_type() );
}

template<bool MemSet>
size_t clZeroRange( void *p, size_t n )
{
char *pAlign = (char *)(((size_t)p + 63) & (ptrdiff_t)-64);
n -= pAlign - (char *)p;
n &= (ptrdiff_t)-64;
if constexpr( !MemSet )
for( char *end = pAlign + n; pAlign != end; pAlign += 64 )
_mm_clzero( pAlign );
else
memset( p, 0, n );
return n;
}

Interestingly I get the same performance for both variants with
MSVC++ 2022. With g++ / glibc I get a performance of about one
third of with memset() than with the clzero()-solution. I think
the memset() of glibc just not optimized so properly. The memset()
of Visual C++ uses non-temporal SSE stores which explains the good
performance.

Would someone here be so nice to post his values ?

Reply all
Reply to author
Forward
0 new messages