Bonita Montero
unread,May 16, 2022, 8:00:33 AM5/16/22Sign in to reply to author
Sign in to forward
Sign in to report message as abuse
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to
x86 on AMD-CPUs since Zen1 has an instruction called CLZERO.
According to Wikichip this is to recover from some memory-errors,
but this is pure nonsense. There was a posting in the LKML that
reveals the correct purpose: it's to fast zero memory without
polluting the cache, i.e. clzero is non-temporal.
I thought it would be nice to have a comparison betwen a looped
clzero and a plain memset, which itself is usually optimized
very good with today's compiler. So I wrote a little benchmark
in C++20 to compare both:
#include <iostream>
#include <chrono>
#include <vector>
#include <memory>
#include <chrono>
#include <cstring>
#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__) || defined(__clang__)
#include <x86intrin.h>
#endif
using namespace std;
using namespace chrono;
template<bool MemSet = false>
size_t clZeroRange( void *p, size_t n );
int main()
{
constexpr size_t
N = 0x4000000,
ROUNDS = 1'000;
vector<char> vc( N, 0 );
auto bench = [&]<bool MemSet>( bool_constant<MemSet> )
{
auto start = high_resolution_clock::now();
size_t n = 0;
for( size_t r = ROUNDS; r--; )
n += clZeroRange<MemSet>( to_address( vc.begin() ), N );
double GBS = (double)(ptrdiff_t)n / 0x1.0p30;
cout << GBS / ((double)(int64_t)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / 1.0e9) << endl;
};
bench( false_type() );
bench( true_type() );
}
template<bool MemSet>
size_t clZeroRange( void *p, size_t n )
{
char *pAlign = (char *)(((size_t)p + 63) & (ptrdiff_t)-64);
n -= pAlign - (char *)p;
n &= (ptrdiff_t)-64;
if constexpr( !MemSet )
for( char *end = pAlign + n; pAlign != end; pAlign += 64 )
_mm_clzero( pAlign );
else
memset( p, 0, n );
return n;
}
Interestingly I get the same performance for both variants with
MSVC++ 2022. With g++ / glibc I get a performance of about one
third of with memset() than with the clzero()-solution. I think
the memset() of glibc just not optimized so properly. The memset()
of Visual C++ uses non-temporal SSE stores which explains the good
performance.
Would someone here be so nice to post his values ?