I evaluated the copy using the different copy functions provided by
Intel IPP and then I tried using the non-temporal
instructions as I dont expect the memory to which I write to, to be in
the cache when I use it again.
What I effectively wanted to do was to write a single copy function
like
void Copy( double* pSrc, int srcStride0, int srcStride1, int
srcStride2, double* pDst, int dstStride0, int dstStride1,
int dstStride2);
which decides at runtime based on some parameters to choose the best
copy(either IPPCopy or non temporal store instructions).
But what I see is that both IppCopy and Copy using Non temporal store
instructions perform better for different sizes and different strides
used( I mean different orientations of copy along or against cache
line). However Im able to generalise this to list of parameters. Can
someone point me to the parameters that would affect the
implementation?
Hi Sankar,
I suspect that IPP also uses non-temporal stores where appropriate (+
possibly parallelization).
In general I would expect IPP to be the fastest (at least on Intel
platforms). If there is some optimization, well, why do you think that
IPP does not use it already?
--
Dmitriy V'jukov