Christian Gollwitzer <
auri...@gmx.de> wrote:
> I don't really understand what kind of copy this performs. But if you
> can reformulate this operation to copy (large) contiguous blocks of
> memory, you could use memcpy, which is probably optimized better. As
> others have suggested, this problem is probably memory bandwidth
> limited. This means you can gain something by vectorizing as large
> blocks as possible at once.
When dealing with multidimensional arrays, if you need to handle
(read, copy, modify...) large portions of it at adjacent indices,
it's always much more efficient to perform the operations on elements
that are in contiguous memory locations rather than jumping around.
This makes it important which index of a multidimensional array is
modified most often.
The reason for this is, obviously, caching. When you access an element
in the array, the CPU will load adjacent memory data to the cache. If
you then access an adjacent element of the array, it will most probably
be already loaded into the cache (closest to the CPU), which will make
it really fast. If, however, you jumped around the RAM by incrementing
the wrong index, you will get lots of cache misses.
For that reason eg. this:
for(unsigned i1 = 0; i1 < size1; ++i1)
for(unsigned i2 = 0; i2 < size2; ++i2)
largeArray[i1][i2] = anotherLargeArray[i1][i2];
will usually be much faster than:
for(unsigned i2 = 0; i2 < size2; ++i2)
for(unsigned i1 = 0; i1 < size1; ++i1)
largeArray[i1][i2] = anotherLargeArray[i1][i2];
memcpy() is often very fast not only because it has been optimized to
death during the last 30 years, but also because it accesses consecutive
memory locations, which is cache-friendly.