loop optimization

Jayden Shui

unread,

Feb 2, 2012, 9:55:46 AM2/2/12

to

Dear All,

I have a high dimensional array and want to assign one block of data
in the array to another block in the array. The two blocks are not
overlapped. The loops are very slow. Code is attached. Any ways to
optimize it? Thanks a lot.

#include <time.h>
#include <iostream>
using namespace std;

int main()
{
float a[100][100][8];

int ib = 1, ie = 99;
int jb = 1, je = 99;
int kb0 = 0, ke0 = 1;
int kb1 = 3, ke1 = 2;

clock_t t0 = clock();

// Loop to count CPU time.
for (int m = 0; m < 100000; ++m) {

// a [ib:ie] [jb:je] [kb0:ke1] = a [ib:ie] [jb:je] [kb1:ke1:-1]
slow
for (int i = ib; i <= ie; ++i)
{
for (int j = jb; j <= je; ++j)
{
for (int k0 = kb0, k1 = kb1; k0 <= ke0; ++k0, --k1)
{
a[i][j][k0] = a[i][j][k1];
}
}
}

}

clock_t t1 = clock();
cout << "CPU Time = " << double(t1 - t0) / CLOCKS_PER_SEC << "s
\n";

return 0;
}

Paavo Helde

unread,

Feb 2, 2012, 10:09:03 AM2/2/12

to

Jayden Shui <jayde...@gmail.com> wrote in news:db0c071d-a79a-49fd-
a510-fa2...@y10g2000vbn.googlegroups.com:

> Dear All,
>
> I have a high dimensional array and want to assign one block of data
> in the array to another block in the array. The two blocks are not
> overlapped. The loops are very slow.

Compared to what? It appears that you had to repeat this 100000 times
just in order to get some meaningful timings - does not look slow to me.

Other than that, try to increase your compiler optimization level. The
inner loop is 2 iterations and should be rolled out, some compilers may
require special flags for doing that (-funroll-loops for gcc, for
example).

hth
Paavo

Jayden Shui

unread,

Feb 2, 2012, 10:43:01 AM2/2/12

to

On Feb 2, 10:09 am, Paavo Helde <myfirstn...@osa.pri.ee> wrote:
> Jayden Shui <jayden.s...@gmail.com> wrote in news:db0c071d-a79a-49fd-
> a510-fa2df3e26...@y10g2000vbn.googlegroups.com:

This is testing code for CPU time of the operations. My code has lots
of such similar operations. I hope using some loop optimization
techniques in http://en.wikipedia.org/wiki/Loop_optimization to speed
it up. I think the testing code should be run in less than 0.5 s in i7
after loop optimization (-O2 or O3 is really not enough) according to
my experience.

Victor Bazarov

unread,

Feb 2, 2012, 10:43:45 AM2/2/12

to

Some compilers I've seen (don't recall which, though) could generate
better code if you helped them reduce the indexing by extracting the
"common subexpression" outside of the inner loop. Something like

for (int i = ...
{
float (&ai)[100][8] = a[i];
for (int j = ...
{
float (&aij)[8] = ai[j];
for (int k0 ...
{
aij[k0] = aij[k1];
}
}
}

Better compilers do that on their own. Try it and see if you get any
improvement. If not, you could (a) parallelize the outer loop (since
there is no overlap) but keep in mind it's only worth if the overhead of
starting worker threads is much smaller than the actual work each has to
perform, or (b) change your algorithm and the data structure.

Barring those choices, you're pretty much stuck with what you have.

V
--
I do not respond to top-posted replies, please don't ask

copx

unread,

Feb 2, 2012, 11:12:40 AM2/2/12

to

On 02.02.2012 15:55, Jayden Shui wrote:
> Dear All,
>
> I have a high dimensional array and want to assign one block of data
> in the array to another block in the array. The two blocks are not
> overlapped. The loops are very slow. Code is attached. Any ways to
> optimize it? Thanks a lot.

[snip code]

I compiled this and ran it on my desktop computer which is not very
fast:

Unoptimized build ("g++ loop.cpp"): 15.762s
Optimized build ("g++ -O2 loop.cpp"): 0s

Yes, that's a zero. It runs so fast you cannot even measure it using
your method. Did you forget to enable optimization?

Jayden Shui

unread,

Feb 2, 2012, 11:21:34 AM2/2/12

to

This is due to there is no output and the code in fact is not run in -
O2. Please add

int i, j, k;
cin >> i >> j >> k;
cin << a[i][j][k];

before return 0;

Paavo Helde

unread,

Feb 2, 2012, 12:03:13 PM2/2/12

to

Jayden Shui <jayde...@gmail.com> wrote in news:a143fa2d-fcd7-4c66-
8e83-e80...@n6g2000vbz.googlegroups.com:

Ok, using 'const int' for all the above constants helped MSVC2010 to
optimize the code about 30%.

>>
>> > clock_t t0 = clock();
>>
>> > // Loop to count CPU time.
>> > for (int m = 0; m < 100000; ++m) {
>>
>> > // a [ib:ie] [jb:je] [kb0:ke1] = a [ib:ie] [jb:je] [kb1:ke1:-
> 1]
>> > slow
>> > for (int i = ib; i <= ie; ++i)
>> > {
>> > for (int j = jb; j <= je; ++j)
>> > {
>> > for (int k0 = kb0, k1 = kb1; k0 <= ke0; +
> +k0, --k1)
>> > {
>> > a[i][j][k0] = a[i][j][k1];
>> > }

And manual unrolling of this the 2-iteration loop gave another 10%.

>> > }
>> > }
>>
>> > }
>>
>> > clock_t t1 = clock();
>> > cout << "CPU Time = " << double(t1 - t0) / CLOCKS_PER_SEC <<
> "s
>> > \n";
>>
>> > return 0;
>> > }
>
> This is testing code for CPU time of the operations. My code has lots
> of such similar operations. I hope using some loop optimization
> techniques in http://en.wikipedia.org/wiki/Loop_optimization to speed
> it up. I think the testing code should be run in less than 0.5 s in i7
> after loop optimization (-O2 or O3 is really not enough) according to
> my experience.

I believe this code is much more dependent on the main memory speed and
processor cache sizes than the actual processor speed. If your real
algorithm has many passes over the arrays you might gain speed by working
more locally instead of pumping arrays in and out of the processor
caches.

hth
Paavo

Christian Gollwitzer

unread,

Feb 2, 2012, 1:51:43 PM2/2/12

to

Am 02.02.12 15:55, schrieb Jayden Shui:
> Dear All,

>
>
> // a [ib:ie] [jb:je] [kb0:ke1] = a [ib:ie] [jb:je] [kb1:ke1:-1]

I don't really understand what kind of copy this performs. But if you
can reformulate this operation to copy (large) contiguous blocks of
memory, you could use memcpy, which is probably optimized better. As
others have suggested, this problem is probably memory bandwidth
limited. This means you can gain something by vectorizing as large
blocks as possible at once.

Christian

Victor Bazarov

unread,

Feb 2, 2012, 3:18:32 PM2/2/12

to

The problem is that the order of elements in the destination block is
inverted (see the "-1"?) That prevents memcpy (or memmove) from being used.

Christian Gollwitzer

unread,

Feb 2, 2012, 3:57:41 PM2/2/12

to

Am 02.02.12 21:18, schrieb Victor Bazarov:

> On 2/2/2012 1:51 PM, Christian Gollwitzer wrote:
> The problem is that the order of elements in the destination block is
> inverted (see the "-1"?) That prevents memcpy (or memmove) from being used.
>

I see. But now I've tried it, and I can't see why it should be faster.
On my machine (dual-core macbook), with -O3, it takes 1.3 seconds to
execute the loop. We are moving 100000*99*99*2*4 bytes, which gives a
speed of
~6 Gbytes/s.

I think this is pretty much the limit of the main memory transfer speed
and I can't see how this can be improved upon. If I try more threads
using OpenMP, then it gets significantly slower. It could run faster on
a fast graphics card because here you can have bulk memory transfer of
160 Gbytes/s (Radeon HD 6950) and more.

Christian

Juha Nieminen

unread,

Feb 2, 2012, 5:11:34 PM2/2/12

to

Christian Gollwitzer <auri...@gmx.de> wrote:
> I don't really understand what kind of copy this performs. But if you
> can reformulate this operation to copy (large) contiguous blocks of
> memory, you could use memcpy, which is probably optimized better. As
> others have suggested, this problem is probably memory bandwidth
> limited. This means you can gain something by vectorizing as large
> blocks as possible at once.

When dealing with multidimensional arrays, if you need to handle
(read, copy, modify...) large portions of it at adjacent indices,
it's always much more efficient to perform the operations on elements
that are in contiguous memory locations rather than jumping around.
This makes it important which index of a multidimensional array is
modified most often.

The reason for this is, obviously, caching. When you access an element
in the array, the CPU will load adjacent memory data to the cache. If
you then access an adjacent element of the array, it will most probably
be already loaded into the cache (closest to the CPU), which will make
it really fast. If, however, you jumped around the RAM by incrementing
the wrong index, you will get lots of cache misses.

For that reason eg. this:

for(unsigned i1 = 0; i1 < size1; ++i1)
for(unsigned i2 = 0; i2 < size2; ++i2)
largeArray[i1][i2] = anotherLargeArray[i1][i2];

will usually be much faster than:

for(unsigned i2 = 0; i2 < size2; ++i2)
for(unsigned i1 = 0; i1 < size1; ++i1)
largeArray[i1][i2] = anotherLargeArray[i1][i2];

memcpy() is often very fast not only because it has been optimized to
death during the last 30 years, but also because it accesses consecutive
memory locations, which is cache-friendly.

Scott Lurndal

unread,

Feb 2, 2012, 6:06:32 PM2/2/12

to

Unless your array sizes total larger than the L3 cache (2 to 24MB depending
on the processor), you're not testing main memory bandwidth at all, just
the cache-to-processor bandwidth (QPI for Intel, HT3 for AMD).

scott

Scott Lurndal

unread,

Feb 2, 2012, 6:20:59 PM2/2/12

to

The array in the OP's post is only 640kbytes in size. This will fit in most
L2 caches. The cost of an L1 cache miss/L2 cache hit is damn close to the
cost of an L1 cache hit, so it is unlikely that cache miss effects play a
role in this particular test on any relatively modern processor. Even an
L2 miss/L3 hit is cheap compared to a L3 miss (anywhere from 60 to 200 nanoseconds
for a L3 miss, depending on memory subsystem configuration e.g. NUMA distance and
snoop/probe behaviour. An L2 hit is on the order of 10 cycles on an i7 (L1 is 4 cycles).

An L3 hit varies depending on the state of the line vis-a-vis other processors
(e.g. a line modified on a remote processor takes 75 cycles to migrate the line
to the L3 cache on the target processor).

There are 3 cycles per nanosecond on a 3Ghz processor.

So, for a 3Ghz i7 (and these ratios apply to older gens and AMD processors)

L1 Hit: 1.333 nanoseconds (64kb I/64kb D)
L2 Hit: 3.333 nanoseconds (512k - 2mb per core)
L3 Hit: 13.333 nanoseconds (OE), 21.66 nanoseconds (S), 25 nanoseconds (M) (2-24mb shared)
Local Mem: 60-80 nanoseconds (mem is attached to requesting processor)
Remote Mem: 100-200 nanoseconds (mem is not attached to requesting processor)

A poorly written OS scheduler can perturb things by moving threads from one core
to another, which will cause additional misses as the new core cache warms up.

scott

Asger Joergensen

unread,

Feb 2, 2012, 7:13:24 PM2/2/12

to

Hi Jayden

Jayden Shui wrote:

> int ib = 1, ie = 99;
> int jb = 1, je = 99;
> int kb0 = 0, ke0 = 1;
> int kb1 = 3, ke1 = 2;

....

> for (int j = jb; j <= je; ++j)
> {
> for (int k0 = kb0, k1 = kb1; k0 <= ke0; ++k0, --k1)

putting in the numbers:
for (int k0 = 0, k1 = 3; k0 <= ke0; ++k0, --k1)

so You are essentially only swapping the first four floats, right ?

well to my solution:

no need for all these [][][] they are just to make things slower
and in some cases simpler to understand.

// first set some sizes
const int lineSize = 8;
const int squareSize = 800;
const int cubeSize = 80000;

// get one chuck of memory of the right size
float cube[ cubeSize ];

float *square = cube;
float* squaresEnd = cube + cubeSize;

while( square < squaresEnd )
{
float* line = square;
float* linesEnd = square + squareSize;

while( line < linesEnd )
{
float* lineBegin = line;
float* lineEnd = line + 3; // lineSize ****

for(; lineBegin < lineEnd; ++lineBegin, --lineEnd )//**
*lineBegin = *lineEnd;

line += lineSize;
}

square += squareSize;
}

this is close to double speed on my machine, but I only have an old
Borland compiler, so it might not be the same for You.

****
I don't know if the 3 was what You intended, if not replace with
lineSize.
**
I removed the = from <= if they are pointing to the same float there
is really no need to swap. ;-)

Best regards
Asger-P

Asger Joergensen

unread,

Feb 2, 2012, 7:47:59 PM2/2/12

to

Hi again Jayden

Asger Joergensen wrote:

> so You are essentially only swapping the first four floats, right ?

Should of course have been moving instead of swapping

> while( line < linesEnd )
> {
> float* lineBegin = line;
> float* lineEnd = line + 3; // lineSize ****
>
> for(; lineBegin < lineEnd; ++lineBegin, --lineEnd )//**
> *lineBegin = *lineEnd;
>
> line += lineSize;
> }

if You really only want to move the first few floats
then this is faster.

while( line < linesEnd )
{
line[0] = line[3];
line[1] = line[2];
line += lineSize;
}

Best regards
Asger-P