The function given below, accesses elements of a matrix by column. When
the size of the matrix goes above 62000, the performance drops by a
factor of 2 to 4. By size, I mean the #rows x #columns (e.g. .
600*103(=61800), 720*86(=61920), 864*72(=62208)). Strangely, if the
width of the matrix is smaller than 60, then this drop does not occur
for matrices of any height.
< Point at which the performance breaks for diferent matrix size >
-------------------------------------------------------------------
length =400,
width = 151, performance = 13.990495
width = 152, performance = 10.697315
length =480,
width = 125, performance = 14.286978
width = 126, performance = 7.834738
length =576,
width = 104, performance = 14.393398
width = 105, performance = 10.371141
length =691,
width = 87, performance = 15.641437
width = 88, performance = 8.756676
length =829,
width = 72, performance = 16.138418
width = 73, performance = 9.426427
length =994,
width = 60, performance = 17.361404
width = 61, performance = 8.407117
length =1000,
width = 61, performance = 22.943909
width = 62, performance = 15.063729
length =1200,
width = 60, performance = 23.197235
width = 61, performance = 11.290311
length =1440,
width = 60, performance = 23.232620
width = 61, performance = 11.229105
length =1728,
width = 60, performance = 23.269831
width = 61, performance = 11.351677
length = 2073,
width = 60, performance = 26.639214
width = 61, performance = 14.486841
length =2487,
width = 60, performance = 26.508232
width = 61, performance = 14.336942
length =2984,
width = 60, performance = 22.978968
width = 61, performance = 11.209981
length =3580,
width = 60, performance = 23.641298
width = 61, performance = 11.982352
Yes, in the code I am making sure that the cache is completely fushed
between timing calls.
double f(int rows, int copies, int columns )
{
struct timeval ru, ru2;
static int count = 0, j, i, k;
static double *addyC1;
static unsigned long time2;
Size = columns * rows;
count = 0;
if(gettimeofday(&ru, (struct timezone *)0)) perror("ERROR:
gettimeoday() returned 0");
while (count++ < 10){
addyC1 = e->mem;
k = copies;
while(k--){ /* Multiply on all copies */
for(j=0;j<columns;j++)
for(i=0;i<rows;i++)
addyC1[i*columns+j] = 3.1416 ; /* access the matrix elements by
column */
addyC1 += Size; /* move to the nextt copy */
}
}
if(gettimeofday(&ru2, (struct timezone *)0)) perror("ERROR:
gettimeoday() returned 0");
time2 = (long)(ru2.tv_sec - ru.tv_sec)*1e6 + (long)(ru2.tv_usec -
ru.tv_usec);
return ((double)(((double )(rows * columns * copies *
count))/(double)(time2)));
}
> The function given below, accesses elements of a matrix by column.
..which is really bad for performance.
Are you porting code written on Cray ?
> When the size of the matrix goes above 62000, the performance drops by a
> factor of 2 to 4. By size, I mean the #rows x #columns (e.g. .
> 600*103(=61800), 720*86(=61920), 864*72(=62208)). Strangely, if the
> width of the matrix is smaller than 60, then this drop does not occur
> for matrices of any height.
I'm not sure what you mean by "width". Do you mean the column size
or the row size (or "rows" or "columns" in the code) ?
In C style code, "width" would be the row size
(hence the variable columns in the code)
but I'm not so sure what you mean since you seem to use
column dominant addressing.
Also, it's not clear what is the "length" printed below.
Is that the matrix size ?
Without looking at the assembly, it's hard to tell whether
the code is really optimal or not. And that could change lots of things.
A couple of facts that may be involved:
US2 has only 64 entry DTLB. Assuming you're using 8KB page,
TLB can cover only 8KB * 64 = 512KiB.
With 62000 doubles (496KB), it's nearly the threshold of the working set.
Depending on how much bigger you make this,
you may have exceeded TLB coverage.
Another factor is the E$ size.
Some US2 systems shipped with 512KB direct mapped E$.
If that were the case, you would be seeing extra memory accesses
above 512KB matrix size.
Depending on what you mean by width,
it may affect L1 D$ hit rate. Since it's direct mapped cache,
some stride could adversely affect the cache utilization.
--
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"
My matrix is stored in row major format, but in the above loop it is
being *accessed by column*. By length and width I mean row size and
column size respectively. (I should have used height instead of width).
For each row size, I vary the column size and try to access an entire
column in a single iteration of the inner loop. Certainly, I was
expecting some kind of cache or TLB miss. What I did not expect is that
the column size, at which performance drops, stops to vary after 1000
or so. For example with a matrix of length 2048, one would expect the
TLB conflict to occur while accessing a column of length 32, rather
than at 60-61.
Furthermore I see the exact same behaviour on a US4 machine.
Any light on the matter will be really appreciated.
The best way to get a real answer is to measure what's going on.
Using the hardware performance counter profiling of Performance Analyser,
you can figure out which instructions take how many cycles and why.
If you don't have Performance Analyser,
you can use cputrack (see man cputrack(1)) to get aggregate numbers
to see which cache miss/tlb/etc occurs most often.