Thanks
NM
W.C.
Citing NM <n...@nm.com> from his previous post:
# I need to transpose a matrix in C. The problem is that due to cache behavior
# the performance is very poor. If anyone knows efficient matrix transpose
# algorithm or has some code can you please help me?
--
Thanks
NM
<w...@uiuc.edu> wrote in message
news:UWaH9.1486$Vf3....@vixen.cso.uiuc.edu...
Do you need in-place transpose?
I think an efficient as well a simple matrix transpose algorithm for
in-place transpose is to make element-wise copy of a matrix and then
memory-copy back to the original matrix. An obvious alternative, that is
swaping matrix elements in-place, is much slower.
This works nicely if the size of a matrix is, say, an order
of magnitude smaller than the size of the physical memory in a
computer. Otherwise, you should review your algorithm to avoid
explicit matrix transpose from the first place.
Pearu
That's certainly true if he is starting from scratch, but who would want to do that? If he already has both C code and Fortran code that he needs to use, he may be stuck with actually moving data around.
I would suggest giving MTL (C++) a look. I'm a bit turned off by the fact that it uses so-called "shallow copy" semantics, but it should prove educational in any case.
Jive
http://www.netlib.org/toms/380
The code is in Fortran but can be easily ported to another language.
matrix33 &matrix33::transpose()
{
float t;
for (unsigned int c = 0; c < 3; c++)
{
for (unsigned int r = c + 1; r < 3; r++)
{
t = col[c][r];
col[c][r] = col[r][c];
col[r][c] = t;
}
}
return *this;
}
hope that helps
"NM" <n...@nm.com> wrote in message news:asj91t$r71$1...@news.cs.utexas.edu...
How often ? What are the relative sizes of the matrix, cache and overall
memory ? Is the matrix square or rectangular ? Are the dimensions fixed,
or dependent on the run ?
Michel
loop over indices by column in matrix 1
loop over indices by row in matrix 1
take value from (column,row) in matrix 1 and put it in (row,column) in
matrix 2
end loop row
end loop column
using c matrices as an example, the last index is the one that changes
most rapidly, thus it's the one close in cache. so to optimize the
use of the cache you'd want to loop so that the row loop is the inner
loop. however during the transpose, the rows will become the columns
in matrix 2 so that'll screw up the caching. if you go the other way
and have the rows in matrix 2 be the inner loop, then matrix 1 will
screw up your caching. i know of no way around this.
if you have to do the transpose inplace then you'll be doing a bunch
of swapping of elements that'll be far apart in memory and there's no
hope of being friendly to the cache.
mike
"NM" <n...@nm.com> wrote in message news:<asj91t$r71$1...@news.cs.utexas.edu>...
>in matrix 2 so that'll screw up the caching. if you go the other way
>and have the rows in matrix 2 be the inner loop, then matrix 1 will
>screw up your caching. i know of no way around this.
>
>if you have to do the transpose inplace then you'll be doing a bunch
>of swapping of elements that'll be far apart in memory and there's no
>hope of being friendly to the cache.
If you're really paranoid about efficiency, you could try to use
prefetch instructions to lessen cache impact. These instructions hint
the CPU that certain pieces of data will be needed in the future. The
CPU can respond by loading them into cache in advance.
Newer versions of GCC support this instruction on some targets. I did
some experimenting with prefetching on a PIII (gcc 3.2), but I was
never able to gain significant performance increase for vectorized
operations, though.
--
Han-Wen Nienhuys | han...@cs.uu.nl | http://www.cs.uu.nl/~hanwen/
> if you have to do the transpose inplace then you'll be doing a bunch
> of swapping of elements that'll be far apart in memory and there's no
> hope of being friendly to the cache.
Transposing in-place by sub-blocks will eliminate some of the cache
misses.
The simple approach only works in-place for square matrices, of
course; the in-place rectangular case is more complicated (that is
what the TOMS algorithm does).
$.02 -Ron Shepard
Have a look at
Michael R. Portnoff, "An Efficient Parallel-Processing Method for
Transposing Large Matrices in Place," IEEE Trans. Image Proc. vol. 8
(1999 September) 1265-1275.
I haven't used the algorithm, but it looks appropriate for what you're doing.
Also, some FFT algorithms involve transposing arrays, so you may wish to survey
that literature. In particular, there is a detailed discussion of
high-performance transposing in the following book:
http://www.ec-securehost.com/SIAM/FR10.html
Hope this helps.
-Eric Rudd
Mike Deskevich wrote:
> if the cache is your only issue, then i can see no way to speed up the
> algorithm. if you don't need to do the transpose in place, then
> you're going to be doing something like this
>
> loop over indices by column in matrix 1
> loop over indices by row in matrix 1
> take value from (column,row) in matrix 1 and put it in (row,column) in
> matrix 2
> end loop row
> end loop column
>
> using c matrices as an example, the last index is the one that changes
> most rapidly, thus it's the one close in cache. so to optimize the
> use of the cache you'd want to loop so that the row loop is the inner
> loop. however during the transpose, the rows will become the columns
> in matrix 2 so that'll screw up the caching. if you go the other way
> and have the rows in matrix 2 be the inner loop, then matrix 1 will
> screw up your caching. i know of no way around this.
There are a whole bunch of algorithms that aim to minimise the number of cache
misses by tranposing carefully chosen sizes of local sub blocks that will just fit
into cache and then shuffling the blocks to their final destination. Most of the
old large scale 2D FFT codes used something similar in the '80 s to get acceptable
performance with large arrays under very tight physical memory constraints.
Naive array transpose of a large 2D image array used to sometimes grind demand
paged machines of that era to a complete standstill. Modern kit has a lot more
memory to play with and often 2 levels of cache.
Regards,
Martin Brown
There was a C code posted for this in 1998 by
Jason Stratos Papadopoulos in this newsgroup.
I remember it being quite fast.
Jentje Goslinga
NM
"Michel OLAGNON" <mola...@ifremer.fr> wrote in message
news:askd4t$o90$1...@iletudy.ifremer.fr...
> Transpose is taking too much time. For example a 4096 x 4096 matrix
> containing complex numbers ( 2 double) is taking about 10 second whereas
> copying takes about 1.6 sec. It is currently a bottleneck of what i am
> doing. I don't know of the cache size, overall memory is 512MB. The matrix
> can be rectangular. dimensions are dependent in the run.
Indeed, transposing large matrices in-place by swaping its elements takes
more time (but less memory) when compared to a transpose algorithm
consisting of element-wise copying. Here are few test results (on PII
400MHz, 196MB RAM):
transposing 2048x2048 matrix
----------------------------
$ time ./main 1 2048 10
size=2048, repeat=10 transpose=transpose_inplace_swap
real 0m6.277s
user 0m5.860s
sys 0m0.410s
$ time ./main 2 2048 10
size=2048, repeat=10 transpose=transpose_inplace_copy
real 0m3.578s
user 0m2.470s
sys 0m1.100s
$ time ./main 3 2048 10
size=2048, repeat=10 transpose=transpose_inplace_copy_cache
real 0m2.838s
user 0m2.460s
sys 0m0.360s
transposing 512x512 matrix
--------------------------
$ time ./main 1 512 50
size=512, repeat=50 transpose=transpose_inplace_swap
real 0m1.874s
user 0m1.820s
sys 0m0.040s
$ time ./main 2 512 50
size=512, repeat=50 transpose=transpose_inplace_copy
real 0m2.026s
user 0m1.170s
sys 0m0.840s
$ time ./main 3 512 50
size=512, repeat=50 transpose=transpose_inplace_copy_cache
real 0m1.180s
user 0m1.140s
sys 0m0.040s
Note that for smaller matrices the difference between above methods are
smaller. transpose_inplace_swap becomes more efficient than
transpose_inplace_copy_cache if the size of a matrix is less that 200-250.
When increasing the size of a matrix, transpose_inplace_copy_cache becomes
more and more efficent than transpose_inplace_swap until physical memory
limit is hit.
Source code for ./main can be found in
http://cens.ioc.ee/~pearu/misc/transpose/
Regards,
Pearu
As others have pointed out there is no fast way
to do a transpose in place for you have to individuslly
access A[i][j] for all ij. You need to scan the matrix A
across the columns in the row major direction ( I assume
you have stored it this way) and then store each row
in the columns of a new matrix B.
This is easy if you are using a modern C++ linear
algebra package with row and column iterators
and have the standard library. Its
just a few lines
Matrix transpose(const Matrix& m) {
Matrix result(m.cols(),m.rows());
for (int i=0; i<m.rows(); i++)
std::copy(m.row_begin(i),m.row_end(i),result.column_begin(i));
return result;
}
If you have to use the older C++ libraries with pointers it
looks something like this snippet
typedef float T; // could also be double, complex etc
Matrix transpose(const Matrix& m)
{
Matrix result(m.cols(),m.rows()); // the new matrix
T * mp = m.Data() ; // m element pointer
T * rep = result.Data() ; // result element pointer
T * rowp = m.Data() ; // row end pointer
for (int i=0; i<m.rows(); i++)
{
rep = result.Data() +i ; // set column pointer to top of column
rowp = += m.rows(); // set pointer to end of row
while(mp < rowp){
*rep = *mp;
mp++;
rep += m.cols();
}
return result;
}
regards... Bill Shortall
"NM" <n...@nm.com> wrote in message news:asj91t$r71$1...@news.cs.utexas.edu...
> I need to transpose a matrix in C.
The C computer programming language
does not have have a matrix type.
You probably meant a two dimensional array of numbers.
> The problem is that,
> due to cache behavior, the performance is very poor.
> If anyone knows efficient matrix transpose algorithm
> or has some code can you please help me?
>
> I only need the transpose operation.
> What the next steps will do is not important to me right now.
> I was wondering if there is any block algorithm
> or something like that which will make use of the cache
> when transposing the matrix.
> If there is a pointer to such algorithm I will be really grateful.
Block algorithms are only useful when they "reuse"
elements of a two dimensional array that have already been cache'd.
The only thing that you can do that might enhance performance of
B = A^T
is to avoid striding across the rows of the source array A
const int m = 4096;
const int n = 4096;
double A[m][n], B[n][m];
for (int i = 0; i < m; ++i)
for (int j = 0; j < n; ++j)
B[j][i] = A[i][j];
Because n is small, B[j][i+1] = A[i+1][j]
will still be in the cache if B[j][i] = A[i][j]
caused the processor to cache both
B[j][i] and B[j][i+1] in the same cache block.
> Transpose is taking too much time.
> For example a 4096 x 4096 matrix containing complex numbers
> ( 2 double) is taking about 10 second
> whereas copying takes about 1.6 sec.
> It is currently a bottleneck of what I am doing.
> I don't know of the cache size, overall memory is 512MB.
> The matrix can be rectangular. dimensions are dependent in the run.
The most probable explanation for this time difference
is that one or both of your arrays are not aligned
on quad word boundaries so that retrieving an element of A
or storing an element of B requires two memory references
instead of just one (assuming that your cache block size is 128 bits).
Try something like this
double alignment_padding;
doubleComplex A[m][n], B[n][m];
and, if you're lucky, you will get the complex elements
of both A and B aligned on quad word boundaries.
I'm not familiar with these things in C so the following may not apply.
In Fortran,
it makes a big difference if matrix dimensions are the power of two or
not. i.e. changing size to 4097 x 4097 will markedly reduce the
execution of the algorithm.
--
Dr.B.Voh
------------------------------------------------------
Applied Algorithms http://sdynamix.com
I'm not familiar with these things in C so the following may not apply.
In Fortran, it makes a big difference if matrix dimensions are the power
of two or not. i.e. changing size to 4097 x 4097 will significantly
NM
"Pearu Peterson" <pe...@cens.ioc.ee> wrote in message
news:Pine.LNX.4.21.02120...@cens.kybi...
> I haven't looked many days for any response, as i have implemented a
> blocking algorithm. Then I found your result here and I was amazed with your
> result, as it was very good and so simple. BUT hey your code is wrong.
Oops. You are right and the fix reversed my conclusions. Sorry for the
confusion.
However, after fixing the code, I find the following result interesting:
$ time ./main 1 512 50
size=512, repeat=50 transpose=transpose_inplace_swap
real 0m2.210s
user 0m2.160s
sys 0m0.040s
$ time ./main 1 513 50
size=513, repeat=50 transpose=transpose_inplace_swap
real 0m1.148s
user 0m1.100s
sys 0m0.030s
That is, by changing the size of a matrix from 512 to 513, almost doubles
the efficiency.
Pearu