I need to transpose a matrix in C. The problem is that due to cache behavior the performance is very poor. If anyone knows efficient matrix transpose algorithm or has some code can you please help me?
You don't really need to store your transpose matrix. If you need to mulitply your transpose matrix with a vector, you can simply write a short routine for doing that. It's all a matter of indexing.
W.C.
Citing NM <n...@nm.com> from his previous post: # I need to transpose a matrix in C. The problem is that due to cache behavior # the performance is very poor. If anyone knows efficient matrix transpose # algorithm or has some code can you please help me?
I only need the transpose operation. What the next steps will do is not important to me right now. I was wondering if there is any block algorithm or something like that which will make use of the cache when transposing the matrix. If there is a pointer to such algorithm I will be really grateful.
> You don't really need to store your transpose matrix. If you need to > mulitply your transpose matrix with a vector, you can simply write a short > routine for doing that. It's all a matter of indexing.
> W.C.
> Citing NM <n...@nm.com> from his previous post: > # I need to transpose a matrix in C. The problem is that due to cache behavior > # the performance is very poor. If anyone knows efficient matrix transpose > # algorithm or has some code can you please help me?
On Tue, 3 Dec 2002, NM wrote: > I need to transpose a matrix in C. The problem is that due to cache behavior > the performance is very poor. If anyone knows efficient matrix transpose > algorithm or has some code can you please help me?
Do you need in-place transpose?
I think an efficient as well a simple matrix transpose algorithm for in-place transpose is to make element-wise copy of a matrix and then memory-copy back to the original matrix. An obvious alternative, that is swaping matrix elements in-place, is much slower.
This works nicely if the size of a matrix is, say, an order of magnitude smaller than the size of the physical memory in a computer. Otherwise, you should review your algorithm to avoid explicit matrix transpose from the first place.
> You don't really need to store your transpose matrix.
That's certainly true if he is starting from scratch, but who would want to do that? If he already has both C code and Fortran code that he needs to use, he may be stuck with actually moving data around.
I would suggest giving MTL (C++) a look. I'm a bit turned off by the fact that it uses so-called "shallow copy" semantics, but it should prove educational in any case.
In article <asj91t$r7...@news.cs.utexas.edu>, "NM" <n...@nm.com> wrote: >I need to transpose a matrix in C. The problem is that due to cache behavior >the performance is very poor. If anyone knows efficient matrix transpose >algorithm or has some code can you please help me?
"NM" <n...@nm.com> wrote in message news:asj91t$r71$1@news.cs.utexas.edu... > I need to transpose a matrix in C. The problem is that due to cache behavior > the performance is very poor. If anyone knows efficient matrix transpose > algorithm or has some code can you please help me?
In article <asj91t$r7...@news.cs.utexas.edu>, "NM" <n...@nm.com> writes: >I need to transpose a matrix in C. The problem is that due to cache behavior >the performance is very poor. If anyone knows efficient matrix transpose >algorithm or has some code can you please help me?
How often ? What are the relative sizes of the matrix, cache and overall memory ? Is the matrix square or rectangular ? Are the dimensions fixed, or dependent on the run ?
if the cache is your only issue, then i can see no way to speed up the algorithm. if you don't need to do the transpose in place, then you're going to be doing something like this
loop over indices by column in matrix 1 loop over indices by row in matrix 1 take value from (column,row) in matrix 1 and put it in (row,column) in matrix 2 end loop row end loop column
using c matrices as an example, the last index is the one that changes most rapidly, thus it's the one close in cache. so to optimize the use of the cache you'd want to loop so that the row loop is the inner loop. however during the transpose, the rows will become the columns in matrix 2 so that'll screw up the caching. if you go the other way and have the rows in matrix 2 be the inner loop, then matrix 1 will screw up your caching. i know of no way around this.
if you have to do the transpose inplace then you'll be doing a bunch of swapping of elements that'll be far apart in memory and there's no hope of being friendly to the cache.
"NM" <n...@nm.com> wrote in message <news:asj91t$r71$1@news.cs.utexas.edu>... > I need to transpose a matrix in C. The problem is that due to cache behavior > the performance is very poor. If anyone knows efficient matrix transpose > algorithm or has some code can you please help me?
In article <71734b21.0212040806.d758...@posting.google.com>,
Mike Deskevich <mikedeskev...@yahoo.com> wrote: >if the cache is your only issue, then i can see no way to speed up the >algorithm. if you don't need to do the transpose in place, then >you're going to be doing something like this >in matrix 2 so that'll screw up the caching. if you go the other way >and have the rows in matrix 2 be the inner loop, then matrix 1 will >screw up your caching. i know of no way around this.
>if you have to do the transpose inplace then you'll be doing a bunch >of swapping of elements that'll be far apart in memory and there's no >hope of being friendly to the cache.
If you're really paranoid about efficiency, you could try to use prefetch instructions to lessen cache impact. These instructions hint the CPU that certain pieces of data will be needed in the future. The CPU can respond by loading them into cache in advance.
Newer versions of GCC support this instruction on some targets. I did some experimenting with prefetching on a PIII (gcc 3.2), but I was never able to gain significant performance increase for vectorized operations, though.
In article <71734b21.0212040806.d758...@posting.google.com>, mikedeskev...@yahoo.com (Mike Deskevich) wrote:
> if you have to do the transpose inplace then you'll be doing a bunch > of swapping of elements that'll be far apart in memory and there's no > hope of being friendly to the cache.
Transposing in-place by sub-blocks will eliminate some of the cache misses.
The simple approach only works in-place for square matrices, of course; the in-place rectangular case is more complicated (that is what the TOMS algorithm does).
NM wrote: > I need to transpose a matrix in C. The problem is that due to cache behavior > the performance is very poor. If anyone knows efficient matrix transpose > algorithm or has some code can you please help me?
Have a look at
Michael R. Portnoff, "An Efficient Parallel-Processing Method for Transposing Large Matrices in Place," IEEE Trans. Image Proc. vol. 8 (1999 September) 1265-1275.
I haven't used the algorithm, but it looks appropriate for what you're doing. Also, some FFT algorithms involve transposing arrays, so you may wish to survey that literature. In particular, there is a detailed discussion of high-performance transposing in the following book:
Mike Deskevich wrote: > if the cache is your only issue, then i can see no way to speed up the > algorithm. if you don't need to do the transpose in place, then > you're going to be doing something like this
> loop over indices by column in matrix 1 > loop over indices by row in matrix 1 > take value from (column,row) in matrix 1 and put it in (row,column) in > matrix 2 > end loop row > end loop column
> using c matrices as an example, the last index is the one that changes > most rapidly, thus it's the one close in cache. so to optimize the > use of the cache you'd want to loop so that the row loop is the inner > loop. however during the transpose, the rows will become the columns > in matrix 2 so that'll screw up the caching. if you go the other way > and have the rows in matrix 2 be the inner loop, then matrix 1 will > screw up your caching. i know of no way around this.
There are a whole bunch of algorithms that aim to minimise the number of cache misses by tranposing carefully chosen sizes of local sub blocks that will just fit into cache and then shuffling the blocks to their final destination. Most of the old large scale 2D FFT codes used something similar in the '80 s to get acceptable performance with large arrays under very tight physical memory constraints.
Naive array transpose of a large 2D image array used to sometimes grind demand paged machines of that era to a complete standstill. Modern kit has a lot more memory to play with and often 2 levels of cache.
Transpose is taking too much time. For example a 4096 x 4096 matrix containing complex numbers ( 2 double) is taking about 10 second whereas copying takes about 1.6 sec. It is currently a bottleneck of what i am doing. I don't know of the cache size, overall memory is 512MB. The matrix can be rectangular. dimensions are dependent in the run.
NM
"Michel OLAGNON" <molag...@ifremer.fr> wrote in message
> In article <asj91t$r7...@news.cs.utexas.edu>, "NM" <n...@nm.com> writes: > >I need to transpose a matrix in C. The problem is that due to cache behavior > >the performance is very poor. If anyone knows efficient matrix transpose > >algorithm or has some code can you please help me?
> How often ? What are the relative sizes of the matrix, cache and overall > memory ? Is the matrix square or rectangular ? Are the dimensions fixed, > or dependent on the run ?
On Thu, 5 Dec 2002, NM wrote: > Transpose is taking too much time. For example a 4096 x 4096 matrix > containing complex numbers ( 2 double) is taking about 10 second whereas > copying takes about 1.6 sec. It is currently a bottleneck of what i am > doing. I don't know of the cache size, overall memory is 512MB. The matrix > can be rectangular. dimensions are dependent in the run.
Indeed, transposing large matrices in-place by swaping its elements takes more time (but less memory) when compared to a transpose algorithm consisting of element-wise copying. Here are few test results (on PII 400MHz, 196MB RAM):
$ time ./main 2 512 50 size=512, repeat=50 transpose=transpose_inplace_copy
real 0m2.026s user 0m1.170s sys 0m0.840s
$ time ./main 3 512 50 size=512, repeat=50 transpose=transpose_inplace_copy_cache
real 0m1.180s user 0m1.140s sys 0m0.040s
Note that for smaller matrices the difference between above methods are smaller. transpose_inplace_swap becomes more efficient than transpose_inplace_copy_cache if the size of a matrix is less that 200-250.
When increasing the size of a matrix, transpose_inplace_copy_cache becomes more and more efficent than transpose_inplace_swap until physical memory limit is hit.
As others have pointed out there is no fast way to do a transpose in place for you have to individuslly access A[i][j] for all ij. You need to scan the matrix A across the columns in the row major direction ( I assume you have stored it this way) and then store each row in the columns of a new matrix B. This is easy if you are using a modern C++ linear algebra package with row and column iterators and have the standard library. Its just a few lines
Matrix transpose(const Matrix& m) { Matrix result(m.cols(),m.rows()); for (int i=0; i<m.rows(); i++)
"NM" <n...@nm.com> wrote in message news:asj91t$r71$1@news.cs.utexas.edu... > I need to transpose a matrix in C. The problem is that due to cache behavior > the performance is very poor. If anyone knows efficient matrix transpose > algorithm or has some code can you please help me?
The C computer programming language does not have have a matrix type. You probably meant a two dimensional array of numbers.
> The problem is that, > due to cache behavior, the performance is very poor. > If anyone knows efficient matrix transpose algorithm > or has some code can you please help me?
> I only need the transpose operation. > What the next steps will do is not important to me right now. > I was wondering if there is any block algorithm > or something like that which will make use of the cache > when transposing the matrix. > If there is a pointer to such algorithm I will be really grateful.
Block algorithms are only useful when they "reuse" elements of a two dimensional array that have already been cache'd. The only thing that you can do that might enhance performance of
B = A^T
is to avoid striding across the rows of the source array A
const int m = 4096; const int n = 4096; double A[m][n], B[n][m]; for (int i = 0; i < m; ++i) for (int j = 0; j < n; ++j) B[j][i] = A[i][j];
Because n is small, B[j][i+1] = A[i+1][j] will still be in the cache if B[j][i] = A[i][j] caused the processor to cache both B[j][i] and B[j][i+1] in the same cache block.
> Transpose is taking too much time. > For example a 4096 x 4096 matrix containing complex numbers > ( 2 double) is taking about 10 second > whereas copying takes about 1.6 sec. > It is currently a bottleneck of what I am doing. > I don't know of the cache size, overall memory is 512MB. > The matrix can be rectangular. dimensions are dependent in the run.
The most probable explanation for this time difference is that one or both of your arrays are not aligned on quad word boundaries so that retrieving an element of A or storing an element of B requires two memory references instead of just one (assuming that your cache block size is 128 bits). Try something like this
> Transpose is taking too much time. For example a 4096 x 4096 matrix > containing complex numbers ( 2 double) is taking about 10 second whereas > copying takes about 1.6 sec. It is currently a bottleneck of what i am > doing. I don't know of the cache size, overall memory is 512MB. The matrix > can be rectangular. dimensions are dependent in the run.
I'm not familiar with these things in C so the following may not apply. In Fortran, it makes a big difference if matrix dimensions are the power of two or not. i.e. changing size to 4097 x 4097 will markedly reduce the execution of the algorithm.
> Transpose is taking too much time. For example a 4096 x 4096 matrix > containing complex numbers ( 2 double) is taking about 10 second whereas > copying takes about 1.6 sec. It is currently a bottleneck of what i am > doing. I don't know of the cache size, overall memory is 512MB. The matrix > can be rectangular. dimensions are dependent in the run.
I'm not familiar with these things in C so the following may not apply. In Fortran, it makes a big difference if matrix dimensions are the power of two or not. i.e. changing size to 4097 x 4097 will significantly reduce the execution of the algorithm.
I haven't looked many days for any response, as i have implemented a blocking algorithm. Then I found your result here and I was amazed with your result, as it was very good and so simple. BUT hey your code is wrong. In the last 2 transpose algorithm it seems from indentation that there are 2 nested loop. But in fact there is none. The loop over j runs only for the last value of i which is already graeter than n. So you haven't done any transposition.
NM
"Pearu Peterson" <pe...@cens.ioc.ee> wrote in message
> > Transpose is taking too much time. For example a 4096 x 4096 matrix > > containing complex numbers ( 2 double) is taking about 10 second whereas > > copying takes about 1.6 sec. It is currently a bottleneck of what i am > > doing. I don't know of the cache size, overall memory is 512MB. The matrix > > can be rectangular. dimensions are dependent in the run.
> Indeed, transposing large matrices in-place by swaping its elements takes > more time (but less memory) when compared to a transpose algorithm > consisting of element-wise copying. Here are few test results (on PII > 400MHz, 196MB RAM):
> Note that for smaller matrices the difference between above methods are > smaller. transpose_inplace_swap becomes more efficient than > transpose_inplace_copy_cache if the size of a matrix is less that 200-250.
> When increasing the size of a matrix, transpose_inplace_copy_cache becomes > more and more efficent than transpose_inplace_swap until physical memory > limit is hit.
On Mon, 16 Dec 2002, NM wrote: > I haven't looked many days for any response, as i have implemented a > blocking algorithm. Then I found your result here and I was amazed with your > result, as it was very good and so simple. BUT hey your code is wrong.
Oops. You are right and the fix reversed my conclusions. Sorry for the confusion.
However, after fixing the code, I find the following result interesting:
$ time ./main 1 512 50 size=512, repeat=50 transpose=transpose_inplace_swap
real 0m2.210s user 0m2.160s sys 0m0.040s $ time ./main 1 513 50 size=513, repeat=50 transpose=transpose_inplace_swap
real 0m1.148s user 0m1.100s sys 0m0.030s
That is, by changing the size of a matrix from 512 to 513, almost doubles the efficiency.