Efficient matrix transpose in c

NM

unread,

Dec 3, 2002, 4:52:12 PM12/3/02

to

I need to transpose a matrix in C. The problem is that due to cache behavior
the performance is very poor. If anyone knows efficient matrix transpose
algorithm or has some code can you please help me?

Thanks

NM

w...@uiuc.edu

unread,

Dec 3, 2002, 6:16:04 PM12/3/02

to

You don't really need to store your transpose matrix. If you need to
mulitply your transpose matrix with a vector, you can simply write a short
routine for doing that. It's all a matter of indexing.

W.C.

Citing NM <n...@nm.com> from his previous post:
# I need to transpose a matrix in C. The problem is that due to cache behavior
# the performance is very poor. If anyone knows efficient matrix transpose
# algorithm or has some code can you please help me?

--

NM

unread,

Dec 3, 2002, 6:39:20 PM12/3/02

to

I only need the transpose operation. What the next steps will do is not
important to me right now. I was wondering if there is any block algorithm
or something like that which will make use of the cache when transposing the
matrix. If there is a pointer to such algorithm I will be really grateful.

Thanks
NM

<w...@uiuc.edu> wrote in message
news:UWaH9.1486$Vf3....@vixen.cso.uiuc.edu...

Pearu Peterson

unread,

Dec 3, 2002, 6:55:01 PM12/3/02

to

Do you need in-place transpose?

I think an efficient as well a simple matrix transpose algorithm for
in-place transpose is to make element-wise copy of a matrix and then
memory-copy back to the original matrix. An obvious alternative, that is
swaping matrix elements in-place, is much slower.

This works nicely if the size of a matrix is, say, an order
of magnitude smaller than the size of the physical memory in a
computer. Otherwise, you should review your algorithm to avoid
explicit matrix transpose from the first place.

Pearu

Jive Dadson

unread,

Dec 3, 2002, 9:24:04 PM12/3/02

to

w...@uiuc.edu wrote:
>
> You don't really need to store your transpose matrix.

That's certainly true if he is starting from scratch, but who would want to do that? If he already has both C code and Fortran code that he needs to use, he may be stuck with actually moving data around.

I would suggest giving MTL (C++) a look. I'm a bit turned off by the fact that it uses so-called "shallow copy" semantics, but it should prove educational in any case.

Jive

George W. Bush

unread,

Dec 3, 2002, 10:34:09 PM12/3/02

to

You might look at TOMS 380

http://www.netlib.org/toms/380

The code is in Fortran but can be easily ported to another language.

Frank

unread,

Dec 4, 2002, 2:51:14 AM12/4/02

to

Here is how I would transpose a 3x3 matrix in C++:

matrix33 &matrix33::transpose()
{
float t;

for (unsigned int c = 0; c < 3; c++)
{
for (unsigned int r = c + 1; r < 3; r++)
{
t = col[c][r];
col[c][r] = col[r][c];
col[r][c] = t;
}
}

return *this;
}

hope that helps

"NM" <n...@nm.com> wrote in message news:asj91t$r71$1...@news.cs.utexas.edu...

Michel OLAGNON

unread,

Dec 4, 2002, 3:07:57 AM12/4/02

to

How often ? What are the relative sizes of the matrix, cache and overall
memory ? Is the matrix square or rectangular ? Are the dimensions fixed,
or dependent on the run ?

Michel

Mike Deskevich

unread,

Dec 4, 2002, 11:06:49 AM12/4/02

to

if the cache is your only issue, then i can see no way to speed up the
algorithm. if you don't need to do the transpose in place, then
you're going to be doing something like this

loop over indices by column in matrix 1
loop over indices by row in matrix 1
take value from (column,row) in matrix 1 and put it in (row,column) in
matrix 2
end loop row
end loop column

using c matrices as an example, the last index is the one that changes
most rapidly, thus it's the one close in cache. so to optimize the
use of the cache you'd want to loop so that the row loop is the inner
loop. however during the transpose, the rows will become the columns
in matrix 2 so that'll screw up the caching. if you go the other way
and have the rows in matrix 2 be the inner loop, then matrix 1 will
screw up your caching. i know of no way around this.

if you have to do the transpose inplace then you'll be doing a bunch
of swapping of elements that'll be far apart in memory and there's no
hope of being friendly to the cache.

mike

"NM" <n...@nm.com> wrote in message news:<asj91t$r71$1...@news.cs.utexas.edu>...

Han-Wen Nienhuys

unread,

Dec 4, 2002, 12:07:51 PM12/4/02

to

In article <71734b21.0212...@posting.google.com>,

Mike Deskevich <mikede...@yahoo.com> wrote:
>if the cache is your only issue, then i can see no way to speed up the
>algorithm. if you don't need to do the transpose in place, then
>you're going to be doing something like this

>in matrix 2 so that'll screw up the caching. if you go the other way

>and have the rows in matrix 2 be the inner loop, then matrix 1 will
>screw up your caching. i know of no way around this.
>
>if you have to do the transpose inplace then you'll be doing a bunch
>of swapping of elements that'll be far apart in memory and there's no
>hope of being friendly to the cache.

If you're really paranoid about efficiency, you could try to use
prefetch instructions to lessen cache impact. These instructions hint
the CPU that certain pieces of data will be needed in the future. The
CPU can respond by loading them into cache in advance.

Newer versions of GCC support this instruction on some targets. I did
some experimenting with prefetching on a PIII (gcc 3.2), but I was
never able to gain significant performance increase for vectorized
operations, though.

--

Han-Wen Nienhuys | han...@cs.uu.nl | http://www.cs.uu.nl/~hanwen/

Ron Shepard

unread,

Dec 4, 2002, 1:02:13 PM12/4/02

to

In article <71734b21.0212...@posting.google.com>,
mikede...@yahoo.com (Mike Deskevich) wrote:

> if you have to do the transpose inplace then you'll be doing a bunch
> of swapping of elements that'll be far apart in memory and there's no
> hope of being friendly to the cache.

Transposing in-place by sub-blocks will eliminate some of the cache
misses.

The simple approach only works in-place for square matrices, of
course; the in-place rectangular case is more complicated (that is
what the TOMS algorithm does).

$.02 -Ron Shepard

Eric Rudd

unread,

Dec 4, 2002, 1:57:09 PM12/4/02

to

NM wrote:

Have a look at

Michael R. Portnoff, "An Efficient Parallel-Processing Method for
Transposing Large Matrices in Place," IEEE Trans. Image Proc. vol. 8
(1999 September) 1265-1275.

I haven't used the algorithm, but it looks appropriate for what you're doing.
Also, some FFT algorithms involve transposing arrays, so you may wish to survey
that literature. In particular, there is a detailed discussion of
high-performance transposing in the following book:

http://www.ec-securehost.com/SIAM/FR10.html

Hope this helps.

-Eric Rudd

Martin Brown

unread,

Dec 5, 2002, 4:44:57 AM12/5/02

to

Mike Deskevich wrote:

> if the cache is your only issue, then i can see no way to speed up the
> algorithm. if you don't need to do the transpose in place, then
> you're going to be doing something like this
>
> loop over indices by column in matrix 1
> loop over indices by row in matrix 1
> take value from (column,row) in matrix 1 and put it in (row,column) in
> matrix 2
> end loop row
> end loop column
>
> using c matrices as an example, the last index is the one that changes
> most rapidly, thus it's the one close in cache. so to optimize the
> use of the cache you'd want to loop so that the row loop is the inner
> loop. however during the transpose, the rows will become the columns
> in matrix 2 so that'll screw up the caching. if you go the other way
> and have the rows in matrix 2 be the inner loop, then matrix 1 will
> screw up your caching. i know of no way around this.

There are a whole bunch of algorithms that aim to minimise the number of cache
misses by tranposing carefully chosen sizes of local sub blocks that will just fit
into cache and then shuffling the blocks to their final destination. Most of the
old large scale 2D FFT codes used something similar in the '80 s to get acceptable
performance with large arrays under very tight physical memory constraints.

Naive array transpose of a large 2D image array used to sometimes grind demand
paged machines of that era to a complete standstill. Modern kit has a lot more
memory to play with and often 2 levels of cache.

Regards,
Martin Brown

Jentje Goslinga

unread,

Dec 5, 2002, 7:58:18 AM12/5/02

to

Hello NM,

There was a C code posted for this in 1998 by
Jason Stratos Papadopoulos in this newsgroup.
I remember it being quite fast.

Jentje Goslinga

NM

unread,

Dec 5, 2002, 10:51:16 AM12/5/02

to

Transpose is taking too much time. For example a 4096 x 4096 matrix
containing complex numbers ( 2 double) is taking about 10 second whereas
copying takes about 1.6 sec. It is currently a bottleneck of what i am
doing. I don't know of the cache size, overall memory is 512MB. The matrix
can be rectangular. dimensions are dependent in the run.

NM

"Michel OLAGNON" <mola...@ifremer.fr> wrote in message
news:askd4t$o90$1...@iletudy.ifremer.fr...

NM

unread,

Dec 5, 2002, 11:02:40 AM12/5/02

to

Thanks

NM

"Jentje Goslinga" <gosl...@shaw.ca> wrote in message
news:3DEF4D75...@shaw.ca...

Pearu Peterson

unread,

Dec 5, 2002, 12:53:51 PM12/5/02

to

On Thu, 5 Dec 2002, NM wrote:

> Transpose is taking too much time. For example a 4096 x 4096 matrix
> containing complex numbers ( 2 double) is taking about 10 second whereas
> copying takes about 1.6 sec. It is currently a bottleneck of what i am
> doing. I don't know of the cache size, overall memory is 512MB. The matrix
> can be rectangular. dimensions are dependent in the run.

Indeed, transposing large matrices in-place by swaping its elements takes
more time (but less memory) when compared to a transpose algorithm
consisting of element-wise copying. Here are few test results (on PII
400MHz, 196MB RAM):

transposing 2048x2048 matrix
----------------------------
$ time ./main 1 2048 10
size=2048, repeat=10 transpose=transpose_inplace_swap

real 0m6.277s
user 0m5.860s
sys 0m0.410s

$ time ./main 2 2048 10
size=2048, repeat=10 transpose=transpose_inplace_copy

real 0m3.578s
user 0m2.470s
sys 0m1.100s

$ time ./main 3 2048 10
size=2048, repeat=10 transpose=transpose_inplace_copy_cache

real 0m2.838s
user 0m2.460s
sys 0m0.360s

transposing 512x512 matrix
--------------------------
$ time ./main 1 512 50
size=512, repeat=50 transpose=transpose_inplace_swap

real 0m1.874s
user 0m1.820s
sys 0m0.040s

$ time ./main 2 512 50
size=512, repeat=50 transpose=transpose_inplace_copy

real 0m2.026s
user 0m1.170s
sys 0m0.840s

$ time ./main 3 512 50
size=512, repeat=50 transpose=transpose_inplace_copy_cache

real 0m1.180s
user 0m1.140s
sys 0m0.040s

Note that for smaller matrices the difference between above methods are
smaller. transpose_inplace_swap becomes more efficient than
transpose_inplace_copy_cache if the size of a matrix is less that 200-250.

When increasing the size of a matrix, transpose_inplace_copy_cache becomes
more and more efficent than transpose_inplace_swap until physical memory
limit is hit.

Source code for ./main can be found in
http://cens.ioc.ee/~pearu/misc/transpose/

Regards,
Pearu

Bill

unread,

Dec 5, 2002, 10:54:39 AM12/5/02

to

Hi NM

As others have pointed out there is no fast way
to do a transpose in place for you have to individuslly
access A[i][j] for all ij. You need to scan the matrix A
across the columns in the row major direction ( I assume
you have stored it this way) and then store each row
in the columns of a new matrix B.
This is easy if you are using a modern C++ linear
algebra package with row and column iterators
and have the standard library. Its
just a few lines

Matrix transpose(const Matrix& m) {
Matrix result(m.cols(),m.rows());
for (int i=0; i<m.rows(); i++)

std::copy(m.row_begin(i),m.row_end(i),result.column_begin(i));
return result;
}

If you have to use the older C++ libraries with pointers it
looks something like this snippet

typedef float T; // could also be double, complex etc

Matrix transpose(const Matrix& m)
{
Matrix result(m.cols(),m.rows()); // the new matrix

T * mp = m.Data() ; // m element pointer
T * rep = result.Data() ; // result element pointer
T * rowp = m.Data() ; // row end pointer

for (int i=0; i<m.rows(); i++)
{
rep = result.Data() +i ; // set column pointer to top of column
rowp = += m.rows(); // set pointer to end of row

while(mp < rowp){
*rep = *mp;
mp++;
rep += m.cols();
}

return result;
}

regards... Bill Shortall

"NM" <n...@nm.com> wrote in message news:asj91t$r71$1...@news.cs.utexas.edu...

E. Robert Tisdale

unread,

Dec 5, 2002, 1:35:27 PM12/5/02

to

Something that calls itself NM wrote:

> I need to transpose a matrix in C.

The C computer programming language
does not have have a matrix type.
You probably meant a two dimensional array of numbers.

> The problem is that,
> due to cache behavior, the performance is very poor.

> If anyone knows efficient matrix transpose algorithm
> or has some code can you please help me?
>

> I only need the transpose operation.
> What the next steps will do is not important to me right now.
> I was wondering if there is any block algorithm
> or something like that which will make use of the cache
> when transposing the matrix.
> If there is a pointer to such algorithm I will be really grateful.

Block algorithms are only useful when they "reuse"
elements of a two dimensional array that have already been cache'd.
The only thing that you can do that might enhance performance of

B = A^T

is to avoid striding across the rows of the source array A

const int m = 4096;
const int n = 4096;
double A[m][n], B[n][m];
for (int i = 0; i < m; ++i)
for (int j = 0; j < n; ++j)
B[j][i] = A[i][j];

Because n is small, B[j][i+1] = A[i+1][j]
will still be in the cache if B[j][i] = A[i][j]
caused the processor to cache both
B[j][i] and B[j][i+1] in the same cache block.

> Transpose is taking too much time.
> For example a 4096 x 4096 matrix containing complex numbers
> ( 2 double) is taking about 10 second
> whereas copying takes about 1.6 sec.

> It is currently a bottleneck of what I am doing.

> I don't know of the cache size, overall memory is 512MB.
> The matrix can be rectangular. dimensions are dependent in the run.

The most probable explanation for this time difference
is that one or both of your arrays are not aligned
on quad word boundaries so that retrieving an element of A
or storing an element of B requires two memory references
instead of just one (assuming that your cache block size is 128 bits).
Try something like this

double alignment_padding;
doubleComplex A[m][n], B[n][m];

and, if you're lucky, you will get the complex elements
of both A and B aligned on quad word boundaries.

bv

unread,

Dec 5, 2002, 7:50:05 PM12/5/02

to

NM wrote:
>
> Transpose is taking too much time. For example a 4096 x 4096 matrix
> containing complex numbers ( 2 double) is taking about 10 second whereas
> copying takes about 1.6 sec. It is currently a bottleneck of what i am
> doing. I don't know of the cache size, overall memory is 512MB. The matrix
> can be rectangular. dimensions are dependent in the run.

I'm not familiar with these things in C so the following may not apply.
In Fortran,
it makes a big difference if matrix dimensions are the power of two or
not. i.e. changing size to 4097 x 4097 will markedly reduce the
execution of the algorithm.

--
Dr.B.Voh
------------------------------------------------------
Applied Algorithms http://sdynamix.com

bv

unread,

Dec 5, 2002, 7:56:34 PM12/5/02

to

NM wrote:
>
> Transpose is taking too much time. For example a 4096 x 4096 matrix
> containing complex numbers ( 2 double) is taking about 10 second whereas
> copying takes about 1.6 sec. It is currently a bottleneck of what i am
> doing. I don't know of the cache size, overall memory is 512MB. The matrix
> can be rectangular. dimensions are dependent in the run.

I'm not familiar with these things in C so the following may not apply.

In Fortran, it makes a big difference if matrix dimensions are the power

of two or not. i.e. changing size to 4097 x 4097 will significantly

NM

unread,

Dec 16, 2002, 4:41:40 PM12/16/02

to

I haven't looked many days for any response, as i have implemented a
blocking algorithm. Then I found your result here and I was amazed with your
result, as it was very good and so simple. BUT hey your code is wrong. In
the last 2 transpose algorithm it seems from indentation that there are 2
nested loop. But in fact there is none. The loop over j runs only for the
last value of i which is already graeter than n. So you haven't done any
transposition.

NM

"Pearu Peterson" <pe...@cens.ioc.ee> wrote in message
news:Pine.LNX.4.21.02120...@cens.kybi...

Pearu Peterson

unread,

Dec 18, 2002, 6:32:07 PM12/18/02

to

On Mon, 16 Dec 2002, NM wrote:

> I haven't looked many days for any response, as i have implemented a
> blocking algorithm. Then I found your result here and I was amazed with your
> result, as it was very good and so simple. BUT hey your code is wrong.

Oops. You are right and the fix reversed my conclusions. Sorry for the
confusion.

However, after fixing the code, I find the following result interesting:

$ time ./main 1 512 50
size=512, repeat=50 transpose=transpose_inplace_swap

real 0m2.210s
user 0m2.160s
sys 0m0.040s
$ time ./main 1 513 50
size=513, repeat=50 transpose=transpose_inplace_swap

real 0m1.148s
user 0m1.100s
sys 0m0.030s

That is, by changing the size of a matrix from 512 to 513, almost doubles
the efficiency.

Pearu