Is there in-place transposition on Julia ? [ transpose!() ]

1,022 views
Skip to first unread message

Diego Javier Zea

unread,
Apr 9, 2013, 11:46:42 AM4/9/13
to julia...@googlegroups.com

Steven G. Johnson

unread,
Apr 9, 2013, 12:55:32 PM4/9/13
to julia...@googlegroups.com
FFTW supports in-place transposition (of any datatype whose size is an integer multiplier of 4 bytes) for square and non-square matrices, but I don't think this is exposed right now.

If you look in base/fftw.jl, it would be easy to make a variant of the transpose(X) function that operates in-place.  Just rename it to transpose!(X), set P=X in the body, and remove the PRESERVE_INPUT flag.

On Tuesday, April 9, 2013 11:46:42 AM UTC-4, Diego Javier Zea wrote:
http://en.wikipedia.org/wiki/In-place_matrix_transposition

Best!

Steven G. Johnson

unread,
Apr 9, 2013, 12:58:08 PM4/9/13
to julia...@googlegroups.com


On Tuesday, April 9, 2013 12:55:32 PM UTC-4, Steven G. Johnson wrote:
FFTW supports in-place transposition (of any datatype whose size is an integer multiplier of 4 bytes) for square and non-square matrices, but I don't think this is exposed right now.

If you look in base/fftw.jl, it would be easy to make a variant of the transpose(X) function that operates in-place.  Just rename it to transpose!(X), set P=X in the body, and remove the PRESERVE_INPUT flag.

Note also that you should call "return reshape(P, n2, n1)" instead of "return P".  This looks like a bug in the transpose routine even for the out-of-place case, at first glance.

Steven G. Johnson

unread,
Apr 9, 2013, 1:26:22 PM4/9/13
to julia...@googlegroups.com

Note also that you should call "return reshape(P, n2, n1)" instead of "return P".  This looks like a bug in the transpose routine even for the out-of-place case, at first glance.

(This is not a bug, because array.jl calls FFTW.transpose with an argument that has already been reshaped.  It is confusing, though.)

Diego Javier Zea

unread,
Apr 9, 2013, 2:12:01 PM4/9/13
to julia...@googlegroups.com
Thanks Steven! I was looking into fftw.jl file and I see in-place transpose is in the Julia's TODO list.
But looking to array.jl to, looks like FFTW is only for large matrices of Complex and Floats and integer matrices are using Julia's list comprehension:

function transpose{T<:Union(Float64,Float32,Complex128,Complex64)}(A::Matrix{T})
   
if length(A) > 50000
       
return FFTW.transpose(reshape(A, size(A, 2), size(A, 1)))
   
else
       
return [ A[j,i] for i=1:size(A,2), j=1:size(A,1) ]
   
end
end


transpose
(x::StridedVector) = [ x[j] for i=1, j=1:size(x,1) ]
transpose
(x::StridedMatrix) = [ x[j,i] for i=1:size(x,2), j=1:size(x,1) ]

Tim Holy

unread,
Apr 9, 2013, 2:20:57 PM4/9/13
to julia...@googlegroups.com
On Tuesday, April 09, 2013 08:46:42 AM Diego Javier Zea wrote:
> http://en.wikipedia.org/wiki/In-place_matrix_transposition

I wrote just such an algorithm in Julia once, got it working, but never
submitted it and now I can't find it. I think the reason was that I wasn't
convinced it was useful: even in a nice language like Julia, it's a lot slower
(I seem to remember typical factors like 7x) than transpose-with-copy.

FFTW's may use SSE, multiple threads, etc, so that might be a lot faster
anyway.

--Tim

Steven G. Johnson

unread,
Apr 9, 2013, 2:29:58 PM4/9/13
to julia...@googlegroups.com

FFTW's doesn't use SSE or multiple threads.  However, it doesn't use the traditional follow-the-cycles approach (unless the element size is very large), as following the permutation cycles jumps all around the array and kills the cache. Instead, it uses a 2-pass or 3-pass approach which improves memory locality at the cost of more moves.  This is still typically slower than an out-of-place transposition, but not by such a big factor.

For example, I just pushed a branch to my Julia repository that implements transpose! (as well as using the FFTW transpose routines for a wider range of types):
       https://github.com/stevengj/julia/tree/transpose
On my machine, for transposing a 3000x4000 Array{Float64}, the in-place version is slower than out-of-place by a factor of about 1.74.  For a square matrix, on the other hand, the in-place version is faster by a factor of about 1.9.

--SGJ

Tim Holy

unread,
Apr 9, 2013, 4:48:38 PM4/9/13
to julia...@googlegroups.com
Very cool. Thanks for the explanation!

--Tim

Diego Javier Zea

unread,
Apr 9, 2013, 7:04:06 PM4/9/13
to julia...@googlegroups.com
Great Steven!!! :D

The transpose!() from your branch is in my case make in my function only 1.035x slower the transpose function on master.
My test matrix have 3553x1410 values (Integer).

Can be out-of-place a little more slower on your branch than in master?
I did in my master:

julia
> import Base.transpose

julia
> for (libname, fname, elty) in ((:libfftw ,"fftw_plan_guru64_r2r",:Float64),
                                     
(:libfftwf,"fftwf_plan_guru64_r2r",:Float32))
       
           
@eval begin
               
function transpose(X::Ptr{$elty}, Y::Ptr{$elty},
                                  n1
::Integer, n2::Integer, n::Int,
                                  s1
::Integer, s2::Integer)
                   plan
= ccall(($fname,$libname), Ptr{Void},
                               
(Int32, Ptr{Int}, Int32, Ptr{Int}, Ptr{$elty}, Ptr{$elty}, Ptr{Int32}, Uint32),
                               
0, C_NULL, 3, Int[n1,s1*n,n2*n, n2,s2*n,n, n,1,1],
                                X
, Y, C_NULL, ESTIMATE | PRESERVE_INPUT)
                   
if plan == C_NULL
                       error
("FFTW could not create plan") # shouldn't normally happen
                   
end
                   execute
($elty, plan)
                   destroy_plan
($elty, plan)
                   
return Y
               
end
           
end
       
end

julia
> function transpose{T}(A::StridedMatrix{T})
           
(n1, n2) = size(A)
           
# sizeof(T) does not work because A elements may be jl_value_t*
           
# FIXME: there should be a way to get elsize as a compile-time constant
           elsize
= div(sizeof(A), n1*n2)
           
if n1*n2 > 50000 && rem(elsize, 4) == 0
               
(s1, s2) = strides(A)
               P
= Array(T, n2, n1)
               
if rem(elsize, 8) == 0
                   FFTW
.transpose(convert(Ptr{Float64}, A),
                                  convert
(Ptr{Float64}, P),
                                  n1
, n2, div(elsize, 8), s1, s2)
               
else
                   FFTW
.transpose(convert(Ptr{Float32}, A),
                                  convert
(Ptr{Float32}, P),
                                  n1
, n2, div(elsize, 4), s1, s2)
               
end
               
return P
           
else
               
return [ A[j,i] for i=1:n2, j=1:n1 ]
           
end
       
end


Before this I get:
                               Category                       Benchmark Iterations TotalWall AverageWall   MaxWall   MinWall             Timestamp    JuliaHash CodeHash      OS CPUCores
[1,]    "transpose_out-of-place_master" "transpose_out-of-place_master"       1000   18.9995   0.0189995 0.0296562 0.0147168 "2013-04-09 19:44:16" "1601355dad"       NA "Linux"        8
julia> benchmark(f,"transpose_in-place_master",1000)

After this:
                               Category                       Benchmark Iterations TotalWall AverageWall   MaxWall   MinWall             Timestamp    JuliaHash CodeHash      OS CPUCores
[1,]    "transpose_out-of-place_master" "transpose_out-of-place_master"       1000   19.0864   0.0190864 0.0277134 0.0148714 "2013-04-09 19:59:43" "1601355dad"       NA "Linux"        8




Steven G. Johnson

unread,
Apr 10, 2013, 2:26:27 PM4/10/13
to julia...@googlegroups.com


On Tuesday, April 9, 2013 7:04:06 PM UTC-4, Diego Javier Zea wrote:
Can be out-of-place a little more slower on your branch than in master?
I did in my master:
 
A difference of < 1% seems like it is probably in the noise...
Reply all
Reply to author
Forward
0 new messages