Mismatch between geqp3 and geqp3_gpu pivots

15 views
Skip to first unread message

Soham Ghosh

unread,
Mar 8, 2022, 1:48:09 AMMar 8
to MAGMA User
Hello! 

I am using MAGMA's qr factorization with pivoting (geqp3) in a F90 code. My pivots from the CPU version of the code (magmaf_dgeqp3) differs from the gpu+cpu version (magmaf_dgeqp3_gpu). The pivots I get  from magmaf_dgeqp3 and Cray ibpsi dgeqp3 are identical. Can you help me with this error? I will paste my code and working environment below: 

qrtest.F90
--------------------------------------------------------------------------------------------------------------------------------------------

module fill_matrix

implicit none


contains


  subroutine fill_matrix_cpu(m,n,A)

    integer, intent(IN) :: m,n

    real(8), intent(OUT) :: A(:,:)


    integer :: i,j

    real(8) :: re


    do j = 1,n

        do i = 1,m

            call random_number(re)

            A(i,j) = re

        end do

    end do


  end subroutine fill_matrix_cpu


end module


program qrtest

    use magma

    use fill_matrix

    implicit none

    

    integer :: m,n,ldda,info,lwork

    integer :: i,j

    real(8), allocatable :: A(:,:), Acopy(:,:)

    real(8), allocatable :: work(:), workcopy(:)

    real(8), allocatable :: tau(:)

    integer, allocatable :: jpvt(:)

    real(8) :: counter

    

    magma_devptr_t :: A_d, work_d ! pointer for device arrays 

    magma_devptr_t :: queue

    

    call magmaf_init()

    

    m = 8

    n = 50

    lwork = 4*n + 1

    ldda = ceiling(real(m)/32)*32

    info = -1

    counter = 0.0D0

    allocate(A(ldda,n)) 

    allocate(Acopy(ldda,n)) 

    allocate(tau(m)) 

    allocate(jpvt(n)) 

    allocate(work(lwork)) 

    

    info = magmaf_dmalloc(A_d, ldda*n)

    if ((info .ne. 0) .OR. (A_d == 0)) then

        print*, "malloc failed for A_d, info =", info

    end if

    ! fill mxn of the matrix with random numbers

    call fill_matrix_cpu(m,n,A)

    ! copy A to A_d 

    call magmaf_queue_create(0,queue)

    call magmaf_dsetmatrix(m,n,A,ldda,A_d,ldda,queue)

    ! bring A_d back and compare with A to check for CPU-GPU copy correctness

    call magmaf_dgetmatrix(m,n,A_d,ldda,Acopy,ldda,queue)

    call magmaf_queue_destroy(queue)

    do j = 1,n

        do i = 1,m

            counter = counter + ABS(A(i,j) - Acopy(i,j))

        end do

    end do 

    if (counter/size(A) .GT. 1.0D-10) then

        print*, "matrix A didnt not get copied to GPU correctly"

    endif

    counter = 0.0D0

    tau = 0.0D0

    jpvt = 0

    work = 0.0D0

    !Get optimum lwork

    !Run QR factorization with pivoting on CPU with lwork = -1

    call magmaf_dgeqp3(m,n,A,ldda,jpvt,tau,work,-1,info)

    print*, "workspace query info", info

    lwork = ceiling(work(1))

    print*, "lwork", lwork

    ! once you have optimum lwork reallocate work and allocate work_d

    deallocate(work)

    allocate(work(lwork))

    allocate(workcopy(lwork))

    work = 0.0D0

    info = magmaf_dmalloc(work_d, lwork)

    if ((info .ne. 0) .OR. (work_d == 0)) then

        print*, "malloc failed for work_d, info =", info

    end if

    info = -1    

    ! Run QR factorization with pivoting on CPU

    call magmaf_dgeqp3(m,n,A,ldda,jpvt,tau,work,lwork,info)

    print *, 'qrcpu-info:',info

    print *, 'pivots from magma QR CPU printed in fort.100'

    write(100,'(I5)') jpvt(:)

    tau = 0.0D0

    jpvt = 0

    work = 0.0D0

    ! fill work_d vector on device

    call magmaf_queue_create(0,queue)

    call magmaf_dsetvector(lwork,work,1,work_d,1,queue)

    ! bring back the vector to CPU to check for correctness

    call magmaf_dgetvector(lwork,work_d,1,workcopy,1,queue)

    call magmaf_queue_destroy(queue)

    do i = 1,lwork

        counter = counter + ABS(work(i) - workcopy(i))

    end do

    if (counter/size(work) .GT. 1.0D-10) then

        print*, "vector work didnt not get copied to GPU correctly"

    endif

    info = -1

    ! Run QR factorization with pivoting on device

    call magmaf_dgeqp3_gpu(m,n,A_d,ldda,jpvt,tau,work_d,lwork,info)

   

    print *, 'qrgpu-info:',info

    print *, 'pivots from magma QR GPU printed in fort.200'

    write(200,'(I5)') jpvt(:)


    deallocate(A)

    deallocate(Acopy)

    deallocate(tau)

    deallocate(jpvt)

    deallocate(work)

    deallocate(workcopy)


    info = magmaf_free(A_d)  

    info = magmaf_free(work_d) 


    call magmaf_finalize()


end program qrtest

----------------------------------------------------------------------------------------------------------------------------------------------

My MAGMA is compiled with gfortran, CUDA 11.4 and Cray libsci 21.08. I compile the above code as follows (ftn is gfortran):

-------------------------------------------------------------------------------------------------------------------------------------------

ftn -Wall -Wno-tabs -cpp -Dmagma_devptr_t="integer(kind=8)"  -Dmagma_devptr_t="integer(kind=8)" -I/global/homes/s/soham/pm/magma/include -cpp -c -o qrtest.o qrtest.F90

ftn -Wall -g3 -fbacktrace -fcheck=all -fopenmp -fopenacc -o qrtest qrtest.o -L/global/homes/s/soham/pm/magma/lib -lmagma_sparse -lmagma -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/cuda/11.4/lib64 -lcublas -lcudart -lcusparse -Lopt/cray/pe/libsci/21.08.1.2/GNU/9.1/x86_64/lib

-----------------------------------------------------------------------------------------------------------------------------------------

First 10 lines of fort.100, the output from magmaf_dgeqp3:

---------------------------------------

   43

   36

    1

   14

   33

   38

   31

   24

    9

   10

-------------------------------------------------------------------------------------------------

First 10 lines of fort.200, the output of magmaf_dgeqp3_gpu:

-----------------------------------------------------------------

   2

    3

    4

    5

    6

    7

    8

   43

    9

   10


Thank you for looking into my issue, 

Soham.

Mark Gates

unread,
Mar 8, 2022, 9:00:26 AMMar 8
to Soham Ghosh, MAGMA User
On Tue, Mar 8, 2022 at 1:48 AM Soham Ghosh <soha...@lbl.gov> wrote:
I am using MAGMA's qr factorization with pivoting (geqp3) in a F90 code. My pivots from the CPU version of the code (magmaf_dgeqp3) differs from the gpu+cpu version (magmaf_dgeqp3_gpu).

I don't know what would cause this difference, hopefully Stan can look into that, but just to clarify:
magmaf_dgeqp3 is the CPU + GPU version — it takes the matrix in CPU memory, and does some computation on the CPU, some on the GPU,
magmaf_dgeqp3_gpu is the GPU-only version — it takes the matrix in GPU memory, and for this routine, essentially all computation is on the GPU.

Mark

Soham Ghosh

unread,
Mar 15, 2022, 10:32:09 PMMar 15
to MAGMA User, mga...@icl.utk.edu, MAGMA User, Soham Ghosh
Hi Mark, 
Thanks for the response. I made a mistake in describing the routines. I should not call magmaf_dgeqp3 a CPU-only code. However, I gather from the documentation that the pivot (array jpvt) never leaves the CPU, even in magmaf_dgeqp3_gpu. Therefore, magmaf_dgeqp3_gpu is also a hybrid version, right? Then the main difference between the _gpu and the routine without the _gpu is that in the latter the data array and the workspace array are on the device, while in the former they are on the CPU. 

I will like to get magmaf_dgeqp3_gpu working correctly, otherwise I pay the penalty of repeatedly moving data to and from device every time i call magmaf_dgeqp3. 

Thanks, 
Soham.
Reply all
Reply to author
Forward
0 new messages