Core dump with magma_dsyevdx_m for large matrix

98 views
Skip to first unread message

Tom Carroll

unread,
Sep 22, 2021, 10:36:08 AM9/22/21
to MAGMA User
Hi,

I posted earlier for advice on memory usage. I was using MAGMA's testing code, specifically testing_dsyevd. I just figured out that the testing code also allocates a big chunk of memory to store another copy of the matrix for diagonalization in LAPACK.

I decided to write my own small test (well, to copy/paste the MAGMA test and modify :) ). I removed the LAPACK memory allocation and this allowed me to test some slightly larger matrices. The code is attached and also pasted below.

In my code, I run magma_dsyevdx_m to get both the eigenvalues and eigenvectors of a large matrix and I run on four Nvidia A100 GPU's. I'm getting a core dump for matrices larger than about 92,000.

When I run my code for a matrix of size 92k, everything works fine:

$ ./magma-test 92000
% MAGMA 2.6.1  64-bit magma_int_t, 64-bit pointer.
Compiled with CUDA support for 8.0
% CUDA runtime 11030, driver 11030. OpenMP threads 32. 
% device 0: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 1: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 2: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 3: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% Wed Sep 22 08:33:49 2021
% ngpu = 4
%   N   CPU Time (sec)   GPU Time (sec)   |S-S_magma|   |A-USU^H|   |I-U^H U|
%===========================================================================
Workspace query complete. info = 0, lwork = 16928552001, liwork = 460003
Finished allocating memory.
Finished filling matrix.
92000      ---            826.4779           ---           ---         ---      ok

However, when I increase to 93k:

$./magma-test 93000
% MAGMA 2.6.1  64-bit magma_int_t, 64-bit pointer.
Compiled with CUDA support for 8.0
% CUDA runtime 11030, driver 11030. OpenMP threads 32. 
% device 0: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 1: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 2: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 3: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% Wed Sep 22 09:06:34 2021
% ngpu = 4
%   N   CPU Time (sec)   GPU Time (sec)   |S-S_magma|   |A-USU^H|   |I-U^H U|
%===========================================================================
Workspace query complete. info = 0, lwork = 17298558001, liwork = 465003
Finished allocating memory.
Finished filling matrix.
Aborted (core dumped)

The use of resources is obviously not very different in these two cases, so I haven't been able to suss out a reasonable explanation yet. In both cases, I have around 45-50 GB free system memory and am using < 20 GB out of the 40 GB per GPU.

Any advice would be great! Thanks!

Cheers,
tom

magma-test.cpp:

// includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <sys/time.h>

// includes, project
#include "magma_v2.h"
#include "magma_lapack.h"
#include "magma_operators.h"
#include <gsl/gsl_math.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_sf.h>


const gsl_rng_type * T;
gsl_rng * r;


int main( int argc, char** argv)
{
        magma_int_t N = atoi(argv[1]);;

        magma_init();
        magma_print_environment();

        real_Double_t gpu_time;
        double *h_R,*h_work, aux_work[1];
        double *w1;
        magma_int_t *iwork, aux_iwork[1];
        magma_int_t Nfound, info, lwork, liwork, lda;
        int status = 0;

        struct timeval t1;
        gettimeofday(&t1, NULL);

        gsl_rng_env_setup();

        T = gsl_rng_default;
        gsl_rng_default_seed = t1.tv_usec * t1.tv_usec;
        r = gsl_rng_alloc (T);

        // pass ngpu = -1 to test multi-GPU code using 1 gpu
        magma_int_t abs_ngpu = 4;
        printf("%% ngpu = %lld\n", (long long) abs_ngpu);
        printf("%%   N   CPU Time (sec)   GPU Time (sec)   |S-S_magma|   |A-USU^H|   |I-U^H U|\n");
        printf("%%============================================================================\n");

        lda = N;
        Nfound = N;
        magma_int_t il = 0, iu = N;
        double vl = 0, vu = N;

        // query for workspace sizes

        magma_dsyevdx_m( abs_ngpu, MagmaVec, MagmaRangeAll, MagmaLower,
                         N, NULL, lda,
                         vl, vu, il, iu,
                         &Nfound, w1,
                         aux_work,  -1,
                         aux_iwork, -1,
                         &info );

        lwork  = (magma_int_t) MAGMA_D_REAL( aux_work[0] );
        liwork = aux_iwork[0];

        printf("Workspace query complete. info = %lld, lwork = %lld, liwork = %lld\n", info, lwork, liwork);

        /* Allocate host memory for the matrix */
        magma_dmalloc_cpu( &w1,     N      );
        magma_imalloc_cpu( &iwork,  liwork );

        magma_dmalloc_pinned( &h_R,    N*lda  );
        magma_dmalloc_pinned( &h_work, lwork  );

        printf("Finished allocating memory.\n");

        for(int j = 0; j < N; j++)
                for(int i = j; i < N; i++)
                        h_R[i + j*N] = 2.0 - 4.0 * gsl_rng_uniform(r);

        printf("Finished filling matrix.\n");
        /* ====================================================================
           Performs operation using MAGMA
           =================================================================== */
        gpu_time = magma_wtime();

        magma_dsyevdx_m( abs_ngpu, MagmaVec, MagmaRangeAll, MagmaLower,
                         N, h_R, lda,
                         vl, vu, il, iu,
                         &Nfound, w1,
                         h_work, lwork,
                         iwork, liwork,
                         &info );

        gpu_time = magma_wtime() - gpu_time;
        if (info != 0) {
                printf("magma_dsyevd returned error %lld: %s.\n",
                       (long long) info, magma_strerror( info ));
        }

        bool okay = true;



        printf("%5lld      ---           %9.4f           ---     ",
               (long long) N, gpu_time);


        // print error checks

        printf("      ---         ---   ");
        printf("   %s\n", (okay ? "ok" : "failed"));
        status += !okay;

        if(N < 100)
                for(int i = 0; i < N; i++)
                        printf("%f\n",w1[i]);

        magma_free_cpu( w1    );
        magma_free_cpu( iwork );
        magma_free_pinned( h_R    );
        magma_free_pinned( h_work );
        fflush( stdout );
        magma_finalize();
        return status;
}

magma-test.cpp

Tom Carroll

unread,
Sep 23, 2021, 1:52:13 PM9/23/21
to MAGMA User, Tom Carroll
I have an update. I have tried testing both the single and double precision routines (magma_dsyevd_m and magma_ssysevd_m). Both work when the matrix size is 92672. Both fail with the same error when the matrix size is 92673. The error is: 

magma_dsyevd returned error -113: cannot allocate memory on GPU device.

In both cases, I am using four Nvidia A100's with 40 GB of memory. In the single precision case, only about 10 GB is used per GPU and in the double precision case only about 20 GB. Sample output below:

$ ./magma-test 92673
% MAGMA 2.6.1  64-bit magma_int_t, 64-bit pointer.
Compiled with CUDA support for 8.0
% CUDA runtime 11030, driver 11030. OpenMP threads 32. 
% device 0: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 1: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 2: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 3: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% Thu Sep 23 12:15:00 2021
% ngpu = 4
%   N   CPU Time (sec)   GPU Time (sec)   |S-S_magma|   |A-USU^H|   |I-U^H U|
%============================================================================
Workspace query complete. info = 0, lwork = 17177125897, liwork = 463368
Finished allocating memory.
Finished filling matrix.
magma_dsyevd returned error -113: cannot allocate memory on GPU device.
92673      ---            570.7015           ---           ---         ---      ok

Thanks for any advice!

Cheers,
tom

Stanimire Tomov

unread,
Sep 23, 2021, 1:58:38 PM9/23/21
to Tom Carroll, MAGMA User
Tom,
Thanks for the report and the update.
This is interesting - sounds like it may not be just running out of memory if it fails for single precision
at this size. I will run it later on our systems to see if I can reproduce and follow up.

Regarding your question on timing if you add -DENABLE_TIMER to CFLAGS in your make.inc,
magma is instrumented for some routines to print timing on various component of 
the algorithms, including the eigensolvers. That may be useful in this case.
Thanks,
Stan

-- 
You received this message because you are subscribed to the Google Groups "MAGMA User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to magma-user+...@icl.utk.edu.
To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/magma-user/a38389e0-a138-4742-8ced-965ad66b0976n%40icl.utk.edu.

Tom Carroll

unread,
Sep 23, 2021, 2:22:51 PM9/23/21
to MAGMA User, to...@icl.utk.edu, MAGMA User, Tom Carroll
Thanks, Stan! I think the timing question was from another user, though....

Tom Carroll

unread,
Sep 28, 2021, 3:33:50 PM9/28/21
to MAGMA User, to...@icl.utk.edu, MAGMA User, Tom Carroll
I have another update here. I was concerned that perhaps I had screwed something up in my build of OpenBlas, so I wanted to try the same test using a different BLAS/LAPACK combo. I installed Intel MKL (which wasn't my first choice since I'm on an AMD system).

MAGMA built successfully and I am now getting the exact same issue. When I increase the size of the matrix from 93672 to 93673 or larger, I get a core dump from dsyevd/ssyevd. The reported error is

magma_dsyevd returned error -113: cannot allocate memory on GPU device.

Again, there seems to be plenty of memory both in the system and on the four GPU's. In the dsyevd case, nvidia-smi reports about 20 GB per GPU. In the ssyevd case, nvidia-smi reports about 10 GB per GPU. (In both cases, that's out of 40 GB available.)

Would it be more helpful if I filed a bug report at this point?

Thanks for the help!

Output follows:

testing/testing_dsyevd -N 92672 -JV --ngpu 4
% MAGMA 2.6.1  64-bit magma_int_t, 64-bit pointer.
Compiled with CUDA support for 8.0
% CUDA runtime 11030, driver 11030. OpenMP threads 32. MKL 2021.0.3, MKL threads 32. 
% device 0: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 1: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 2: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 3: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% Tue Sep 28 13:26:45 2021
% Usage: testing/testing_dsyevd [options] [-h|--help]

% jobz = Vectors needed, uplo = Lower, ngpu = 4
%   N   CPU Time (sec)   GPU Time (sec)   |S-S_magma|   |A-USU^H|   |I-U^H U|
%============================================================================
92672      ---            607.5866           ---           ---         ---      ok

testing/testing_dsyevd -N 92673 -JV --ngpu 4
% MAGMA 2.6.1  64-bit magma_int_t, 64-bit pointer.
Compiled with CUDA support for 8.0
% CUDA runtime 11030, driver 11030. OpenMP threads 32. MKL 2021.0.3, MKL threads 32. 
% device 0: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 1: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 2: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 3: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% Tue Sep 28 13:39:23 2021
% Usage: testing/testing_dsyevd [options] [-h|--help]

% jobz = Vectors needed, uplo = Lower, ngpu = 4
%   N   CPU Time (sec)   GPU Time (sec)   |S-S_magma|   |A-USU^H|   |I-U^H U|
%============================================================================
magma_dsyevd returned error -113: cannot allocate memory on GPU device.
92673      ---            406.1887           ---           ---         ---      ok


$ testing/testing_ssyevd -N 92673 -JV --ngpu 4
% MAGMA 2.6.1  64-bit magma_int_t, 64-bit pointer.
Compiled with CUDA support for 8.0
% CUDA runtime 11030, driver 11030. OpenMP threads 32. MKL 2021.0.3, MKL threads 32. 
% device 0: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 1: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 2: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 3: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% Tue Sep 28 13:49:21 2021
% Usage: testing/testing_ssyevd [options] [-h|--help]

% jobz = Vectors needed, uplo = Lower, ngpu = 4
%   N   CPU Time (sec)   GPU Time (sec)   |S-S_magma|   |A-USU^H|   |I-U^H U|
%============================================================================
magma_ssyevd returned error -113: cannot allocate memory on GPU device.
92673      ---            210.2290           ---           ---         ---      ok


On Thursday, September 23, 2021 at 1:58:38 PM UTC-4 to...@icl.utk.edu wrote:

Tom Carroll

unread,
Sep 29, 2021, 1:49:43 PM9/29/21
to MAGMA User, Tom Carroll, to...@icl.utk.edu, MAGMA User
Sorry for yet another update ...

I got a private reply (Thanks, Ed) that suggested this could still be related to a 32-bit integer overflow issue. I had not been thinking along those lines since that overflow should happen with a dimension > 46336 (based on this bug report). I'm seeing a problem with a dimension > 92672 ... However, 92672 = 46336 x 2.

Could there be an issue all the way down in cudaMalloc? I've put some debugging print-outs in the MAGMA code to try to trace the "magma_dsyevd returned error -113: cannot allocate memory on GPU device" error I'm getting. I've gotten all the way down to a magma_malloc call, which seems to pass exactly what I expect to cudaMalloc. 

Any advice would be appreciated!

Cheers,
tom

Stanimire Tomov

unread,
Sep 29, 2021, 3:08:06 PM9/29/21
to MAGMA User, tjca...@gmail.com, Stanimire Tomov, MAGMA User
Tracking a bug like this is difficult because it happens only for these very large sizes.
I also get something similar on a system where GPUs actually have 2x more memory,
and actually our system locks, printing messages like this:

% jobz = Vectors needed, uplo = Lower, ngpu = 4
%   N   CPU Time (sec)   GPU Time (sec)   |S-S_magma|   |A-USU^H|   |I-U^H U|
%============================================================================
time ssytrd = 291.78
time sstedc =   2.31
time sormtr + copy =   2.04

magma_ssyevd returned error -113: cannot allocate memory on GPU device.
92673      ---            296.9140           ---           ---         ---      ok

Message from syslogd@guyot at Sep 29 14:18:25 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#49 stuck for 22s! [testing_ssyevd:222233]

Message from syslogd@guyot at Sep 29 14:18:53 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#49 stuck for 22s! [testing_ssyevd:222233]
...

We filed a bug report with Nvidia. This blocking makes it even more difficult to debug.
Looking at when the message happened, it looks like it comes from file dlaex0_m.cpp
line 151. I will put some checks around that to see.
Did you trace it to somewhere else? Probably gdb can also help to trace it.
It is possible when the allocation is large, although there is memory, it can be probably fragmented
because we do some allocate and free, and there may indeed be no contiguous chunk
of the memory requested to allocate. If that's the reason, probably one could do a standalone test
by allocating and de-allocating large memories to see if that's the problem.

You can also fill up an issue on the magma site by summarizing the experiment
(and I already reproduced it) so that the problem gets documented and keep it in
mind until is resolved.

Stan

Tom Carroll

unread,
Sep 30, 2021, 9:59:40 PM9/30/21
to MAGMA User, to...@icl.utk.edu, Tom Carroll, MAGMA User
Thanks, Stan, for replicating the issue. That's very helpful for my sanity :).

I initially tracked the issue to analogous place as you in slaex0_m.cpp -- I'm debugging with single precision just because the testing cycle is faster. But I've done some more digging and I'm not sure that's where the error originates. My method was to check the GPU memory using calls to cuGetMemInfo. This required recompiling MAGMA with -lcuda, but it seems to work perfectly (in that the output of cuGetMemInfo agrees with the command line output of nvidia-smi and with my calculations for how much memory is being allocated).

When I do this, I see that the free memory on each GPU is exactly what I expect (about 31 GB) until it suddenly drops to zero. cuGetMemInfo reports zero free memory before the call to magma_slaex0_m; I think magma_slaex0_m is just where the next memory allocation happens (which fails).

Tracking from there, I followed this sequence of function calls: 

magma_ssyevd_m (line 280) --> magma_ssytrd_mgpu (line 389) --> magma_slatrd_mgpu (line 439) --> magmablas_ssymv_mgpu_sync (line 877) --> magma_queue_sync (line 1240) --> cudaStreamSynchronize

After that call to cudaStreamSynchronize, the free memory drops from about 31 GB to zero. Of course, there are many prior calls to cudaStreamSynchronize that produce no issues ... so it seems likely that something is happening to set this up.

Anyway, I don't know if any of that is helpful but I hope so!

Cheers,
tom

Tom Carroll

unread,
Oct 4, 2021, 4:21:58 PM10/4/21
to MAGMA User, Tom Carroll, to...@icl.utk.edu, MAGMA User
Okay -- I think I have pinned down this bug and potentially fixed it. Further testing on my end revealed that the cuda errors appeared immediately after the call to ssymv_kernel_L_mgpu from magmablas_ssymv_mgpu (at line 774 in ssymv_mgpu.cu).

I noticed that in this function, n and lda (and everything else) are declared as int. I poked around for a bit with middling success before finally just changing every int in that function to magma_int_t. And ... that actually worked! Running testing_ssyevd now works for matrices larger than 92672. I'm sure that everything in that function does not need to be magma_int_t but this was my crude fix.

I haven't yet tried this for double precision, nor have I done anything to check for correctness of results.

Stan, perhaps you can verify that this works on your end and also makes sense?

Thanks!

Cheers,
tom

ps I'll also file this in a bug report, in case it's more visible to the developers there.

Stanimire Tomov

unread,
Oct 13, 2021, 8:40:05 PM10/13/21
to MAGMA User, tjca...@gmail.com, Stanimire Tomov, MAGMA User
Tom,
Thanks! I verified in the different precisions and this indeed fixes the issue.
The 64bit integer arithmetic is not natively supported on NVIDIA GPUs, so we have been avoiding it even when compiling MAGMA for 64bit ints. We didn't guess right away that this is the problem here because it was happening for larger than usual matrices (when N*N/128 gets out of 32bit int range vs N*N, where 128 is the number of threads used in the thread block; not sure it is exactly 128 but is something like that).
Because of the software emulation of 64-bit integer arithmetic I changed int to magma_int_t only at a few places where it was needed.
Filing it as bug report sounds good - I can tag you there to give credit for the fix when committing the fix. You could do a pull request if you prefer that (note that the fix has to be done only in zhemv_mgpu.cu;  the other versions are not in the repo; they get generated from the 'z' version).
Stan

Tom Carroll

unread,
Oct 18, 2021, 12:38:27 PM10/18/21
to MAGMA User, to...@icl.utk.edu, Tom Carroll, MAGMA User
Wonderful! Thanks, Stan. The bug report is already filed. And I suspected as much once I learned that the gpus don't support 64 bit integers.

As an aside, the testing code that MAGMA provides is really helpful as a set of examples. I've just written a little parallel version of dgemv since my matrix won't fit on one GPU. It was fairly trivial to break it into four pieces, distribute those to my four gpus, and then collect the results. I'll get around to posting that code here as a potentially helpful example (and maybe someone will point out improvements :) ).

(I'm doing ~1000 multiplications with the same matrix, so I only need to distribute the matrix to the gpus once and the savings is thus pretty huge!)

Cheers,
tom
Reply all
Reply to author
Forward
0 new messages