Out of memory on Tesla T4

305 views
Skip to first unread message

EKIN TIRE

unread,
Apr 11, 2021, 11:21:10 AM4/11/21
to Comp542 Class
Hello,

While running training on T4, after some time (sometimes 10 seconds, sometimes after 5 minutes), I am getting out of memory error. All the tests are passing until that point. Can there be any problem that can cause this? The error is below, after running training for around 5 minutes.

Out of GPU memory trying to allocate 9.208 MiB
Effective GPU memory usage: 99.94% (14.746 GiB/14.756 GiB)
CUDA allocator usage: 14.204 GiB
binned usage: 14.204 GiB (14.204 GiB allocated, 0 bytes cached)

Thank you,

Ekin Tire

GURKAN SOYKAN

unread,
Apr 11, 2021, 1:53:25 PM4/11/21
to EKIN TIRE, Comp542 Class
Dear Ekin,
I would humbly suggest clearing memory on regular intervals. Such as when training outputs losses.
and I am also using a regular garbage collector, GC.
Hope this helps.

--
For information about this course, please see http://courses.ku.edu.tr/comp542
For questions about the course, please email com...@ku.edu.tr
---
You received this message because you are subscribed to the Google Groups "Comp542 Class" group.
To unsubscribe from this group and stop receiving emails from it, send an email to Comp542+u...@ku.edu.tr.
To view this discussion on the web visit https://groups.google.com/a/ku.edu.tr/d/msgid/Comp542/CAC5Q6kVkuJEBPvgMB5Vf6Os%2BOk18UCp2NucQXAqpkvCk2mUZCw%40mail.gmail.com.


--
Gürkan Soykan
M.Sc. Computer Science Student
Koç University

OYKU ZEYNEP BAYRAMOGLU

unread,
Apr 11, 2021, 1:58:32 PM4/11/21
to GURKAN SOYKAN, EKIN TIRE, Comp542 Class
Hi,

I get following outputs from loss tests:

Evaluated: (14097.643f0, 1432) == (14096.252f0, 1432)

Evaluated: (1.04277675f6, 105937) == (1.0427646f6, 105937)

Is this difference normal? Also it is not clear which other layers we should apply dropout, should we apply it after decoder as well?

Best,
Öykü Zeynep



GURKAN SOYKAN

unread,
Apr 11, 2021, 2:35:22 PM4/11/21
to OYKU ZEYNEP BAYRAMOGLU, Comp542 Class
Hi Oyku,
Almost none of my losses are exactly in the tests but close. So I am trying to understand whether my model works or not by looking at training loss and translations.
But I believe TA's would explain if this should be the case or not. 
Secondly, I am also having a hard time to understand how to use dropout but at the end I just used it for the decoder projection.
Hope this helps.
Best.

EKIN TIRE

unread,
Apr 11, 2021, 2:40:31 PM4/11/21
to OYKU ZEYNEP BAYRAMOGLU, GURKAN SOYKAN, Comp542 Class
Hello,

@Oyku I have similar loss differences as well.

@Gurkan Now, I try to run the garbage collector using GC.gc() every time training outputs losses. As far as I understand, we cannot use "CUDA.unsafe_free!" function on KnetArrays. Is there any way besides GC.gc() to free memory? I also print out the current memory status using CUDA.memory_status()

The time difference between the last two memory usage messages is less than miliseconds, I think there isn't even any computation between these two messages. Even though CUDA.memory_status() says it only allocated 5.455 GiB memory, immediately after the GPU tries to use all of the allocated memory + cached memory I guess.

┌ Info: Training S2S_v1
└ @ Main In[111]:1
Effective GPU memory usage: 92.67% (13.674 GiB/14.756 GiB)
CUDA allocator usage: 13.157 GiB
binned usage: 13.157 GiB (5.033 GiB allocated, 8.123 GiB cached)
nothing
┣                    ┫ [0.01%, 1/14330, 00:14/57:29:23, 14.44s/i] (9.7849655f0, 9.775839f0)
Effective GPU memory usage: 94.36% (13.924 GiB/14.756 GiB)
CUDA allocator usage: 13.405 GiB
binned usage: 13.405 GiB (6.673 GiB allocated, 6.733 GiB cached)
nothing
┣▏                   ┫ [0.70%, 100/14330, 01:18/03:07:07, 1.55i/s] (6.034509f0, 6.064538f0)
Effective GPU memory usage: 88.23% (13.020 GiB/14.756 GiB)
CUDA allocator usage: 12.500 GiB
binned usage: 12.500 GiB (5.455 GiB allocated, 7.045 GiB cached)
nothing
┣▎                   ┫ [1.40%, 200/14330, 02:20/02:47:07, 1.62i/s] (5.945044f0, 5.9759536f0)
Out of GPU memory trying to allocate 9.208 MiB
Effective GPU memory usage: 99.92% (14.744 GiB/14.756 GiB)
CUDA allocator usage: 14.204 GiB
binned usage: 14.204 GiB (14.204 GiB allocated, 0 bytes cached)

Thank you,

Ekin Tire

GURKAN SOYKAN

unread,
Apr 11, 2021, 2:44:26 PM4/11/21
to EKIN TIRE, Comp542 Class
Dear @EKIN TIRE ,
I am also using CUDA.reclaim()
Best.

EKIN TIRE

unread,
Apr 11, 2021, 3:02:35 PM4/11/21
to GURKAN SOYKAN, Comp542 Class
CUDA.reclaim() does not seem to work as well. I suspect that the model object gets too big after some time, since when I recreate the S2S instance and assign it to model, memory usage drops immediately. I could not figure out how to solve this problem though.

Thank you.

Ekin Tire

OYKU ZEYNEP BAYRAMOGLU

unread,
Apr 11, 2021, 3:28:50 PM4/11/21
to GURKAN SOYKAN, Comp542 Class
Hi Gürkan,

I also tried using dropout for decoder projection but I could get a dev loss of 3.69 at the end of 10 epochs. Thanks for your answer. 

Best,
Öykü

IBRAHIM SHOER

unread,
Apr 11, 2021, 4:26:08 PM4/11/21
to EKIN TIRE, GURKAN SOYKAN, Comp542 Class
I have the same issue as Ekin, tried everything but couldn't solve it.

Ibrahim Shoer

--
For information about this course, please see http://courses.ku.edu.tr/comp542
For questions about the course, please email com...@ku.edu.tr
---
You received this message because you are subscribed to the Google Groups "Comp542 Class" group.
To unsubscribe from this group and stop receiving emails from it, send an email to Comp542+u...@ku.edu.tr.

ILKER KESEN

unread,
Apr 12, 2021, 9:55:20 AM4/12/21
to IBRAHIM SHOER, EKIN TIRE, GURKAN SOYKAN, Comp542 Class
Hi all,

Ekin, could you please reduce the batch size by half and try it again? Here, memory usage depends on the sequence lengths also: so the longer sequence you have, the more space your model will need. Or, you can reduce the maximum sentence length as another option. I'm not sure on which GPUs we have tested this assignment. Maybe Julia CUDA guys changed something and now CUDA arrays take more space. I'd do this: (i) find the maximum sequence length, (ii) forw+back+update call with a dummy batch that has maximum sequence length, (iii) find a proper batch size.

Oyku, in your case, floating pointing number rounding error might be the issue. I'm not 100% sure, but this is a very possible one.

For the dropout question, here's what assignment points out,

"RNNs take a dropout value at construction and apply dropout to the input of every layer if it is non-zero. You need to handle dropout for other layers in the loss function or in layer definitions as necessary."

So, you need to apply dropout to both encoder and decoder word embeddings (read documentation via `@doc RNN`), and additionally, you can apply dropout to input of the linear prediction layer. Dropout regularization does not work well when it is applied on the recurrent connections, please avoid that.

Best.
- ilker



Osman Mutlu

unread,
Apr 12, 2021, 10:17:26 AM4/12/21
to ILKER KESEN, IBRAHIM SHOER, EKIN TIRE, GURKAN SOYKAN, Comp542 Class
Hi Oyku,

This loss difference is expected since vocabulary implementations from the previous assignment may differ and it is not a problem. To quote from the assignment: "Your loss can be slightly different due to different ordering of words in the vocabulary.".

Best,
Osman


Batuhan Ozyurt

unread,
May 1, 2021, 4:58:42 PM5/1/21
to Osman Mutlu, ILKER KESEN, IBRAHIM SHOER, EKIN TIRE, GURKAN SOYKAN, Comp542 Class
Dear all,

I am having trouble with the out of memory issue in the ATTN assignment, too. Normally the batch size is 64 in the assignment, I tried reducing it to 16 but I still get the error during training. I also tried Gc.gc(true) and CUDA.reclaim(). I get this error on a Tesla T4 on HPC. Did any of you guys figure out how to fix this? Any help would be appreciated. Thank you.

Best regards,
Batuhan

GURKAN SOYKAN

unread,
May 1, 2021, 5:10:34 PM5/1/21
to Batuhan Ozyurt, Comp542 Class
Dear Batuhan, 
I had somewhat of a similar issue in terms of both memory and time constraints.
However, I used this BenchmarkTools to optimize my code. Then I realized I was doing some heavy concatenations and memory allocations.
I found most of my choking points in the batching part and loss calculation.
Hope this helps.
Best.

Deniz Yuret

unread,
May 3, 2021, 6:50:22 AM5/3/21
to GURKAN SOYKAN, bozy...@ku.edu.tr, com...@ku.edu.tr
I just tested the reference solution on a T4 and I am attaching the training log to this email. The 15GB on the T4 seems to be sufficient to complete the training of 10 epochs in under 3 hours.

I also wanted to point out that the default amount of CPU memory was reduced on the cluster recently. If your problem is with CPU memory I suggest starting your job with more memory (e.g. 16/32 GB) using the --mem=16GB option.

t4-timing.log

Batuhan Ozyurt

unread,
May 3, 2021, 4:07:44 PM5/3/21
to Deniz Yuret, GURKAN SOYKAN, Comp542 Class
Thank you Gürkan, I found the mistake with your help: In my loss function I called "cat" in the for loop and that was the problem. I changed it to "push!", I believe that decreased the complexity. Also thank you Deniz Hocam for your support, the problem was with the GPU memory. 

Best,
Batuhan
Reply all
Reply to author
Forward
0 new messages