CUDA error: 'out of memory' during the training of rnnlm

841 views
Skip to first unread message

aymsagul ablimit

unread,
Aug 23, 2019, 3:25:49 AM8/23/19
to kaldi-help
I am training rnnlm. It should be 200 iteraion in total, however in 173 iteration the training is broken because of the  "out of memory" error. I used just one gpu and in the cmd.sh, I adjusted the the parameter of cuda memory:

export cuda_cmd="run.pl --gpu 1 --mem 8G

The error message is following:

 11 nnet3-copy --learning-rate=0.000548086893011532 /share/temp/exp_kaldi_RNNLM/exp/rnnlm_lstm_tdnn_7gb_60k_with_different_Weight_ilse900_cc1/263.raw -
 
12 LOG (rnnlm-train[5.5.433~1453-7637d]:PrintMemoryUsage():cu-allocator.cc:368) Memory usage: 0/0 bytes currently allocated/total-held; 0/0 blocks currently allocated/free; largest f    ree/allocated block sizes are 0/0; time taken total/cudaMalloc is 0/0.013484, synchronized the GPU 0 times out of 0 frees; device memory info: free:5132M, used:5856M, total:10989M    , free/total:0.467051maximum allocated: 0current allocated: 0
 
13 ERROR (rnnlm-train[5.5.433~1453-7637d]:AllocateNewRegion():cu-allocator.cc:519) Failed to allocate a memory region of 5384437760 bytes.  Possibly this is due to sharing the GPU.      Try switching the GPUs to exclusive mode (nvidia-smi -c 3) and using the option --use-gpu=wait to scripts like steps/nnet3/chain/train.py.  Memory info: free:10268M, used:720M, to    tal:10989M, free/total:0.934409 CUDA error: 'out of memory'
 
14
 
15 [ Stack-Trace: ]
 
16 /share/documents/abulimit/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0xb42) [0x7f34d68eb6a2]
 
17 rnnlm-train(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x559a73a47e89]
 
18 /share/documents/abulimit/kaldi/src/lib/libkaldi-cudamatrix.so(kaldi::CuMemoryAllocator::AllocateNewRegion(unsigned long)+0x46f) [0x7f34d6dd69ef]
 
19 /share/documents/abulimit/kaldi/src/lib/libkaldi-cudamatrix.so(kaldi::CuMemoryAllocator::MallocPitch(unsigned long, unsigned long, unsigned long*)+0x4a6) [0x7f34d6dd72de]
 
20 /share/documents/abulimit/kaldi/src/lib/libkaldi-cudamatrix.so(kaldi::CuMatrix<float>::Resize(int, int, kaldi::MatrixResizeType, kaldi::MatrixStrideType)+0x296) [0x7f34d6d97fa8]
 
21 /share/documents/abulimit/kaldi/src/lib/libkaldi-cudamatrix.so(kaldi::CuMatrix<float>::Swap(kaldi::Matrix<float>*)+0x6e) [0x7f34d6d99282]
 
22 /share/documents/abulimit/kaldi/src/lib/libkaldi-cudamatrix.so(kaldi::CuMatrix<float>::Read(std::istream&, bool)+0x5b) [0x7f34d6d99427]
 
23 /share/documents/abulimit/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NaturalGradientAffineComponent::Read(std::istream&, bool)+0x73) [0x7f34d8339b03]
 
24 /share/documents/abulimit/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::Component::ReadNew(std::istream&, bool)+0xc4) [0x7f34d832a874]
 
25 /share/documents/abulimit/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::Nnet::Read(std::istream&, bool)+0xc7f) [0x7f34d83b5b81]
 
26 rnnlm-train(main+0x8ee) [0x559a73a46ae8]
 
27 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f34d5d55b97]
 
28 rnnlm-train(_start+0x2a) [0x559a73a4611a]
 
29
 
30 WARNING (rnnlm-train[5.5.433~1453-7637d]:Close():kaldi-io.cc:515) Pipe nnet3-copy --learning-rate=0.000548086893011532 /share/temp/exp_kaldi_RNNLM/exp/rnnlm_lstm_tdnn_7gb    _60k_with_different_Weight_ilse900_cc1/263.raw -| had nonzero return status 36096
 
31 kaldi::KaldiFatalError
 
32 # Accounting: time=2 threads=1
 
33 # Ended (code 255) at Fri Aug 23 01:59:12 CEST 2019, elapsed time 2 seconds


During the training, only one process was running on the gpu.

how can I solve this problem? Has somebody suggestions? Thank you in advance.

Daniel Povey

unread,
Aug 23, 2019, 12:55:24 PM8/23/19
to kaldi-help
The `--mem 8g` option won't make any difference if you are using run.pl, and anyway it refers to CPU memory.

If I were you I'd just start from that iteration using the --stage option.  Probably it's a non-repeatable error or driver issue.


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/9376c6db-2061-4365-9c06-cd0f46f7299a%40googlegroups.com.

Pema Galey

unread,
Jul 7, 2020, 11:54:15 PM7/7/20
to kaldi-help
Hi,
How did you solve this issue?
I am getting similar error while training the chain-model.

aymsagul ablimit

unread,
Jul 8, 2020, 5:17:01 AM7/8/20
to kaldi-help
That was probably driver issue. I just run ran the training again (from the interation where the training was broken) and it did work.
Reply all
Reply to author
Forward
0 new messages