GPU out of memory error

Jaskaran Singh Puri

unread,

Apr 30, 2019, 12:19:25 AM4/30/19

to kaldi-help

I'm training a nnet3 model. However, I'm getting the "GPU out of memory error", the Nvidia GPU I have is 16 GB while still kaldi fails to allocate 3 GB when required.

It also says to run GPU in exclusive mode, which I cannot as I do not have root permissions. Is there another way around? I've already reduced the mini-batch size to 32 from 128

Should I reduce this more?

Please guide

Daniel Povey

unread,

Apr 30, 2019, 12:21:04 AM4/30/19

to kaldi-help

You may be able to get around it by reducing the number of jobs (e.g. --num-jobs-initial and --num-jobs-final) to no more than the number of GPUs you have (e.g. 1). The results may change though.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/2f7dc384-4efa-4774-bef3-010502c8be4a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jaskaran Singh Puri

unread,

Apr 30, 2019, 12:23:55 AM4/30/19

to kaldi-help

Thanks, but are you saying it may have a significant impact on the WER?

On Tuesday, April 30, 2019 at 9:51:04 AM UTC+5:30, Dan Povey wrote:

You may be able to get around it by reducing the number of jobs (e.g. --num-jobs-initial and --num-jobs-final) to no more than the number of GPUs you have (e.g. 1). The results may change though.

On Tue, Apr 30, 2019 at 12:19 AM Jaskaran Singh Puri <jaskar...@gmail.com> wrote:

I'm training a nnet3 model. However, I'm getting the "GPU out of memory error", the Nvidia GPU I have is 16 GB while still kaldi fails to allocate 3 GB when required.

It also says to run GPU in exclusive mode, which I cannot as I do not have root permissions. Is there another way around? I've already reduced the mini-batch size to 32 from 128
Should I reduce this more?

Please guide

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Daniel Povey

unread,

Apr 30, 2019, 12:27:37 AM4/30/19

to kaldi-help

It may have some impact as it could affect the tuning. You should probably reduce the number of epochs by about 25% or so.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/21ea10e4-f7fc-4985-b9a8-6fa8e916f800%40googlegroups.com.

Jaskaran Singh Puri

unread,

Jun 22, 2019, 11:55:26 PM6/22/19

to kaldi-help

So, I'm again running into this issue, Kaldi is trying to allocate around 16 GB in GPU, whereas i can see 32GB free memory in the GPU logs. I have both num-jobs params set to 1, and batch size is 128.

I still can't run GPU exclusively, what could be a possible work around here? Do I have to increase number of GPUs or reduce batch size?

Daniel Povey

unread,

Jun 23, 2019, 11:20:14 AM6/23/19

to kaldi-help

Sounds to me like you are not accurately describing the problem. You should always show a screen paste.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/0b105d42-f89a-46cc-8f91-cbb11d3752ef%40googlegroups.com.

Jaskaran Singh Puri

unread,

Jun 24, 2019, 2:51:29 AM6/24/19

to kaldi-help

nnet3-chain-train --use-gpu=yes --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --read-cache=/notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/cache.2722 --write-cache=/notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/cache.2723 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=2.0 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=1.0 --srand=2722 'nnet3-am-copy --raw=true --learning-rate=0.0002972304523027908 --scale=1.0 /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/2722.mdl - |' /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/den.fst 'ark,bg:nnet3-chain-copy-egs                          --frame-shift=2                         ark:/notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/egs/cegs.427.ark ark:- |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=2722 ark:- ark:- | nnet3-chain-merge-egs                         --minibatch-size=64 ark:- ark:- |' /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/2723.1.raw
WARNING (nnet3-chain-train[5.5]:SelectGpuId():cu-device.cc:211) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG (nnet3-chain-train[5.5]:SelectGpuIdAuto():cu-device.cc:331) Selecting from 1 GPUs
LOG (nnet3-chain-train[5.5]:SelectGpuIdAuto():cu-device.cc:346) cudaSetDevice(0): Tesla V100-SXM2-32GB  free:32162M, used:318M, total:32480M, free/total:0.990198
LOG (nnet3-chain-train[5.5]:SelectGpuIdAuto():cu-device.cc:393) Trying to select device: 0 (automatically), mem_ratio: 0.990198
LOG (nnet3-chain-train[5.5]:SelectGpuIdAuto():cu-device.cc:412) Success selecting device 0 free mem ratio: 0.990198
LOG (nnet3-chain-train[5.5]:FinalizeActiveGpu():cu-device.cc:266) The active GPU is [0]: Tesla V100-SXM2-32GB   free:31968M, used:512M, total:32480M, free/total:0.984225 version 7.0
nnet3-am-copy --raw=true --learning-rate=0.0002972304523027908 --scale=1.0 /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/2722.mdl -
LOG (nnet3-chain-train[5.5]:PrintMemoryUsage():cu-allocator.cc:368) Memory usage: 0/0 bytes currently allocated/total-held; 0/0 blocks currently allocated/free; largest free/allocated block sizes are 0/0; time taken total/cudaMalloc is 0/0.503543, synchronized the GPU 0 times out of 0 frees; device memory info: free:31968M, used:512M, total:32480M, free/total:0.984225maximum allocated: 0current allocated: 0
ERROR (nnet3-chain-train[5.5]:AllocateNewRegion():cu-allocator.cc:519) Failed to allocate a memory region of 16761487360 bytes.  Possibly this is due to sharing the GPU.  Try switching the GPUs to exclusive mode (nvidia-smi -c 3) and using the option --use-gpu=wait to scripts like steps/nnet3/chain/train.py.  Memory info: free:31968M, used:512M, total:32480M, free/total:0.984225

So, I'm running this training on batch-size of 64, reduce from 128, and have 'final-jobs' set to 1 i.e. same as no. of GPU's

I can't run this on exclusive mode due to lack of root permissions. Is there any work-around for this? Should I keep reducing the batch-size further?

It stopped at 2700th iteration on 64 batch size and 700th iteration in batch size of 128.

Please guide

Daniel Povey

unread,

Jun 24, 2019, 11:25:10 AM6/24/19

to kaldi-help

It says it's trying to allocate 16G on a GPU with 32G of memory (which is a lot!), and essentially no memory on the GPU is currently being used (only 512M, which is likely reserved for system usage). This doesn't really make sense. I suspect a driver bug.

You could probably make it work just by starting again at that same iteration with the --stage option. Likely the error is not repeatable.

Dan

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/30156213-db9e-474e-ab2d-45676ee5a43c%40googlegroups.com.

Justin Luitjens

unread,

Jun 24, 2019, 11:28:06 AM6/24/19

to kaldi...@googlegroups.com

maybe the allocation is too large? Can you try lowering the gpu memory proprotion?

For example:

--cuda-memory-proportion=0.1

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAEWAuySpBJJSoC_c7X1gfziW7%3DeuOtn7Ok7wAavMX6vGH-Zkuw%40mail.gmail.com.

Jaskaran Singh Puri

unread,

Jun 24, 2019, 12:50:50 PM6/24/19

to kaldi-help

But this happened twice as mentioned above

Daniel Povey

unread,

Jun 24, 2019, 12:52:19 PM6/24/19

to kaldi-help

Yes but not repeatably. Likely driver or hardware issue. Not Kaldi related, most likely.

On Mon, Jun 24, 2019 at 12:50 PM Jaskaran Singh Puri <jaskar...@gmail.com> wrote:

But this happened twice as mentioned above

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/bc95c4ff-6af1-4d8f-b53a-46aa45beca35%40googlegroups.com.

Justin Luitjens

unread,

Jun 24, 2019, 1:40:06 PM6/24/19

to kaldi...@googlegroups.com

Can you modify the error output to also output the error string?

i.e. in cu-allocator.cc add the line below:

if (e != cudaSuccess) {
PrintMemoryUsage();
KALDI_ERR << "Failed to allocated memory. CUDA error is " << cudaGetErrorString(e); //ADD THIS LINE
if (!CuDevice::Instantiate().IsComputeExclusive()) {
KALDI_ERR << "Failed to allocate a memory region of " << region_size

<< " bytes. Possibly this is due to sharing the GPU. Try "
<< "switching the GPUs to exclusive mode (nvidia-smi -c 3) and using "
<< "the option --use-gpu=wait to scripts like "
<< "steps/nnet3/chain/train.py. Memory info: "

<< mem_info;

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAEWAuyReNo5hUj-shsFWh%3DVk7f32jEKh%3Dm%3DvziY2-k7jpFbLyw%40mail.gmail.com.

Justin Luitjens

unread,

Jun 24, 2019, 10:51:17 PM6/24/19

to kaldi...@googlegroups.com

A change to add the error string output was just merged in. Please update and reproduce and provide the error message.

Jaskaran Singh Puri

unread,

Jun 30, 2019, 8:46:01 AM6/30/19

to kaldi-help

The image has to be compiled again right? I don't see any error messages that I added to the .cc file in the train.xx.log files

On Tuesday, June 25, 2019 at 8:21:17 AM UTC+5:30, Justin Luitjens wrote:

A change to add the error string output was just merged in. Please update and reproduce and provide the error message.

On Mon, Jun 24, 2019 at 11:39 AM Justin Luitjens <luit...@gmail.com> wrote:

Can you modify the error output to also output the error string?

i.e. in cu-allocator.cc add the line below:

if (e != cudaSuccess) {
PrintMemoryUsage();
KALDI_ERR << "Failed to allocated memory. CUDA error is " << cudaGetErrorString(e); //ADD THIS LINE
if (!CuDevice::Instantiate().IsComputeExclusive()) {
KALDI_ERR << "Failed to allocate a memory region of " << region_size
<< " bytes. Possibly this is due to sharing the GPU. Try "
<< "switching the GPUs to exclusive mode (nvidia-smi -c 3) and using "
<< "the option --use-gpu=wait to scripts like "
<< "steps/nnet3/chain/train.py. Memory info: "
<< mem_info;

On Mon, Jun 24, 2019 at 10:52 AM Daniel Povey <dpo...@gmail.com> wrote:

Yes but not repeatably. Likely driver or hardware issue. Not Kaldi related, most likely.

On Mon, Jun 24, 2019 at 12:50 PM Jaskaran Singh Puri <jaskar...@gmail.com> wrote:

But this happened twice as mentioned above

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/bc95c4ff-6af1-4d8f-b53a-46aa45beca35%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Justin Luitjens

unread,

Jun 30, 2019, 9:00:54 AM6/30/19

to kaldi...@googlegroups.com

Yes, get latest and recompile

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/660816ff-6cc3-401b-b66c-2dad191b6441%40googlegroups.com.

Jaskaran Singh Puri

unread,

Jul 14, 2019, 4:13:20 AM7/14/19

to kaldi-help

Still getting the same error and I did not get that line printed of the file https://github.com/kaldi-asr/kaldi/blob/master/src/cudamatrix/cu-allocator.cc, line 525

<< " CUDA error: '" << cudaGetErrorString(e) << "'";

This wasn't returned in my LOG File

Justin Luitjens

unread,

Jul 14, 2019, 8:17:27 AM7/14/19

to kaldi...@googlegroups.com

Are you sure you have the latest source? If so make a clean build. If that doesn’t work please include the full output.

Sent from my iPhone

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/d6221753-4f84-4ad6-a537-952aeba5672b%40googlegroups.com.

Jaskaran Singh Puri

unread,

Jul 14, 2019, 1:12:28 PM7/14/19

to kaldi-help

Yes, got it compiled it a couple of days ago.

following is the log:



# nnet3-chain-train --use-gpu=yes --verbose=1 --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --read-cache=/notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/cache.780 --write-cache=/notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/cache.781 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=2.0 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=1.0 --srand=780 "nnet3-am-copy --raw=true --learning-rate=0.0007063383326394145 --scale=1.0 /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/780.mdl - |" /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/den.fst "ark,bg:nnet3-chain-copy-egs                          --frame-shift=1                         ark:/notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/egs/cegs.207.ark ark:- |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=780 ark:- ark:- | nnet3-chain-merge-egs                         --minibatch-size=128 ark:- ark:- |" /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/781.1.raw 
# Started at Wed Jul 10 20:23:34 UTC 2019
#
nnet3-chain-train --use-gpu=yes --verbose=1 --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --read-cache=/notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/cache.780 --write-cache=/notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/cache.781 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=2.0 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=1.0 --srand=780 'nnet3-am-copy --raw=true --learning-rate=0.0007063383326394145 --scale=1.0 /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/780.mdl - |' /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/den.fst 'ark,bg:nnet3-chain-copy-egs                          --frame-shift=1                         ark:/notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/egs/cegs.207.ark ark:- |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=780 ark:- ark:- | nnet3-chain-merge-egs                         --minibatch-size=128 ark:- ark:- |' /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/781.1.raw 
WARNING (nnet3-chain-train[5.5]:SelectGpuId():cu-device.cc:221) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG (nnet3-chain-train[5.5]:SelectGpuIdAuto():cu-device.cc:349) Selecting from 1 GPUs
LOG (nnet3-chain-train[5.5]:SelectGpuIdAuto():cu-device.cc:364) cudaSetDevice(0): Tesla V100-SXM2-16GB free:15812M, used:318M, total:16130M, free/total:0.980263
LOG (nnet3-chain-train[5.5]:SelectGpuIdAuto():cu-device.cc:411) Trying to select device: 0 (automatically), mem_ratio: 0.980263
LOG (nnet3-chain-train[5.5]:SelectGpuIdAuto():cu-device.cc:430) Success selecting device 0 free mem ratio: 0.980263
LOG (nnet3-chain-train[5.5]:FinalizeActiveGpu():cu-device.cc:284) The active GPU is [0]: Tesla V100-SXM2-16GB free:15646M, used:484M, total:16130M, free/total:0.969971 version 7.0
nnet3-am-copy --raw=true --learning-rate=0.0007063383326394145 --scale=1.0 /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/780.mdl - 
LOG (nnet3-chain-train[5.5]:PrintMemoryUsage():cu-allocator.cc:368) Memory usage: 0/0 bytes currently allocated/total-held; 0/0 blocks currently allocated/free; largest free/allocated block sizes are 0/0; time taken total/cudaMalloc is 0/0.283798, synchronized the GPU 0 times out of 0 frees; device memory info: free:15646M, used:484M, total:16130M, free/total:0.969971maximum allocated: 0current allocated: 0
ERROR (nnet3-chain-train[5.5]:AllocateNewRegion():cu-allocator.cc:519) Failed to allocate a memory region of 8204058624 bytes.  Possibly this is due to sharing the GPU.  Try switching the GPUs to exclusive mode (nvidia-smi -c 3) and using the option --use-gpu=wait to scripts like steps/nnet3/chain/train.py.  Memory info: free:15646M, used:484M, total:16130M, free/total:0.969971


[ Stack-Trace: ]
kaldi::MessageLogger::LogMessage() const
kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)
kaldi::CuMemoryAllocator::AllocateNewRegion(unsigned long)
kaldi::CuMemoryAllocator::MallocPitch(unsigned long, unsigned long, unsigned long*)
kaldi::CuMatrix<float>::Resize(int, int, kaldi::MatrixResizeType, kaldi::MatrixStrideType)
kaldi::CuMatrix<float>::Swap(kaldi::Matrix<float>*)
kaldi::CuMatrix<float>::Read(std::istream&, bool)
kaldi::nnet3::FixedAffineComponent::Read(std::istream&, bool)
kaldi::nnet3::Component::ReadNew(std::istream&, bool)
kaldi::nnet3::Nnet::Read(std::istream&, bool)
main
__libc_start_main
_start


WARNING (nnet3-chain-train[5.5]:Close():kaldi-io.cc:515) Pipe nnet3-am-copy --raw=true --learning-rate=0.0007063383326394145 --scale=1.0 /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/780.mdl - | had nonzero return status 13
kaldi::KaldiFatalError
# Accounting: time=3 threads=1
# Ended (code 255) at Wed Jul 10 20:23:37 UTC 2019, elapsed time 3 seconds

Daniel Povey

unread,

Jul 14, 2019, 1:53:42 PM7/14/19

to kaldi-help

If it happens only occasionally, it could be that two jobs are simultaneously trying to allocate memory, like a race condition. Setting to exclusive mode and running train.py with --use-gpu=wait would fix it; reducing --cuda-memory-proportion to, say, 0.25 might help too. In any case you can restart from where it failed, using the --stage --train-stage option to the run_xxx.sh script (or --stage option to train.py)

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/3daa16e1-624e-4101-a620-d9011dfad743%40googlegroups.com.

Reply all

Reply to author

Forward