--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/2f7dc384-4efa-4774-bef3-010502c8be4a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You may be able to get around it by reducing the number of jobs (e.g. --num-jobs-initial and --num-jobs-final) to no more than the number of GPUs you have (e.g. 1). The results may change though.
On Tue, Apr 30, 2019 at 12:19 AM Jaskaran Singh Puri <jaskar...@gmail.com> wrote:
I'm training a nnet3 model. However, I'm getting the "GPU out of memory error", the Nvidia GPU I have is 16 GB while still kaldi fails to allocate 3 GB when required.--It also says to run GPU in exclusive mode, which I cannot as I do not have root permissions. Is there another way around? I've already reduced the mini-batch size to 32 from 128Should I reduce this more?Please guide
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/21ea10e4-f7fc-4985-b9a8-6fa8e916f800%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/0b105d42-f89a-46cc-8f91-cbb11d3752ef%40googlegroups.com.
nnet3-chain-train --use-gpu=yes --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --read-cache=/notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/cache.2722 --write-cache=/notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/cache.2723 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=2.0 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=1.0 --srand=2722 'nnet3-am-copy --raw=true --learning-rate=0.0002972304523027908 --scale=1.0 /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/2722.mdl - |' /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/den.fst 'ark,bg:nnet3-chain-copy-egs --frame-shift=2 ark:/notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/egs/cegs.427.ark ark:- | nnet3-chain-shuffle-egs --buffer-size=5000 --srand=2722 ark:- ark:- | nnet3-chain-merge-egs --minibatch-size=64 ark:- ark:- |' /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/2723.1.raw
WARNING (nnet3-chain-train[5.5]:SelectGpuId():cu-device.cc:211) Not in compute-exclusive mode. Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG (nnet3-chain-train[5.5]:SelectGpuIdAuto():cu-device.cc:331) Selecting from 1 GPUs
LOG (nnet3-chain-train[5.5]:SelectGpuIdAuto():cu-device.cc:346) cudaSetDevice(0): Tesla V100-SXM2-32GB free:32162M, used:318M, total:32480M, free/total:0.990198
LOG (nnet3-chain-train[5.5]:SelectGpuIdAuto():cu-device.cc:393) Trying to select device: 0 (automatically), mem_ratio: 0.990198
LOG (nnet3-chain-train[5.5]:SelectGpuIdAuto():cu-device.cc:412) Success selecting device 0 free mem ratio: 0.990198
LOG (nnet3-chain-train[5.5]:FinalizeActiveGpu():cu-device.cc:266) The active GPU is [0]: Tesla V100-SXM2-32GB free:31968M, used:512M, total:32480M, free/total:0.984225 version 7.0
nnet3-am-copy --raw=true --learning-rate=0.0002972304523027908 --scale=1.0 /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/2722.mdl -
LOG (nnet3-chain-train[5.5]:PrintMemoryUsage():cu-allocator.cc:368) Memory usage: 0/0 bytes currently allocated/total-held; 0/0 blocks currently allocated/free; largest free/allocated block sizes are 0/0; time taken total/cudaMalloc is 0/0.503543, synchronized the GPU 0 times out of 0 frees; device memory info: free:31968M, used:512M, total:32480M, free/total:0.984225maximum allocated: 0current allocated: 0
ERROR (nnet3-chain-train[5.5]:AllocateNewRegion():cu-allocator.cc:519) Failed to allocate a memory region of 16761487360 bytes. Possibly this is due to sharing the GPU. Try switching the GPUs to exclusive mode (nvidia-smi -c 3) and using the option --use-gpu=wait to scripts like steps/nnet3/chain/train.py. Memory info: free:31968M, used:512M, total:32480M, free/total:0.984225
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/30156213-db9e-474e-ab2d-45676ee5a43c%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAEWAuySpBJJSoC_c7X1gfziW7%3DeuOtn7Ok7wAavMX6vGH-Zkuw%40mail.gmail.com.
But this happened twice as mentioned above
--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/bc95c4ff-6af1-4d8f-b53a-46aa45beca35%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAEWAuyReNo5hUj-shsFWh%3DVk7f32jEKh%3Dm%3DvziY2-k7jpFbLyw%40mail.gmail.com.
A change to add the error string output was just merged in. Please update and reproduce and provide the error message.
On Mon, Jun 24, 2019 at 11:39 AM Justin Luitjens <luit...@gmail.com> wrote:
Can you modify the error output to also output the error string?i.e. in cu-allocator.cc add the line below:if (e != cudaSuccess) {
PrintMemoryUsage();
KALDI_ERR << "Failed to allocated memory. CUDA error is " << cudaGetErrorString(e); //ADD THIS LINE
if (!CuDevice::Instantiate().IsComputeExclusive()) {
KALDI_ERR << "Failed to allocate a memory region of " << region_size
<< " bytes. Possibly this is due to sharing the GPU. Try "
<< "switching the GPUs to exclusive mode (nvidia-smi -c 3) and using "
<< "the option --use-gpu=wait to scripts like "
<< "steps/nnet3/chain/train.py. Memory info: "
<< mem_info;
On Mon, Jun 24, 2019 at 10:52 AM Daniel Povey <dpo...@gmail.com> wrote:
Yes but not repeatably. Likely driver or hardware issue. Not Kaldi related, most likely.
On Mon, Jun 24, 2019 at 12:50 PM Jaskaran Singh Puri <jaskar...@gmail.com> wrote:
But this happened twice as mentioned above
--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/bc95c4ff-6af1-4d8f-b53a-46aa45beca35%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/660816ff-6cc3-401b-b66c-2dad191b6441%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/d6221753-4f84-4ad6-a537-952aeba5672b%40googlegroups.com.
# nnet3-chain-train --use-gpu=yes --verbose=1 --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --read-cache=/notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/cache.780 --write-cache=/notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/cache.781 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=2.0 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=1.0 --srand=780 "nnet3-am-copy --raw=true --learning-rate=0.0007063383326394145 --scale=1.0 /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/780.mdl - |" /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/den.fst "ark,bg:nnet3-chain-copy-egs --frame-shift=1 ark:/notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/egs/cegs.207.ark ark:- | nnet3-chain-shuffle-egs --buffer-size=5000 --srand=780 ark:- ark:- | nnet3-chain-merge-egs --minibatch-size=128 ark:- ark:- |" /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/781.1.raw
# Started at Wed Jul 10 20:23:34 UTC 2019
#
nnet3-chain-train --use-gpu=yes --verbose=1 --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --read-cache=/notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/cache.780 --write-cache=/notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/cache.781 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=2.0 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=1.0 --srand=780 'nnet3-am-copy --raw=true --learning-rate=0.0007063383326394145 --scale=1.0 /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/780.mdl - |' /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/den.fst 'ark,bg:nnet3-chain-copy-egs --frame-shift=1 ark:/notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/egs/cegs.207.ark ark:- | nnet3-chain-shuffle-egs --buffer-size=5000 --srand=780 ark:- ark:- | nnet3-chain-merge-egs --minibatch-size=128 ark:- ark:- |' /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/781.1.raw
WARNING (nnet3-chain-train[5.5]:SelectGpuId():cu-device.cc:221) Not in compute-exclusive mode. Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG (nnet3-chain-train[5.5]:SelectGpuIdAuto():cu-device.cc:349) Selecting from 1 GPUs
LOG (nnet3-chain-train[5.5]:SelectGpuIdAuto():cu-device.cc:364) cudaSetDevice(0): Tesla V100-SXM2-16GB free:15812M, used:318M, total:16130M, free/total:0.980263
LOG (nnet3-chain-train[5.5]:SelectGpuIdAuto():cu-device.cc:411) Trying to select device: 0 (automatically), mem_ratio: 0.980263
LOG (nnet3-chain-train[5.5]:SelectGpuIdAuto():cu-device.cc:430) Success selecting device 0 free mem ratio: 0.980263
LOG (nnet3-chain-train[5.5]:FinalizeActiveGpu():cu-device.cc:284) The active GPU is [0]: Tesla V100-SXM2-16GB free:15646M, used:484M, total:16130M, free/total:0.969971 version 7.0
nnet3-am-copy --raw=true --learning-rate=0.0007063383326394145 --scale=1.0 /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/780.mdl -
LOG (nnet3-chain-train[5.5]:PrintMemoryUsage():cu-allocator.cc:368) Memory usage: 0/0 bytes currently allocated/total-held; 0/0 blocks currently allocated/free; largest free/allocated block sizes are 0/0; time taken total/cudaMalloc is 0/0.283798, synchronized the GPU 0 times out of 0 frees; device memory info: free:15646M, used:484M, total:16130M, free/total:0.969971maximum allocated: 0current allocated: 0
ERROR (nnet3-chain-train[5.5]:AllocateNewRegion():cu-allocator.cc:519) Failed to allocate a memory region of 8204058624 bytes. Possibly this is due to sharing the GPU. Try switching the GPUs to exclusive mode (nvidia-smi -c 3) and using the option --use-gpu=wait to scripts like steps/nnet3/chain/train.py. Memory info: free:15646M, used:484M, total:16130M, free/total:0.969971
[ Stack-Trace: ]
kaldi::MessageLogger::LogMessage() const
kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)
kaldi::CuMemoryAllocator::AllocateNewRegion(unsigned long)
kaldi::CuMemoryAllocator::MallocPitch(unsigned long, unsigned long, unsigned long*)
kaldi::CuMatrix<float>::Resize(int, int, kaldi::MatrixResizeType, kaldi::MatrixStrideType)
kaldi::CuMatrix<float>::Swap(kaldi::Matrix<float>*)
kaldi::CuMatrix<float>::Read(std::istream&, bool)
kaldi::nnet3::FixedAffineComponent::Read(std::istream&, bool)
kaldi::nnet3::Component::ReadNew(std::istream&, bool)
kaldi::nnet3::Nnet::Read(std::istream&, bool)
main
__libc_start_main
_start
WARNING (nnet3-chain-train[5.5]:Close():kaldi-io.cc:515) Pipe nnet3-am-copy --raw=true --learning-rate=0.0007063383326394145 --scale=1.0 /notebooks/jpuri/training_v3/chain_300k/exp/chain/tdnn_7b/780.mdl - | had nonzero return status 13
kaldi::KaldiFatalError
# Accounting: time=3 threads=1
# Ended (code 255) at Wed Jul 10 20:23:37 UTC 2019, elapsed time 3 seconds
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/3daa16e1-624e-4101-a620-d9011dfad743%40googlegroups.com.