Cannot allocate new memory while combining models

166 views
Skip to first unread message

Xuerui Yang

unread,
Apr 8, 2019, 4:19:59 AM4/8/19
to kaldi-help


Hi,

I've finished a training of chain model and now it in model combination process. However, after some computing, in log/combine.log file, it says 

ERROR (nnet3-chain-combine[5.5]:AllocateNewRegion():cu-allocator.cc:505) Failed to allocate a memory region of 173015040 bytes.  Possibly smaller minibatch size would help.  Memory info: free:3M, used:11175M, total:11178M, free/total:0.000313101

And I tried to reduce the minibatch size from 64 to 32 to ... to 1. The results stay the same. But if I only use one iteration's result for example 1000.mdl to mkgraph or decode then it will work. It seems this process use only one GPU to do combination, is there any method to make it to multi GPU computing?

The full log are as follows:

LOG (nnet3-chain-combine[5.5]:AllocateNewRegion():cu-allocator.cc:485) About to allocate new memory region of 173015040 bytes; current memory info is: free:169M, used:11009M, total:11178M, free/total:0.015163
LOG (nnet3-chain-combine[5.5]:AllocateNewRegion():cu-allocator.cc:485) About to allocate new memory region of 173015040 bytes; current memory info is: free:3M, used:11175M, total:11178M, free/total:0.000313101
LOG (nnet3-chain-combine[5.5]:PrintMemoryUsage():cu-allocator.cc:352) Memory usage: 9141741824/11465129984 bytes currently allocated/total-held; 990/49 blocks currently allocated/free; largest free/allocated block sizes are 173015040/173015040; time taken total/cudaMalloc is 0.0364385/0.0122967, synchronized the GPU 0 times out of 16028 frees; device memory info: free:3M, used:11175M, total:11178M, free/total:0.000313101
ERROR (nnet3-chain-combine[5.5]:AllocateNewRegion():cu-allocator.cc:505) Failed to allocate a memory region of 173015040 bytes.  Possibly smaller minibatch size would help.  Memory info: free:3M, used:11175M, total:11178M, free/total:0.000313101

[ Stack-Trace: ]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::FatalMessageLogger::~FatalMessageLogger()
kaldi::CuMemoryAllocator::AllocateNewRegion(unsigned long)
kaldi::CuMemoryAllocator::Malloc(unsigned long)
kaldi::CuMatrix<float>::Resize(int, int, kaldi::MatrixResizeType, kaldi::MatrixStrideType)
kaldi::nnet3::time_height_convolution::ConvolveForward(kaldi::nnet3::time_height_convolution::ConvolutionComputation const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float>*)
kaldi::nnet3::TimeHeightConvolutionComponent::Propagate(kaldi::nnet3::ComponentPrecomputedIndexes const*, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float>*) const
kaldi::nnet3::NnetComputer::ExecuteCommand()
kaldi::nnet3::NnetComputer::Run()
kaldi::nnet3::NnetChainComputeProb::Compute(kaldi::nnet3::NnetChainExample const&)
kaldi::nnet3::ComputeObjf(bool, bool, std::vector<kaldi::nnet3::NnetChainExample, std::allocator<kaldi::nnet3::NnetChainExample> > const&, kaldi::nnet3::Nnet const&, kaldi::chain::ChainTrainingOptions const&, fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > > const&, kaldi::nnet3::NnetChainComputeProb*)
kaldi::nnet3::ComputeObjf(bool, bool, std::vector<kaldi::nnet3::NnetChainExample, std::allocator<kaldi::nnet3::NnetChainExample> > const&, kaldi::nnet3::Nnet const&, kaldi::chain::ChainTrainingOptions const&, fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > > const&, kaldi::nnet3::NnetChainComputeProb*)
main
__libc_start_main
nnet3-chain-combine() [0x5bc789]


Thank you!

joseph.an...@gmail.com

unread,
Apr 8, 2019, 5:59:50 AM4/8/19
to kaldi-help
Are you using the GPUs in exclusive process/thread mode? 

Daniel Povey

unread,
Apr 8, 2019, 12:53:35 PM4/8/19
to kaldi-help
Possibly it is the model files themselves that are using up all the memory.  I have never seen that happen before.
There are options/mechanisms that are supposed to avoid it using too many models though.  You could just remove some of the models from the command line manually, that's one way. 
Wtihout seeing the command line /screen output, I wouldnt know


Dan

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/2ed8aca8-7c54-4231-b37f-e06744b9aa5a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Xuerui Yang

unread,
Apr 8, 2019, 10:28:10 PM4/8/19
to kaldi-help
I modified the code based on this PR: https://github.com/hhadian/kaldi/pull/18. Changes are made only in chain-traing.cc and chain-denominator.cc. Numeritor computation is completed before denominator and the nnet_output_deriv is sent to denominator comutation so that the boost coefficient can be taken into consideration.
 
The screen output is like:
2019.4.9.PNG


在 2019年4月9日星期二 UTC+8上午12:53:35,Dan Povey写道:
Possibly it is the model files themselves that are using up all the memory.  I have never seen that happen before.
There are options/mechanisms that are supposed to avoid it using too many models though.  You could just remove some of the models from the command line manually, that's one way. 
Wtihout seeing the command line /screen output, I wouldnt know


Dan

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Xuerui Yang

unread,
Apr 8, 2019, 11:16:50 PM4/8/19
to kaldi-help
I reduced the numbers of combined models from 23 to 8 and it just fits the GPU memory. But I'm not sure there will be a performance reduction.

在 2019年4月9日星期二 UTC+8上午10:28:10,Xuerui Yang写道:

Daniel Povey

unread,
Apr 8, 2019, 11:32:28 PM4/8/19
to kaldi-help
It might not matter much.
You should have said at the outset that you  had changed the code.
Maybe Hossein can help.


To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages