Cannot allocate new memory while combining models

Xuerui Yang

unread,

Apr 8, 2019, 4:19:59 AM4/8/19

to kaldi-help

Hi,

I've finished a training of chain model and now it in model combination process. However, after some computing, in log/combine.log file, it says

ERROR (nnet3-chain-combine[5.5]:AllocateNewRegion():cu-allocator.cc:505) Failed to allocate a memory region of 173015040 bytes. Possibly smaller minibatch size would help. Memory info: free:3M, used:11175M, total:11178M, free/total:0.000313101

And I tried to reduce the minibatch size from 64 to 32 to ... to 1. The results stay the same. But if I only use one iteration's result for example 1000.mdl to mkgraph or decode then it will work. It seems this process use only one GPU to do combination, is there any method to make it to multi GPU computing?

The full log are as follows:

LOG (nnet3-chain-combine[5.5]:AllocateNewRegion():cu-allocator.cc:485) About to allocate new memory region of 173015040 bytes; current memory info is: free:169M, used:11009M, total:11178M, free/total:0.015163

LOG (nnet3-chain-combine[5.5]:AllocateNewRegion():cu-allocator.cc:485) About to allocate new memory region of 173015040 bytes; current memory info is: free:3M, used:11175M, total:11178M, free/total:0.000313101

LOG (nnet3-chain-combine[5.5]:PrintMemoryUsage():cu-allocator.cc:352) Memory usage: 9141741824/11465129984 bytes currently allocated/total-held; 990/49 blocks currently allocated/free; largest free/allocated block sizes are 173015040/173015040; time taken total/cudaMalloc is 0.0364385/0.0122967, synchronized the GPU 0 times out of 16028 frees; device memory info: free:3M, used:11175M, total:11178M, free/total:0.000313101

ERROR (nnet3-chain-combine[5.5]:AllocateNewRegion():cu-allocator.cc:505) Failed to allocate a memory region of 173015040 bytes. Possibly smaller minibatch size would help. Memory info: free:3M, used:11175M, total:11178M, free/total:0.000313101

[ Stack-Trace: ]

kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)

kaldi::FatalMessageLogger::~FatalMessageLogger()

kaldi::CuMemoryAllocator::AllocateNewRegion(unsigned long)

kaldi::CuMemoryAllocator::Malloc(unsigned long)

kaldi::CuMatrix<float>::Resize(int, int, kaldi::MatrixResizeType, kaldi::MatrixStrideType)

kaldi::nnet3::time_height_convolution::ConvolveForward(kaldi::nnet3::time_height_convolution::ConvolutionComputation const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float>*)

kaldi::nnet3::TimeHeightConvolutionComponent::Propagate(kaldi::nnet3::ComponentPrecomputedIndexes const*, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float>*) const

kaldi::nnet3::NnetComputer::ExecuteCommand()

kaldi::nnet3::NnetComputer::Run()

kaldi::nnet3::NnetChainComputeProb::Compute(kaldi::nnet3::NnetChainExample const&)

kaldi::nnet3::ComputeObjf(bool, bool, std::vector<kaldi::nnet3::NnetChainExample, std::allocator<kaldi::nnet3::NnetChainExample> > const&, kaldi::nnet3::Nnet const&, kaldi::chain::ChainTrainingOptions const&, fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > > const&, kaldi::nnet3::NnetChainComputeProb*)

main

__libc_start_main

nnet3-chain-combine() [0x5bc789]

Thank you!

joseph.an...@gmail.com

unread,

Apr 8, 2019, 5:59:50 AM4/8/19

to kaldi-help

Are you using the GPUs in exclusive process/thread mode?

Daniel Povey

unread,

Apr 8, 2019, 12:53:35 PM4/8/19

to kaldi-help

Possibly it is the model files themselves that are using up all the memory. I have never seen that happen before.

There are options/mechanisms that are supposed to avoid it using too many models though. You could just remove some of the models from the command line manually, that's one way.

Wtihout seeing the command line /screen output, I wouldnt know

Dan

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/2ed8aca8-7c54-4231-b37f-e06744b9aa5a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Xuerui Yang

unread,

Apr 8, 2019, 10:28:10 PM4/8/19

to kaldi-help

I modified the code based on this PR: https://github.com/hhadian/kaldi/pull/18. Changes are made only in chain-traing.cc and chain-denominator.cc. Numeritor computation is completed before denominator and the nnet_output_deriv is sent to denominator comutation so that the boost coefficient can be taken into consideration.

The screen output is like:

在 2019年4月9日星期二 UTC+8上午12:53:35，Dan Povey写道：

Possibly it is the model files themselves that are using up all the memory. I have never seen that happen before.
There are options/mechanisms that are supposed to avoid it using too many models though. You could just remove some of the models from the command line manually, that's one way.
Wtihout seeing the command line /screen output, I wouldnt know

Dan

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Xuerui Yang

unread,

Apr 8, 2019, 11:16:50 PM4/8/19

to kaldi-help

I reduced the numbers of combined models from 23 to 8 and it just fits the GPU memory. But I'm not sure there will be a performance reduction.

在 2019年4月9日星期二 UTC+8上午10:28:10，Xuerui Yang写道：

Daniel Povey

unread,

Apr 8, 2019, 11:32:28 PM4/8/19

to kaldi-help

It might not matter much.

You should have said at the outset that you had changed the code.

Maybe Hossein can help.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/1af5edda-d89b-40fc-abea-0e9185529621%40googlegroups.com.

Reply all

Reply to author

Forward