I've finished a training of chain model and now it in model combination process. However, after some computing, in log/combine.log file, it says
And I tried to reduce the minibatch size from 64 to 32 to ... to 1. The results stay the same. But if I only use one iteration's result for example 1000.mdl to mkgraph or decode then it will work. It seems this process use only one GPU to do combination, is there any method to make it to multi GPU computing?
LOG (nnet3-chain-combine[5.5]:AllocateNewRegion():cu-allocator.cc:485) About to allocate new memory region of 173015040 bytes; current memory info is: free:169M, used:11009M, total:11178M, free/total:0.015163
LOG (nnet3-chain-combine[5.5]:AllocateNewRegion():cu-allocator.cc:485) About to allocate new memory region of 173015040 bytes; current memory info is: free:3M, used:11175M, total:11178M, free/total:0.000313101
LOG (nnet3-chain-combine[5.5]:PrintMemoryUsage():cu-allocator.cc:352) Memory usage: 9141741824/11465129984 bytes currently allocated/total-held; 990/49 blocks currently allocated/free; largest free/allocated block sizes are 173015040/173015040; time taken total/cudaMalloc is 0.0364385/0.0122967, synchronized the GPU 0 times out of 16028 frees; device memory info: free:3M, used:11175M, total:11178M, free/total:0.000313101
ERROR (nnet3-chain-combine[5.5]:AllocateNewRegion():cu-allocator.cc:505) Failed to allocate a memory region of 173015040 bytes. Possibly smaller minibatch size would help. Memory info: free:3M, used:11175M, total:11178M, free/total:0.000313101
[ Stack-Trace: ]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::FatalMessageLogger::~FatalMessageLogger()
kaldi::CuMemoryAllocator::AllocateNewRegion(unsigned long)
kaldi::CuMemoryAllocator::Malloc(unsigned long)
kaldi::CuMatrix<float>::Resize(int, int, kaldi::MatrixResizeType, kaldi::MatrixStrideType)
kaldi::nnet3::time_height_convolution::ConvolveForward(kaldi::nnet3::time_height_convolution::ConvolutionComputation const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float>*)
kaldi::nnet3::TimeHeightConvolutionComponent::Propagate(kaldi::nnet3::ComponentPrecomputedIndexes const*, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float>*) const
kaldi::nnet3::NnetComputer::ExecuteCommand()
kaldi::nnet3::NnetComputer::Run()
kaldi::nnet3::NnetChainComputeProb::Compute(kaldi::nnet3::NnetChainExample const&)
kaldi::nnet3::ComputeObjf(bool, bool, std::vector<kaldi::nnet3::NnetChainExample, std::allocator<kaldi::nnet3::NnetChainExample> > const&, kaldi::nnet3::Nnet const&, kaldi::chain::ChainTrainingOptions const&, fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > > const&, kaldi::nnet3::NnetChainComputeProb*)
kaldi::nnet3::ComputeObjf(bool, bool, std::vector<kaldi::nnet3::NnetChainExample, std::allocator<kaldi::nnet3::NnetChainExample> > const&, kaldi::nnet3::Nnet const&, kaldi::chain::ChainTrainingOptions const&, fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > > const&, kaldi::nnet3::NnetChainComputeProb*)
main
__libc_start_main
nnet3-chain-combine() [0x5bc789]