My questions is essentially how much cuda memory do the nnet1 programs typically consume? And, are the error messages in the log below indicative of how much memory is actually used or is some memory usage hidden?
For context the computer runs OS X 10.10.5 with a 1GB GT 650M, CUDA driver 7.5.25 and I'm trying to train an RBM with 704 visible units, 1024 hidden units randomiser size set to 1000 and batch size to 100. By my calculations this should use about 21MB of memory and if the error messages are anything to go by the actual use is about 48MB, in either case a lot less than 1GB. With hidden layers of size 512, the initial RBM training succeeds but second layer fails and only at 128 I can successfully pretrain and finetune the network. So in addition to the questions at the top I wonder if it's possible to use the binaries incorrectly or whether there is some other factor at play (like compute exclusive mode which I can not find a way to activate on my platform)?
nnet-initialize /Users/myname/tmp/master/wsj/exprmnt_predict_mask_debug/2_1024/1_rbm_proto /Users/myname/tmp/master/wsj/exprmnt_predict_mask_debug/2_1024/1_rbm_init
VLOG[1] (nnet-initialize:Init():nnet-nnet.cc:381) <NnetProto>
VLOG[1] (nnet-initialize:Init():nnet-nnet.cc:381) <Rbm> <InputDim> 704 <OutputDim> 1024 <VisibleType> gauss <HiddenType> bern <ParamStddev> 1
VLOG[1] (nnet-initialize:Init():nnet-nnet.cc:381) </NnetProto>
VLOG[1] (nnet-initialize:Init():nnet-nnet.cc:381)
LOG (nnet-initialize:main():nnet-initialize.cc:64) Written initialized model to /Users/myname/tmp/master/wsj/exprmnt_predict_mask_debug/2_1024/1_rbm_init
rbm-train-cd1-frmshuff --use-gpu=yes --verbose=1 --minibatch_size=100 --randomizer_size=1000 --num-iters=1 --learn-rate=0.01 --l2-penalty=0.0002 /Users/myname/tmp/master/wsj/exprmnt_predict_mask_debug/2_1024/1_rbm_init 'ark:copy-feats ark:/Users/myname/tmp/master/wsj/exprmnt_predict_mask_debug/2_1024/e_trn_mix.ark ark:- | normalise-feats-apply ark:/Users/myname/tmp/master/wsj/exprmnt_predict_mask_debug/2_1024/e_trn_mix_stats.ark ark:- ark:- | splice-feats --left-context=5 --right-context=5 ark:- ark:- |' /Users/myname/tmp/master/wsj/exprmnt_predict_mask_debug/2_1024/1_rbm
LOG (rbm-train-cd1-frmshuff:SelectGpuIdAuto():cu-device.cc:288) Selecting from 1 GPUs
WARNING (rbm-train-cd1-frmshuff:DeviceGetName():cu-device.cc:488) cannot open libcuda.so
WARNING (rbm-train-cd1-frmshuff:GetFreeMemory():cu-device.cc:443) cannot open libcuda.so
LOG (rbm-train-cd1-frmshuff:SelectGpuIdAuto():cu-device.cc:303) cudaSetDevice(0): Unknown GPU free:0M, used:0M, total:0M, free/total:1
LOG (rbm-train-cd1-frmshuff:SelectGpuIdAuto():cu-device.cc:352) Trying to select device: 0 (automatically), mem_ratio: 1
LOG (rbm-train-cd1-frmshuff:SelectGpuIdAuto():cu-device.cc:371) Success selecting device 0 free mem ratio: 1
WARNING (rbm-train-cd1-frmshuff:DeviceGetName():cu-device.cc:488) cannot open libcuda.so
WARNING (rbm-train-cd1-frmshuff:GetFreeMemory():cu-device.cc:443) cannot open libcuda.so
LOG (rbm-train-cd1-frmshuff:FinalizeActiveGpu():cu-device.cc:213) The active GPU is [0]: Unknown GPU free:0M, used:0M, total:0M, free/total:1 version 3.0
copy-feats ark:/Users/myname/tmp/master/wsj/exprmnt_predict_mask_debug/2_1024/e_trn_mix.ark ark:-
splice-feats --left-context=5 --right-context=5 ark:- ark:-
normalise-feats-apply ark:/Users/myname/tmp/master/wsj/exprmnt_predict_mask_debug/2_1024/e_trn_mix_stats.ark ark:- ark:-
LOG (rbm-train-cd1-frmshuff:Init():nnet-randomizer.cc:31) Seeding by srand with : 777
LOG (rbm-train-cd1-frmshuff:main():rbm-train-cd1-frmshuff.cc:137) RBM TRAINING STARTED
LOG (rbm-train-cd1-frmshuff:main():rbm-train-cd1-frmshuff.cc:140) Iteration 1/1
WARNING (rbm-train-cd1-frmshuff:RbmUpdate():nnet/nnet-rbm.h:319) Mismatch between pos_vis and neg_vis variances, danger of weight explosion. a) Reducing weights with scale 0.0337389 b) Lowering learning rate to 0.0045 [pos_vis_std:0.918402,neg_vis_std:27.2209]
VLOG[1] (rbm-train-cd1-frmshuff:main():rbm-train-cd1-frmshuff.cc:234) Setting momentum 0.9 and learning rate 0.005 after processing 0.000277778h
LOG (rbm-train-cd1-frmshuff:PrintMemoryUsage():cu-allocator.cc:126)
Memory usage: 48272300 bytes currently allocated (max: 49652140); 30696336 currently in use by user (max: 36114320); 166/17492 calls to Malloc* resulted in CUDA calls.
LOG (rbm-train-cd1-frmshuff:PrintMemoryUsage():cu-allocator.cc:133) Time taken in cudaMallocPitch=0.0120091, in cudaMalloc=0.00150895, in cudaFree=0.0119398, in this->MallocPitch()=0.062773
WARNING (rbm-train-cd1-frmshuff:MallocPitchInternal():cu-allocator.cc:97) Allocation of 2816 x 1924 region failed: freeing some memory and trying again.
LOG (rbm-train-cd1-frmshuff:MallocPitchInternal():cu-allocator.cc:102) To avoid future problems like this, changing memory_factor from 1.5 to 1.1
ERROR (rbm-train-cd1-frmshuff:Randomize():cu-math.cc:113) cudaError_t 2 : "out of memory" returned from 'cudaGetLastError()'
WARNING (rbm-train-cd1-frmshuff:Close():kaldi-io.cc:465) Pipe copy-feats ark:/Users/myname/tmp/master/wsj/exprmnt_predict_mask_debug/2_1024/e_trn_mix.ark ark:- | normalise-feats-apply ark:/Users/myname/tmp/master/wsj/exprmnt_predict_mask_debug/2_1024/e_trn_mix_stats.ark ark:- ark:- | splice-feats --left-context=5 --right-context=5 ark:- ark:- | had nonzero return status 36096
ERROR (rbm-train-cd1-frmshuff:Randomize():cu-math.cc:113) cudaError_t 2 : "out of memory" returned from 'cudaGetLastError()'
[stack trace: ]
0 rbm-train-cd1-frmshuff 0x000000010966a0b7 _ZN5kaldi18KaldiGetStackTraceEv + 71
1 rbm-train-cd1-frmshuff 0x000000010966b33a _ZN5kaldi17KaldiErrorMessageD2Ev + 410
2 rbm-train-cd1-frmshuff 0x000000010966b4b5 _ZN5kaldi17KaldiErrorMessageD1Ev + 21
3 rbm-train-cd1-frmshuff 0x00000001094f1a93 _ZN5kaldi2cu9RandomizeIfEEvRKNS_12CuMatrixBaseIT_EERKNS_7CuArrayIiEEPS4_ + 1187
4 rbm-train-cd1-frmshuff 0x00000001094dab12 _ZN5kaldi5nnet116MatrixRandomizer9RandomizeERKSt6vectorIiSaIiEE + 322
5 rbm-train-cd1-frmshuff 0x000000010948377e main + 6126
6 libdyld.dylib 0x00007fff8d0b15c9 start + 1