Increased GPU memory usage of nnet-train-simple

András Balog

unread,

Oct 3, 2018, 8:54:50 AM10/3/18

to kaldi-help

Dear Dan!

The issue that I found is when I reinstalled kaldi, the GPU memory usage of the DNN training processes increased very much.

I've done some investigation to describe (and solve or avoid) the problem.

Normally my DNN training process uses 200-300 MB GPU memory, so nvidia-smi ( | grep nnet-train-simple) shows something like this:

| 0 6737 C nnet-train-simple 249MiB |

I've found out, that 8c0e3e3119a45d3cb8b4b60a62acecabf9154d91 ("Refactor CUDA allocator code based on large cached regions") is the commit where it turns into:

| 0 6980 C nnet-train-simple 4129MiB |

I tried the latest NVIDIA drivers and CUDA (410.48 / 10.0) as well as an older version of them (384.145 / 9.0), but it makes no difference. The GPU card is a GTX1070.

I logged the process of increasing the allocated memory by running nvidia-smi in every 0.25 second. The logging started just before running nnet-train-simple in the train_pnorm_fast.sh.

This is what I got in the first (good) case:

steps/nnet2/train_pnorm_fast.sh --num-threads 1 --num-jobs-nnet 1 --num-hidden-layers 6 --p 4 --pnorm-input-dim 2000 --pnorm-output-dim 400 --stage 0 --cmd run.pl --cleanup false --minibatch-size 512 --initial-learning-rate 0.1 --final-learning-rate 0.01 data/train data/lang exp/tri3_ali exp/pnorm6x2000

steps/nnet2/train_pnorm_fast.sh: Will train for 15 + 5 epochs, equalling

steps/nnet2/train_pnorm_fast.sh: 1350 + 450 = 1800 iterations,

steps/nnet2/train_pnorm_fast.sh: (while reducing learning rate) + (with constant learning rate).

steps/nnet2/train_pnorm_fast.sh: Will not do mix up

Training neural net (pass 0)

| 0 6737 C nnet-train-simple 8MiB |

| 0 6737 C nnet-train-simple 123MiB |

| 0 6737 C nnet-train-simple 147MiB |

| 0 6737 C nnet-train-simple 241MiB |

| 0 6737 C nnet-train-simple 249MiB |

And with the more recent kaldi version, at some point the memory usage jumps to ~4GB:

steps/nnet2/train_pnorm_fast.sh --num-threads 1 --num-jobs-nnet 1 --num-hidden-layers 6 --p 4 --pnorm-input-dim 2000 --pnorm-output-dim 400 --stage 0 --cmd run.pl --cleanup false --minibatch-size 512 --initial-learning-rate 0.1 --final-learning-rate 0.01 data/train data/lang exp/tri3_ali exp/pnorm6x2000

steps/nnet2/train_pnorm_fast.sh: Will train for 15 + 5 epochs, equalling

steps/nnet2/train_pnorm_fast.sh: 1350 + 450 = 1800 iterations,

steps/nnet2/train_pnorm_fast.sh: (while reducing learning rate) + (with constant learning rate).

steps/nnet2/train_pnorm_fast.sh: Will not do mix up

Training neural net (pass 0)

| 0 6980 C nnet-train-simple 8MiB |

| 0 6980 C nnet-train-simple 13MiB |

| 0 6980 C nnet-train-simple 125MiB |

| 0 6980 C nnet-train-simple 149MiB |

| 0 6980 C nnet-train-simple 4129MiB |

I usually run 3 jobs per GPU (e.g. --num-jobs-nnet 18 on 6 GPUs) to minimize the overall run time. But this way only one process can fit in the GPU RAM. So now I use the version of the previous commit.

Is this some bug or should I configure something differently with the modified CUDA allocation?

Thanks,

András

Daniel Povey

unread,

Oct 3, 2018, 11:44:29 AM10/3/18

to kaldi-help

It's recommended not to share the GPU between multiple training processes like that. Possibly you could give them the options --use-gpu=wait (I don't know what the script level options for this are, as I haven't touched nnet2 in a long time).

However, to work around the problem you can change CuAllocatorOptions::memory_proportion from, say, 0.8 to 0.1.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/1f5f1141-2495-4b2f-8398-7552203f1ac0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

András Balog

unread,

Oct 3, 2018, 1:40:29 PM10/3/18

to kaldi-help

Thank you! So this is indeed a feature, not a bug. I'll do some performance tests with my old 3-job setup, the old allocator with 1 job, with the new allocator with 1 or more jobs, and I'll see which works faster. (or which doesn't work at all)

Note: In the cu-allocator.h, the comment says that the default value is 0.8, but as I see, the code actually sets it to 0.5.

András

Reply all

Reply to author

Forward