Kaldi "chain" training memory consumption

240 views
Skip to first unread message

Edwin

unread,
Sep 2, 2020, 4:12:34 AM9/2/20
to kaldi-help
Hi,

for a while now I am running a variant of Kaldi's "chain" training to obtain acoustic models for Serbian. I would just like to know is it expected to have very high memory consumption using this training. I am running the training on a VM with 4 x NVIDIA Tesla P100 GPU, and I basically maxed out the available memory (624 GB). I use a minibatch size of 16, num-jobs-initial is set to 3, and num-jobs-final to 16. Even with the maxed out memory, at some point during training run.pl exits with a memory error for one of the training examples, so I have to resume the training with a smaller minibatch (I first decreased it to 8, then to 4, then the training was able to complete). I am not willing to lower the minibatch from the start, bacause the training already lasts for several weeks as is. I use a database of about 1000 hours, multiplied 6x (2x with artifitial added noise, 3x with speed perturbation), so ~6000 hours in total. The neural network is a TDNN with 10 layers and 1024 neurons per layer (3-frame time contexts on each layer of TDNN), and I train it for 5 epochs. Other parameters are default I believe. The Kaldi version is from somewhere during 2018 (if that's important). In short, I wish to know if I'm doing something wrong, or such a memory consumption is expected in the described setup (and if so, can I in any way decrease the consumption without affecting the final models). Thanks in advance.

Daniel Povey

unread,
Sep 2, 2020, 4:25:49 AM9/2/20
to kaldi-help
Possibly the jobs that measure the validation and train probability/diagnostics are piling up in memory (nnet3-chain-compute-prob) because the validation set is too large.
Hard to know without more specific info such as what program is using up memory.  I suspect you are not accurately describing what is happening.

On Wed, Sep 2, 2020 at 4:12 PM Edwin <wpt....@gmail.com> wrote:
Hi,

for a while now I am running a variant of Kaldi's "chain" training to obtain acoustic models for Serbian. I would just like to know is it expected to have very high memory consumption using this training. I am running the training on a VM with 4 x NVIDIA Tesla P100 GPU, and I basically maxed out the available memory (624 GB). I use a minibatch size of 16, num-jobs-initial is set to 3, and num-jobs-final to 16. Even with the maxed out memory, at some point during training run.pl exits with a memory error for one of the training examples, so I have to resume the training with a smaller minibatch (I first decreased it to 8, then to 4, then the training was able to complete). I am not willing to lower the minibatch from the start, bacause the training already lasts for several weeks as is. I use a database of about 1000 hours, multiplied 6x (2x with artifitial added noise, 3x with speed perturbation), so ~6000 hours in total. The neural network is a TDNN with 10 layers and 1024 neurons per layer (3-frame time contexts on each layer of TDNN), and I train it for 5 epochs. Other parameters are default I believe. The Kaldi version is from somewhere during 2018 (if that's important). In short, I wish to know if I'm doing something wrong, or such a memory consumption is expected in the described setup (and if so, can I in any way decrease the consumption without affecting the final models). Thanks in advance.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/eaaaac1c-c5cf-48d1-bddb-3c81ecbbc0e4n%40googlegroups.com.

Edwin

unread,
Sep 2, 2020, 7:00:18 AM9/2/20
to kaldi-help
The validation set parameters (i.e., num_utts_subset) are not changed from their default values, valid_uttlist has ~500 utterances. The probability/diagnostics might well be piling up (as there are millions of training examples), but at this point (the training is completed) I am not sure which program was using the memory. I tried to explain the issue the best I could: at some point during training, train.py exits with an error from run.pl related to memory (the error message was overwritten in the appropriate train.$stage.$job.log file when I resumed the training, so I cannot copy it right now), and I can resume the training only if I set the minibatch size to a smaller value (I usually halve it). I only wish to know if this may happen with this amount of data and this neural net configuration, if the minibatch/jobs parameters are set as I stated above, or something somewhere should be changed/fixed on my side. It would be ideal to use VMs with a lot less memory (e.g. up to 64GB). I am willing to tell you any additional info that you need (which I can find).

Daniel Povey

unread,
Sep 2, 2020, 7:10:33 AM9/2/20
to kaldi-help
Mm, IDK, would need to see more info such as specific error messages and log files with the error in.


Edwin

unread,
Sep 2, 2020, 7:14:46 AM9/2/20
to kaldi-help
Alright, I will try to recreate the error. Until then, is the "chain" training of TDNNs known to take up a lot of memory (if you have several thousands of hours of training data)?

Daniel Povey

unread,
Sep 2, 2020, 7:15:44 AM9/2/20
to kaldi-help

Edwin

unread,
Sep 3, 2020, 11:45:18 AM9/3/20
to kaldi-help
Hi, here are the logs with the error. "run.log" is the main log file, and train.0.3.log is the one generated by train.py. You can see the training parameters in the logs as well.
The error seems to happen when the memory usage of one of the GPUs approaches the total memory available (16276 MiB), in this run it happened for GPU 3 (see CUDA.png, it's a screenshot from just before the error). The actual percentage of RAM used doesn't move too much, even though the "VIRT" column as seen on MEM.png shows some huge numbers, but I'm not sure if that's an issue by itself (will it always get as big as possible, depending on the available memory).

The key lines in the log seem to be the following:
WARNING (nnet3-chain-train[5.4.240~1403-c60f2]:MallocPitchInternal():cu-allocator.cc:97) Allocation of 49836032 x 51 region failed: freeing some memory and trying again. 
LOG (nnet3-chain-train[5.4.240~1403-c60f2]:MallocPitchInternal():cu-allocator.cc:102) To avoid future problems like this, changing memory_factor from 1.3 to 1.1

After that the allocation is tried a couple of times more and then the error happens. The problematic call seems to be in nnet3-chain-copy-egs.

For this run, I set the minibatch_size (num-chunk-per-minibatch in train.py) parameter to 1024, so the error occurs in the matter of seconds. Usually minibatch_size is initially set to 16, then lowered to 8 and 4 if (when) needed.

Any help is welcome.

run.log.txt
train.0.3.log
CUDA.png
MEM.png

Daniel Povey

unread,
Sep 3, 2020, 12:10:15 PM9/3/20
to kaldi-help
That is a problem of GPU memory, not CPU memory.  You should do what it says in the log file:
  WARNING (nnet3-chain-train[5.4.240~1403-c60f2]:SelectGpuId():cu-device.cc:196) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
but you need to make sure that the number of jobs is <= 4, or set --use-gpu=wait to train.py.

The virtual memory usages of 50G cannot be trusted, it's an artifact of the GPU usage somehow.


Edwin

unread,
Sep 4, 2020, 3:54:53 AM9/4/20
to kaldi-help
Thanks, I will try the mentioned solutions and get back to you if I have further questions or issues.

Edwin

unread,
Sep 4, 2020, 9:21:45 AM9/4/20
to kaldi-help
Hi, the following configuration seems to work for me now:
 - setting compute exclusive mode with  'nvidia-smi -c 3' before the training
 - setting "--use-gpu wait" for train.py
 - setting both num-jobs-initial and num-jobs-final to 4 for train.py (as we have 4 GPUs, and I do not believe that I need to set num-jobs-initial any lower for the initial iterations due to potential training instability, as 4 is quite low)

It works with 64 GB of (CPU) memory (I decreased it on the VM), for up to 256 for minibatch size (num-chunk-per-minibatch for train.py), and if my calculations are correct, the training should now last for about a week only (unless something unexpected happens). It can work with even less memory too with some minibatch adjustments.

Thanks a lot. Still not sure about the virtual memory usages (still ~45G per program), but it does not seem to affect the training.
Reply all
Reply to author
Forward
0 new messages