Problem while training chain based TDNN model on GPUs

35 views
Skip to first unread message

Sougata Mukherjee

unread,
Mar 19, 2024, 6:37:06 AMMar 19
to kaldi-help
I am using this script to run chain based tdnn model on my dataset. But it is showing the following error:-

Failed to allocate a memory region of 24902631424 bytes.  Possibly this is due to sharing the GPU.  Try switching the GPUs to exclusive mode (nvidia-smi -c 3) and using the option --use-gpu=wait to scripts like steps/nnet3/chain/train.py.  Memory info: free:47496M, used:1088M, total:48585M, free/total:0.97759 CUDA error: 'out of memory'

So I have switched the GPUs to the exclusive mode by using the command 'sudo nvidia-smi -c 3' and again have run the script, then it is showing a different error which is as follows:-

'Failed to create CUDA context, no more unused GPUs?'

Kindly please help me in running this script on GPUs.

Sougata Mukherjee

unread,
Mar 20, 2024, 12:53:54 AMMar 20
to kaldi-help
@Daniel Povey can you kindly please answer this.

Astik Biswas

unread,
Mar 20, 2024, 1:33:41 AMMar 20
to kaldi...@googlegroups.com
Please checkout number of initial and final GPUs. I guess there is a mismatch between the number of GPU declared in the script and the physical gpu available.

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/1a9da130-edaa-4021-a473-755b3c21aee8n%40googlegroups.com.

Sougata Mukherjee

unread,
Mar 20, 2024, 4:10:08 PMMar 20
to kaldi-help
How to check the number of initial and final GPUs? I couldn't find any such thing in this script.

Astik Biswas

unread,
Mar 21, 2024, 12:38:07 AMMar 21
to kaldi...@googlegroups.com
Please check line 232 and 233. It is hard coded. 

Sougata Mukherjee

unread,
Mar 21, 2024, 1:18:48 AMMar 21
to kaldi-help
But here it is mentioned that the results will get changed. Isn't it possible to carry out the experiment where the actual result will not get changed? 

Astik Biswas

unread,
Mar 21, 2024, 1:21:14 AMMar 21
to kaldi...@googlegroups.com
In this case, you need to have 5 physical GPUs if you are using exclusive mode. However, I don't think there will be a major drop in performance if you use less GPUs. You can try

Reply all
Reply to author
Forward
0 new messages