You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to kaldi...@googlegroups.com
Dear All,We have trained a Kaldi TDNN model with Hindi + English mix dataset of around 4700 hours – the train set more of less clean. However, our test scenario cases are short, single sentences and sometimes with background noise or noisy audios (IVR usecases), and sometimes audio with unintelligbile speech due to noise, channel issues, fast speech, etc.We have included few noisy audios (only around 25000 files) also for training, we have used tag(silence) for noisy data and mapped it to sil. However, in our dataset we dont have any tags – just plain sentences.We are using vosk server for ASR deployment.Few audios are attached for the below issues. Issues we are facing are as follows:1) In case of background noise/speech (which is difficult to understand), ASR detects meaningful words. We want the ASR to return empty string/sil. 2) Sometimes audio is very much unintelligbile – we want ASR not to recognize anything (is it possible?) 3) In some cases, even though the audios are heard as “can’t” clearly, the ASR decodes “cant” as “can” – just an example – same for “yes/no” too.These kind of scenarios where it decodes the total opposite words are bit concerning for us.Can you suggest few things we can do, apart from training the model again (maybe with more tags related to noise)?Can we somehow use the confidence score to decide if we consider the decoded output?Any other idea?It took more than a week in NVIDIA A6000 48GB GPU. Any suggestions on pre-processing / post-processing/ LM changes would be of great help for us.Thanks in advance.