Robustness of ASR

567 views
Skip to first unread message

xiaoy...@gmail.com

unread,
Sep 2, 2018, 9:20:03 PM9/2/18
to kaldi-help
Hi all, 

I just want to know which method do you prefer to make the model more robust about noise? Data augmentation(e.g.  add noises and then train a model on the augmented data) or signal preprocessing(just do denoising be recognition)?
So far I've tried both, and I noticed the my model perform poorly on the denoised data, I've listened to some denoised samples, and found out some of those good-quality audios become unclear after being denoised...
Any response will be appreciated! 

Shin

Daniel Povey

unread,
Sep 2, 2018, 9:33:38 PM9/2/18
to kaldi-help
Data augmentataion is the better approach. Denoising (at least if it
involves nonlinear methods like noise subtraction) will tend to
degrade ASR performance even if it improves perceptual quality, and
even if you train on it. But beamforming approaches like MVDR can
help, assuming you have multiple microphones.

Dan
> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
> To post to this group, send email to kaldi...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/f1d77ff1-19e3-4f32-b67c-6d3fbb85c876%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Shin XXX

unread,
Sep 2, 2018, 9:50:41 PM9/2/18
to kaldi...@googlegroups.com
Many thanks, Dan! I'll play around beamforming later on ami. Besides, could you tell me any reasons why denoising isn't helping? Is it because it'll denoise both noisy and clean audios as a front-end part...
 
Shin


Daniel Povey

unread,
Sep 2, 2018, 10:03:58 PM9/2/18
to kaldi-help
It will tend to lose information. Better to give the ASR model the
ambiguous input and let it decide.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CADtO4NVToapNEFVcm0nseNwuNLdLwr7qhXH%3DdtqMMYGtSdht%2Bw%40mail.gmail.com.

Shin XXX

unread,
Sep 2, 2018, 10:08:10 PM9/2/18
to kaldi...@googlegroups.com

aliiire...@gmail.com

unread,
Sep 7, 2018, 8:03:22 AM9/7/18
to kaldi-help
Hi Dan
on your opinion witch model is more robust?
SGMM,LSTM,TDNN,LSTM+TDNN ... . 
and wicch egs in kaldi?

Daniel Povey

unread,
Sep 7, 2018, 12:47:36 PM9/7/18
to kaldi-help
Probably TDNN.
RE which examples- it may depend how much data you have, but you could start with mini_librispeech as it's fast to run and up-to-date.


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

fei

unread,
Sep 11, 2018, 8:53:51 AM9/11/18
to kaldi-help
Hi Dan
     To improve robustness of asr,  I want to do an experiment that consists in concatenating a speech enhancement and a speech recognition deep neural network and to jointly update their parameters as if they were within a single bigger network. The idea is from the paper "BATCH-NORMALIZED JOINT TRAINING FOR DNN-BASED DISTANT SPEECH RECOGNITION".
     I intend to modify the script egs\fisher_english\s5\local\semisup\chain\run_tdnn_50k_semisupervised.sh. The loss function of one output is LF-Mmi (chain) , and the other is quadratic (MSE). However, I don't know how to combine the two loss function, Can you give me some suggestions?

在 2018年9月8日星期六 UTC+8上午12:47:36,Dan Povey写道:

Daniel Povey

unread,
Sep 11, 2018, 12:50:22 PM9/11/18
to kaldi-help
A lot of people have tried the kind of thing you are talking about, but it never seems to go anywhere.  I think data augmentation during training is the way to go.  Regarding how to do what you want to do: you'd have to give the neural network an auxiliary output, e.g. 'output-denoising' with the quadratic objective, and as the supervision for that output (in the egs) give the targets, scaled by a small number (like 0.1) which will control the dynamic range of that term in the objective function.  Doing that would require changes in the code, e.g. of nnet3-chain-get-egs; and I can't guarantee that everything would work correctly without further code or script tweaks.

Dan


fei

unread,
Sep 11, 2018, 10:16:34 PM9/11/18
to kaldi-help
Hi,Dan
    Thank you for your reply, Maybe it is difficult for me to do joint training. About data augmentation, I always use script ami/s5b/local/nnet3/multi_condition/run_ivecotor_common.sh. Then I have some probelms when using it as follows:
(1) Whether I need to prepare a clean data for data augmentation?  My dataset collected form mobile phone or telephone  and contains a little noise and reverberation, Can I use it for data augmentation?  
(2)About point source noise. I find there is no point source noise in the simulate data , and it's depend on "--max-noise-per-minute". My wavs always is only 10 seconds, Should I set the parameter max_noise_per_minute lager?

Thanks! 

在 2018年9月12日星期三 UTC+8上午12:50:22,Dan Povey写道:

Daniel Povey

unread,
Sep 11, 2018, 10:24:09 PM9/11/18
to kaldi-help

    Thank you for your reply, Maybe it is difficult for me to do joint training. About data augmentation, I always use script ami/s5b/local/nnet3/multi_condition/run_ivecotor_common.sh. Then I have some probelms when using it as follows:
(1) Whether I need to prepare a clean data for data augmentation?  My dataset collected form mobile phone or telephone  and contains a little noise and reverberation, Can I use it for data augmentation?  

Yes, adding noise to noisy data is fine.
 
(2)About point source noise. I find there is no point source noise in the simulate data , and it's depend on "--max-noise-per-minute". My wavs always is only 10 seconds, Should I set the parameter max_noise_per_minute lager?

It might be a bug in that script.  See if you can fix it and make a pull request.


Dan
 

Shin XXX

unread,
Sep 12, 2018, 1:17:53 AM9/12/18
to kaldi...@googlegroups.com
About point source noise. I find there is no point source noise in the simulate data , and it's depend on "--max-noise-per-minute".  
No noise list found or no point source noise added to your augmented wav.scp?
If it's the latter, then yes, you should set that parameter larger, cause the maximum num of point source noises
 = floor(max-noise-per-minute * your-speech-time(10s) / 60), which is 0 in your situation.

Daniel Povey

unread,
Sep 12, 2018, 12:20:38 PM9/12/18
to kaldi-help, David Snyder
David, could you please try to make sure this script interprets that option in a more sensible way, e.g. probabilitistically?  I seem to remember we had a discussion about similar things in the past, with Nickolay involved... see if you can figure out a good way to fix it.

Dan


Reply all
Reply to author
Forward
0 new messages