AM training to obtain [Music,laugh,noise] in the audio transcriptions

881 views
Skip to first unread message

spr...@vicomtech.org

unread,
Oct 15, 2015, 6:46:24 AM10/15/15
to kaldi-help
Hi,

I would like to train an acoustic model able to detect (music,laugh,noise,etc). I have speech audios and music and noise audios, but I don't know how to train the acoustic model. I mean how to preapare  data/lang files for obtain in the audio transcriptions [music, laugh, noise]. Is there any example in Kaldi?

thanks in advance.

Daniel Povey

unread,
Oct 15, 2015, 1:30:02 PM10/15/15
to kaldi-help, David Snyder
David, you might have something to say about this?
Dan


--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

David Snyder

unread,
Oct 15, 2015, 1:49:59 PM10/15/15
to kaldi-help, spr...@vicomtech.org
I've had a similar problem, where I've needed to detect speech, music, and other noises at the frame level. I've found that it is both simple and works pretty well to train several GMMs, one for each of music, speech, and noise. Frame-level classification can be done by simply picking the GMM that has the maximum probability (weighted by some priors) on that frame.

If you want to try something like that, you could take a look at the UBMs used in egs/sre08/v1/run.sh. It might also be helpful to look at sid/gender_id.sh to see the binaries involved in classification.

David Snyder

unread,
Oct 15, 2015, 3:15:37 PM10/15/15
to kaldi-help, spr...@vicomtech.org
Also, we plan on releasing a corpus of music, speech, and noise to openslr.org in the next few days. Soon after that (I hope before the end of this month) we'll add a Kaldi example which demonstrates music/speech discrimination (and possibly a frame-level VAD for speaker ID). This stuff might be helpful for what you're doing. 

spr...@vicomtech.org

unread,
Oct 16, 2015, 5:06:23 AM10/16/15
to kaldi-help, spr...@vicomtech.org
Dear David,

Thank you so much for your answer. I was looking at those recipes, but the point is that I think that they are prepared for one-to-one speaker verification, or gender identification at utterance level. That is, in these recipes there is not included a process to segment an audio signal and then check if each of the segments is one thing or other (a type of diarization process). Moreover, these scripts should be run before doing the recognition in a separate process. We would like to obtain this information during decoding.

Some weeks ago, we downloaded an acoustic model from http://kaldi-asr.org/downloads/all/egs/fisher_english/s5/exp/nnet2_online
and testing it we realized that this model included information about [noise], [laughter]... during the recognition process. Do you know who could be the main author of these English models trained over Fisher corpus? May be, this person could have the key.

Anyway, many thanks for your support.

David Snyder

unread,
Oct 16, 2015, 9:51:16 AM10/16/15
to kaldi-help, spr...@vicomtech.org
The gender ID script would have to be modified, but it wouldn't be too hard. You could use the binary https://github.com/kaldi-asr/kaldi/blob/master/src/fgmmbin/fgmm-global-get-frame-likes.cc to get frame-level likelihoods for each of the GMMs, and then compare them (at the frame-level) to decide how to classify that frame. If you need to do this as a preprocessing step for ASR, you'd probably have to do a little more work to convert the frame-level classifications into contiguous segments of speech/music/noise, etc.

AFAIK, Dan is the primary author for the online nnet stuff. 

Daniel Povey

unread,
Oct 16, 2015, 1:50:31 PM10/16/15
to kaldi-help, Santiago Prieto Calero
The Fisher recipe produces [noise] and [laughter] because those were included in the training transcripts- it's done by the ASR system itself.
Eventually we would like to include diarization tools in Kaldi but we want to do it right; it may be a year or so before it's done.
Dan


Alaa Jaber

unread,
Oct 14, 2019, 6:13:36 AM10/14/19
to kaldi-help
Well, why not be built in the system without having to put it in transcript? since all these sounds are the same in almost all languages
three years has passed on published this question, is there any change or improvement that has occurred on kaldi about this point?

Jan Trmal

unread,
Oct 14, 2019, 9:55:56 AM10/14/19
to kaldi-help
It's still not possible -- ask in three years again :)

y.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages