Voice Activity Detection (VAD) and MFCC feature extraction

2,435 views
Skip to first unread message

Shivam Verma

unread,
Mar 16, 2017, 1:03:05 PM3/16/17
to librosa
Hello, 
    I want to extract mfcc feature from a audio sample only when their is some voice activity is detected. So, for each frame i want to check for Voice Activity Detection (VAD) and if result is 1 than compute mfcc for that frame, reject that frame otherwise.
Thank You in Advance.

Justin Salamon

unread,
Mar 16, 2017, 1:10:41 PM3/16/17
to Shivam Verma, librosa
Unless you want to use something fancier, you're likely to want to compute MFCC's precisely for determining whether the voice is present or not. If you have strong computational constraints you could try to compute even simpler features (e.g. RMS), though it is unlikely that these would result in a robust VAD system, unless you're working with recordings where the only sound source is voice in which case a simpler energy-based indicator might to the trick.

Best,
Justin

--
You received this message because you are subscribed to the Google Groups "librosa" group.
To unsubscribe from this group and stop receiving emails from it, send an email to librosa+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/librosa/d96d088d-690f-4b33-9a25-593a01ef07da%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Justin Salamon, PhD
Senior Research Scientist
Music and Audio Research Laboratory (MARL)
& Center for Urban Science and Progress (CUSP)
New York University, New York, NY

Brian McFee

unread,
Mar 16, 2017, 1:18:54 PM3/16/17
to librosa, shivam...@gmail.com

You could then threshold the frame-wise RMSE of the separated vocal signal to get a simple detector: rmse(vox) > threshold.  (Threshold can be a fixed constant or some statistic of the separated signal, eg, mean(rmse(vox))).

If you have hard computational constraints, you could fashion a crude detector by running harmonic-percussive-residual separation with an aggressive margin, so that the H and P discard anything with vibrato/scooping/etc, as done in the hpss example: https://librosa.github.io/librosa_gallery/auto_examples/plot_hprss.html (eg, margin=8 or above).  You could then treat the residual component S - (H + P) as a proxy for vocals.  It will be less precise as a source separator -- ie have high distortion -- but it might work okay for detection.

On Thursday, March 16, 2017 at 1:10:41 PM UTC-4, Justin Salamon wrote:
Unless you want to use something fancier, you're likely to want to compute MFCC's precisely for determining whether the voice is present or not. If you have strong computational constraints you could try to compute even simpler features (e.g. RMS), though it is unlikely that these would result in a robust VAD system, unless you're working with recordings where the only sound source is voice in which case a simpler energy-based indicator might to the trick.

Best,
Justin
Message has been deleted

Shivam Verma

unread,
Mar 17, 2017, 9:39:34 AM3/17/17
to librosa, shivam...@gmail.com
I want to train a neural network for speaker recognition, for training I intend to use MFCC as features. So, all I want to do is if voice is present in a frame then calculate mfcc for that frame else discard that frame. if mfcc are calculated for a frame where voice is not present and used as feature for training then model might not train well



On Thursday, March 16, 2017 at 10:40:41 PM UTC+5:30, Justin Salamon wrote:
Unless you want to use something fancier, you're likely to want to compute MFCC's precisely for determining whether the voice is present or not. If you have strong computational constraints you could try to compute even simpler features (e.g. RMS), though it is unlikely that these would result in a robust VAD system, unless you're working with recordings where the only sound source is voice in which case a simpler energy-based indicator might to the trick.

Best,
Justin
On Thu, Mar 16, 2017 at 12:03 PM, Shivam Verma <shivam...@gmail.com> wrote:
Hello, 
    I want to extract mfcc feature from a audio sample only when their is some voice activity is detected. So, for each frame i want to check for Voice Activity Detection (VAD) and if result is 1 than compute mfcc for that frame, reject that frame otherwise.
Thank You in Advance.

--
You received this message because you are subscribed to the Google Groups "librosa" group.
To unsubscribe from this group and stop receiving emails from it, send an email to librosa+u...@googlegroups.com.

Rafael Valle

unread,
Mar 18, 2017, 1:37:28 AM3/18/17
to librosa, shivam...@gmail.com
You might also be interested in Kaldi http://kaldi-asr.org/

Philippe Remy

unread,
Apr 10, 2017, 2:26:06 AM4/10/17
to librosa
Hello Shivam,

Your approach does not make sense here.

Let's say you want a traditional algorithm to extract voice and silence with an accuracy of say 97% and train a neural network on top of that?
The accuracy of your neural network will be at most 97%.
You have to build your own dataset by convolving silences and voices. Or having a dataset annotated by humans.
Reply all
Reply to author
Forward
0 new messages