Voice Activity Detection (VAD) and MFCC feature extraction

Shivam Verma

unread,

Mar 16, 2017, 1:03:05 PM3/16/17

to librosa

Hello,

I want to extract mfcc feature from a audio sample only when their is some voice activity is detected. So, for each frame i want to check for Voice Activity Detection (VAD) and if result is 1 than compute mfcc for that frame, reject that frame otherwise.

Thank You in Advance.

Justin Salamon

unread,

Mar 16, 2017, 1:10:41 PM3/16/17

to Shivam Verma, librosa

Unless you want to use something fancier, you're likely to want to compute MFCC's precisely for determining whether the voice is present or not. If you have strong computational constraints you could try to compute even simpler features (e.g. RMS), though it is unlikely that these would result in a robust VAD system, unless you're working with recordings where the only sound source is voice in which case a simpler energy-based indicator might to the trick.

Best,

Justin

--
You received this message because you are subscribed to the Google Groups "librosa" group.
To unsubscribe from this group and stop receiving emails from it, send an email to librosa+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/librosa/d96d088d-690f-4b33-9a25-593a01ef07da%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Justin Salamon, PhD

Senior Research Scientist

Music and Audio Research Laboratory (MARL)

& Center for Urban Science and Progress (CUSP)

New York University, New York, NY

www.justinsalamon.com

Brian McFee

unread,

Mar 16, 2017, 1:18:54 PM3/16/17

to librosa, shivam...@gmail.com

The example gallery has a demo of how to use repet-sim for vocal separation here: https://librosa.github.io/librosa_gallery/auto_examples/plot_vocal_separation.html#sphx-glr-auto-examples-plot-vocal-separation-py

You could then threshold the frame-wise RMSE of the separated vocal signal to get a simple detector: rmse(vox) > threshold. (Threshold can be a fixed constant or some statistic of the separated signal, eg, mean(rmse(vox))).

If you have hard computational constraints, you could fashion a crude detector by running harmonic-percussive-residual separation with an aggressive margin, so that the H and P discard anything with vibrato/scooping/etc, as done in the hpss example: https://librosa.github.io/librosa_gallery/auto_examples/plot_hprss.html (eg, margin=8 or above). You could then treat the residual component S - (H + P) as a proxy for vocals. It will be less precise as a source separator -- ie have high distortion -- but it might work okay for detection.

On Thursday, March 16, 2017 at 1:10:41 PM UTC-4, Justin Salamon wrote:

Unless you want to use something fancier, you're likely to want to compute MFCC's precisely for determining whether the voice is present or not. If you have strong computational constraints you could try to compute even simpler features (e.g. RMS), though it is unlikely that these would result in a robust VAD system, unless you're working with recordings where the only sound source is voice in which case a simpler energy-based indicator might to the trick.

Best,
Justin

Message has been deleted

Shivam Verma

unread,

Mar 17, 2017, 9:39:34 AM3/17/17

to librosa, shivam...@gmail.com

I want to train a neural network for speaker recognition, for training I intend to use MFCC as features. So, all I want to do is if voice is present in a frame then calculate mfcc for that frame else discard that frame. if mfcc are calculated for a frame where voice is not present and used as feature for training then model might not train well

On Thursday, March 16, 2017 at 10:40:41 PM UTC+5:30, Justin Salamon wrote:

Unless you want to use something fancier, you're likely to want to compute MFCC's precisely for determining whether the voice is present or not. If you have strong computational constraints you could try to compute even simpler features (e.g. RMS), though it is unlikely that these would result in a robust VAD system, unless you're working with recordings where the only sound source is voice in which case a simpler energy-based indicator might to the trick.

Best,
Justin

On Thu, Mar 16, 2017 at 12:03 PM, Shivam Verma <shivam...@gmail.com> wrote:

Hello,
I want to extract mfcc feature from a audio sample only when their is some voice activity is detected. So, for each frame i want to check for Voice Activity Detection (VAD) and if result is 1 than compute mfcc for that frame, reject that frame otherwise.
Thank You in Advance.

--
You received this message because you are subscribed to the Google Groups "librosa" group.

To unsubscribe from this group and stop receiving emails from it, send an email to librosa+u...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/librosa/d96d088d-690f-4b33-9a25-593a01ef07da%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rafael Valle

unread,

Mar 18, 2017, 1:37:28 AM3/18/17

to librosa, shivam...@gmail.com

You might also be interested in Kaldi http://kaldi-asr.org/

Philippe Remy

unread,

Apr 10, 2017, 2:26:06 AM4/10/17

to librosa

Hello Shivam,

Your approach does not make sense here.

Let's say you want a traditional algorithm to extract voice and silence with an accuracy of say 97% and train a neural network on top of that?

The accuracy of your neural network will be at most 97%.

You have to build your own dataset by convolving silences and voices. Or having a dataset annotated by humans.

Reply all

Reply to author

Forward