I am trying to use the webrtc audio processing implementation of voice detection to detect and distinguish human speech vs any other sound.
When I enable voice detection, I observer the following call stack:
AudioProcessingImpl::ProcessCaptureStreamLocked()
VoiceDetectionImpl::ProcessCaptureAudio()
WebRtcVad_Process()
WebRtcVad_CalcVad16khz()
WebRtcVad_CalcVad8khzf()
GmmProbability()
Apparently GmmProbability() is the workhorse of the voice detection and documented as:
// Calculates the probabilities for both speech and background noise using
// Gaussian Mixture Models (GMM). A hypothesis-test is performed to decide which
// type of signal is most probable.
// - returns : the VAD decision (0 - noise, 1 - speech).
Unfortunately, GmmProbability() does not seem able to distinguish a true human speech vs any other loud noise or sound.
I.e. if I just scratch the mic, I'll get "1-speech", if I leave my cup of coffee on the table, I'll get again "1-speech", yet I would like to distinguish these sounds from when I am actually speaking.
Is there any way to configure or change GmmProbability() to make such distinction?
Has anyone experimented with this?
Thanks,
Danail Kirov