Nnet3 raw DNN online forward pass...

marc....@protonmail.com

unread,

Oct 7, 2022, 11:46:16 AM10/7/22

to kaldi-developers

Hello,

Based on the online2bin/online2-wav-nnet3-latgen-incremental.cc examples, I am trying to do inference on a raw nnet3 VAD model, to select feature vectors to be fed to the online decoder. I am using a separate DecodableNnetLoopedOnline object with

// at initiatlization time...

// load nnet3 into vad_net

vad_decodable_opts.frame_subsampling_factor = 1 ;
vad_decodable_opts.acoustic_scale = 1.0 ;
vad_decodable_opts.frames_per_chunk = same_as_decoder_decodable_object ;
vad_decodable_info = new DecodableNnetSimpleLoopedInfo(vad_decodable_opts, vad_nnet) ;

// create decodeable object, with FBANK features from feature_pipeline

vad_decodable = new DecodableNnetLoopedOnline(*vad_decodable_info, feature_pipeline->InputFeature(), NULL) ;

//compute feature vectors for current chunk

feature_pipeline->AcceptWaveform(samp_freq, audio_data) ;

.

BaseFloat *vad_output_data = vad_output.Data() ;
vad_output_data[0] = vad_decodable->LogLikelihood(i, 1) ;
vad_output_data[1] = vad_decodable->LogLikelihood(i, 2) ;

These three lines above never compute any output, with NumFramesReady being always 0.

My guess is that this DNN has no transition model inside, so maybe I can not use LogLikelihood to access the pseudo-likelihoods or posteriors. This network is based on the egs/sad_rats recipe.

Indeed, is there a simple way to compute the raw VAD scores for each of the feature_pipeline feature vectors?

Thank you for any ideas into this...

Best,

Marc

marc....@protonmail.com

unread,

Oct 7, 2022, 11:58:17 AM10/7/22

to kaldi-developers

ERROR (online2-paudio-vad-nnet3-latgen-incremental[5.5.1041~2-7d91f]:AdvanceChunk():decodable-online-looped.cc:149) Attempt to access frame past the end of the available input

marc....@protonmail.com

unread,

Oct 7, 2022, 12:54:28 PM10/7/22

to kaldi-developers

I got it working. The issue was that a chunk of feature vectors does not have enough context to compute a chunk of VAD outputs, extract context is required, so now I am collecting extra audio chunks until enough features are available before calling LogLikelihood(). Also, reducing chunk_size for the vad_decodable helped too and with lower latency. Hope this helps someone in the future...