Why the online MFCC feature generated by feature_pipeline will be changed by CollapseModel ?

161 views

Skip to first unread message

Jin

unread,

Jun 23, 2020, 12:56:52 PM6/23/20

to kaldi-help

Hi, all

I want to implement an online decoding program, so I refer to kaldi-master/src/online2bin/online2-wav-nnet3-latgen-faster.cc and copy it and modify it. However, I found a strange thing that using collapsemodel will change the feature calculated by pipeline

The detail is as follows:

Step 1.

I only want to get the mfcc feature, therefore my code is as follows:

my_Code_ver1:

=================

// online2bin/online2-wav-nnet3-latgen-faster.cc

// 2016 Api.ai (Author: Ilya Platonov)

// See ../../COPYING for clarification regarding multiple authors

// Licensed under the Apache License, Version 2.0 (the "License");

// you may not use this file except in compliance with the License.

// You may obtain a copy of the License at

// http://www.apache.org/licenses/LICENSE-2.0

// THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY

// KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED

// WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,

// MERCHANTABLITY OR NON-INFRINGEMENT.

// See the Apache 2 License for the specific language governing permissions and

// limitations under the License.

#include "feat/wave-reader.h"

#include "online2/online-nnet3-decoding.h"

#include "online2/online-nnet2-feature-pipeline.h"

#include "online2/onlinebin-util.h"

#include "online2/online-timing.h"

#include "online2/online-endpoint.h"

#include "fstext/fstext-lib.h"

#include "lat/lattice-functions.h"

#include "util/kaldi-thread.h"

#include "nnet3/nnet-utils.h"

namespace kaldi {

void GetDiagnosticsAndPrintOutput(const std::string &utt,

const fst::SymbolTable *word_syms,

const CompactLattice &clat,

int64 *tot_num_frames,

double *tot_like) {

if (clat.NumStates() == 0) {

KALDI_WARN << "Empty lattice.";

return;

}

CompactLattice best_path_clat;

CompactLatticeShortestPath(clat, &best_path_clat);

Lattice best_path_lat;

ConvertLattice(best_path_clat, &best_path_lat);

double likelihood;

LatticeWeight weight;

int32 num_frames;

std::vector<int32> alignment;

std::vector<int32> words;

GetLinearSymbolSequence(best_path_lat, &alignment, &words, &weight);

num_frames = alignment.size();

likelihood = -(weight.Value1() + weight.Value2());

*tot_num_frames += num_frames;

*tot_like += likelihood;

KALDI_VLOG(2) << "Likelihood per frame for utterance " << utt << " is "

<< (likelihood / num_frames) << " over " << num_frames

<< " frames.";

if (word_syms != NULL) {

std::cerr << utt << ' ';

for (size_t i = 0; i < words.size(); i++) {

std::string s = word_syms->Find(words[i]);

if (s == "")

KALDI_ERR << "Word-id " << words[i] << " not in symbol table.";

std::cerr << s << ' ';

}

std::cerr << std::endl;

}

int main(int argc, char *argv[]) {

try {

using namespace kaldi;

using namespace fst;

typedef kaldi::int32 int32;

typedef kaldi::int64 int64;

const char *usage =

"Reads in wav file(s) and simulates online decoding with neural nets\n"

"(nnet3 setup), with optional iVector-based speaker adaptation and\n"

"optional endpointing. Note: some configuration values and inputs are\n"

"set via config files whose filenames are passed as options\n"

"\n"

"Usage: online2-wav-nnet3-latgen-faster [options] <nnet3-in> <fst-in> "

"<spk2utt-rspecifier> <wav-rspecifier> <lattice-wspecifier>\n"

"The spk2utt-rspecifier can just be <utterance-id> <utterance-id> if\n"

"you want to decode utterance by utterance.\n";

ParseOptions po(usage);

std::string word_syms_rxfilename;

// feature_opts includes configuration for the iVector adaptation,

// as well as the basic features.

OnlineNnet2FeaturePipelineConfig feature_opts;

nnet3::NnetSimpleLoopedComputationOptions decodable_opts;

LatticeFasterDecoderConfig decoder_opts;

OnlineEndpointConfig endpoint_opts;

BaseFloat chunk_length_secs = 0.18;

bool do_endpointing = false;

bool online = true;

po.Register("chunk-length", &chunk_length_secs,

"Length of chunk size in seconds, that we process. Set to <= 0 "

"to use all input in one chunk.");

po.Register("word-symbol-table", &word_syms_rxfilename,

"Symbol table for words [for debug output]");

po.Register("do-endpointing", &do_endpointing,

"If true, apply endpoint detection");

po.Register("online", &online,

"You can set this to false to disable online iVector estimation "

"and have all the data for each utterance used, even at "

"utterance start. This is useful where you just want the best "

"results and don't care about online operation. Setting this to "

"false has the same effect as setting "

"--use-most-recent-ivector=true and --greedy-ivector-extractor=true "

"in the file given to --ivector-extraction-config, and "

"--chunk-length=-1.");

po.Register("num-threads-startup", &g_num_threads,

"Number of threads used when initializing iVector extractor.");

feature_opts.Register(&po);

decodable_opts.Register(&po);

decoder_opts.Register(&po);

endpoint_opts.Register(&po);

po.Read(argc, argv);

if (po.NumArgs() != 5) {

po.PrintUsage();

return 1;

}

std::string nnet3_rxfilename = po.GetArg(1),

fst_rxfilename = po.GetArg(2),

spk2utt_rspecifier = po.GetArg(3),

wav_rspecifier = po.GetArg(4),

clat_wspecifier = po.GetArg(5);

OnlineNnet2FeaturePipelineInfo feature_info(feature_opts);

if (!online) {

feature_info.ivector_extractor_info.use_most_recent_ivector = true;

feature_info.ivector_extractor_info.greedy_ivector_extractor = true;

chunk_length_secs = -1.0;

}

Matrix<double> global_cmvn_stats;

if (feature_info.global_cmvn_stats_rxfilename != "")

ReadKaldiObject(feature_info.global_cmvn_stats_rxfilename,

&global_cmvn_stats);

TransitionModel trans_model;

nnet3::AmNnetSimple am_nnet;

{

bool binary;

Input ki(nnet3_rxfilename, &binary);

trans_model.Read(ki.Stream(), binary);

am_nnet.Read(ki.Stream(), binary);

SetBatchnormTestMode(true, &(am_nnet.GetNnet()));

SetDropoutTestMode(true, &(am_nnet.GetNnet()));

nnet3::CollapseModel(nnet3::CollapseModelConfig(), &(am_nnet.GetNnet()));

}

/* // comment 20200623_173051

// this object contains precomputed stuff that is used by all decodable

// objects. It takes a pointer to am_nnet because if it has iVectors it has

// to modify the nnet to accept iVectors at intervals.

nnet3::DecodableNnetSimpleLoopedInfo decodable_info(decodable_opts,

&am_nnet);

fst::Fst<fst::StdArc> *decode_fst = ReadFstKaldiGeneric(fst_rxfilename);

fst::SymbolTable *word_syms = NULL;

if (word_syms_rxfilename != "")

if (!(word_syms = fst::SymbolTable::ReadText(word_syms_rxfilename)))

KALDI_ERR << "Could not read symbol table from file "

<< word_syms_rxfilename;

int32 num_done = 0, num_err = 0;

double tot_like = 0.0;

int64 num_frames = 0;

SequentialTokenVectorReader spk2utt_reader(spk2utt_rspecifier);

RandomAccessTableReader<WaveHolder> wav_reader(wav_rspecifier);

CompactLatticeWriter clat_writer(clat_wspecifier);

OnlineTimingStats timing_stats;

for (; !spk2utt_reader.Done(); spk2utt_reader.Next()) {

std::string spk = spk2utt_reader.Key();

const std::vector<std::string> &uttlist = spk2utt_reader.Value();

OnlineIvectorExtractorAdaptationState adaptation_state(

feature_info.ivector_extractor_info);

OnlineCmvnState cmvn_state(global_cmvn_stats);

for (size_t i = 0; i < uttlist.size(); i++) {

std::string utt = uttlist[i];

if (!wav_reader.HasKey(utt)) {

KALDI_WARN << "Did not find audio for utterance " << utt;

num_err++;

continue;

}

const WaveData &wave_data = wav_reader.Value(utt);

// get the data for channel zero (if the signal is not mono, we only

// take the first channel).

SubVector<BaseFloat> data(wave_data.Data(), 0);

OnlineNnet2FeaturePipeline feature_pipeline(feature_info);

feature_pipeline.SetAdaptationState(adaptation_state);

feature_pipeline.SetCmvnState(cmvn_state);

/* // comment 20200623_173141

OnlineSilenceWeighting silence_weighting(

trans_model,

feature_info.silence_weighting_config,

decodable_opts.frame_subsampling_factor);

SingleUtteranceNnet3Decoder decoder(decoder_opts, trans_model,

decodable_info,

*decode_fst, &feature_pipeline);

OnlineTimer decoding_timer(utt);

BaseFloat samp_freq = wave_data.SampFreq();

int32 chunk_length;

if (chunk_length_secs > 0) {

chunk_length = int32(samp_freq * chunk_length_secs);

if (chunk_length == 0) chunk_length = 1;

} else {

chunk_length = std::numeric_limits<int32>::max();

}

int32 samp_offset = 0;

std::vector<std::pair<int32, BaseFloat> > delta_weights;

while (samp_offset < data.Dim()) {

int32 samp_remaining = data.Dim() - samp_offset;

int32 num_samp = chunk_length < samp_remaining ? chunk_length

: samp_remaining;

SubVector<BaseFloat> wave_part(data, samp_offset, num_samp);

feature_pipeline.AcceptWaveform(samp_freq, wave_part);

samp_offset += num_samp;

decoding_timer.WaitUntil(samp_offset / samp_freq);

if (samp_offset == data.Dim()) {

// no more input. flush out last frames

feature_pipeline.InputFinished();

// code to printf the value of the feature, 20200623_171127

Vector<BaseFloat> feat01;

feat01.Resize(feature_pipeline.Dim());

std::cout<<"utt = "<<utt<<"\n";

for (int i1=0; i1<feature_pipeline.NumFramesReady(); i1++) {

feature_pipeline.GetFrame(i1, &feat01);

std::cout<<"feature = ";

for (int i2=0; i2<feat01.Dim(); i2++) {

if (i2 < feat01.Dim() - 1) {

std::cout<<feat01(i2)<<" ";

} else {

if (i1 < feature_pipeline.NumFramesReady() - 1) {

std::cout<<feat01(i2)<<" \n";

} else {

std::cout<<feat01(i2)<<" ]\n";

}

/* // comment 20200623_172819

if (silence_weighting.Active() &&

feature_pipeline.IvectorFeature() != NULL) {

silence_weighting.ComputeCurrentTraceback(decoder.Decoder());

silence_weighting.GetDeltaWeights(feature_pipeline.NumFramesReady(),

&delta_weights);

feature_pipeline.IvectorFeature()->UpdateFrameWeights(delta_weights);

}

decoder.AdvanceDecoding();

if (do_endpointing && decoder.EndpointDetected(endpoint_opts)) {

break;

}

/* // comment 20200623_172907

decoder.FinalizeDecoding();

CompactLattice clat;

bool end_of_utterance = true;

decoder.GetLattice(end_of_utterance, &clat);

GetDiagnosticsAndPrintOutput(utt, word_syms, clat,

&num_frames, &tot_like);

decoding_timer.OutputStats(&timing_stats);

// In an application you might avoid updating the adaptation state if

// you felt the utterance had low confidence. See lat/confidence.h

feature_pipeline.GetAdaptationState(&adaptation_state);

feature_pipeline.GetCmvnState(&cmvn_state);

// we want to output the lattice with un-scaled acoustics.

BaseFloat inv_acoustic_scale =

1.0 / decodable_opts.acoustic_scale;

ScaleLattice(AcousticLatticeScale(inv_acoustic_scale), &clat);

clat_writer.Write(utt, clat);

KALDI_LOG << "Decoded utterance " << utt;

num_done++;

}

timing_stats.Print(online);

KALDI_LOG << "Decoded " << num_done << " utterances, "

<< num_err << " with errors.";

KALDI_LOG << "Overall likelihood per frame was " << (tot_like / num_frames)

<< " per frame over " << num_frames << " frames.";

/* // comment 20200623_175541

delete decode_fst;

delete word_syms; // will delete if non-NULL.

return (num_done != 0 ? 0 : 1);

} catch(const std::exception& e) {

std::cerr << e.what();

return -1;

}

} // main()

=================

in this program, I comment many code, and add some code on lines 261 - 271 to print the value of the online MFCC feature.

Step 2.

I make and run it.

the mfcc.conf is as follows:

=================

--use-energy=false # use average of log energy, not energy.

--sample-frequency=16000 # Switchboard is sampled at 8kHz

--num-mel-bins=40 # similar to Google's setup.

--num-ceps=40 # there is no dimensionality reduction.

--low-freq=40 # low cutoff frequency for mel bins

--high-freq=-200 # high cutoff frequently, relative to Nyquist of 4000 (=3800)

=================

then, the online mfcc feature is as follows:

=================

feature = 34.3405 -43.4712 -9.36623 -6.47962 -13.6996 -12.058 0.281473 4.01853 10.9514 -3.08243 -0.147551 -6.50696 7.32777 -2.50403 7.

feature = 34.0814 -42.8561 -4.71221 -10.1752 -1.70983 -2.09292 -5.9388 1.98959 5.00703 4.5387 6.19564 8.12872 2.55187 7.43772 -2.60018

feature = 33.4881 -47.1861 -17.4549 -14.3747 -9.82685 -12.9513 -12.1817 -0.622683 13.8116 7.53594 -1.22806 -18.0101 -26.4598 -7.95405

feature = 33.7378 -44.5546 -14.4415 -14.4512 -10.3326 -7.71062 -2.36232 -13.5167 -13.5539 -0.582236 8.88844 1.60055 8.47699 11.4204

feature = 33.7891 -41.2944 -4.9964 -7.96058 -7.84591 -9.15858 2.32942 -8.17799 2.40428 -6.9144 -1.8094 -10.4293 -12.8396 -14.3085

......

=================

To prove the online mfcc result is correct, I also use offline script: "steps/make_mfcc.sh" to extract the mfcc feature and use "copy-feats" to convert .ark file to .txt file for readable. The offline mfcc feature is:

=================

feature = 34.34048 -43.4712 -9.366229 -6.479622 -13.69956 -12.05803 0.2814732 4.018528 10.95138 -3.082435 -0.1475509 -6.506959 7.327765 -2.504032

feature = 34.08136 -42.85607 -4.712212 -10.17521 -1.709832 -2.092924 -5.9388 1.989586 5.007028 4.538695 6.195642 8.128718 2.551874 7.437718 -2.600184

feature = 33.4881 -47.18608 -17.45494 -14.37472 -9.826847 -12.95129 -12.18169 -0.622683 13.81161 7.53594 -1.228062 -18.01012 -26.45985 -7.95405

feature = 33.73775 -44.55455 -14.44146 -14.45118 -10.33265 -7.710624 -2.362316 -13.51671 -13.55395 -0.5822357 8.888443 1.600546 8.476991 11.4204

feature = 33.78905 -41.2944 -4.996396 -7.960578 -7.845908 -9.158577 2.329417 -8.177985 2.404277 -6.914402 -1.809403 -10.42929 -12.83959 -14.30855

=================

they are the same, so I think the mfcc extract by "my_Code_ver1.cc" is right.

Step 3.

For decoding, the nnet3 model is needed, so I uncomment the code on lines 161 - 171, the code is as follows:

my_Code_ver2:

=================

// online2bin/online2-wav-nnet3-latgen-faster.cc

// 2016 Api.ai (Author: Ilya Platonov)

// See ../../COPYING for clarification regarding multiple authors

// Licensed under the Apache License, Version 2.0 (the "License");

// you may not use this file except in compliance with the License.

// You may obtain a copy of the License at

// http://www.apache.org/licenses/LICENSE-2.0

// THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY

// KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED

// WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,

// MERCHANTABLITY OR NON-INFRINGEMENT.

// See the Apache 2 License for the specific language governing permissions and

// limitations under the License.

#include "feat/wave-reader.h"

#include "online2/online-nnet3-decoding.h"

#include "online2/online-nnet2-feature-pipeline.h"

#include "online2/onlinebin-util.h"

#include "online2/online-timing.h"

#include "online2/online-endpoint.h"

#include "fstext/fstext-lib.h"

#include "lat/lattice-functions.h"

#include "util/kaldi-thread.h"

#include "nnet3/nnet-utils.h"

namespace kaldi {

void GetDiagnosticsAndPrintOutput(const std::string &utt,

const fst::SymbolTable *word_syms,

const CompactLattice &clat,

int64 *tot_num_frames,

double *tot_like) {

if (clat.NumStates() == 0) {

KALDI_WARN << "Empty lattice.";

return;

}

CompactLattice best_path_clat;

CompactLatticeShortestPath(clat, &best_path_clat);

Lattice best_path_lat;

ConvertLattice(best_path_clat, &best_path_lat);

double likelihood;

LatticeWeight weight;

int32 num_frames;

std::vector<int32> alignment;

std::vector<int32> words;

GetLinearSymbolSequence(best_path_lat, &alignment, &words, &weight);

num_frames = alignment.size();

likelihood = -(weight.Value1() + weight.Value2());

*tot_num_frames += num_frames;

*tot_like += likelihood;

KALDI_VLOG(2) << "Likelihood per frame for utterance " << utt << " is "

<< (likelihood / num_frames) << " over " << num_frames

<< " frames.";

if (word_syms != NULL) {

std::cerr << utt << ' ';

for (size_t i = 0; i < words.size(); i++) {

std::string s = word_syms->Find(words[i]);

if (s == "")

KALDI_ERR << "Word-id " << words[i] << " not in symbol table.";

std::cerr << s << ' ';

}

std::cerr << std::endl;

}

int main(int argc, char *argv[]) {

try {

using namespace kaldi;

using namespace fst;

typedef kaldi::int32 int32;

typedef kaldi::int64 int64;

const char *usage =

"Reads in wav file(s) and simulates online decoding with neural nets\n"

"(nnet3 setup), with optional iVector-based speaker adaptation and\n"

"optional endpointing. Note: some configuration values and inputs are\n"

"set via config files whose filenames are passed as options\n"

"\n"

"Usage: online2-wav-nnet3-latgen-faster [options] <nnet3-in> <fst-in> "

"<spk2utt-rspecifier> <wav-rspecifier> <lattice-wspecifier>\n"

"The spk2utt-rspecifier can just be <utterance-id> <utterance-id> if\n"

"you want to decode utterance by utterance.\n";

ParseOptions po(usage);

std::string word_syms_rxfilename;

// feature_opts includes configuration for the iVector adaptation,

// as well as the basic features.

OnlineNnet2FeaturePipelineConfig feature_opts;

nnet3::NnetSimpleLoopedComputationOptions decodable_opts;

LatticeFasterDecoderConfig decoder_opts;

OnlineEndpointConfig endpoint_opts;

BaseFloat chunk_length_secs = 0.18;

bool do_endpointing = false;

bool online = true;

po.Register("chunk-length", &chunk_length_secs,

"Length of chunk size in seconds, that we process. Set to <= 0 "

"to use all input in one chunk.");

po.Register("word-symbol-table", &word_syms_rxfilename,

"Symbol table for words [for debug output]");

po.Register("do-endpointing", &do_endpointing,

"If true, apply endpoint detection");

po.Register("online", &online,

"You can set this to false to disable online iVector estimation "

"and have all the data for each utterance used, even at "

"utterance start. This is useful where you just want the best "

"results and don't care about online operation. Setting this to "

"false has the same effect as setting "

"--use-most-recent-ivector=true and --greedy-ivector-extractor=true "

"in the file given to --ivector-extraction-config, and "

"--chunk-length=-1.");

po.Register("num-threads-startup", &g_num_threads,

"Number of threads used when initializing iVector extractor.");

feature_opts.Register(&po);

decodable_opts.Register(&po);

decoder_opts.Register(&po);

endpoint_opts.Register(&po);

po.Read(argc, argv);

if (po.NumArgs() != 5) {

po.PrintUsage();

return 1;

}

std::string nnet3_rxfilename = po.GetArg(1),

fst_rxfilename = po.GetArg(2),

spk2utt_rspecifier = po.GetArg(3),

wav_rspecifier = po.GetArg(4),

clat_wspecifier = po.GetArg(5);

OnlineNnet2FeaturePipelineInfo feature_info(feature_opts);

if (!online) {

feature_info.ivector_extractor_info.use_most_recent_ivector = true;

feature_info.ivector_extractor_info.greedy_ivector_extractor = true;

chunk_length_secs = -1.0;

}

Matrix<double> global_cmvn_stats;

if (feature_info.global_cmvn_stats_rxfilename != "")

ReadKaldiObject(feature_info.global_cmvn_stats_rxfilename,

&global_cmvn_stats);

// TransitionModel trans_model;

// nnet3::AmNnetSimple am_nnet;

// {

// bool binary;

// Input ki(nnet3_rxfilename, &binary);

// trans_model.Read(ki.Stream(), binary);

// am_nnet.Read(ki.Stream(), binary);

// SetBatchnormTestMode(true, &(am_nnet.GetNnet()));

// SetDropoutTestMode(true, &(am_nnet.GetNnet()));

// nnet3::CollapseModel(nnet3::CollapseModelConfig(), &(am_nnet.GetNnet()));

// }

/* // comment 20200623_173051

// this object contains precomputed stuff that is used by all decodable

// objects. It takes a pointer to am_nnet because if it has iVectors it has

// to modify the nnet to accept iVectors at intervals.

nnet3::DecodableNnetSimpleLoopedInfo decodable_info(decodable_opts,

&am_nnet);

fst::Fst<fst::StdArc> *decode_fst = ReadFstKaldiGeneric(fst_rxfilename);

fst::SymbolTable *word_syms = NULL;

if (word_syms_rxfilename != "")

if (!(word_syms = fst::SymbolTable::ReadText(word_syms_rxfilename)))

KALDI_ERR << "Could not read symbol table from file "

<< word_syms_rxfilename;

int32 num_done = 0, num_err = 0;

double tot_like = 0.0;

int64 num_frames = 0;

SequentialTokenVectorReader spk2utt_reader(spk2utt_rspecifier);

RandomAccessTableReader<WaveHolder> wav_reader(wav_rspecifier);

CompactLatticeWriter clat_writer(clat_wspecifier);

OnlineTimingStats timing_stats;

for (; !spk2utt_reader.Done(); spk2utt_reader.Next()) {

std::string spk = spk2utt_reader.Key();

const std::vector<std::string> &uttlist = spk2utt_reader.Value();

OnlineIvectorExtractorAdaptationState adaptation_state(

feature_info.ivector_extractor_info);

OnlineCmvnState cmvn_state(global_cmvn_stats);

for (size_t i = 0; i < uttlist.size(); i++) {

std::string utt = uttlist[i];

if (!wav_reader.HasKey(utt)) {

KALDI_WARN << "Did not find audio for utterance " << utt;

num_err++;

continue;

}

const WaveData &wave_data = wav_reader.Value(utt);

// get the data for channel zero (if the signal is not mono, we only

// take the first channel).

SubVector<BaseFloat> data(wave_data.Data(), 0);

OnlineNnet2FeaturePipeline feature_pipeline(feature_info);

feature_pipeline.SetAdaptationState(adaptation_state);

feature_pipeline.SetCmvnState(cmvn_state);

/* // comment 20200623_173141

OnlineSilenceWeighting silence_weighting(

trans_model,

feature_info.silence_weighting_config,

decodable_opts.frame_subsampling_factor);

SingleUtteranceNnet3Decoder decoder(decoder_opts, trans_model,

decodable_info,

*decode_fst, &feature_pipeline);

OnlineTimer decoding_timer(utt);

BaseFloat samp_freq = wave_data.SampFreq();

int32 chunk_length;

if (chunk_length_secs > 0) {

chunk_length = int32(samp_freq * chunk_length_secs);

if (chunk_length == 0) chunk_length = 1;

} else {

chunk_length = std::numeric_limits<int32>::max();

}

int32 samp_offset = 0;

std::vector<std::pair<int32, BaseFloat> > delta_weights;

while (samp_offset < data.Dim()) {

int32 samp_remaining = data.Dim() - samp_offset;

int32 num_samp = chunk_length < samp_remaining ? chunk_length

: samp_remaining;

SubVector<BaseFloat> wave_part(data, samp_offset, num_samp);

feature_pipeline.AcceptWaveform(samp_freq, wave_part);

samp_offset += num_samp;

decoding_timer.WaitUntil(samp_offset / samp_freq);

if (samp_offset == data.Dim()) {

// no more input. flush out last frames

feature_pipeline.InputFinished();

// code to printf the value of the feature, 20200623_171127

Vector<BaseFloat> feat01;

feat01.Resize(feature_pipeline.Dim());

std::cout<<"utt = "<<utt<<"\n";

for (int i1=0; i1<feature_pipeline.NumFramesReady(); i1++) {

feature_pipeline.GetFrame(i1, &feat01);

std::cout<<"feature = ";

for (int i2=0; i2<feat01.Dim(); i2++) {

if (i2 < feat01.Dim() - 1) {

std::cout<<feat01(i2)<<" ";

} else {

if (i1 < feature_pipeline.NumFramesReady() - 1) {

std::cout<<feat01(i2)<<" \n";

} else {

std::cout<<feat01(i2)<<" ]\n";

}

/* // comment 20200623_172819

if (silence_weighting.Active() &&

feature_pipeline.IvectorFeature() != NULL) {

silence_weighting.ComputeCurrentTraceback(decoder.Decoder());

silence_weighting.GetDeltaWeights(feature_pipeline.NumFramesReady(),

&delta_weights);

feature_pipeline.IvectorFeature()->UpdateFrameWeights(delta_weights);

}

decoder.AdvanceDecoding();

if (do_endpointing && decoder.EndpointDetected(endpoint_opts)) {

break;

}

/* // comment 20200623_172907

decoder.FinalizeDecoding();

CompactLattice clat;

bool end_of_utterance = true;

decoder.GetLattice(end_of_utterance, &clat);

GetDiagnosticsAndPrintOutput(utt, word_syms, clat,

&num_frames, &tot_like);

decoding_timer.OutputStats(&timing_stats);

// In an application you might avoid updating the adaptation state if

// you felt the utterance had low confidence. See lat/confidence.h

feature_pipeline.GetAdaptationState(&adaptation_state);

feature_pipeline.GetCmvnState(&cmvn_state);

// we want to output the lattice with un-scaled acoustics.

BaseFloat inv_acoustic_scale =

1.0 / decodable_opts.acoustic_scale;

ScaleLattice(AcousticLatticeScale(inv_acoustic_scale), &clat);

clat_writer.Write(utt, clat);

KALDI_LOG << "Decoded utterance " << utt;

num_done++;

}

timing_stats.Print(online);

KALDI_LOG << "Decoded " << num_done << " utterances, "

<< num_err << " with errors.";

KALDI_LOG << "Overall likelihood per frame was " << (tot_like / num_frames)

<< " per frame over " << num_frames << " frames.";

/* // comment 20200623_175541

delete decode_fst;

delete word_syms; // will delete if non-NULL.

return (num_done != 0 ? 0 : 1);

} catch(const std::exception& e) {

std::cerr << e.what();

return -1;

}

} // main()

=================

the difference between my_Code_ver1 and my_Code_ver2 is only from line 161 to 171, then I make and run my_Code_ver2, (all parameters are the same), the new online mfcc feature now is as follows:

=================

feature = 33.7779 -41.9107 -6.68789 -6.67634 -3.35078 -12.2021 -20.4014 -21.1953 -11.1807 -20.8798 -10.3126 -1.39876 9.60673 -12.1254

feature = 32.8806 -46.7642 -12.6515 -10.4324 0.06168 -7.38267 -12.6474 -17.6967 -11.5854 9.6172 21.7533 9.54143 -4.63272 -7.29442

feature = 34.3729 -45.1859 -14.8671 -17.9491 -5.87585 -13.5997 0.678618 2.79805 3.74683 2.14518 0.180282 -0.16978 9.16503 -0.16565

feature = 33.3199 -44.1305 -11.9354 -18.7365 -9.10925 -11.7803 -18.3868 -18.802 -19.8656 -4.3989 -1.17359 10.8771 5.59053 3.71423

feature = 33.8922 -43.6418 -6.71278 -7.78159 -7.28932 -7.14056 0.785026 8.52194 2.98069 -12.3241 -6.75987 7.04074 -10.224 -3.81524

=================

the online mfcc feature extracted by "my_Code_ver1" is different from that extracted by "my_Code_ver2"

Then I comment each line from 161 to 171 the find the problem, finally the "nnet3::CollapseModel(nnet3::CollapseModelConfig(), &(am_nnet.GetNnet()));" is the critical line.

Here is my question:

1. why the "nnet3::CollapseModel(nnet3::CollapseModelConfig(), &(am_nnet.GetNnet()));" changed the online mfcc feature extracted by feature_pipeline ?

2. How to keep the online feature extracted by feature_pipeline the same with the offline feature ?

Best wishes,

Jin

Daniel Povey

unread,

Jun 24, 2020, 3:41:59 AM6/24/20

to kaldi-help

Probably to do with dithering, which calls rand(), probably a random number is used somewhere in CollapseModel().

You can set --dither=0.0 in feature extraction if it bothers you.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/8e7e07f5-33d3-42d8-9729-db38f4251f38n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages