The length of the MFCC feature vector ?

K.R

unread,

Sep 20, 2016, 5:24:46 AM9/20/16

to kaldi-help

I at moment trying to implement a deep neural networks which given samples is capable of producing the MFCC feature vector which kaldi outputs.

I just opened the feature vector, and don't understand the length of it. As from what i can read is it supposed to have 39 components, but i only have 29?...

Where is the last 10?.. How does Kaldi format the MFCC features - what does each feature contain?

Danijel Korzinek

unread,

Sep 21, 2016, 1:46:25 AM9/21/16

to kaldi-help

The features depend on which example you are using. Can you give more details on exactly which scripts you used to generate the features?

K.R

unread,

Sep 21, 2016, 6:19:46 AM9/21/16

to kaldi-help

#Feature part of thing
mfccdir=mfcc

for x in data/train; do
 steps/make_mfcc.sh --cmd "$train_cmd" --nj 42 $x exp/make_mfcc/$x $mfccdir
      utils/fix_data_dir.sh data/train
        steps/compute_cmvn_stats.sh $x exp/make_mfcc/$x $mfccdir
        utils/fix_data_dir.sh data/train
done

this was what i ran

Danijel Korzinek

unread,

Sep 21, 2016, 6:44:14 AM9/21/16

to kaldi-help

make_mfcc.sh uses the standard settings of the compute-mfcc program. It also automatically reads the "config/mfcc.conf" file if it is present (see if you have that file and what it contains).

Here are the standard settings:

compute-mfcc-feats

Create MFCC feature files.

Usage: compute-mfcc-feats [options...] <wav-rspecifier> <feats-wspecifier>

Options:

--blackman-coeff : Constant coefficient for generalized Blackman window. (float, default = 0.42)

--cepstral-lifter : Constant that controls scaling of MFCCs (float, default = 22)

--channel : Channel to extract (-1 -> expect mono, 0 -> left, 1 -> right) (int, default = -1)

--debug-mel : Print out debugging information for mel bin computation (bool, default = false)

--dither : Dithering constant (0.0 means no dither) (float, default = 1)

--energy-floor : Floor on energy (absolute, not relative) in MFCC computation (float, default = 0)

--frame-length : Frame length in milliseconds (float, default = 25)

--frame-shift : Frame shift in milliseconds (float, default = 10)

--high-freq : High cutoff frequency for mel bins (if < 0, offset from Nyquist) (float, default = 0)

--htk-compat : If true, put energy or C0 last and use a factor of sqrt(2) on C0. Warning: not sufficient to get HTK compatible features (need to change other parameters). (bool, default = false)

--low-freq : Low cutoff frequency for mel bins (float, default = 20)

--min-duration : Minimum duration of segments to process (in seconds). (float, default = 0)

--num-ceps : Number of cepstra in MFCC computation (including C0) (int, default = 13)

--num-mel-bins : Number of triangular mel-frequency bins (int, default = 23)

--output-format : Format of the output files [kaldi, htk] (string, default = "kaldi")

--preemphasis-coefficient : Coefficient for use in signal preemphasis (float, default = 0.97)

--raw-energy : If true, compute energy before preemphasis and windowing (bool, default = true)

--remove-dc-offset : Subtract mean from waveform on each frame (bool, default = true)

--round-to-power-of-two : If true, round window size to power of two. (bool, default = true)

--sample-frequency : Waveform data sample frequency (must match the waveform file, if specified there) (float, default = 16000)

--snip-edges : If true, end effects will be handled by outputting only frames that completely fit in the file, and the number of frames depends on the frame-length. If false, the number of frames depends only on the frame-shift, and we reflect the data at the ends. (bool, default = true)

--subtract-mean : Subtract mean of each feature file [CMS]; not recommended to do it this way. (bool, default = false)

--use-energy : Use energy (not C0) in MFCC computation (bool, default = true)

--utt2spk : Utterance to speaker-id map rspecifier (if doing VTLN and you have warps per speaker) (string, default = "")

--vtln-high : High inflection point in piecewise linear VTLN warping function (if negative, offset from high-mel-freq (float, default = -500)

--vtln-low : Low inflection point in piecewise linear VTLN warping function (float, default = 100)

--vtln-map : Map from utterance or speaker-id to vtln warp factor (rspecifier) (string, default = "")

--vtln-warp : Vtln warp factor (only applicable if vtln-map not specified) (float, default = 1)

--window-type : Type of window ("hamming"|"hanning"|"povey"|"rectangular"|"blackmann") (string, default = "povey")

Standard options:

--config : Configuration file to read (this option may be repeated) (string, default = "")

--help : Print out usage message (bool, default = false)

--print-args : Print the command line arguments (to stderr) (bool, default = true)

--verbose : Verbose level (higher->more logging) (int, default = 0)

So it generates a feature vector of length 13 (num-ceps option).

The deltas and acc are usually calculated on-the-fly in different setups, e.g. train_deltas.sh passes the features through the add-deltas program which has these default settings:

add-deltas

Add deltas (typically to raw mfcc or plp features

Usage: add-deltas [options] in-rspecifier out-wspecifier

Options:

--delta-order : Order of delta computation (int, default = 2)

--delta-window : Parameter controlling window for delta computation (actual window size for each delta order is 1 + 2*delta-window-size) (int, default = 2)

--truncate : If nonzero, first truncate features to this dimension. (int, default = 0)

That means that the final vector to that setup has a length of 39 (13 mfcc+delta+acc).

train_lda.mllt on the other hand takes the 13 mfcc features, splice each frames with 4 frames on the left and right (so 9*13 = 117) and uses an LDA transform to generate 40 features (as provided in the config of the script).

DNN setups, like nnet3/run_tdnn.sh usually uses the hires setup (config/mfcc_hires.conf), which has 40 melfilters converted into 40 MFCCs and uses a splicing (9 frames, like above) and combines it with an iVector (usually dim 100) to give some 217 features at input.

I don't know of any setup that uses 29 features.

Some papers on TIMIT (eg Alex Graves' thesis) use 26 features, which is 13 MFCC + delta and no acc.

Armando

unread,

Sep 21, 2016, 6:52:15 AM9/21/16

to kaldi-help

How did you open the feature archive to verify the number of entries for each frame?
did you output the feature archive in txt format?

Danijel Korzinek

unread,

Sep 21, 2016, 6:57:32 AM9/21/16

to kaldi-help

Just in case if it's not obvious, there is a utility that does this:

feat-to-dim

Reads an archive of features. If second argument is wxfilename, writes

the feature dimension of the first feature file; if second argument is

wspecifier, writes an archive of the feature dimension, indexed by utterance

id.

Usage: feat-to-dim [options] <feat-rspecifier> (<dim-wspecifier>|<dim-wxfilename>)

e.g.: feat-to-dim scp:feats.scp -

K.R

unread,

Sep 21, 2016, 3:11:03 PM9/21/16

to kaldi-help

I use the copy-feats to view my mvcc features..

Danijel Korzinek

unread,

Sep 21, 2016, 3:22:35 PM9/21/16

to kaldi-help

Try "feat-to-dim ark:you_feat_file.ark ark,t:-" instead.

K.R

unread,

Sep 24, 2016, 12:44:03 PM9/24/16

to kaldi-help

I am not sure whether this output make more sense



k@k-ThinkPad-T420s:~/kaldi-trunk/egs/start/s5/mfcc$ ../src/featbin/feat-to-dim ark:/home/k/kaldi-trunk/egs/start/s5/mfcc/raw_mfcc_train.1.ark ark,t:-
../src/featbin/feat-to-dim ark:/home/k/kaldi-trunk/egs/start/s5/mfcc/raw_mfcc_train.1.ark ark,t:- 
fcaw-b-an406 13 
fcaw-b-an407 13 
fcaw-b-an408 13 
fcaw-b-an409 13 
fcaw-b-an410 13 
fcaw-b-cen1 13 
fcaw-b-cen2 13 
fcaw-b-cen3 13 
fcaw-b-cen4 13 
fcaw-b-cen5 13 
fcaw-b-cen6 13 
fcaw-b-cen7 13 
fcaw-b-cen8 13

Danijel Korzinek

unread,

Sep 25, 2016, 8:56:08 AM9/25/16

to kaldi-help

Yes it does. Each file has exactly 13 features. That is 12 MFCCs + energy. Deltas are added later using a different program.

Reply all

Reply to author

Forward