Why does CNN-TDNN require MFCCs to be converted to Mel Filterbanks?

Sage Khan (Sage Khan)

unread,

Oct 6, 2022, 1:27:56 AM10/6/22

to kaldi-help

In the CNN-TDNN training portion I found that MFCCs are converted to Mel Filterbank. It says same in the docs as well. The CNN-TDNN diagram also shows input features taking in 200 ivectors and 40 Mel Filter banks. Is there a particular reason as to why mel filter banks are used here where as the previous steps use MFCC?

What I understand is that the reason MFCC is still used is because they are more easily compressible, being de-correlated; we dump them to disk with compression to 1 byte per coefficient. Since we dump all the coefficients, so it’s equivalent to filter-banks times a full-rank matrix without any information loss. Plus it is most familiar thing in the field so it is preferred.

But is there a reason as to why we need to convert it in CNN-TDNN?

Script for Reference:

https://github.com/anish9208/gramvaani_hindi_asr/blob/main/kaldi/asr/Run_cnn-tdnn.sh

Daniel Povey

unread,

Oct 6, 2022, 4:43:46 AM10/6/22

to kaldi...@googlegroups.com

Because the CNN needs to have a meaningful non-time dimension to operate on, so we want to convert back into frequency space.

They are still dumped to disk as MFCC, we just invert the cosine transformation.

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/4bf699f4-9d78-4f56-a634-3229394fddc6n%40googlegroups.com.

Sage Khan (Sage Khan)

unread,

Oct 6, 2022, 5:20:08 AM10/6/22

to kaldi-help

So basically in Chain CNN-TDNN training what we see it doing in the scripts is that it uses the info in MFCC files to convert them to Filterbanks. Basically reversing DCT portion.

TDNNs work on temporal domain whereas CNNs do not. So introduction of CNN layers before TDNNs allow the whole CNN-TDNN thing to learn temporal as well as frequency information as well right?

Daniel Povey

unread,

Oct 6, 2022, 6:06:07 AM10/6/22

to kaldi...@googlegroups.com

both have time dimension (TDNN == 1-D CNN) but CNN also has frequency dimension

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/a4b35452-186e-4467-8213-eb924bfb09dan%40googlegroups.com.

Sage Khan (Sage Khan)

unread,

Oct 6, 2022, 6:46:37 AM10/6/22

to kaldi-help

Thank you so much sir. That Clarifies :)

Sage Khan (Sage Khan)

unread,

Oct 6, 2022, 6:47:27 AM10/6/22

to kaldi-help

Since CNN-TDNN takes into account Time and Frequency, this could possibly make it could in Para linguistic Speech Processing scenarios like Speech Emotion Recognition, right? (Theoretically)

Reply all

Reply to author

Forward