Question about Kaldi Spec-Augment implementation

mura...@gmail.com

unread,

Dec 26, 2020, 10:36:11 AM12/26/20

to kaldi-help

Hi, I have 2 questions about spec-augment.

idct-layer name=idct input=input dim=40 cepstral-lifter=22 affine-transform-file=$dir/configs/idct.mat

batchnorm-component name=batchnorm0 input=idct

spec-augment-layer name=spec-augment freq-max-proportion=0.5 time-zeroed-proportion=0.2 time-mask-max-frames=20

About the idct-layer:

I understand that Kaldi performs an idct (Inverse Discrete Cosine Transform) to get the filterbanks out of the MFCCs. But why does one apply spec-augment on the filterbanks rather than on the MFCCs ? I also do not understand what the cepstral-lifter parameter does. Can somebody explain me please?

About the spec-augment layer:

I understand that according to the SpecAugment paper: https://arxiv.org/pdf/1904.08779.pdf time-mask-max-frames is that parameter that applies the transformation to make the network robust to small losses of speech segments but what is freq-max-proportion=0.5(If it was an integer I would assume it would be the number of consecutive mel frequency channels according to the paper, but since this number is a decimal I have no clue). I also do not understand what is time-zeroed-proportion=0.2?

Thanks,

Merry xmas and new year

Daniel Povey

unread,

Dec 26, 2020, 11:42:55 PM12/26/20

to kaldi-help

Hi, I have 2 questions about spec-augment.

idct-layer name=idct input=input dim=40 cepstral-lifter=22 affine-transform-file=$dir/configs/idct.mat
batchnorm-component name=batchnorm0 input=idct
spec-augment-layer name=spec-augment freq-max-proportion=0.5 time-zeroed-proportion=0.2 time-mask-max-frames=20

About the idct-layer:

I understand that Kaldi performs an idct (Inverse Discrete Cosine Transform) to get the filterbanks out of the MFCCs. But why does one apply spec-augment on the filterbanks rather than on the MFCCs ?

it doesn't make sense to take out a band of MFCCs like it does to take out a band of frequencies, it corresponds to no physical process

I also do not understand what the cepstral-lifter parameter does. Can somebody explain me please?

It's a weight on the cepstral coefficients, important to be the same as we used when dumping MFCCs.

About the spec-augment layer:

I understand that according to the SpecAugment paper: https://arxiv.org/pdf/1904.08779.pdf time-mask-max-frames is that parameter that applies the transformation to make the network robust to small losses of speech segments but what is freq-max-proportion=0.5(If it was an integer I would assume it would be the number of consecutive mel frequency channels according to the paper, but since this number is a decimal I have no clue).

I think maximum proportion of frequency space that can be zeroed at one time (from the frequency masking)

I also do not understand what is time-zeroed-proportion=0.2?

Proportion of time axis that is to be zeroed out

Thanks,
Merry xmas and new year

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/6896a28e-5464-4314-aa95-f830710130dbn%40googlegroups.com.

Ho Yin Chan

unread,

Dec 27, 2020, 11:39:58 PM12/27/20

to kaldi-help

I believe RWTH applied the spec-augment on the log mel bands and ivector as well.

https://arxiv.org/pdf/2004.00960.pdf

Message has been deleted

mura...@gmail.com

unread,

Jan 3, 2021, 10:53:01 PM1/3/21

to kaldi-help

Thanks for the answer Dan and for the paper ricky, they were very helpful and interesting, respectively.

I have just 3 last questions regarding Kaldi implementation of SpecAugment.

1) Why do most recipes have a batchnorm-layer after the idct layer?

2) While other Kaldi augmentation techniques like the default speed-perturb triple the original data (because of 0.9, 1.0, 1.1 factors), in what order does SpecAugment increase the data? Does it just double it? (Keeping a version untouched and another version with the frequencies and time masks and temporal deformations? Or does it triple and quadruple adding variations of the three types of SpecAugment transformations? )

3) When combining speed perturb + SpecAugment, most recipes typically add SpecAugment on top of speed perturbed data, right? (Not the other way around )

Thanks a lot for your attention,

Daniel Povey

unread,

Jan 3, 2021, 11:00:37 PM1/3/21

to kaldi-help

SpecAug doesn't change the amount of data it's applied randomly each epoch. In principle you can use more epochs.

In practice we didn't find SpecAug helpful in Kaldi except for mini_librispeech (v. small data).

It could be that it works for reasons that are specific to model types that we don't use, such as transformers.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/bb6cf25d-1367-494a-a73b-71bd0ea03a47n%40googlegroups.com.

mura...@gmail.com

unread,

Jan 11, 2021, 2:04:24 PM1/11/21

to kaldi-help

Thanks a lot for the reply.

I have been making an effort to understand cepstral liftering better, by reading: https://maxwell.ict.griffith.edu.au/spl/publications/papers/euro99_kkp_fbe.pdf (Decorrelated and Liftered Filter-bank Energies for Robust Speech Recognition)

and I understood (like you mentioned) that essentially this process is a way of reweighting the cepstral coefficients to give more importance to some coefficients.

There are various types of lifters: Linear lifters, statistical lifters, sinusoidal lifter, exponential lifters...

However, in Kaldi SpecAugmentation, the parameter that we pass to the script is simply a constant: cepstral-lifter=22

(https://github.com/kaldi-asr/kaldi/blob/21c17d1defc55ebaf09fa4993c1e2b7c8d440f17/egs/multi_cn/s5/local/chain/tuning/run_cnn_tdnn_1b.sh)

Do you remember what type of liftering does Kaldi do?

Thanks a lot Dan

Daniel Povey

unread,

Jan 12, 2021, 12:18:08 AM1/12/21

to kaldi-help

It's based on whatever HTK does. I wouldn't spend too much time on it.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/3ce8006a-9152-4220-a8c9-2e5cdf8cc977n%40googlegroups.com.

Reply all

Reply to author

Forward