Question about Kaldi Spec-Augment implementation

487 views
Skip to first unread message

mura...@gmail.com

unread,
Dec 26, 2020, 10:36:11 AM12/26/20
to kaldi-help


Hi,  I have 2 questions about spec-augment.

  idct-layer name=idct input=input dim=40 cepstral-lifter=22 affine-transform-file=$dir/configs/idct.mat
  batchnorm-component name=batchnorm0 input=idct
  spec-augment-layer name=spec-augment freq-max-proportion=0.5 time-zeroed-proportion=0.2 time-mask-max-frames=20

About the idct-layer:

I understand that Kaldi performs an idct (Inverse Discrete Cosine Transform) to get the filterbanks out of the MFCCs. But why does one apply spec-augment on the filterbanks rather than on the MFCCs ? I also do not understand what the cepstral-lifter parameter does. Can somebody explain me please?

About the spec-augment layer:

I understand that according to the SpecAugment paper: https://arxiv.org/pdf/1904.08779.pdf time-mask-max-frames is that parameter that applies the transformation to make the network robust to small losses of speech segments but what is  freq-max-proportion=0.5(If it was an integer I would assume it would be the number of consecutive mel frequency channels according to the paper, but since this number is a decimal I have no clue). I also do not understand what is time-zeroed-proportion=0.2?


Thanks,
Merry xmas and new year



Daniel Povey

unread,
Dec 26, 2020, 11:42:55 PM12/26/20
to kaldi-help


Hi,  I have 2 questions about spec-augment.

  idct-layer name=idct input=input dim=40 cepstral-lifter=22 affine-transform-file=$dir/configs/idct.mat
  batchnorm-component name=batchnorm0 input=idct
  spec-augment-layer name=spec-augment freq-max-proportion=0.5 time-zeroed-proportion=0.2 time-mask-max-frames=20

About the idct-layer:

I understand that Kaldi performs an idct (Inverse Discrete Cosine Transform) to get the filterbanks out of the MFCCs. But why does one apply spec-augment on the filterbanks rather than on the MFCCs ?

it doesn't make sense to take out a band of MFCCs like it does to take out a band of frequencies, it corresponds to no physical process
 
I also do not understand what the cepstral-lifter parameter does. Can somebody explain me please?

It's a weight on the cepstral coefficients, important to be the same  as we used when dumping MFCCs. 


About the spec-augment layer:

I understand that according to the SpecAugment paper: https://arxiv.org/pdf/1904.08779.pdf time-mask-max-frames is that parameter that applies the transformation to make the network robust to small losses of speech segments but what is  freq-max-proportion=0.5(If it was an integer I would assume it would be the number of consecutive mel frequency channels according to the paper, but since this number is a decimal I have no clue).

I think maximum proportion of frequency space that can be zeroed at one time (from the frequency masking)
 
I also do not understand what is time-zeroed-proportion=0.2?

Proportion of time axis that is to be zeroed out 


Thanks,
Merry xmas and new year



--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/6896a28e-5464-4314-aa95-f830710130dbn%40googlegroups.com.

Ho Yin Chan

unread,
Dec 27, 2020, 11:39:58 PM12/27/20
to kaldi-help
I believe RWTH applied the spec-augment on the log mel bands and ivector as well.

Message has been deleted
Message has been deleted
Message has been deleted

mura...@gmail.com

unread,
Jan 3, 2021, 10:53:01 PM1/3/21
to kaldi-help
Thanks for the answer  Dan and for the paper ricky, they were very helpful and interesting, respectively.

I have just 3 last questions regarding Kaldi implementation of SpecAugment.

1) Why do most recipes have a batchnorm-layer after the idct layer?

2) While other Kaldi augmentation techniques like the default speed-perturb triple the original data  (because of 0.9, 1.0, 1.1 factors), in what order does SpecAugment increase the data? Does it just double it? (Keeping a version untouched and another version with the frequencies and time masks and temporal deformations? Or does it triple and quadruple adding variations of the three types of SpecAugment transformations? )

3) When combining speed perturb + SpecAugment, most recipes typically add SpecAugment on top of speed perturbed data, right? (Not the other way around )

Thanks a lot for your attention,

Daniel Povey

unread,
Jan 3, 2021, 11:00:37 PM1/3/21
to kaldi-help
SpecAug doesn't change the amount of data it's applied randomly each epoch.  In principle you can use more epochs.
In practice we didn't find SpecAug helpful in Kaldi except for mini_librispeech (v. small data).
It could be that it works for reasons that are specific to model types that we don't use, such as transformers.

mura...@gmail.com

unread,
Jan 11, 2021, 2:04:24 PM1/11/21
to kaldi-help
Thanks a lot for the reply.

I have been making an effort to understand cepstral liftering better, by reading: https://maxwell.ict.griffith.edu.au/spl/publications/papers/euro99_kkp_fbe.pdf (Decorrelated and Liftered Filter-bank Energies for Robust Speech Recognition)
 and I understood (like you mentioned) that essentially this process is a way of reweighting the cepstral coefficients to give more importance to some coefficients.

There are various types of lifters: Linear lifters, statistical lifters, sinusoidal lifter, exponential lifters... 

However, in Kaldi SpecAugmentation, the parameter that we pass to the script is simply a constant: cepstral-lifter=22

Do you remember what type of liftering does Kaldi do?

Thanks a lot Dan

Daniel Povey

unread,
Jan 12, 2021, 12:18:08 AM1/12/21
to kaldi-help
It's based on whatever HTK does.  I wouldn't spend too much time on it.

Reply all
Reply to author
Forward
0 new messages