Question about tdnn network-context,time-stride and input representation

karex trx

unread,

Oct 19, 2020, 4:44:06 PM10/19/20

to kaldi-help

Hi,
How is the input context specified for tdnn neural network ?
The example shown in Figure 1 of this paper "A time delay neural network architecture for efficient modeling of longtemporal contexts" has network context as [-13,9].

I was going through kaldi tdnn recipe run_tdnn_1g.sh of tedlium s5_r2,
and was unable to find specific part of neural network config specifying
the network context.

Does argument egs.chunk-width to train.py relates to network context in the paper ?

Also, I have bit confusion regarding input-representation of Frame and time-stride.
I am attaching the Figure from paper and marked chunk-width, input=Append(-2,0,+2)

and time-stride=3. Please suggest whether my understanding is correct or not for all three?

Thanks,

karex trx

unread,

Oct 20, 2020, 7:19:08 AM10/20/20

to kaldi-help

Hi,

Can anyone please explain !

Daniel Povey

unread,

Oct 20, 2020, 8:23:19 AM10/20/20

to kaldi-help

No it's the left-context and right-context of the network which is worked out from the output of `nnet3-info` (for nnet3)

and passed to the script that dumps the egs.

Saying the input to a layer is Append(-3,0,3) is equivalent to using time-stride=3 for newer config scripts.

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/e64fdf4a-8187-40ba-abae-d90cffcc6221n%40googlegroups.com.

karex trx

unread,

Oct 20, 2020, 6:16:31 PM10/20/20

to kaldi-help

Thank you Dan! I got it that the left and right context are decided internally by calling `nnet3-info` which specifies the context depending on the network.

After reading bit more, as per my understanding, chunk-size is used to split the training samples, into smaller chunks of even size and num-chunk-per-minibatch specifies the mini-batch size for SGD.

If my understanding is correct so far, I have one more doubt regarding frames-per-iter(which might sound dumb). In run_tdnn_1g tedlium setup, frames-per-iter= 500000.
Could you please explain, why frames-per-iter is not specified as multiple of chunk size ?

Kirill 'kkm' Katsnelson

unread,

Oct 21, 2020, 9:46:21 AM10/21/20

to kaldi-help

500K is a good number. The larger it is, the better the GPU memory is utilized, you can bump it, but I've seen a bit of WER degradation going over 3M, and you do not want too few iterations with a small training set. Everything w.r.t chunking is computed in get_egs.sh, you do not need to worry about that. The basic idea is the minibatch size varies to get the best fit, and frames per iter would not be exactly 500K. Usually I recommend reading the scripts, but this one is quite hard to grok and messy.

-kkm

karex trx

unread,

Oct 21, 2020, 7:01:38 PM10/21/20

to kaldi-help

Thank you, KKM for your valuable response!

karex trx

unread,

Oct 25, 2020, 11:41:51 PM10/25/20

to kaldi-help

How is the sub-sampling of hidden activations specified in chain TDNN training as mentioned in "A time delay neural network architecture for efficient modeling of long temporal contexts".

I am not able to determine in tedlium/s5_r2/local/chain/run_tdnn_1g.sh where sub-sampling is specified for neural net.

mura...@gmail.com

unread,

Jan 3, 2021, 2:26:48 PM1/3/21

to kaldi-help

I think my notions of "splicing context" and 'time-stride" are shaky.

In this thread it is said: Append(-3,0,3) is equivalent to using time-stride=3.

If we consider the first layer in the image which has Append(-2,0,2) the way I understand it is:

5 frames (2 left, 1 central, 2 right) of the initial context are being spliced into 1 in the next layer, according to this thread this would correspond to time-stride of 2.

But in other posts I've seen online, (https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks-Part-2/) time stride relates to number of units the filter "jumps". However, from the "A time delay neural network architecture for efficient modeling of longtemporal contexts" image it seems stride is actually 1: The first unit in the second layer splices frames 0-5 from the previous, the second unit splices 1-6 , the third 2-7, the fourth 3-8. If the stride is 2, like it is said earlier in this thread: shouldn't the first unit splice frames 0-5 and the second unit splice 2-7 ?

Thanks a lot for the patience.

Nahuel

unread,

May 10, 2022, 10:53:23 AM5/10/22

to kaldi-help

As I understand it, Append(-3, 0, 3) is syntactic sugar to get the feature vectors at time t-3, t and t+3 of the previous layers but not the whole range [t-3, t+3]. This is called sub-sampling in the paper (section 3.1) and it assumes that data is correlated so that you can "jump" or better "skip" certain vectors. If you want to get the whole range [t-3, t+3], you could do Append(-3, -2, -1, 0, 1, 2, 3), although there might be an easier way I don't know of. See https://groups.google.com/g/kaldi-help/c/3FAAi1oWthA/m/c_OZe5GqAQAJ and https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/libs/nnet3/xconfig/utils.py#L463 for more details.

Also, "stride" is different from "time-stride": the "time-stride" parameter is used by Kaldi to know how many frames to look ahead and back within the same filter when computing an output; "stride" is usually used in convolutional neural networks to know by how much the filter is moved from calculating one output to calculating another output. So, in general, it's not incompatible to say that a TDNN layer has, for instance, time-stride=3 and stride=2: to compute the output at time t it would look at the feature vectors {t-3, t, t+3} (time-stride=3), and to compute the following output it would look at the feature vectors {t-1, t+2, t+5} (stride=2) instead of {t-2, t+1, t+4} (which would be stride=1).

I hope it helps, and please correct me if I'm wrong.

Reply all

Reply to author

Forward