I think my notions of "splicing context" and 'time-stride" are shaky.
In this thread it is said: Append(-3,0,3) is equivalent to using time-stride=3.
If we consider the first layer in the image which has Append(-2,0,2) the way I understand it is:
5 frames (2 left, 1 central, 2 right) of the initial context are being spliced into 1 in the next layer, according to this thread this would correspond to time-stride of 2.
But in other posts I've seen online, (
https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks-Part-2/) time stride relates to number of units the filter "jumps". However, from the "
A time delay neural network architecture for efficient modeling of longtemporal contexts" image it seems stride is actually 1: The first unit in the second layer splices frames 0-5 from the previous, the second unit splices 1-6 , the third 2-7, the fourth 3-8. If the stride is 2, like it is said earlier in this thread: shouldn't the first unit splice frames 0-5 and the second unit splice 2-7 ?
Thanks a lot for the patience.