Ok, I spent a lot of time on this,
I concatenated the input along time axes in the column of the input matrix (the rows are the features), given so I apply multiple layer b+W*I with a relu activation function.
After these layers I applied a mean_dim() along the time dimension, to compute the average along the time.
Following there are 2 layers with a single column inputs representing the full utterance, and the final softmax layer with the one hot target across N.
The problem is that this does not converge, its loss remains around -log(1/N), (so it is coherent to a random choice across the N outputs).
I am not sure on how the backpropagation works across the pooling layer (I mean the expression mean_dim()).
In particular, are the weights updated too much given they influence the product of each column in the input matrix that will be cumulated for computing the mean?
With this doubt I also tried to apply a scale_gradient(1/NCols) before the input to the pooling layer but without any improvement.
The following is maybe a fool question, but I do not know how the framework work internally:
Given the weights are updated for their contribution to each of the columns in the input, are they updated multiple times with a different contribution for each column? is this done in a thread safe way?
Using kaldi (there is a recipe for this) the network converges so I do not know what I am doing wrong here.
Thanks in advance for any advice!