Stats pooling layer

39 views
Skip to first unread message

Emanuele Dalmasso

unread,
Sep 5, 2018, 11:28:24 AM9/5/18
to DyNet Users
Hi,
I am trying to use dynet for training a speaker recognition embeddings extractor (something of the kind of http://danielpovey.com/files/2017_interspeech_embeddings.pdf ).
I do not know how it is possible to do the statistic pooling layer.

There is something of this kind already available in dynet? or someone have any advice for doing it?

Thank you very much!


Graham Neubig

unread,
Sep 5, 2018, 11:33:47 AM9/5/18
to Emanuele Dalmasso, DyNet Users
I wasn't able to tell from a quick browse of the paper, but if it's
just pooling together by taking the elementwise max or something, then
the "max_dim()" function should be able to do this for you.

Graham
> --
> You received this message because you are subscribed to the Google Groups "DyNet Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to dynet-users...@googlegroups.com.
> To post to this group, send email to dynet...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/dynet-users/6551a564-f59f-4963-b346-51feab44b093%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Emanuele Dalmasso

unread,
Sep 6, 2018, 9:47:56 AM9/6/18
to DyNet Users
Thank you very much for your answer,

Yes, it should compute the average of the frames (maybe also the standard deviation), I am just not sure about the fact that the average has to be done on the time axis.
So the first layers are at the frame level, the pooling layer should cumulate and estimate the average/std of a block of frames, and the following layers are trained one block at a time.

An alternative approach that I can think about would be to train all the frames not one by one, but aggregated together as a big input to the first layers (in a way similar to a cnn, but where the matrix for each cnn block are equals) and than the pooling layer will cumulate along the blocks dimension (which is in real the time axis), and forward propagate the statistic of the full utterance.
Anyway I do not know if there is instead a more direct approach, or if this is the correct one.

Emanuele Dalmasso

unread,
Oct 2, 2018, 12:22:07 PM10/2/18
to DyNet Users
Ok, I spent a lot of time on this,
I concatenated the input along time axes in the column of the input matrix (the rows are the features), given so I apply multiple layer b+W*I with a relu activation function.
After these layers I applied a mean_dim() along the time dimension, to compute the average along the time.
Following there are 2 layers with a single column inputs representing the full utterance, and the final softmax layer with the one hot target across N.

The problem is that this does not converge, its loss remains around -log(1/N), (so it is coherent to a random choice across the N outputs).

I am not sure on how the backpropagation works across the pooling layer (I mean the expression mean_dim()).
In particular, are the weights updated too much given they influence the product of each column in the input matrix that will be cumulated for computing the mean?
With this doubt I also tried to apply a scale_gradient(1/NCols) before the input to the pooling layer but without any improvement.

The following is maybe a fool question, but I do not know how the framework work internally:
Given the weights are updated for their contribution to each of the columns in the input, are they updated multiple times with a different contribution for each column? is this done in a thread safe way?

Using kaldi (there is a recipe for this) the network converges so I do not know what I am doing wrong here.

Thanks in advance for any advice!

Emanuele Dalmasso

unread,
Oct 4, 2018, 10:40:25 AM10/4/18
to DyNet Users
I do not know if there is anyone interested in it, but at the end I found the problem here.
Main issue appears to be the RELU unit that explodes. On windows/cpu this results in the production of nan during the backpropagation phase, while on linux/gpu no error is signaled but the network never converge.

So using some form of normalization (like selu instead of relu) solve the issues.

Many thanks to gneubig that give me at least one answer at my first question!

Graham Neubig

unread,
Oct 4, 2018, 12:54:15 PM10/4/18
to Emanuele Dalmasso, DyNet Users
Hmm, that's bizarre... Thanks for pointing it out. Could you open an
issue on the github site about the relu problem so a record remains of
it?

Graham
On Thu, Oct 4, 2018 at 10:40 AM Emanuele Dalmasso
> To view this discussion on the web visit https://groups.google.com/d/msgid/dynet-users/931fd34d-9b9c-420b-bc3e-4ba0cd309f53%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages