How to do attention over an LSTM sequences with masking?

6,596 views
Skip to first unread message

Edward Banner

unread,
Jun 17, 2016, 7:11:18 PM6/17/16
to Keras-users
I am interested in a relatively simple operation - computing an attention mask over the activations produced by an LSTM after an Embedding layer, which crucially uses mask_zero=True.

I can get it working without masking. However I cannot get it working with masking because I am using Flatten and Reshape layers, which don't support masking, so I am kind of stuck! Can anyone suggest how I can accomplish this? Can I get away with not using masking somehow? Should I write a custom layer? If so, could someone provide an example?

This example gets as close to what I am trying to achieve as I could find, but he gets around using Flatten and Reshape by using a separate set of weights to compute each index of the attention mask (which is too expensive for me).

Here is my code sample:

from keras.layers import Input, Embedding, Dense, LSTM, merge, Activation, Permute, Reshape
from keras.layers import Convolution1D, MaxPooling1D, Flatten, TimeDistributed, RepeatVector
from keras.layers.convolutional import AveragePooling1D
from keras.models import Model

max_doclen = 12
word_dim, vocab_size = 5, 10

nb_class = 2

input = Input(shape=[max_doclen], dtype='int32')

# embed and lstm the document
embedded = Embedding(output_dim=word_dim, input_dim=vocab_size, input_length=max_doclen, weights=None, mask_zero=True)(input)
activations = LSTM(16, return_sequences=True)(embedded)

# attention
mask = TimeDistributed(Dense(1))(activations) # compute the attention mask
mask = Flatten()(mask) # flatten the mask to get it ready to be used by RepeatVector - DOES NOT SUPPORT MASKING!
mask = Activation('softmax')(mask)
mask = RepeatVector(16)(mask)
mask = Permute([2, 1])(mask)

# apply mask
activations = merge([activations, mask], mode='mul')
activations = AveragePooling1D(pool_length=max_doclen)(activations)
activations = Flatten()(activations)

probas = Dense(nb_class, activation='softmax')(activations)

# compile
model = Model(input=input, output=probas)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()

Thanks in advance!

Edward Banner

unread,
Jun 17, 2016, 7:14:05 PM6/17/16
to Keras-users
Here's the code using code highlighting if it's easier to read. Sorry for not using it off the bat!

Edward Banner

unread,
Jun 21, 2016, 4:15:42 PM6/21/16
to Keras-users
Apologies for bumping, but I'd really like to get an answer to this. Surely someone has found out how to do attention over vectors from an LSTM layer? And presumably you used masking?

Alex Rubinsteyn

unread,
Aug 1, 2016, 3:01:41 PM8/1/16
to Keras-users
I'm also curious about this, hoping someone replies. 

Shweta Garg

unread,
Dec 12, 2016, 5:47:09 AM12/12/16
to Keras-users
@Edward and @Alex, I am also stuck with exactly same problem. Did you find any solution of it. It would be nice if you can share your solution.

Christos Baziotis

unread,
Jan 7, 2017, 11:33:56 AM1/7/17
to Keras-users
Any news on this? I also am stuck at this...

Alex Rubinsteyn

unread,
Jan 16, 2017, 2:00:00 PM1/16/17
to Keras-users
From what I've seen in various examples, it seems like everyone just discards masking and instead uses 0 as an input. Presumably the LSTM learns to just propagate its output through a variable number of 0s to the right of the actual input. 

Christos Baziotis

unread,
Jan 16, 2017, 4:07:15 PM1/16/17
to Keras-users
Two things. 

1) First don't do weighted average. Instead do a weighted sum in order for the padded (zero) inputs to not have an effect. Like this:

activations = LSTM(64, return_sequences=True)(words)
activations_weights
= Dense(1, activation='tanh')(activations)
activations_weights
= Flatten()(activations_weights)
activations_weights
= Activation('softmax')(activations_weights)
activations_weights
= RepeatVector(64)(activations_weights)
activations_weights
= Permute([2, 1])(activations_weights)
activations_weighted
= merge([activations, activations_weights], mode='mul')
sent_representation
= Lambda(lambda x: K.sum(x, axis=-2), output_shape=(64,))(activations_weighted)

probabilities
= Dense(classes)(sent_representation)
probabilities
= Activation('softmax')(probabilities)

2) For this i need if someone to help. From most papers i have read, they use the hidden states (hi) of each timestep and not the outputs for each timestep like the posted code.
Does anybody know if keras offers the ability to get the hidden states?


Shweta Garg

unread,
Jan 22, 2017, 11:15:48 AM1/22/17
to Keras-users
@alex, Thank you for your reply.

Basically you are proposing to pad the sequence with 0 and train with mask_zero=False. This I already have and it is training fine for me. But the problem is that I am not getting very satisfactory results. And my suspicion is that it is because of lot of padding for small sentences. Do you have any word around for this problem


On Saturday, June 18, 2016 at 4:41:18 AM UTC+5:30, Edward Banner wrote:

Christos Baziotis

unread,
Jan 22, 2017, 4:24:11 PM1/22/17
to Keras-users
I made this Layer and it works with masking. 
It is as simple as it gets.
If you are interested in the discussion and how i ended up with this, read the discussion here.

If anyone has anything to add to the discussion please do!

Christos

Yasser Hifny

unread,
Jan 30, 2017, 5:26:18 PM1/30/17
to Keras-users
Hi Christos,
Thanks for your contribution. I have a question: When you multiply the input by the mask then some values will be zero. Then you take exp (). The elements with zero will now have nonzero values and contribute in the output. Hence, the masking is not working correctly, right?

Thanks,
Yasser

Christos Baziotis

unread,
Jan 31, 2017, 7:38:12 AM1/31/17
to Keras-users
Yasser you are correct!

I had fixed it but forgot to post an update. 
You can see in the comments above that i had it right and maybe when copy-pasting i mixed things up.

Anyway you can see the updated gists to get the fix.

Yasser Hifny

unread,
Jan 31, 2017, 9:40:51 AM1/31/17
to Keras-users
I see you added epsilon to avoid NaNs but I do not see a fix in your code (attached). 
can you please where you did the fix? The issue I mentioned is that some elements after the exp() will not be zero and contribute to the results and we do not want this behavior
   def call(self, x, mask=None):
       eij = K.dot(x, self.W)

        if self.bias:
           eij += self.b

        eij = K.tanh(eij)

        a = K.exp(eij)

        # apply mask after the exp. will be re-normalized next
       if mask is not None:
           # Cast the mask to floatX to avoid float64 upcasting in theano
           a *= K.cast(mask, K.floatx())

        # in some cases especially in the early stages of training the sum may be almost zero
       # and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
       # a /= K.cast(K.sum(a, axis=1, keepdims=True), K.floatx())
       a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())

        a = K.expand_dims(a)
       weighted_input = x * a
       return K.sum(weighted_input, axis=1)

Thanks,
Yasser

Christos Baziotis

unread,
Jan 31, 2017, 11:32:38 AM1/31/17
to Keras-users
The fix is the i apply the mask *after* i have calculated the `a = K.exp(eij)`. Before that i applied the mask before and then i did the softmax. 

Run this locally to verify.


# input_mask
mask
= np.array([True, True, True, False, False, False])


# eij = tanh(dot(x, W) + b)
eij
= np.array([0.1, 0.4, 0.2, 0, 0, 0])


# eij = tanh(dot(x, W) + b), assume eij = [0.1, 0.4, 0.2, 0, 0, 0] with the last 3 being the padded words/values
a
= np.exp(eij)
print(a)
# [ 1.10517092  1.4918247   1.22140276  1.          1.          1.        ]


# step 2: apply the mask and omit the padded values *after* the exp(eij)

if mask is not None:

    a
*= mask
   
print(a)
   
# [ 1.10517092  1.4918247   1.22140276  0.          0.          0.        ]


# step 3: normalize the values
a
/= sum(a)
print(a)
# [ 0.28943311  0.39069383  0.31987306  0.          0.          0.        ]


assert sum(a) == 1


Yasser Hifny

unread,
Jan 31, 2017, 11:42:04 AM1/31/17
to Keras-users
It is clear now. Thank you 

Thanks,
Yasser

Yasser Hifny

unread,
Feb 11, 2017, 6:40:26 PM2/11/17
to Keras-users
@Christos Baziotis, you code is not available now, any reason for that? 

Christos Baziotis

unread,
Feb 12, 2017, 5:23:32 AM2/12/17
to Keras-users

pr...@cogent.co.jp

unread,
Jun 6, 2017, 6:31:04 AM6/6/17
to Keras-users
Have a look at this:


I hope it will help you!

deanhope...@gmail.com

unread,
Aug 6, 2019, 9:56:11 AM8/6/19
to Keras-users
Hi Shweta, 

did you ever resolve this? i am experiencing a similar problem. 
Reply all
Reply to author
Forward
0 new messages