How to do attention over an LSTM sequences with masking?

Edward Banner

unread,

Jun 17, 2016, 7:11:18 PM6/17/16

to Keras-users

I am interested in a relatively simple operation - computing an attention mask over the activations produced by an LSTM after an Embedding layer, which crucially uses mask_zero=True.

I can get it working without masking. However I cannot get it working with masking because I am using Flatten and Reshape layers, which don't support masking, so I am kind of stuck! Can anyone suggest how I can accomplish this? Can I get away with not using masking somehow? Should I write a custom layer? If so, could someone provide an example?

This example gets as close to what I am trying to achieve as I could find, but he gets around using Flatten and Reshape by using a separate set of weights to compute each index of the attention mask (which is too expensive for me).

Here is my code sample:

from keras.layers import Input, Embedding, Dense, LSTM, merge, Activation, Permute, Reshape

from keras.layers import Convolution1D, MaxPooling1D, Flatten, TimeDistributed, RepeatVector

from keras.layers.convolutional import AveragePooling1D

from keras.models import Model

max_doclen = 12

word_dim, vocab_size = 5, 10

nb_class = 2

input = Input(shape=[max_doclen], dtype='int32')

# embed and lstm the document

embedded = Embedding(output_dim=word_dim, input_dim=vocab_size, input_length=max_doclen, weights=None, mask_zero=True)(input)

activations = LSTM(16, return_sequences=True)(embedded)

# attention

mask = TimeDistributed(Dense(1))(activations) # compute the attention mask

mask = Flatten()(mask) # flatten the mask to get it ready to be used by RepeatVector - DOES NOT SUPPORT MASKING!

mask = Activation('softmax')(mask)

mask = RepeatVector(16)(mask)

mask = Permute([2, 1])(mask)

# apply mask

activations = merge([activations, mask], mode='mul')

activations = AveragePooling1D(pool_length=max_doclen)(activations)

activations = Flatten()(activations)

probas = Dense(nb_class, activation='softmax')(activations)

# compile

model = Model(input=input, output=probas)

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()

Thanks in advance!

Edward Banner

unread,

Jun 17, 2016, 7:14:05 PM6/17/16

to Keras-users

Here's the code using code highlighting if it's easier to read. Sorry for not using it off the bat!

Edward Banner

unread,

Jun 21, 2016, 4:15:42 PM6/21/16

to Keras-users

Apologies for bumping, but I'd really like to get an answer to this. Surely someone has found out how to do attention over vectors from an LSTM layer? And presumably you used masking?

Alex Rubinsteyn

unread,

Aug 1, 2016, 3:01:41 PM8/1/16

to Keras-users

I'm also curious about this, hoping someone replies.

Shweta Garg

unread,

Dec 12, 2016, 5:47:09 AM12/12/16

to Keras-users

@Edward and @Alex, I am also stuck with exactly same problem. Did you find any solution of it. It would be nice if you can share your solution.

Christos Baziotis

unread,

Jan 7, 2017, 11:33:56 AM1/7/17

to Keras-users

Any news on this? I also am stuck at this...

Alex Rubinsteyn

unread,

Jan 16, 2017, 2:00:00 PM1/16/17

to Keras-users

From what I've seen in various examples, it seems like everyone just discards masking and instead uses 0 as an input. Presumably the LSTM learns to just propagate its output through a variable number of 0s to the right of the actual input.

Christos Baziotis

unread,

Jan 16, 2017, 4:07:15 PM1/16/17

to Keras-users

Two things.

1) First don't do weighted average. Instead do a weighted sum in order for the padded (zero) inputs to not have an effect. Like this:

activations = LSTM(64, return_sequences=True)(words)
activations_weights = Dense(1, activation='tanh')(activations)
activations_weights = Flatten()(activations_weights)
activations_weights = Activation('softmax')(activations_weights)
activations_weights = RepeatVector(64)(activations_weights)
activations_weights = Permute([2, 1])(activations_weights)
activations_weighted = merge([activations, activations_weights], mode='mul')
sent_representation = Lambda(lambda x: K.sum(x, axis=-2), output_shape=(64,))(activations_weighted)

probabilities = Dense(classes)(sent_representation)
probabilities = Activation('softmax')(probabilities)

2) For this i need if someone to help. From most papers i have read, they use the hidden states (hi) of each timestep and not the outputs for each timestep like the posted code.

Does anybody know if keras offers the ability to get the hidden states?

Shweta Garg

unread,

Jan 22, 2017, 11:15:48 AM1/22/17

to Keras-users

@alex, Thank you for your reply.

Basically you are proposing to pad the sequence with 0 and train with mask_zero=False. This I already have and it is training fine for me. But the problem is that I am not getting very satisfactory results. And my suspicion is that it is because of lot of padding for small sentences. Do you have any word around for this problem

On Saturday, June 18, 2016 at 4:41:18 AM UTC+5:30, Edward Banner wrote:

Christos Baziotis

unread,

Jan 22, 2017, 4:24:11 PM1/22/17

to Keras-users

I made this Layer and it works with masking.

It is as simple as it gets.

If you are interested in the discussion and how i ended up with this, read the discussion here.

If anyone has anything to add to the discussion please do!

Christos

Yasser Hifny

unread,

Jan 30, 2017, 5:26:18 PM1/30/17

to Keras-users

Hi Christos,

Thanks for your contribution. I have a question: When you multiply the input by the mask then some values will be zero. Then you take exp (). The elements with zero will now have nonzero values and contribute in the output. Hence, the masking is not working correctly, right?

Thanks,

Yasser

Christos Baziotis

unread,

Jan 31, 2017, 7:38:12 AM1/31/17

to Keras-users

Yasser you are correct!

I had fixed it but forgot to post an update.

You can see in the comments above that i had it right and maybe when copy-pasting i mixed things up.

Anyway you can see the updated gists to get the fix.

Yasser Hifny

unread,

Jan 31, 2017, 9:40:51 AM1/31/17

to Keras-users

I see you added epsilon to avoid NaNs but I do not see a fix in your code (attached).

can you please where you did the fix? The issue I mentioned is that some elements after the exp() will not be zero and contribute to the results and we do not want this behavior

   def call(self, x, mask=None):
        eij = K.dot(x, self.W)

        if self.bias:
            eij += self.b

        eij = K.tanh(eij)

        a = K.exp(eij)

        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in theano
            a *= K.cast(mask, K.floatx())

        # in some cases especially in the early stages of training the sum may be almost zero
        # and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
        # a /= K.cast(K.sum(a, axis=1, keepdims=True), K.floatx())
        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())

        a = K.expand_dims(a)
        weighted_input = x * a
        return K.sum(weighted_input, axis=1)

Thanks,

Yasser

Christos Baziotis

unread,

Jan 31, 2017, 11:32:38 AM1/31/17

to Keras-users

The fix is the i apply the mask *after* i have calculated the `a = K.exp(eij)`. Before that i applied the mask before and then i did the softmax.

Run this locally to verify.

# input_mask
mask = np.array([True, True, True, False, False, False])


# eij = tanh(dot(x, W) + b)
eij = np.array([0.1, 0.4, 0.2, 0, 0, 0])


# eij = tanh(dot(x, W) + b), assume eij = [0.1, 0.4, 0.2, 0, 0, 0] with the last 3 being the padded words/values
a = np.exp(eij)
print(a)
# [ 1.10517092  1.4918247   1.22140276  1.          1.          1.        ]


# step 2: apply the mask and omit the padded values *after* the exp(eij)


if mask is not None:


    a *= mask
    print(a)
    # [ 1.10517092  1.4918247   1.22140276  0.          0.          0.        ]


# step 3: normalize the values
a /= sum(a)
print(a)
# [ 0.28943311  0.39069383  0.31987306  0.          0.          0.        ]


assert sum(a) == 1

Yasser Hifny

unread,

Jan 31, 2017, 11:42:04 AM1/31/17

to Keras-users

It is clear now. Thank you

Thanks,

Yasser

Yasser Hifny

unread,

Feb 11, 2017, 6:40:26 PM2/11/17

to Keras-users

@Christos Baziotis, you code is not available now, any reason for that?

Christos Baziotis

unread,

Feb 12, 2017, 5:23:32 AM2/12/17

to Keras-users

Here you go:

https://gist.github.com/cbaziotis/7ef97ccf71cbc14366835198c09809d2

https://gist.github.com/cbaziotis/6428df359af27d58078ca5ed9792bd6d

pr...@cogent.co.jp

unread,

Jun 6, 2017, 6:31:04 AM6/6/17

to Keras-users

Have a look at this:

https://github.com/philipperemy/keras-attention-mechanism

I hope it will help you!

deanhope...@gmail.com

unread,

Aug 6, 2019, 9:56:11 AM8/6/19

to Keras-users

Hi Shweta,

did you ever resolve this? i am experiencing a similar problem.

Reply all

Reply to author

Forward

def call(self, x, mask=None):
	eij = K.dot(x, self.W)

	if self.bias:
	eij += self.b

	eij = K.tanh(eij)

	a = K.exp(eij)

	# apply mask after the exp. will be re-normalized next
	if mask is not None:
	# Cast the mask to floatX to avoid float64 upcasting in theano
	a *= K.cast(mask, K.floatx())

	# in some cases especially in the early stages of training the sum may be almost zero
	# and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
	# a /= K.cast(K.sum(a, axis=1, keepdims=True), K.floatx())
	a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())

	a = K.expand_dims(a)
	weighted_input = x * a
	return K.sum(weighted_input, axis=1)