Help on implementing “Hierarchical Attention Networks for Document Classification”

5,931 views
Skip to first unread message

zhyuya...@googlemail.com

unread,
Nov 14, 2016, 3:36:03 AM11/14/16
to Keras-users
Dear all,

Is there any one how has read this paper accepted in HACAL'16: "Hierarchical Attention Networks for Document Classification":

The idea is very interesting, and the results are impressive. Could anyone please give me some hints on how to implement it in keras?

Best regards,
Zhenyu Yang

bkj...@gmail.com

unread,
Dec 22, 2016, 11:25:32 AM12/22/16
to Keras-users, zhyuya...@googlemail.com
I haven't implemented it, but I was going to look into it.  I agree that the results are very impressive, though I'm actually a little confused about where the gains are coming from.  They show that HN-ATT outperforms HN-AVE and HN-MAX, but actually the two simpler variants outperform all of the other benchmarks.   Correct me if I'm wrong, but HN-AVE just:

  a) splits the document into set of sentences 
  b) averages word embeddings for each sentence to get a set of sentence representations
  c) averages sentence representations to get a document representation

I think this is really just equivalent to the model from this paper


plus some reweighting of words based on the length of sentences they're found in.  The linked paper actually evaluates on some of the same datasets as the HN and consistently underperforms HN-AVE by 2-3%.  So all of this would suggest that those gains from the "hierarchical" part of "hierarchical attention networks", which I think is pretty interesting.  I'm going to try to implement HN-AVE in the next couple of weeks, and will report back here if I find anything interesting.

~ Ben

bkj...@gmail.com

unread,
Dec 22, 2016, 11:36:35 AM12/22/16
to Keras-users, zhyuya...@googlemail.com, bkj...@gmail.com
Oh actually, apologies, it's been a while since I read the whole paper: they don't average the word or sentence embeddings, they encode them with a GRU.  So it's actually a little more involved than fastText + reweighting, but I'll try implementing both variants.

bkj...@gmail.com

unread,
Dec 27, 2016, 7:25:42 PM12/27/16
to Keras-users, zhyuya...@googlemail.com, bkj...@gmail.com

I think that the HN-MAX is (roughly)

max_sents = # maximum number of sentences per document
max_words = # maximum number of words per sentence

x = Input(shape=(max_sents, max_words,))

emb_words = TimeDistributed(Embedding(input_dim=max_features, output_dim=200, mask_zero=True))(x)

emb_sents = TimeDistributed(Bidirectional(GRU(50, return_sequences=True)))(emb_words)
emb_sents = TimeDistributed(GlobalMaxPooling1D())(emb_sents)

emb_docs = Bidirectional(GRU(50 return_sequences=True))(emb_sents)
emb_docs = GlobalMaxPooling1D()(emb_docs)

prediction = Dense(y_train.shape[1], activation='softmax')(emb_docs)
model = Model(input=x, output=prediction)
model.compile(loss='categorical_crossentropy', optimizer=SGD(momentum=0.9), metrics=['accuracy'])

Turning this into the HN-AVG variant is fairly straightforward, and to the HN-ATT you'd have to write a little attention unit, but that shouldn't be particularly difficult I don't think.


Using the Yelp 2013 data from http://ir.hit.edu.cn/~dytang/, I haven't been able to reproduce the results in the paper (I'm getting within 1-2% with HN-MAX and HN-AVG).  I'm not using pretrained word vectors, so perhaps that's the reason.  Would love for someone to be able to modify this code to reproduce the results more exactly, as I'm not 100% where I'm going wrong.


~ Ben

bkj...@gmail.com

unread,
Dec 27, 2016, 7:26:57 PM12/27/16
to Keras-users, zhyuya...@googlemail.com, bkj...@gmail.com
(And also, the code as posted above is what they implemented in the paper -- I've been using the `rmsprop` optimizer in my experiments since in the past I've had better luck using it with RNNs)

Yasser Hifny

unread,
Dec 28, 2016, 8:09:15 PM12/28/16
to Keras-users, zhyuya...@googlemail.com, bkj...@gmail.com
Hi,

when testing your code:

from keras.models import Sequential, Model
from keras.layers import Input, Dense, TimeDistributed
from keras.layers import GRU,GlobalMaxPooling1D,Bidirectional, Embedding, LSTM
from keras.optimizers import SGD

max_sents = 100
max_words = 50



x = Input(shape=(max_sents, max_words,))

emb_words = TimeDistributed(Embedding(input_dim=1000, output_dim=200, mask_zero=True))(x)

emb_sents = TimeDistributed(Bidirectional(GRU(50, return_sequences=True)))(emb_words)
emb_sents = TimeDistributed(GlobalMaxPooling1D())(emb_sents)

emb_docs = Bidirectional(GRU(50, return_sequences=True))(emb_sents)
emb_docs = GlobalMaxPooling1D()(emb_docs)

prediction = Dense(44, activation='softmax')(emb_docs)
model = Model(input=x, output=prediction)
model.compile(loss='categorical_crossentropy', optimizer=SGD(momentum=0.9), metrics=['accuracy'])
print model.summary()


I got this error:

$ python han.py
Using Theano backend.
/usr/lib/python2.7/site-packages/keras/engine/topology.py:368: UserWarning: The `regularizers` property of layers/models is deprecated. Regularization losses are now managed via the `losses` layer/model property.
  warnings.warn('The `regularizers` property of '
Traceback (most recent call last):
  File "han.py", line 15, in <module>
    emb_sents = TimeDistributed(Bidirectional(GRU(50, return_sequences=True)))(emb_words)
  File "/usr/lib/python2.7/site-packages/keras/engine/topology.py", line 569, in __call__
    self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
  File "/usr/lib/python2.7/site-packages/keras/engine/topology.py", line 632, in add_inbound_node
    Node.create_node(self, inbound_layers, node_indices, tensor_indices)
  File "/usr/lib/python2.7/site-packages/keras/engine/topology.py", line 164, in create_node
    output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
  File "/usr/lib/python2.7/site-packages/keras/layers/wrappers.py", line 129, in call
    y = self.layer.call(X)  # (nb_samples * timesteps, ...)
  File "/usr/lib/python2.7/site-packages/keras/layers/wrappers.py", line 203, in call
    Y = self.forward_layer.call(X, mask)
  File "/usr/lib/python2.7/site-packages/keras/layers/recurrent.py", line 201, in call
    input_shape = K.int_shape(x)
  File "/usr/lib/python2.7/site-packages/keras/backend/theano_backend.py", line 128, in int_shape
    raise Exception('Not a Keras tensor:', x)
Exception: ('Not a Keras tensor:', Reshape{3}.0)


how to solve this error?

Thanks
Yasser



On Tuesday, December 27, 2016 at 7:26:57 PM UTC-5, bkj...@gmail.com wrote:
(And also, the code as posted above is what they implemented in the paper -- I've been using the `rmsprop` optimizer in my experiments since in the past I've had better luck using it with RNNs)

On Tuesday, December 27, 2016 at 7:25:42 PM UTC-5, bkj...@gmail.com wrote:

I think that the HN-MAX is (roughly)

max_sents = # maximum number of sentences per document,,,

Ben Johnson

unread,
Dec 28, 2016, 8:16:18 PM12/28/16
to Keras-users, Yasser Hifny, zhyuya...@googlemail.com
I believe you have to downgrade from 1.2.0 to 1.1.1 

There was a bug introduced recently that messes up the TimeDistributed layers.

~ Ben

Richard Liao

unread,
Dec 29, 2016, 11:04:50 AM12/29/16
to Keras-users, yhi...@gmail.com, zhyuya...@googlemail.com, bkj...@gmail.com
Hi all,

I have implemented the paper using Keras. Here is my github repository https://github.com/richliao/textClassifier and I have written a blog about the implementation: https://richliao.github.io/supervised/classification/2016/12/26/textclassifier-HATN/.

Feedbacks are very welcomed. 

Richard

Ben Johnson

unread,
Dec 29, 2016, 11:08:56 AM12/29/16
to Richard Liao, Keras-users, Yasser Hifny, zhyuya...@googlemail.com
Ah wonderful.  Would you be willing to benchmark your method against the datasets that they used in the paper?  I'd really like to be able to reproduce the (very good) results that they report.

All of the Tang datasets can be found at: http://ir.hit.edu.cn/~dytang/paper/emnlp2015/emnlp-2015-data.7z (they're about 1G total)

(I've primarily been looking at the "Yelp 2013" dataset.)

Yasser Hifny

unread,
Dec 29, 2016, 11:31:02 AM12/29/16
to Keras-users, ric...@gmail.com, yhi...@gmail.com, zhyuya...@googlemail.com, bkj...@gmail.com
Thanks a lot, please confirm the attention layer implementation is correct:

from keras.engine import Layer
from keras import initializations


class Attention(Layer):
    '''Attention operation for temporal data.
    # Input shape
        3D tensor with shape: `(samples, steps, features)`.
    # Output shape
        2D tensor with shape: `(samples, features)`.
    '''
def __init__(self, attention_dim, **kwargs):
self.init = initializations.get('glorot_uniform')
self.attention_dim = attention_dim
super(Attention, self).__init__(**kwargs)
def build(self, input_shape):
        self.W = self.inner_init((self.attention_dim, self.attention_dim),
                                   name='{}_W'.format(self.name))
        self.b = K.zeros((self.attention_dim,), name='{}_b'.format(self.name))

        self.u = K.zeros((self.attention_dim,), name='{}_u'.format(self.name))

        self.trainable_weights += [self.W, self.b, self.u]
def get_output_shape_for(self, input_shape):
return (input_shape[0], input_shape[2])

def call(self, x, mask=None):

        a = K.tanh(K.dot(x, self.W) + self.b)
        alpha = K.exp(K.dot(a, self.u))
        alpha =alpha/K.sum(alpha)

        return  x * K.tile(alpha, (self.attention_dim,1))


Thanks,
Yasser

bkj...@gmail.com

unread,
Dec 29, 2016, 11:39:35 AM12/29/16
to Keras-users, ric...@gmail.com, yhi...@gmail.com, zhyuya...@googlemail.com, bkj...@gmail.com
I've always wondered whether someone somewhere has written an attention layer for Keras w/ a nice API.  In the meantime, I've personally favored building it out of other Keras layers rather than implementing a layer myself.  Something like
```
lstm = LSTM(rnn_dim, ..., return_sequences=True)

att = TimeDistributed(Dense(rnn_dim, activation='tanh')))(lstm)
att = TimeDistributed(Dense(1, bias=False))(att)
att = Reshape((rnn_dim,))(att)
att = Activation('softmax')

lstm = Merge(mode='dot')([lstm, att])
```

Not sure if that would actually run as written -- I'm away from my machine, but that's the basic approach I've taken in the past.  Let me know if you think there's an error though.

Richard Liao

unread,
Dec 29, 2016, 11:48:27 AM12/29/16
to Keras-users, ric...@gmail.com, yhi...@gmail.com, zhyuya...@googlemail.com, bkj...@gmail.com
Will do...

Richard Liao

unread,
Dec 30, 2016, 1:40:32 PM12/30/16
to Keras-users, ric...@gmail.com, yhi...@gmail.com, zhyuya...@googlemail.com, bkj...@gmail.com
Looks fine if x * K.tile(alpha, (self.attention_dim,1)) is the dot product of last dimension. I don't know how to use K.tile. I will look into it.

I used TimeDistributed(Dense()) to do the hidden dense layer operation. In your code, you put all in the custom layer which I don't see any problem. Did you get it running? 


On Thursday, December 29, 2016 at 11:31:02 AM UTC-5, Yasser Hifny wrote:

Yasser Hifny

unread,
Dec 30, 2016, 3:41:17 PM12/30/16
to Keras-users, ric...@gmail.com, yhi...@gmail.com, zhyuya...@googlemail.com, bkj...@gmail.com
Thanks Richard for your feedback. I managed to compile the code and get model summary unit now (which may  mean that the  "call function" is not tested).
I need to run an experiment to ensure it works. I will keep you updated.

Thanks,
Yasser

Yasser Hifny

unread,
Jan 1, 2017, 8:43:33 AM1/1/17
to Keras-users, ric...@gmail.com, yhi...@gmail.com, zhyuya...@googlemail.com, bkj...@gmail.com
Hi Ben and Richard,

I started from Ben and Richard codes and added my attention layer and its testing code:

import numpy as np
import pandas as pd
import cPickle
from collections import defaultdict
import re

from bs4 import BeautifulSoup

import sys
import os

os.environ['KERAS_BACKEND']='theano'

from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical

from keras.layers import Embedding
from keras.layers import Dense, Input, Flatten
from keras.layers import Conv1D, MaxPooling1D, Embedding, Merge, Dropout, LSTM, GRU, Bidirectional, TimeDistributed, GlobalMaxPooling1D,GlobalAveragePooling1D
from keras.models import Model

from keras import backend as K
from keras.engine.topology import Layer, InputSpec
from keras import initializations

MAX_SENT_LENGTH = 100
MAX_SENTS = 15
MAX_NB_WORDS = 20000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2

def clean_str(string):
    """
    Tokenization/string cleaning for dataset
    Every dataset is lower cased except
    """
    string = re.sub(r"\\", "", string)    
    string = re.sub(r"\'", "", string)    
    string = re.sub(r"\"", "", string)    
    return string.strip().lower()

data_train = pd.read_csv('labeledTrainData.tsv', sep='\t')
print data_train.shape

import nltk
from nltk import tokenize
#nltk.download('punkt')

reviews = []
labels = []
texts = []

for idx in range(data_train.review.shape[0]):
    text = BeautifulSoup(data_train.review[idx])
    text = clean_str(text.get_text().encode('ascii','ignore'))
    texts.append(text)
    sentences = tokenize.sent_tokenize(text)
    reviews.append(sentences)
    
    labels.append(data_train.sentiment[idx])

tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)

data = np.zeros((len(texts), MAX_SENTS, MAX_SENT_LENGTH), dtype='int32')

for i, sentences in enumerate(reviews):
    for j, sent in enumerate(sentences):
        if j< MAX_SENTS:
            wordTokens = text_to_word_sequence(sent)
            for k, word in enumerate(wordTokens):
                if k<MAX_SENT_LENGTH:
                    data[i,j,k] = tokenizer.word_index[word]
                    
word_index = tokenizer.word_index
print('Total %s unique tokens.' % len(word_index))

labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
y_val = labels[-nb_validation_samples:]

print('Number of positive and negative reviews in traing and validation set')
print y_train.sum(axis=0)
print y_val.sum(axis=0)

GLOVE_DIR = ""
embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Total %s word vectors.' % len(embeddings_index))

"""

embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
        
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SENT_LENGTH,
                            trainable=True)

sentence_input = Input(shape=(MAX_SENT_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sentence_input)
l_lstm = Bidirectional(LSTM(100))(embedded_sequences)
sentEncoder = Model(sentence_input, l_lstm)

review_input = Input(shape=(MAX_SENTS,MAX_SENT_LENGTH), dtype='int32')
review_encoder = TimeDistributed(sentEncoder)(review_input)
l_lstm_sent = Bidirectional(LSTM(100))(review_encoder)
preds = Dense(2, activation='softmax')(l_lstm_sent)
model = Model(review_input, preds)

model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

print("model fitting - Hierachical LSTM")
print model.summary()
model.fit(x_train, y_train, validation_data=(x_val, y_val),
          nb_epoch=10, batch_size=50)
"""

# building Hierachical Attention network
embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
        
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SENT_LENGTH,
                            trainable=True,
                            mask_zero=True)

class AttLayer(Layer):
    def __init__(self, **kwargs):
        self.init = initializations.get('normal')
        #self.input_spec = [InputSpec(ndim=3)]
        super(AttLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape)==3
        #self.W = self.init((input_shape[-1],1))
        self.W = self.init((input_shape[-1],))
        #self.input_spec = [InputSpec(shape=input_shape)]
        self.trainable_weights = [self.W]
        super(AttLayer, self).build(input_shape)  # be sure you call this somewhere!

    def call(self, x, mask=None):
        eij = K.tanh(K.dot(x, self.W))
        
        ai = K.exp(eij)
        weights = ai/K.sum(ai, axis=1).dimshuffle(0,'x')
        
        weighted_input = x*weights.dimshuffle(0,1,'x')
        return weighted_input.sum(axis=1)

    def get_output_shape_for(self, input_shape):
        return (input_shape[0], input_shape[-1])
"""
sentence_input = Input(shape=(MAX_SENT_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sentence_input)
l_lstm = Bidirectional(GRU(100, return_sequences=True))(embedded_sequences)
l_dense = TimeDistributed(Dense(200))(l_lstm)
l_att = AttLayer()(l_dense)
sentEncoder = Model(sentence_input, l_att)

review_input = Input(shape=(MAX_SENTS,MAX_SENT_LENGTH), dtype='int32')
review_encoder = TimeDistributed(sentEncoder)(review_input)
l_lstm_sent = Bidirectional(GRU(100, return_sequences=True))(review_encoder)
l_dense_sent = TimeDistributed(Dense(200))(l_lstm_sent)
l_att_sent = AttLayer()(l_dense_sent)
preds = Dense(2, activation='softmax')(l_att_sent)
model = Model(review_input, preds)
"""

from keras.engine import Layer
from keras import initializations
from keras import backend as K

class Attention(Layer):
'''Attention operation for temporal data.
# Input shape
3D tensor with shape: `(samples, steps, features)`.
# Output shape
2D tensor with shape: `(samples, features)`.
'''
def __init__(self, attention_dim, **kwargs):
self.init = initializations.get('glorot_uniform')
self.attention_dim = attention_dim
super(Attention, self).__init__(**kwargs)

def build(self, input_shape):

self.W = self.init((self.attention_dim, self.attention_dim),
  name='{}_W'.format(self.name))
self.b = K.zeros((self.attention_dim,), name='{}_b'.format(self.name))

self.u = K.zeros((self.attention_dim,), name='{}_u'.format(self.name))

self.trainable_weights += [self.W, self.b, self.u]

def get_output_shape_for(self, input_shape):
return (input_shape[0], input_shape[2])

def call(self, x, mask=None):

a = K.tanh(K.dot(x, self.W) + self.b)
alpha = K.exp(K.dot(a, self.u))
alpha =alpha/K.sum(alpha)

return  x * K.tile(alpha, (self.attention_dim,1))


x = Input(shape=(MAX_SENTS, MAX_SENT_LENGTH,))

emb_words = TimeDistributed(embedding_layer)(x)

emb_sents = TimeDistributed(Bidirectional(GRU(100, return_sequences=True)))(emb_words)
emb_sents = TimeDistributed( Attention(200))(emb_sents)

emb_docs = Bidirectional(GRU(100, return_sequences=True))(emb_sents)
emb_docs = Attention(200)(emb_docs)

prediction = Dense(2, activation='softmax')(emb_docs)
model = Model(input=x, output=prediction)


model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

print("model fitting - Hierachical Attention networks")
print model.summary()
model.fit(x_train, y_train, validation_data=(x_val, y_val),
          nb_epoch=10 , batch_size=50)

and I got this error

Traceback (most recent call last):
  File "textClassifierHATT_yasser.py", line 251, in <module>
    prediction = Dense(2, activation='softmax')(emb_docs)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 470, in __call__
    self.assert_input_compatibility(x)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 411, in assert_input_compatibility
    str(K.ndim(x)))
Exception: Input 0 is incompatible with layer dense_1: expected ndim=2, found ndim=3Enter code here...


how to fix this error?

Thanks,
Yasser

On Friday, December 30, 2016 at 1:40:32 PM UTC-5, Richard Liao wrote:

Yasser Hifny

unread,
Jan 1, 2017, 11:54:34 AM1/1/17
to Keras-users, ric...@gmail.com, yhi...@gmail.com, zhyuya...@googlemail.com, bkj...@gmail.com
Hi,

This is a working code and I tested it:
class Attention(Layer):
'''Attention operation for temporal data.
# Input shape
3D tensor with shape: `(samples, steps, features)`.
# Output shape
2D tensor with shape: `(samples, features)`.
'''
def __init__(self, attention_dim, **kwargs):
self.init = initializations.get('glorot_uniform')
self.attention_dim = attention_dim
super(Attention, self).__init__(**kwargs)

def build(self, input_shape):

self.W = self.init((self.attention_dim, self.attention_dim),
  name='{}_W'.format(self.name))
self.b = K.zeros((self.attention_dim,), name='{}_b'.format(self.name))

self.u = K.zeros((self.attention_dim,), name='{}_u'.format(self.name))

self.trainable_weights += [self.W, self.b, self.u]

def get_output_shape_for(self, input_shape):
return (input_shape[0], input_shape[2])

def call(self, x, mask=None):

a = K.tanh(K.dot(x, self.W) + self.b)
ai=alpha = K.exp(K.dot(a, self.u))
weights = ai/K.sum(ai, axis=1).dimshuffle(0,'x')
        
weighted_input = x*weights.dimshuffle(0,1,'x')
return weighted_input.sum(axis=1)


x = Input(shape=(MAX_SENTS, MAX_SENT_LENGTH,))

emb_words = TimeDistributed(embedding_layer)(x)

emb_sents = TimeDistributed(Bidirectional(GRU(100, return_sequences=True)))(emb_words)
emb_sents = TimeDistributed( Attention(200))(emb_sents)

emb_docs = Bidirectional(GRU(100, return_sequences=True))(emb_sents)
emb_docs = Attention(200)(emb_docs)

prediction = Dense(2, activation='softmax')(emb_docs)
model = Model(input=x, output=prediction)


and I got this result on your setup:


Epoch 4/10
20000/20000 [==============================] - 722s - loss: 0.1397 - acc: 0.9478 - val_loss: 0.2481 - val_acc: 0.9006

it turns out that the tile idea is not correct and I used Richard logic to compute the last two steps in the "call" function.

Thanks,
Yasser

alex.trem...@gmail.com

unread,
Jan 6, 2017, 7:23:16 PM1/6/17
to Keras-users, yhi...@gmail.com, zhyuya...@googlemail.com, bkj...@gmail.com
Can you elaborate, please? Is there a bug report you can reference? 
Thanks,
Alex

Alexander Measure

unread,
Jan 10, 2017, 6:03:28 PM1/10/17
to Keras-users, ric...@gmail.com, yhi...@gmail.com, zhyuya...@googlemail.com, bkj...@gmail.com
This doesn't work when using the Tensorflow backend because K.dot(a, self.u) is the dot product of a 3D tensor and a 1D tensor which is not currently supported by the Keras backend. Anyone have any ideas for how to fix that? It looks like the backend might support things if we broadcast u into a 3D tensor first but I haven't figured out how to do that yet.

Richard Liao

unread,
Jan 10, 2017, 8:27:06 PM1/10/17
to Alexander Measure, Keras-users, yhi...@gmail.com, zhyuya...@googlemail.com, Ben Johnson
I think you just need to reshape u into 2D, such that you could use tf.matmul(3D, [u, 1]), which behaves like a dot product. The tf.matmul will return 3D but we can just do reshape to bring it back to 2D. 

Alexander Measure

unread,
Jan 12, 2017, 7:20:19 PM1/12/17
to Keras-users, amea...@gmail.com, yhi...@gmail.com, zhyuya...@googlemail.com, bkj...@gmail.com
Thanks for the help. My backend-neutral rewrite is below. I'm not convinced it's working entirely correctly, I still need to run it on a reference data set, but I found what appear to be bugs in the other implementations mentioned so far.

Regarding bkj's suggestion to just use existing Keras layers, it works until you get to the merge dot product layer. Keras doesn't like the dimensions of the 2 inputs (the attention layer, which is [n_hidden], and the LSTM output which is [n_samples, n_steps, n_hidden]) and no amount of repeating or reshaping seemed to get it to do the dot product I was looking for. Maybe a multiply will work, still have to experiment with that more.

Regarding Richard's implementation, it looks like you're basically applying 2 sets of weights to the input of the dense layer before passing it through the tanh, the first one being the TimeDistributedDense and the second one being  eij=K.tanh(K.dot(x, self.W)). The paper indicates that K.dot(x,self.W) is the essentially the first dense layer of the attention network.

Regarding Yasser's implementation, you are initializing the self.u tensor to zeros, but this tensor is basically the weights for the attention neural network. No gradient will propagate through a layer with 0 weights so they will remain 0 and everything will be given an equal weight. As a result everything gets the same attention and you're basically doing average pooling. If you fix this, I think we basically have the same implementation (no coincidence since I used yours as a starting point). 

Please let me know if you spot any errors in my implementation below! I'll try to put it on Gist eventually but right now that appears to be down.

from keras.engine.topology import Layer
from keras import initializations
from keras import backend as K


class Attention(Layer):
'''Attention operation for temporal data.
# Input shape
3D tensor with shape: `(samples, steps, features)`.
# Output shape
2D tensor with shape: `(samples, features)`.
'''
def __init__(self, attention_dim, **kwargs):
         self.init = initializations.get('glorot_uniform')
         self.attention_dim = attention_dim
         super(Attention, self).__init__(**kwargs)

def build(self, input_shape):
         self.W = self.init((self.attention_dim, self.attention_dim),
name='{}_W'.format(self.name))
         self.b = K.zeros((self.attention_dim,), name='{}_b'.format(self.name))
         self.u = self.init((self.attention_dim,), name='{}_u'.format(self.name))
         self.trainable_weights += [self.W, self.b, self.u]
         self.built = True
         
def get_output_shape_for(self, input_shape):
return (input_shape[0], input_shape[2])

def call(self, x, mask=None):
         # Calculate the first hidden activations
         a1 = K.tanh(K.dot(x, self.W) + self.b) # [n_samples, n_steps, n_hidden]
         # K.dot won't let us dot a 3D with a 1D so we do it with mult + sum
         mul_a1_u = a1 * self.u                  # [n_samples, n_steps, n_hidden]
         dot_a1_u = K.sum(mul_a1_u, axis=2)      # [n_samples, n_steps]
         # Calculate the per step attention weights
         a2_num = K.exp(dot_a1_u)                # [n_samples, n_steps]
         a2_den = K.sum(a2_num, axis=1)          # [n_samples]
         a2_den = K.expand_dims(a2_den)          # [n_samples, 1] so div broadcasts
         a2 = a2_num / a2_den                    # [n_samples, n_steps]
         a2 = K.expand_dims(a2)                  # [n_samples, n_steps, 1] so div broadcasts
         # Apply attention weights to steps
         weighted_input = x * a2                 # [n_samples, n_steps, n_features]
         # Sum across the weighted steps to get the pooled activations
         return K.sum(weighted_input, axis=1)


Yasser Hifny

unread,
Jan 14, 2017, 2:22:44 PM1/14/17
to Keras-users, amea...@gmail.com, yhi...@gmail.com, zhyuya...@googlemail.com, bkj...@gmail.com
Thanks Alexander for your implementation.  We need to pay attention to masking as we do not support it. 
I was thinking to test bkj's suggestion as it may support masking but it does not work according to your comment.

Thanks,
Yasser

Christos Baziotis

unread,
Jan 19, 2017, 12:28:00 PM1/19/17
to Keras-users, zhyuya...@googlemail.com
I have a question regarding the dimensions of self.W. Why is W 2D like in Alexander Measure's  Layer:

 and not 1D like in Yasser Hifny's post?
class AttLayer(Layer):
   
def __init__(self, **kwargs):
       
self.init = initializations.get('normal')
       
#self.input_spec = [InputSpec(ndim=3)]
       
super(AttLayer, self).__init__(**kwargs)


   
def build(self, input_shape):
       
assert len(input_shape)==3
       
#self.W = self.init((input_shape[-1],1))
       
self.W = self.init((input_shape[-1],))
       
#self.input_spec = [InputSpec(shape=input_shape)]
       
self.trainable_weights = [self.W]
       
super(AttLayer, self).build(input_shape)  # be sure you call this somewhere!


   
def call(self, x, mask=None):
        eij
= K.tanh(K.dot(x, self.W))
       
        ai
= K.exp(eij)
        weights
= ai/K.sum(ai, axis=1).dimshuffle(0,'x')
       
        weighted_input
= x*weights.dimshuffle(0,1,'x')
       
return weighted_input.sum(axis=1)


   
def get_output_shape_for(self, input_shape):
       
return (input_shape[0], input_shape[-1])

My understanding was that the second way is what the paper describes (+ the missing bias weight). 
uit = tanh(Ww hit + bw)


Surely i am missing something. Could someone please explain me?

Alexander Measure

unread,
Jan 19, 2017, 3:51:41 PM1/19/17
to Keras-users, zhyuya...@googlemail.com
You're referring to an older version of Yasser's layer which does not work. His version that does work has a 2D self.W. 

Ww doesn't have to be 2 dimensional, but capital letters are typically used to indicate matrices which are 2D. Also, the paper says that uit is equivalent to the hidden activation of a one layer MLP, which would be a vector (one activation per neuron). (Ww)(hit) only produces a vector of activations if Ww is 2 dimensional. It does seem like overkill to give the hidden layer the same dimensions as the input (as I do in my layer), but I haven't experimented with modifying that.

Alex

Christos Baziotis

unread,
Jan 19, 2017, 5:15:43 PM1/19/17
to Keras-users, zhyuya...@googlemail.com
Thanks Alexander! 
I was struggling to understand why W had to be a square matrix hxh (h-> dim of RNN activations) and where that was written in the paper. 
In the paper they do not define the dimensions of W.

Any ideas on how to add masking support? I tried a simple:
class Attention(Layer):
   
def __init__(self, **kwargs):
       
self.supports_masking = True

But the Dense layer i have next throws an error:
    probabilities = Dense(classes)(sentence)
    probabilities
= Activation('softmax')(probabilities)
    model
= Model(input=_input, output=probabilities)
    model
.compile(optimizer=Adam(clipnorm=5., lr=0.001), loss='categorical_crossentropy')


ValueError: Layer dense_1 does not support masking, but was passed an input_mask: Elemwise{neq,no_inplace}.0

karu...@gmail.com

unread,
Jan 20, 2017, 4:51:39 AM1/20/17
to Keras-users, zhyuya...@googlemail.com
Is there a reason for not using masks on your implementation ?

Christos Baziotis

unread,
Jan 20, 2017, 5:50:46 AM1/20/17
to Keras-users, zhyuya...@googlemail.com, karu...@gmail.com
I need to tell me what you think regarding masking. I have this Attention Layer (i don't use the context vector u as in the paper but this is irrelevant). I made 2 adjustions:

class Attention(Layer):
   
def __init__(self, **kwargs):

       
self.supports_masking = True # this

       
self.init = initializations.get('glorot_uniform')

       
super(Attention, self).__init__(**kwargs)


   
def build(self, input_shape):

       
assert len(input_shape) == 3
       
self.W = self.init((input_shape[-1],), name='{}_W'.format(self.name))
       
self.b = K.ones((input_shape[1],), name='{}_b'.format(self.name))
       
self.trainable_weights = [self.W, self.b]


       
super(Attention, self).build(input_shape)


   
def compute_mask(self, input, input_mask=None):   # and this
       
return None



   
def call(self, x, mask=None):
        eij
= K.tanh(K.dot(x, self.W) + self.b)
        ai
= K.exp(eij)
        weights
= ai / K.sum(ai, axis=1).dimshuffle(0, 'x')
        weighted_input
= x * weights.dimshuffle(0, 1, 'x')

       
return weighted_input.sum(axis=1)


   
def get_output_shape_for(self, input_shape):

       
return input_shape[0], input_shape[-1]


Now it works with masking. My thinking goes like this. Since in the call() we compress the time dimension we don't have to pass the mask to the next layer, since it makes no sence to do that.
i tried it and it works but i don't know if this correct and what are the consequences of this.

I would like to know what you think.

Pedro Cardoso

unread,
Jan 20, 2017, 6:38:24 AM1/20/17
to Christos Baziotis, Keras-users, zhyuya...@googlemail.com
Hi.

indeed, you have no need to pass on the mask in your implementation. So that is good. But you could have used the mask in the calculation.


def call(self, x, mask=None):
        eij
= K.tanh(K.dot(x, self.W) + self.b)
        ai
= K.exp(eij)

        if mask is not None:
           
ai = mask*ai
        ...

For this to work, you need to activate mask in embedding, and add a masking layer for sentences.

Also, not having the context vector does not make this a simple softmax ?

Christos Baziotis

unread,
Jan 20, 2017, 8:00:36 AM1/20/17
to Keras-users, christos...@gmail.com, zhyuya...@googlemail.com, karu...@gmail.com
Thanks i missed the addition of the mask calculation in the call(). Regarding the context vector, it has to do with different kinds of attention. In many papers they do not use a context vector.

This is how it looks with the context vector:
    def call(self, x, mask=None):
        a
= K.tanh(K.dot(x, self.W) + self.b)
        ai
= K.exp(K.dot(a, self.u))

       
if mask is not None:
            ai
= mask *
ai

        weights
= ai / K.sum(ai, axis=1).dimshuffle(0, 'x')

        weighted_input
= x * weights.dimshuffle(0, 1, 'x')
       
return weighted_input.sum(axis=1)

And this is without a context vector.
    def call(self, x, mask=None):
        eij
= K.tanh(K.dot(x, self.W))
       
if self.bias:
            eij
+= self.b

        ai
= K.exp(eij)

       
if mask is not None:
            ai
= mask *
ai

        weights
= ai / K.sum(ai, axis=1).dimshuffle(0, 'x')

        weighted_input
= x * weights.dimshuffle(0, 1, 'x')
       
return weighted_input.sum(axis=1)

I have added set mask_zero=True in my Emedding layer. Isn't that enough? Do i also have to add a Masking Layer.

Christos Baziotis

unread,
Jan 20, 2017, 8:26:02 AM1/20/17
to Keras-users, christos...@gmail.com, zhyuya...@googlemail.com, karu...@gmail.com
Pedro, i tried this as you posted, 
def call(self, x, mask=None):
        eij
= K.tanh(K.dot(x, self.W) + self.b)

        ai
= K.exp(eij)
       
if mask is not None:
            ai
= mask*ai



and i get loss:nan during training...

Pedro Cardoso

unread,
Jan 20, 2017, 9:23:30 AM1/20/17
to Christos Baziotis, Keras-users, zhyuya...@googlemail.com
Yes, I am getting the same with my own implementation .

The issue seems to be with sentences only containing 0. The sentence embedding is nan. BTW, this works well if you see a document as a single sentence, and do not do the hierarchical part.

That is why I was asking if removing the mask usage was studied or not :)

Trying to see how to multiply the mask with the layer "review_encoder"

Christos Baziotis

unread,
Jan 20, 2017, 9:51:27 AM1/20/17
to Keras-users, zhyuya...@googlemail.com
I tested this in a simple scenario. Simple sentence (sentiment) classification. No hierarchy there.

So if i enable masking in the embedding layer what happens is that we will end up feeding some zero words to the RNN (LSTM/GRU).
So this means that some of the hi's (the hidden states) will be zero.
And this finally mean that some of the time dimensions (timesteps) in x in call will be zero.

So instead of doing this:
K.tanh(K.dot(x, self.W))

We can just something like: 
K.dot(K.not_equal(x, 0, axis=time_dim), self.W)

This won't work obviously but you get the idea. 
I am not familiar with Keras Backend API so i don't know how to do it. 

I am not sure what i said is correct. Tell me what you think. 

Pedro Cardoso

unread,
Jan 20, 2017, 12:09:44 PM1/20/17
to Christos Baziotis, Keras-users, zhyuya...@googlemail.com
When you have a mask, the points 0 will not be calculated in the RNN step

For timedistributed is more complicated. I changed the timeDistributes, but still getting nan

For info, this was it.

class MaskedTimeDistributed(TimeDistributed):

    def call(self, X, mask=None):
        input_shape = self.input_spec[0].shape
        # batch size matters, use rnn-based implementation
        def step(x, states):
            output = self.layer.call(x)
            return output, []

        last_output, outputs, states = K.rnn(step, X,
                                             initial_states=[],
                                             mask = mask)
        y = outputs
        return y


--
You received this message because you are subscribed to a topic in the Google Groups "Keras-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/keras-users/IWK9opMFavQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to keras-users+unsubscribe@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/keras-users/5e82da5f-15d6-4b75-a4b2-9f7e7b7c3012%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Christos Baziotis

unread,
Jan 20, 2017, 12:32:21 PM1/20/17
to Keras-users, zhyuya...@googlemail.com
For now i have commented out this:
    def call(self, x, mask=None):
        a
= K.tanh(K.dot(x, self.W))


       
if self.bias:
            a
+= self.b


        ai
= K.exp(K.dot(a, self.u))


       
# if mask is not None:
       
#     ai = mask * ai



        weights
= ai / K.sum(ai, axis=1).dimshuffle(0, 'x')
        weighted_input
= x * weights.dimshuffle(0, 1, 'x')
       
return weighted_input.sum(axis=1)


I don't know what is going on, but adding applying the mask like this causes the NaN's...

Alexander Measure

unread,
Jan 20, 2017, 9:04:42 PM1/20/17
to Keras-users, zhyuya...@googlemail.com
I don't know if masking is responsible, but I was having lots of problems with NaN's without masking. I traced my problem to the softmax calculation, which is not numerically stable as implemented in my version because exp can fairly easily produce results that exceed float32's ability to represent them. A numerically stable version (without masking) is here, the solution is to just use K.softmax. 

Christos Baziotis

unread,
Jan 21, 2017, 5:45:58 AM1/21/17
to Keras-users, zhyuya...@googlemail.com
What happens when you apply the mask in this version? Did you try that?
In my case applying the mask causes the NaN's. 
Message has been deleted

Christos Baziotis

unread,
Jan 21, 2017, 11:53:20 AM1/21/17
to Keras-users, zhyuya...@googlemail.com
Well this works, but it don't know if it is correct. 
    def call(self, x, mask=None):
        uit
= K.tanh(K.dot(x, self.W))

       
if self.bias:
            uit
+= self.b

        ait
= K.dot(uit, self.u)

       
# apply mask

       
if mask is not None:

            mask
= K.cast(mask, 'float32')
            ait
*= mask

        ai
= K.exp(ait)

        weights
= ai / K.sum(ai, axis=1).dimshuffle(0, 'x')
        weighted_input
= x * weights.dimshuffle(0, 1, 'x')
       
return weighted_input.sum(axis=1)

Is this how you apply the mask? I can't find any example.

bkj...@gmail.com

unread,
Jan 22, 2017, 1:57:41 PM1/22/17
to Keras-users, zhyuya...@googlemail.com
Looks like you guys have been doing a bunch of work here.  Curious -- has anyone been able to reproduce the strong results reported in the original paper?  If we can replicate the MAX/AVE results, the structure of those models may help in debugging the ATT model.

Christos Baziotis

unread,
Jan 31, 2017, 9:02:05 AM1/31/17
to Keras-users, zhyuya...@googlemail.com
I experimented with this kind of attention in the context of sentence classification, so no hierarchy. But the logic is the same. I have to say that i have seen improvements, especially as the length of the input gets longer (words, sentences or whatever...). 
But i can't say that it's crazy better than the simple scenario. 

on the other hand, every problem is different. 
I suspect that in the case of document classification, the gains will be amplified, because first you get better sentence representations (attention over words) and then better doc representation (attention over sentences).

On a side note, the masking doesn't work in with timedistributed, look here: https://github.com/fchollet/keras/issues/5212
I tried to work with a hierarchy and even though the mask over words works, when i apply the TimeDistributed Layer over the representations of the senteces in order to feed them to the next RNN, the padded sentences are not masked.
If anyone has any ideas please share them.

Alexander Measure

unread,
Feb 10, 2017, 4:21:37 PM2/10/17
to Keras-users, zhyuya...@googlemail.com, bkj...@gmail.com
I haven't tried to replicate the results on the data sets used in the paper, but I have gotten similar results (i.e. attention beats max pooling by a few points) on 2 other text classification data sets that are not public. I also managed to get your approach working (using pre-existing keras layers), I think it is much better than the custom layer approach because it makes it much easier to experiment with different attention structures. Building the attention mechanism using existing keras layers is done almost exactly as you indicated earlier, but with some minor modifications:

lstm = LSTM(rnn_dim, ..., return_sequences=True)                      # [n_samples, n_steps, rnn_dim]
att = TimeDistributed(Dense(rnn_dim, activation='tanh')))(lstm) # [n_samples, n_steps, rnn_dim]
att = TimeDistributed(Dense(1, bias=False))(att)                         # [n_samples, n_steps, 1]
att = Reshape((rnn_dim,))(att)                                                         # [n_samples, n_steps]
att = Activation('softmax')(att)                                                          # [n_samples, n_steps]
lstm = merge([att, lstm], mode='dot', dot_axes=(1,1))                  # [n_samples, rnn_dim]

Ben Johnson

unread,
Feb 10, 2017, 4:40:36 PM2/10/17
to Alexander Measure, Keras-users, zhyuya...@googlemail.com
Are you able to post an end-to-end example comparing regular LSTMs, HAN-MAX, HAN-ATT (using one of the standard Keras datasets ideally?)  That would be very much appreciated and I think would save a lot of people a lot of sweat.

~ Ben

Alexander Measure

unread,
Feb 11, 2017, 2:51:38 PM2/11/17
to Keras-users, amea...@gmail.com, zhyuya...@googlemail.com, bkj...@gmail.com
The Keras datasets are already pre-tokenized without regard to sentence structure so we'd need to find another dataset to demonstrate the hierarchical component. You can run attention on any sequence though, here's an example on the Keras IMDB dataset: https://gist.github.com/ameasure/6f3fbdcccab4f319ab8dea4c62206a73

Pedro Cardoso

unread,
Feb 11, 2017, 3:55:17 PM2/11/17
to Christos Baziotis, Keras-users, zhyuya...@googlemail.com
Regarding the masking. If you apply a mask layer at the start, and the embeddings at the sentence level, you should be ok.

the embbedding layer at sentence level will create the mask.
At the document level, the TimeDistributed will not take into consideration the masking but it is ok because:
1) the sentence is all zeros, so output is zero.
2) attention at the document level - for the multiple sentneces - will take into consideration the masking created by the Mask layer at the start.

Pedro Cardoso

--
You received this message because you are subscribed to a topic in the Google Groups "Keras-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/keras-users/IWK9opMFavQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to keras-users+unsubscribe@googlegroups.com.

chen50...@gmail.com

unread,
Feb 15, 2017, 9:08:51 PM2/15/17
to Keras-users, zhyuya...@googlemail.com, bkj...@gmail.com
How about this dataset http://goo.gl/JyCnZq. See https://github.com/zhangxiangxiao/Crepe for detail.

And what is the defference between the performance of char-level NN and word-level NN.


On Wednesday, December 28, 2016 at 8:25:42 AM UTC+8, Ben Johnson wrote:

I think that the HN-MAX is (roughly)

max_sents = # maximum number of sentences per document
max_words = # maximum number of words per sentence

x = Input(shape=(max_sents, max_words,))

emb_words = TimeDistributed(Embedding(input_dim=max_features, output_dim=200, mask_zero=True))(x)

emb_sents = TimeDistributed(Bidirectional(GRU(50, return_sequences=True)))(emb_words)
emb_sents = TimeDistributed(GlobalMaxPooling1D())(emb_sents)

emb_docs = Bidirectional(GRU(50 return_sequences=True))(emb_sents)
emb_docs = GlobalMaxPooling1D()(emb_docs)

prediction = Dense(y_train.shape[1], activation='softmax')(emb_docs)
model = Model(input=x, output=prediction)
model.compile(loss='categorical_crossentropy', optimizer=SGD(momentum=0.9), metrics=['accuracy'])

Pedro Cardoso

unread,
Mar 2, 2017, 9:29:11 AM3/2/17
to chen50...@gmail.com, Keras-users, zhyuya...@googlemail.com, bkj...@gmail.com
Also, I used IMDB data for evaluating the model

--
You received this message because you are subscribed to a topic in the Google Groups "Keras-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/keras-users/IWK9opMFavQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to keras-users+unsubscribe@googlegroups.com.

madhum...@gmail.com

unread,
Mar 24, 2017, 7:18:26 AM3/24/17
to Keras-users, zhyuya...@googlemail.com
1) I see here https://gist.github.com/cbaziotis/7ef97ccf71cbc14366835198c09809d2 that you have added a small (epsilon) value to avoid the problem of NaN. And I also see in the comments that Masking does not work with TimeDistributed layer. Is that still the case?

2) I am trying to understand the context vector they have introduced in the paper. I understand how attention works, but I am confused about why do they have a separate query (context) vector representation for every word (in case of word attention). If it is a vector representing what should be the properties of an important word for the sentence, from my intuition, I would expect it to be a single vector for a sentence - not a different one for every word.

Regards,
Madhumita

kamaln...@gmail.com

unread,
Nov 26, 2017, 2:09:26 PM11/26/17
to Keras-users
Hi Richard, 

I read your blog and found it very interesting. When I tried to run the code you had provided, I am facing the following problem.

<bound method Container.summary of <keras.engine.training.Model object at 0x7f2d5e366438>>
Traceback (most recent call last):
  File "LSTM_With_Attention.py", line 220, in <module>
    epochs=2, batch_size=1)
  File "/home/kamal/miniconda3/lib/python3.6/site-packages/keras/engine/training.py", line 1575, in fit
    self._make_train_function()
  File "/home/kamal/miniconda3/lib/python3.6/site-packages/keras/engine/training.py", line 960, in _make_train_function
    loss=self.total_loss)
  File "/home/kamal/miniconda3/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/home/kamal/miniconda3/lib/python3.6/site-packages/keras/optimizers.py", line 226, in get_updates
    accumulators = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
  File "/home/kamal/miniconda3/lib/python3.6/site-packages/keras/optimizers.py", line 226, in <listcomp>
    accumulators = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
  File "/home/kamal/miniconda3/lib/python3.6/site-packages/keras/backend/theano_backend.py", line 275, in int_shape
    raise TypeError('Not a Keras tensor:', x)
TypeError: ('Not a Keras tensor:', Elemwise{add,no_inplace}.0)

I am using Keras 2.0.8 and theano 0.9.0. Any help would be highly appreciated

Thanks
Message has been deleted

aianu...@gmail.com

unread,
Dec 4, 2017, 6:04:40 AM12/4/17
to Keras-users
Hi

Did any one try implementing the knowledge layer addressed in the paper titled "Leveraging Knowledge Bases in LSTMs for Improving Machine Reading"
https://www.cs.cmu.edu/~bishan/papers/kblstm_acl2017.pdf . It is similar to the attention layer mechanism but I am facing problem in implementing the knowledge layer. Can anyone help me in solving this issue.

Thanks in Advance
Message has been deleted

tora...@gmail.com

unread,
Feb 27, 2018, 8:14:09 PM2/27/18
to Keras-users
Hi all, Richard,

I've been trying your code on my macbook and it works fine. However, now I'm transiting to a GPU server with tensorflowgpu as backend for keras and unfortunately getting errors at any Bidirectional layers. This happens for both your implementation of the RNN and HANN. Wondering whether you or anyone else have tried any Keras implementation of HANN on gpu, specially with tensorflow. I'm now using a tflearn implementation of the RNN model but it's not easy to translate the attention layer using tflearn layers. I can explain the errors more in details if needed, but at this point I'm looking for an attested implementation of HANN on GPU.




On Thursday, December 29, 2016 at 8:04:50 AM UTC-8, Richard Liao wrote:
Hi all,

I have implemented the paper using Keras. Here is my github repository https://github.com/richliao/textClassifier and I have written a blog about the implementation: https://richliao.github.io/supervised/classification/2016/12/26/textclassifier-HATN/.

Feedbacks are very welcomed. 

Richard

Quang Anh Dang

unread,
Feb 28, 2018, 9:06:00 PM2/28/18
to Keras-users
Hi Richard,

Thank you for sharing your code. However, I did not see where the two context vectors u_w and u_s (in the original paper) are defined in your AttLayer and trained together with W. Could you please justify further a bit about this decisive step?

~ QuangAnh


On Thursday, December 29, 2016 at 11:04:50 AM UTC-5, Richard Liao wrote:
Hi all,

I have implemented the paper using Keras. Here is my github repository https://github.com/richliao/textClassifier and I have written a blog about the implementation: https://richliao.github.io/supervised/classification/2016/12/26/textclassifier-HATN/.

Feedbacks are very welcomed. 

Richard

On Wednesday, December 28, 2016 at 8:16:18 PM UTC-5, Ben Johnson wrote:
I believe you have to downgrade from 1.2.0 to 1.1.1 

There was a bug introduced recently that messes up the TimeDistributed layers.

~ Ben


On Wed, Dec 28, 2016 at 8:09 PM Yasser Hifny <yhi...@gmail.com> wrote:
Hi,

when testing your code:

from keras.models import Sequential, Model
from keras.layers import Input, Dense, TimeDistributed
from keras.layers import GRU,GlobalMaxPooling1D,Bidirectional, Embedding, LSTM
from keras.optimizers import SGD

max_sents = 100
max_words = 50



x = Input(shape=(max_sents, max_words,))

emb_words = TimeDistributed(Embedding(input_dim=1000, output_dim=200, mask_zero=True))(x)

emb_sents = TimeDistributed(Bidirectional(GRU(50, return_sequences=True)))(emb_words)
emb_sents = TimeDistributed(GlobalMaxPooling1D())(emb_sents)

emb_docs = Bidirectional(GRU(50, return_sequences=True))(emb_sents)
emb_docs = GlobalMaxPooling1D()(emb_docs)

prediction = Dense(44, activation='softmax')(emb_docs)
model = Model(input=x, output=prediction)
model.compile(loss='categorical_crossentropy', optimizer=SGD(momentum=0.9), metrics=['accuracy'])
print model.summary()


I got this error:

$ python han.py
Using Theano backend.
/usr/lib/python2.7/site-packages/keras/engine/topology.py:368: UserWarning: The `regularizers` property of layers/models is deprecated. Regularization losses are now managed via the `losses` layer/model property.
  warnings.warn('The `regularizers` property of '
Traceback (most recent call last):
  File "han.py", line 15, in <module>
    emb_sents = TimeDistributed(Bidirectional(GRU(50, return_sequences=True)))(emb_words)
  File "/usr/lib/python2.7/site-packages/keras/engine/topology.py", line 569, in __call__
    self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
  File "/usr/lib/python2.7/site-packages/keras/engine/topology.py", line 632, in add_inbound_node
    Node.create_node(self, inbound_layers, node_indices, tensor_indices)
  File "/usr/lib/python2.7/site-packages/keras/engine/topology.py", line 164, in create_node
    output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
  File "/usr/lib/python2.7/site-packages/keras/layers/wrappers.py", line 129, in call
    y = self.layer.call(X)  # (nb_samples * timesteps, ...)
  File "/usr/lib/python2.7/site-packages/keras/layers/wrappers.py", line 203, in call
    Y = self.forward_layer.call(X, mask)
  File "/usr/lib/python2.7/site-packages/keras/layers/recurrent.py", line 201, in call
    input_shape = K.int_shape(x)
  File "/usr/lib/python2.7/site-packages/keras/backend/theano_backend.py", line 128, in int_shape
    raise Exception('Not a Keras tensor:', x)
Exception: ('Not a Keras tensor:', Reshape{3}.0)


how to solve this error?

Thanks
Yasser



On Tuesday, December 27, 2016 at 7:26:57 PM UTC-5, bkj...@gmail.com wrote:
(And also, the code as posted above is what they implemented in the paper -- I've been using the `rmsprop` optimizer in my experiments since in the past I've had better luck using it with RNNs)


On Tuesday, December 27, 2016 at 7:25:42 PM UTC-5, bkj...@gmail.com wrote:

I think that the HN-MAX is (roughly)

max_sents = # maximum number of sentences per document,,,

 


max_words = # maximum number of words per sentence



x = Input(shape=(max_sents, max_words,))



emb_words = TimeDistributed(Embedding(input_dim=max_features, output_dim=200, mask_zero=True))(x)



emb_sents = TimeDistributed(Bidirectional(GRU(50, return_sequences=True)))(emb_words)

emb_sents = TimeDistributed(GlobalMaxPooling1D())(emb_sents)



emb_docs = Bidirectional(GRU(50 return_sequences=True))(emb_sents)

emb_docs = GlobalMaxPooling1D()(emb_docs)



prediction = Dense(y_train.shape[1], activation='softmax')(emb_docs)

model = Model(input=x, output=prediction)

model.compile(loss='categorical_crossentropy', optimizer=SGD(momentum=0.9), metrics=['accuracy'])

Turning this into the HN-AVG variant is fairly straightforward, and to the HN-ATT you'd have to write a little attention unit, but that shouldn't be particularly difficult I don't think.


Using the Yelp 2013 data from http://ir.hit.edu.cn/~dytang/, I haven't been able to reproduce the results in the paper (I'm getting within 1-2% with HN-MAX and HN-AVG).  I'm not using pretrained word vectors, so perhaps that's the reason.  Would love for someone to be able to modify this code to reproduce the results more exactly, as I'm not 100% where I'm going wrong.


~ Ben


On Thursday, December 22, 2016 at 11:36:35 AM UTC-5, bkj...@gmail.com wrote:
Oh actually, apologies, it's been a while since I read the whole paper: they don't average the word or sentence embeddings, they encode them with a GRU.  So it's actually a little more involved than fastText + reweighting, but I'll try implementing both variants.



On Thursday, December 22, 2016 at 11:25:32 AM UTC-5, bkj...@gmail.com wrote:
I haven't implemented it, but I was going to look into it.  I agree that the results are very impressive, though I'm actually a little confused about where the gains are coming from.  They show that HN-ATT outperforms HN-AVE and HN-MAX, but actually the two simpler variants outperform all of the other benchmarks.   Correct me if I'm wrong, but HN-AVE just:

  a) splits the document into set of sentences 
  b) averages word embeddings for each sentence to get a set of sentence representations
  c) averages sentence representations to get a document representation

I think this is really just equivalent to the model from this paper


plus some reweighting of words based on the length of sentences they're found in.  The linked paper actually evaluates on some of the same datasets as the HN and consistently underperforms HN-AVE by 2-3%.  So all of this would suggest that those gains from the "hierarchical" part of "hierarchical attention networks", which I think is pretty interesting.  I'm going to try to implement HN-AVE in the next couple of weeks, and will report back here if I find anything interesting.

~ Ben

Message has been deleted

eng.ze...@gmail.com

unread,
Jul 22, 2018, 1:12:13 PM7/22/18
to Keras-users
Hi
I face this error
"ImportError: cannot import name np_utils"                        

robo...@gmail.com

unread,
Sep 19, 2018, 8:44:37 PM9/19/18
to Keras-users
Hello,

This is another interesting implementation of the model, using exactly the same dataset: https://github.com/LukeZhuang/Hierarchical-Attention-Network

Has someone ever got the same performance on the same (Tang) dataset?
Reply all
Reply to author
Forward
0 new messages