I think that the HN-MAX
is (roughly)
max_sents = # maximum number of sentences per document
max_words = # maximum number of words per sentence
x = Input(shape=(max_sents, max_words,))
emb_words = TimeDistributed(Embedding(input_dim=max_features, output_dim=200, mask_zero=True))(x)
emb_sents = TimeDistributed(Bidirectional(GRU(50, return_sequences=True)))(emb_words)
emb_sents = TimeDistributed(GlobalMaxPooling1D())(emb_sents)
emb_docs = Bidirectional(GRU(50 return_sequences=True))(emb_sents)
emb_docs = GlobalMaxPooling1D()(emb_docs)
prediction = Dense(y_train.shape[1], activation='softmax')(emb_docs)
model = Model(input=x, output=prediction)
model.compile(loss='categorical_crossentropy', optimizer=SGD(momentum=0.9), metrics=['accuracy'])
Turning this into the HN-AVG
variant is fairly straightforward, and to the HN-ATT
you'd have to write a little attention unit, but that shouldn't be particularly difficult I don't think.
Using the Yelp 2013 data from http://ir.hit.edu.cn/~dytang/, I haven't been able to reproduce the results in the paper (I'm getting within 1-2% with HN-MAX and HN-AVG). I'm not using pretrained word vectors, so perhaps that's the reason. Would love for someone to be able to modify this code to reproduce the results more exactly, as I'm not 100% where I'm going wrong.
~ Ben
(And also, the code as posted above is what they implemented in the paper -- I've been using the `rmsprop` optimizer in my experiments since in the past I've had better luck using it with RNNs)
On Tuesday, December 27, 2016 at 7:25:42 PM UTC-5, bkj...@gmail.com wrote:
I think that the
HN-MAX
is (roughly)
max_sents = # maximum number of sentences per document,,,
import numpy as npimport pandas as pdimport cPicklefrom collections import defaultdictimport re
from bs4 import BeautifulSoup
import sysimport os
os.environ['KERAS_BACKEND']='theano'
from keras.preprocessing.text import Tokenizer, text_to_word_sequencefrom keras.preprocessing.sequence import pad_sequencesfrom keras.utils.np_utils import to_categorical
from keras.layers import Embeddingfrom keras.layers import Dense, Input, Flattenfrom keras.layers import Conv1D, MaxPooling1D, Embedding, Merge, Dropout, LSTM, GRU, Bidirectional, TimeDistributed, GlobalMaxPooling1D,GlobalAveragePooling1Dfrom keras.models import Model
from keras import backend as Kfrom keras.engine.topology import Layer, InputSpecfrom keras import initializations
MAX_SENT_LENGTH = 100MAX_SENTS = 15MAX_NB_WORDS = 20000EMBEDDING_DIM = 100VALIDATION_SPLIT = 0.2
def clean_str(string): """ Tokenization/string cleaning for dataset Every dataset is lower cased except """ string = re.sub(r"\\", "", string) string = re.sub(r"\'", "", string) string = re.sub(r"\"", "", string) return string.strip().lower()
data_train = pd.read_csv('labeledTrainData.tsv', sep='\t')print data_train.shape
import nltkfrom nltk import tokenize#nltk.download('punkt')
reviews = []labels = []texts = []
for idx in range(data_train.review.shape[0]): text = BeautifulSoup(data_train.review[idx]) text = clean_str(text.get_text().encode('ascii','ignore')) texts.append(text) sentences = tokenize.sent_tokenize(text) reviews.append(sentences) labels.append(data_train.sentiment[idx])
tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)tokenizer.fit_on_texts(texts)
data = np.zeros((len(texts), MAX_SENTS, MAX_SENT_LENGTH), dtype='int32')
for i, sentences in enumerate(reviews): for j, sent in enumerate(sentences): if j< MAX_SENTS: wordTokens = text_to_word_sequence(sent) for k, word in enumerate(wordTokens): if k<MAX_SENT_LENGTH: data[i,j,k] = tokenizer.word_index[word] word_index = tokenizer.word_indexprint('Total %s unique tokens.' % len(word_index))
labels = to_categorical(np.asarray(labels))print('Shape of data tensor:', data.shape)print('Shape of label tensor:', labels.shape)
indices = np.arange(data.shape[0])np.random.shuffle(indices)data = data[indices]labels = labels[indices]nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])
x_train = data[:-nb_validation_samples]y_train = labels[:-nb_validation_samples]x_val = data[-nb_validation_samples:]y_val = labels[-nb_validation_samples:]
print('Number of positive and negative reviews in traing and validation set')print y_train.sum(axis=0)print y_val.sum(axis=0)
GLOVE_DIR = ""embeddings_index = {}f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))for line in f: values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype='float32') embeddings_index[word] = coefsf.close()
print('Total %s word vectors.' % len(embeddings_index))
"""
embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))for word, i in word_index.items(): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: # words not found in embedding index will be all-zeros. embedding_matrix[i] = embedding_vector embedding_layer = Embedding(len(word_index) + 1, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SENT_LENGTH, trainable=True)
sentence_input = Input(shape=(MAX_SENT_LENGTH,), dtype='int32')embedded_sequences = embedding_layer(sentence_input)l_lstm = Bidirectional(LSTM(100))(embedded_sequences)sentEncoder = Model(sentence_input, l_lstm)
review_input = Input(shape=(MAX_SENTS,MAX_SENT_LENGTH), dtype='int32')review_encoder = TimeDistributed(sentEncoder)(review_input)l_lstm_sent = Bidirectional(LSTM(100))(review_encoder)preds = Dense(2, activation='softmax')(l_lstm_sent)model = Model(review_input, preds)
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc'])
print("model fitting - Hierachical LSTM")print model.summary()model.fit(x_train, y_train, validation_data=(x_val, y_val), nb_epoch=10, batch_size=50)"""
# building Hierachical Attention networkembedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))for word, i in word_index.items(): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: # words not found in embedding index will be all-zeros. embedding_matrix[i] = embedding_vector embedding_layer = Embedding(len(word_index) + 1, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SENT_LENGTH, trainable=True, mask_zero=True)
class AttLayer(Layer): def __init__(self, **kwargs): self.init = initializations.get('normal') #self.input_spec = [InputSpec(ndim=3)] super(AttLayer, self).__init__(**kwargs)
def build(self, input_shape): assert len(input_shape)==3 #self.W = self.init((input_shape[-1],1)) self.W = self.init((input_shape[-1],)) #self.input_spec = [InputSpec(shape=input_shape)] self.trainable_weights = [self.W] super(AttLayer, self).build(input_shape) # be sure you call this somewhere!
def call(self, x, mask=None): eij = K.tanh(K.dot(x, self.W)) ai = K.exp(eij) weights = ai/K.sum(ai, axis=1).dimshuffle(0,'x') weighted_input = x*weights.dimshuffle(0,1,'x') return weighted_input.sum(axis=1)
def get_output_shape_for(self, input_shape): return (input_shape[0], input_shape[-1])"""sentence_input = Input(shape=(MAX_SENT_LENGTH,), dtype='int32')embedded_sequences = embedding_layer(sentence_input)l_lstm = Bidirectional(GRU(100, return_sequences=True))(embedded_sequences)l_dense = TimeDistributed(Dense(200))(l_lstm)l_att = AttLayer()(l_dense)sentEncoder = Model(sentence_input, l_att)
review_input = Input(shape=(MAX_SENTS,MAX_SENT_LENGTH), dtype='int32')review_encoder = TimeDistributed(sentEncoder)(review_input)l_lstm_sent = Bidirectional(GRU(100, return_sequences=True))(review_encoder)l_dense_sent = TimeDistributed(Dense(200))(l_lstm_sent)l_att_sent = AttLayer()(l_dense_sent)preds = Dense(2, activation='softmax')(l_att_sent)model = Model(review_input, preds)
"""
from keras.engine import Layerfrom keras import initializations
from keras import backend as K
class Attention(Layer): '''Attention operation for temporal data. # Input shape 3D tensor with shape: `(samples, steps, features)`. # Output shape 2D tensor with shape: `(samples, features)`. ''' def __init__(self, attention_dim, **kwargs): self.init = initializations.get('glorot_uniform') self.attention_dim = attention_dim super(Attention, self).__init__(**kwargs)
def build(self, input_shape):
self.W = self.init((self.attention_dim, self.attention_dim),
name='{}_W'.format(self.name)) self.b = K.zeros((self.attention_dim,), name='{}_b'.format(self.name))
self.u = K.zeros((self.attention_dim,), name='{}_u'.format(self.name))
self.trainable_weights += [self.W, self.b, self.u]
def get_output_shape_for(self, input_shape): return (input_shape[0], input_shape[2])
def call(self, x, mask=None):
a = K.tanh(K.dot(x, self.W) + self.b) alpha = K.exp(K.dot(a, self.u)) alpha =alpha/K.sum(alpha)
return x * K.tile(alpha, (self.attention_dim,1))
x = Input(shape=(MAX_SENTS, MAX_SENT_LENGTH,))
emb_words = TimeDistributed(embedding_layer)(x)
emb_sents = TimeDistributed(Bidirectional(GRU(100, return_sequences=True)))(emb_words)emb_sents = TimeDistributed( Attention(200))(emb_sents)
emb_docs = Bidirectional(GRU(100, return_sequences=True))(emb_sents)emb_docs = Attention(200)(emb_docs)
prediction = Dense(2, activation='softmax')(emb_docs)
model = Model(input=x, output=prediction)
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop', metrics=['acc'])
print("model fitting - Hierachical Attention networks")print model.summary()model.fit(x_train, y_train, validation_data=(x_val, y_val), nb_epoch=10 , batch_size=50)
class Attention(Layer): '''Attention operation for temporal data. # Input shape 3D tensor with shape: `(samples, steps, features)`. # Output shape 2D tensor with shape: `(samples, features)`. ''' def __init__(self, attention_dim, **kwargs): self.init = initializations.get('glorot_uniform') self.attention_dim = attention_dim super(Attention, self).__init__(**kwargs)
def build(self, input_shape):
self.W = self.init((self.attention_dim, self.attention_dim), name='{}_W'.format(self.name)) self.b = K.zeros((self.attention_dim,), name='{}_b'.format(self.name))
self.u = K.zeros((self.attention_dim,), name='{}_u'.format(self.name))
self.trainable_weights += [self.W, self.b, self.u]
def get_output_shape_for(self, input_shape): return (input_shape[0], input_shape[2])
def call(self, x, mask=None):
a = K.tanh(K.dot(x, self.W) + self.b)
ai=alpha = K.exp(K.dot(a, self.u))
weights = ai/K.sum(ai, axis=1).dimshuffle(0,'x') weighted_input = x*weights.dimshuffle(0,1,'x') return weighted_input.sum(axis=1)
x = Input(shape=(MAX_SENTS, MAX_SENT_LENGTH,))
emb_words = TimeDistributed(embedding_layer)(x)
emb_sents = TimeDistributed(Bidirectional(GRU(100, return_sequences=True)))(emb_words)emb_sents = TimeDistributed( Attention(200))(emb_sents)
emb_docs = Bidirectional(GRU(100, return_sequences=True))(emb_sents)emb_docs = Attention(200)(emb_docs)
prediction = Dense(2, activation='softmax')(emb_docs)model = Model(input=x, output=prediction)
Epoch 4/1020000/20000 [==============================] - 722s - loss: 0.1397 - acc: 0.9478 - val_loss: 0.2481 - val_acc: 0.9006
from keras.engine.topology import Layer
from keras import initializationsfrom keras import backend as K
class Attention(Layer): '''Attention operation for temporal data. # Input shape 3D tensor with shape: `(samples, steps, features)`. # Output shape 2D tensor with shape: `(samples, features)`. ''' def __init__(self, attention_dim, **kwargs): self.init = initializations.get('glorot_uniform') self.attention_dim = attention_dim super(Attention, self).__init__(**kwargs)
def build(self, input_shape): self.W = self.init((self.attention_dim, self.attention_dim), name='{}_W'.format(self.name)) self.b = K.zeros((self.attention_dim,), name='{}_b'.format(self.name))
self.u = self.init((self.attention_dim,), name='{}_u'.format(self.name))
self.trainable_weights += [self.W, self.b, self.u]
self.built = True
def get_output_shape_for(self, input_shape): return (input_shape[0], input_shape[2])
def call(self, x, mask=None):
# Calculate the first hidden activations a1 = K.tanh(K.dot(x, self.W) + self.b) # [n_samples, n_steps, n_hidden] # K.dot won't let us dot a 3D with a 1D so we do it with mult + sum mul_a1_u = a1 * self.u # [n_samples, n_steps, n_hidden] dot_a1_u = K.sum(mul_a1_u, axis=2) # [n_samples, n_steps] # Calculate the per step attention weights a2_num = K.exp(dot_a1_u) # [n_samples, n_steps] a2_den = K.sum(a2_num, axis=1) # [n_samples] a2_den = K.expand_dims(a2_den) # [n_samples, 1] so div broadcasts a2 = a2_num / a2_den # [n_samples, n_steps] a2 = K.expand_dims(a2) # [n_samples, n_steps, 1] so div broadcasts # Apply attention weights to steps weighted_input = x * a2 # [n_samples, n_steps, n_features] # Sum across the weighted steps to get the pooled activations return K.sum(weighted_input, axis=1)
class AttLayer(Layer):
def __init__(self, **kwargs):
self.init = initializations.get('normal')
#self.input_spec = [InputSpec(ndim=3)]
super(AttLayer, self).__init__(**kwargs)
def build(self, input_shape):
assert len(input_shape)==3
#self.W = self.init((input_shape[-1],1))
self.W = self.init((input_shape[-1],))
#self.input_spec = [InputSpec(shape=input_shape)]
self.trainable_weights = [self.W]
super(AttLayer, self).build(input_shape) # be sure you call this somewhere!
def call(self, x, mask=None):
eij = K.tanh(K.dot(x, self.W))
ai = K.exp(eij)
weights = ai/K.sum(ai, axis=1).dimshuffle(0,'x')
weighted_input = x*weights.dimshuffle(0,1,'x')
return weighted_input.sum(axis=1)
def get_output_shape_for(self, input_shape):
return (input_shape[0], input_shape[-1])
uit = tanh(Ww hit + bw)
class Attention(Layer):
def __init__(self, **kwargs):
self.supports_masking = True
probabilities = Dense(classes)(sentence)
probabilities = Activation('softmax')(probabilities)
model = Model(input=_input, output=probabilities)
model.compile(optimizer=Adam(clipnorm=5., lr=0.001), loss='categorical_crossentropy')
ValueError: Layer dense_1 does not support masking, but was passed an input_mask: Elemwise{neq,no_inplace}.0
class Attention(Layer):
def __init__(self, **kwargs):
self.supports_masking = True # this
self.init = initializations.get('glorot_uniform')
super(Attention, self).__init__(**kwargs)
def build(self, input_shape):
assert len(input_shape) == 3
self.W = self.init((input_shape[-1],), name='{}_W'.format(self.name))
self.b = K.ones((input_shape[1],), name='{}_b'.format(self.name))
self.trainable_weights = [self.W, self.b]
super(Attention, self).build(input_shape)
def compute_mask(self, input, input_mask=None): # and this
return None
def call(self, x, mask=None):
eij = K.tanh(K.dot(x, self.W) + self.b)
ai = K.exp(eij)
weights = ai / K.sum(ai, axis=1).dimshuffle(0, 'x')
weighted_input = x * weights.dimshuffle(0, 1, 'x')
return weighted_input.sum(axis=1)
def get_output_shape_for(self, input_shape):
return input_shape[0], input_shape[-1]
def call(self, x, mask=None):
eij = K.tanh(K.dot(x, self.W) + self.b)
ai = K.exp(eij)
if mask is not None:
ai
= mask*
ai
...
For this to work, you need to activate mask in embedding, and add a masking layer for sentences.
Also, not having the context vector does not make this a simple softmax ?
def call(self, x, mask=None):
a = K.tanh(K.dot(x, self.W) + self.b)
ai = K.exp(K.dot(a, self.u))
if mask is not None:
ai = mask *
ai
weights = ai / K.sum(ai, axis=1).dimshuffle(0, 'x')
weighted_input = x * weights.dimshuffle(0, 1, 'x')
return weighted_input.sum(axis=1)
def call(self, x, mask=None):
eij = K.tanh(K.dot(x, self.W))
if self.bias:
eij += self.b
ai = K.exp(eij)
if mask is not None:
ai = mask *
ai
weights = ai / K.sum(ai, axis=1).dimshuffle(0, 'x')
weighted_input = x * weights.dimshuffle(0, 1, 'x')
return weighted_input.sum(axis=1)
def call(self, x, mask=None):
eij = K.tanh(K.dot(x, self.W) + self.b)
ai = K.exp(eij)
if mask is not None:
ai = mask*ai
and i get loss:nan during training...
K.tanh(K.dot(x, self.W))
K.dot(K.not_equal(x, 0, axis=time_dim), self.W)
--
You received this message because you are subscribed to a topic in the Google Groups "Keras-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/keras-users/IWK9opMFavQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to keras-users+unsubscribe@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/keras-users/5e82da5f-15d6-4b75-a4b2-9f7e7b7c3012%40googlegroups.com.
def call(self, x, mask=None):
a = K.tanh(K.dot(x, self.W))
if self.bias:
a += self.b
ai = K.exp(K.dot(a, self.u))
# if mask is not None:
# ai = mask * ai
weights = ai / K.sum(ai, axis=1).dimshuffle(0, 'x')
weighted_input = x * weights.dimshuffle(0, 1, 'x')
return weighted_input.sum(axis=1)
def call(self, x, mask=None):
uit = K.tanh(K.dot(x, self.W))
if self.bias:
uit += self.b
ait = K.dot(uit, self.u)
# apply mask
if mask is not None:
mask = K.cast(mask, 'float32')
ait *= mask
ai = K.exp(ait)
weights = ai / K.sum(ai, axis=1).dimshuffle(0, 'x')
weighted_input = x * weights.dimshuffle(0, 1, 'x')
return weighted_input.sum(axis=1)
lstm = LSTM(rnn_dim, ..., return_sequences=True) # [n_samples, n_steps, rnn_dim]
att = TimeDistributed(Dense(rnn_dim, activation='tanh')))(lstm) # [n_samples, n_steps, rnn_dim]
att = TimeDistributed(Dense(1, bias=False))(att) # [n_samples, n_steps, 1]
att = Reshape((rnn_dim,))(att) # [n_samples, n_steps]
att = Activation('softmax')(att) # [n_samples, n_steps]
lstm = merge([att, lstm], mode='dot', dot_axes=(1,1)) # [n_samples, rnn_dim]
--
You received this message because you are subscribed to a topic in the Google Groups "Keras-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/keras-users/IWK9opMFavQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to keras-users+unsubscribe@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/keras-users/b0f3cba3-b6ff-48f1-b3f0-9688d8d4d89b%40googlegroups.com.
I think that the
HN-MAX
is (roughly)
max_sents = # maximum number of sentences per document max_words = # maximum number of words per sentence x = Input(shape=(max_sents, max_words,)) emb_words = TimeDistributed(Embedding(input_dim=max_features, output_dim=200, mask_zero=True))(x) emb_sents = TimeDistributed(Bidirectional(GRU(50, return_sequences=True)))(emb_words) emb_sents = TimeDistributed(GlobalMaxPooling1D())(emb_sents) emb_docs = Bidirectional(GRU(50 return_sequences=True))(emb_sents) emb_docs = GlobalMaxPooling1D()(emb_docs) prediction = Dense(y_train.shape[1], activation='softmax')(emb_docs) model = Model(input=x, output=prediction) model.compile(loss='categorical_crossentropy', optimizer=SGD(momentum=0.9), metrics=['accuracy'])
--
You received this message because you are subscribed to a topic in the Google Groups "Keras-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/keras-users/IWK9opMFavQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to keras-users+unsubscribe@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/keras-users/3f9d047c-809f-4dcd-b5ec-74356d7d6e53%40googlegroups.com.
Hi all,I have implemented the paper using Keras. Here is my github repository https://github.com/richliao/textClassifier and I have written a blog about the implementation: https://richliao.github.io/supervised/classification/2016/12/26/textclassifier-HATN/.Feedbacks are very welcomed.Richard
Hi all,I have implemented the paper using Keras. Here is my github repository https://github.com/richliao/textClassifier and I have written a blog about the implementation: https://richliao.github.io/supervised/classification/2016/12/26/textclassifier-HATN/.Feedbacks are very welcomed.
Richard
On Wednesday, December 28, 2016 at 8:16:18 PM UTC-5, Ben Johnson wrote:I believe you have to downgrade from 1.2.0 to 1.1.1There was a bug introduced recently that messes up the TimeDistributed layers.~ BenOn Wed, Dec 28, 2016 at 8:09 PM Yasser Hifny <yhi...@gmail.com> wrote:Hi,when testing your code:from keras.models import Sequential, Modelfrom keras.layers import Input, Dense, TimeDistributedfrom keras.layers import GRU,GlobalMaxPooling1D,Bidirectional, Embedding, LSTMfrom keras.optimizers import SGDmax_sents = 100max_words = 50x = Input(shape=(max_sents, max_words,))emb_words = TimeDistributed(Embedding(input_dim=1000, output_dim=200, mask_zero=True))(x)
emb_sents = TimeDistributed(Bidirectional(GRU(50, return_sequences=True)))(emb_words)emb_sents = TimeDistributed(GlobalMaxPooling1D())(emb_sents)
emb_docs = Bidirectional(GRU(50, return_sequences=True))(emb_sents)emb_docs = GlobalMaxPooling1D()(emb_docs)prediction = Dense(44, activation='softmax')(emb_docs)
model = Model(input=x, output=prediction)model.compile(loss='categorical_crossentropy', optimizer=SGD(momentum=0.9), metrics=['accuracy'])
print model.summary()I got this error:$ python han.pyUsing Theano backend./usr/lib/python2.7/site-packages/keras/engine/topology.py:368: UserWarning: The `regularizers` property of layers/models is deprecated. Regularization losses are now managed via the `losses` layer/model property.warnings.warn('The `regularizers` property of '
Traceback (most recent call last):
File "han.py", line 15, in <module>emb_sents = TimeDistributed(Bidirectional(GRU(50, return_sequences=True)))(emb_words)File "/usr/lib/python2.7/site-packages/keras/engine/topology.py", line 569, in __call__self.add_inbound_node(inbound_layers, node_indices, tensor_indices)File "/usr/lib/python2.7/site-packages/keras/engine/topology.py", line 632, in add_inbound_nodeNode.create_node(self, inbound_layers, node_indices, tensor_indices)File "/usr/lib/python2.7/site-packages/keras/engine/topology.py", line 164, in create_nodeoutput_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))File "/usr/lib/python2.7/site-packages/keras/layers/wrappers.py", line 129, in cally = self.layer.call(X) # (nb_samples * timesteps, ...)File "/usr/lib/python2.7/site-packages/keras/layers/wrappers.py", line 203, in callY = self.forward_layer.call(X, mask)File "/usr/lib/python2.7/site-packages/keras/layers/recurrent.py", line 201, in callinput_shape = K.int_shape(x)File "/usr/lib/python2.7/site-packages/keras/backend/theano_backend.py", line 128, in int_shaperaise Exception('Not a Keras tensor:', x)Exception: ('Not a Keras tensor:', Reshape{3}.0)how to solve this error?ThanksYasser
On Tuesday, December 27, 2016 at 7:26:57 PM UTC-5, bkj...@gmail.com wrote:(And also, the code as posted above is what they implemented in the paper -- I've been using the `rmsprop` optimizer in my experiments since in the past I've had better luck using it with RNNs)
I think that the
HN-MAX
is (roughly)
max_sents = # maximum number of sentences per document,,,
max_words = # maximum number of words per sentence
x = Input(shape=(max_sents, max_words,))
emb_words = TimeDistributed(Embedding(input_dim=max_features, output_dim=200, mask_zero=True))(x)
emb_sents = TimeDistributed(Bidirectional(GRU(50, return_sequences=True)))(emb_words)
emb_sents = TimeDistributed(GlobalMaxPooling1D())(emb_sents)
emb_docs = Bidirectional(GRU(50 return_sequences=True))(emb_sents)
emb_docs = GlobalMaxPooling1D()(emb_docs)
prediction = Dense(y_train.shape[1], activation='softmax')(emb_docs)
model = Model(input=x, output=prediction)
model.compile(loss='categorical_crossentropy', optimizer=SGD(momentum=0.9), metrics=['accuracy'])Turning this into the
HN-AVG
variant is fairly straightforward, and to theHN-ATT
you'd have to write a little attention unit, but that shouldn't be particularly difficult I don't think.
Using the Yelp 2013 data from http://ir.hit.edu.cn/~dytang/, I haven't been able to reproduce the results in the paper (I'm getting within 1-2% with HN-MAX and HN-AVG). I'm not using pretrained word vectors, so perhaps that's the reason. Would love for someone to be able to modify this code to reproduce the results more exactly, as I'm not 100% where I'm going wrong.
~ Ben
On Thursday, December 22, 2016 at 11:36:35 AM UTC-5, bkj...@gmail.com wrote:
Oh actually, apologies, it's been a while since I read the whole paper: they don't average the word or sentence embeddings, they encode them with a GRU. So it's actually a little more involved than fastText + reweighting, but I'll try implementing both variants.
I haven't implemented it, but I was going to look into it. I agree that the results are very impressive, though I'm actually a little confused about where the gains are coming from. They show that HN-ATT outperforms HN-AVE and HN-MAX, but actually the two simpler variants outperform all of the other benchmarks. Correct me if I'm wrong, but HN-AVE just:a) splits the document into set of sentencesb) averages word embeddings for each sentence to get a set of sentence representationsc) averages sentence representations to get a document representationI think this is really just equivalent to the model from this paperplus some reweighting of words based on the length of sentences they're found in. The linked paper actually evaluates on some of the same datasets as the HN and consistently underperforms HN-AVE by 2-3%. So all of this would suggest that those gains from the "hierarchical" part of "hierarchical attention networks", which I think is pretty interesting. I'm going to try to implement HN-AVE in the next couple of weeks, and will report back here if I find anything interesting.~ Ben