help with text prediction with recurrent nets

3,615 views
Skip to first unread message

dav...@gmail.com

unread,
May 27, 2015, 10:59:53 PM5/27/15
to keras...@googlegroups.com
Hi Francois

I want to try to use keras to build simple text prediction tools with keras and lstm.  To get started, I looked at the IMDB review -> sentiment example, but that seems to be more of a classification problem.  If I want to get started with the IMDB example:

from keras.models import Sequential from keras.layers.core import Dense, Dropout, Activation, Embedding from keras.layers.recurrent import LSTM model = Sequential() model.add(Embedding(max_features, 256)) model.add(LSTM(256, 128, activation='sigmoid', inner_activation='hard_sigmoid')) model.add(Dropout(0.5)) model.add(Dense(128, 1)) model.add(Activation('sigmoid')) model.compile(loss='binary_crossentropy', optimizer='rmsprop') model.fit(X_train, Y_train, batch_size=16, nb_epoch=10) score = model.evaluate(X_test, Y_test, batch_size=16)

I'll use the "categorical_crossentropy" rather than 'binary_crossentropy', and to obtain my test data, I plan to sample fixed-length sentences from some large text source (say wikipedia) and convert the sentences into arrays of integers that map to unique characters.  So X_train and Y_train will just be sample[0:N-1] and sample[1:N] respectively.

However is this the correct approach?  My fear is that inside the model, it will simply take all the X_train to predict the output and it won't be causality condition won't hold.  And I'm not sure what happens to the hidden states when I take the trained model and feed it further text.

The IMDB example is more of a classification-pattern code, rather than a stream-of-input / stream-of-output pattern (like machine translation).  So I'm wondering if you can help me out and start me off with another code pattern I can follow when trying to do text prediction with keras.

If I'm able to get something to work well with this awesome library maybe I can try to contribute it to one of the examples in the documentation as well!  To balance out the recent interest from Torch7 :)

Thanks

David (@hardmaru)

François Chollet

unread,
May 28, 2015, 12:57:25 AM5/28/15
to dav...@gmail.com, keras...@googlegroups.com
Hi David,

As a side note, there is more than me on the mailing list ;-)

What you describe seems correct. Your input would be a 3D tensor with shape (sentences, timesteps(t), letters), and your targets would be 3D tensors with shape (sentences, timesteps(t+1), letters). The letter dimension would be binary in the case of the input, and would be a probability distribution (use categorical_softmax) in the case of the output.

In such a setting, only the input timesteps [0..t] would be used to compute the output at time t. The LSTM layer doesn't look into the future. 

Your network would look similar to:

model = Sequential() model.add(Embedding(max_features, 256)) model.add(LSTM(256, 29, return_sequences=True)) # assuming 29 letters in the alphabet model.add(Dropout(0.5)) model.add(Activation('categorical_softmax')) # output is a probability distribution with shape (samples, timesteps, 29)
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

After training, you can generate text with a loop:

while 1:
next_letter = model.predict_classes(sentence_tensor)[0]
sentence_tensor = append_letter(sentence_tensor, next_letter)

(Note: I haven't tested any of this...)

And sure, a working example showing text generation or something similar would be really cool to have. 

Cheers,

Francois

--
You received this message because you are subscribed to the Google Groups "Keras-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to keras-users...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/keras-users/10d18272-c0f5-4d9e-8cec-7412cd5cc1b4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

François Chollet

unread,
May 28, 2015, 1:03:28 AM5/28/15
to dav...@gmail.com, keras...@googlegroups.com
Alternatively: this would likely work better

model = Sequential() model.add(Embedding(max_features, 256)) model.add(LSTM(256, 128, return_sequences=True)) # assuming 29 letters in the alphabet model.add(Dropout(0.5))
model.add(TimeDistributedDense(128, 29)) model.add(Activation('categorical_softmax')) # output is a probability distribution with shape (samples, timesteps, 29)

François Chollet

unread,
May 28, 2015, 1:08:59 AM5/28/15
to david ha, keras...@googlegroups.com
Actually, use time_distributed_softmax instead of softmax. Man I'm tired.

david ha

unread,
May 28, 2015, 1:30:17 AM5/28/15
to François Chollet, keras...@googlegroups.com
Thanks for quick response, esp at this time!

I'm gonna have a play around with it.

p.nec...@gmail.com

unread,
May 28, 2015, 4:41:26 AM5/28/15
to keras...@googlegroups.com, dav...@gmail.com
while 1:
next_letter = model.predict_classes(sentence_tensor)[0]
sentence_tensor = append_letter(sentence_tensor, next_letter)

^ This is a horrible way to do it, albeit the only way with keras atm. Every new letter you add will have to compute the entire sequence history all over again, thus taking longer and longer for every step.
It would be nice if you could retain the last step internal state for running mode.
Also would be nice to have the temperature option for the softmax. :))


I've been trying to do this letter-by-letter text prediction task for a while now by training it on the lord of the rings with not much luck. :))
Maybe it just needs more training time


在 2015年5月28日星期四 UTC+2上午6:57:25,François Chollet写道:

david ha

unread,
May 28, 2015, 8:06:11 AM5/28/15
to François Chollet, keras...@googlegroups.com
Thanks for the model suggestion, Francois.

This is what I ended up doing to build the model (it took a few minutes to compile though...)

-----

max_features = len(myDictionary) # in a simple training text file, there was only 51 unique characters

model = Sequential()
model.add(Embedding(max_features, 256))
model.add(LSTM(256, 128, return_sequences=True))
model.add(Dropout(0.5))
model.add(TimeDistributedDense(128, max_features))
model.add(Activation('time_distributed_softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
# this took a while, and gave me a RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility
# I looked it up and this warning seems to be okay for other users of Theano

-----

I'm a bit confused about what you said above:

"Your input would be a 3D tensor with shape (sentences, timesteps(t), letters), and your targets would be 3D tensors with shape (sentences, timesteps(t+1), letters). The letter dimension would be binary in the case of the input, and would be a probability distribution (use categorical_softmax) in the case of the output."

I wrote some helper functions to extract, say, 10000 samples from a text file, where each sequence has a length of 101, and all data converted to dtype=int32 (and there are only 51 unique integers in the data)

So now I have an ndarray called samples, which has a shape of (10000, 101), and I take the first 100 elements to be the training input and the last 100 elements to be the training outputs

--------

In[138]: 
samples

Out[138]: 
array([[ 2, 25,  0, ...,  1, 15,  0],
       [ 6, 24, 15, ...,  3,  6,  7],
       [ 8,  5,  0, ..., 14,  2,  8],
       ..., 
       [17,  4,  6, ...,  0, 12,  3],
       [ 6,  3,  7, ...,  7,  0, 14],
       [10, 10,  0, ...,  7,  3, 19]], dtype=int32)
In[139]: 

samples.shape
Out[139]: (10000, 101)

In[140]:
X_train = samples[:, 0:-1]
Y_train = samples[:, 1:]

X_train.shape
Out[141]: (10000, 100)

Y_train.shape
Out[142]: (10000, 100)

--------

I want to figure out how to convert this X_train, Y_train into data that the above model can attempt to train.  My understanding is that because we used the Embedding() layer, I can leave the X_train as integers, each defining a single character, and I will probably need to unroll the Y_train data into a big set of [0, 0, 0, 0, 1, 0 ,0, ...., 0, 0, 0] arrays for the output.  I have some trouble understanding how to construct the 3D tensor with the timesteps above.

(Sorry, I'm not really familiar with how tensors can get constructed in theano or python!)  Thanks again for the help.

David



p.nec...@gmail.com

unread,
May 28, 2015, 8:33:43 AM5/28/15
to keras...@googlegroups.com, francois...@gmail.com, dav...@gmail.com
You can't just train from indices to indices. You have to convert your targets into one-hot encodings. So a = 1 = [1 0 0 .... 0], b = 2 = [0 1 0 .... 0], etc.
On the input side the embedding layer handles the transformations, on the output side you have to do it.

So your Y_train will be of size (batches, sequence_length, vocabulary_size). In your case that would be like (10000, 100, 51).

在 2015年5月28日星期四 UTC+2下午2:06:11,david ha写道:
...

david ha

unread,
May 28, 2015, 10:18:49 AM5/28/15
to p.nec...@gmail.com, keras...@googlegroups.com, François Chollet
Okay.  I thought I needed to convert the outputs to one-hot encodings.  Thanks vm for the clarification.

So Y_train would be an array of one-hot matrices, a 3D tensor with shape (10000, 100, 51), where every element is either 1 or 0.

And X_train would be an array of integer vectors, with shape (10000, 100).  X_train would not need to be a 3D tensor since the Embedding() layer would convert it from a normal 2D ndarray into a 3D tensor.  I think I get it now!

Regards

David

dav...@gmail.com

unread,
May 29, 2015, 9:25:09 PM5/29/15
to keras...@googlegroups.com, dav...@gmail.com
I'm able to get the flow working, but so far it's just outputting gibberish.  Tried to use gradient clipping as well
(is this how to do it?  RMSprop(clipnorm = 5.0)):

nLayer0 = 256 # 256
nLayer1 = 128 # 128
# max_features is 51

model = Sequential()
model.add(Embedding(max_features, nLayer0))
model.add(LSTM(nLayer0, nLayer1, return_sequences=True))
model.add(Dropout(0.5))
model.add(TimeDistributedDense(nLayer1, max_features))
model.add(Activation('time_distributed_softmax'))

model.compile(loss='categorical_crossentropy', optimizer=RMSprop(clipnorm = 5.0)) 

So far I'm only testing this on a smallish basic dataset (Paul Graham's essays), so I figure it would work even if I overfit a bit.  There must be a bug in my code somewhere.

When predicting the next letter, I tried to also extract the entire softmax probability distribution, and them sample from that one, rather than just get the character with the highest probability.  Perhaps I can also put in a 'temperature'-like parameter to bias the probabilities in some way as well.

p = model.predict_proba(x, batch_size=1, verbose=0)[0][-1]
next_char = gen_sample(p) # gen_sample samples from p

François Chollet

unread,
Jun 15, 2015, 8:49:15 PM6/15/15
to david ha, keras...@googlegroups.com
Hi David,

Sorry for not getting back to you earlier. I just added a text generation example that can write pretty decent Nietzschean philosophy after a couple hours of training on GPU:


It turns out that the "temperature" for sampling (or more generally the choice of the sampling strategy) is critical to get sensible results. 

To note: the script above is very inefficient to due the stateless nature of LSTM units in Keras. We definitely need to develop a stateful way to handle RNNs. 

Cheers,

Francois


Stefan Otte

unread,
Jun 16, 2015, 3:46:16 AM6/16/15
to François Chollet, keras...@googlegroups.com
Hey Francois,

can you give us some samples of "pretty decent Nietzschean philosophy" please! :)


Best,
 Stefan


david ha

unread,
Jun 16, 2015, 10:58:41 PM6/16/15
to François Chollet, keras...@googlegroups.com
Great- thanks for the script example! I learned a great deal from it (in addition to the exercise of trying it out myself)

I actually implemented the temperature sampling feature after extracting the probabilities, but I think I didn't train the model enough, definitely not a few hours.

While it may be easy to hack in a way for the model to retain state, the challenge may be to design an elegant code pattern for the RNN model to retain its state, and be able to access  it, while at the same time keeping the existing simple and elegant way of doing things in keras.  It would be like trying to design a very simple interface for a very complicated machinery under the hood.

David


François Chollet

unread,
Jun 17, 2015, 1:39:55 AM6/17/15
to david ha, keras...@googlegroups.com
While it may be easy to hack in a way for the model to retain state, the challenge may be to design an elegant code pattern for the RNN model to retain its state, and be able to access  it, while at the same time keeping the existing simple and elegant way of doing things in keras.  It would be like trying to design a very simple interface for a very complicated machinery under the hood.

Better still would be to design a very simple interface for a very simple machinery under the hood. Interfaces should aim for simplicity, but so should the code itself.

can you give us some samples of "pretty decent Nietzschean philosophy" please! :)

I did some modification to the existing script (added two layers of Dropout, set maxlen to 20, step to 3). Here are few word bites after 20 epochs:

"he has given it the sense of unity and self-control as look to the individuals and platoness of men in the soul and the common power, the madied of morals and presurable and belief in the same time and the conscience of their influence, which is the present the conscience of the common end" (I think "platoness" refers to being Platon-like. I like it.)

"the law is a goversion of the common." (I take it to mean, "law is the government of the plebe")

"will the same time and beings and art of the strong and self-distrust of the same and all not only a soul and still store of the same time and artist in sacrifice their own soul, and always the most distrous of a man." (yes, no doubt about that!)

"we can nation of everywhere, the strength of the foundation, and also us the most dinge in the master and art" (it even speaks a bit of German apparently).

p.nec...@gmail.com

unread,
Jun 17, 2015, 11:25:01 AM6/17/15
to keras...@googlegroups.com, dav...@gmail.com
Is there any particular reason not to use the embedding layer for letters in a case like this?
Besides the fact that the vocabulary in this case is much smaller
...

François Chollet

unread,
Jun 17, 2015, 11:53:45 AM6/17/15
to p.nec...@gmail.com, keras...@googlegroups.com, david ha
Is there any particular reason not to use the embedding layer for letters in a case like this? Besides the fact that the vocabulary in this case is much smaller

Yes: we want the output to be a probability distribution over characters. If each character was encoded by a dense vector learned with an Embedding layer, then output sampling would become a K-nearest neighbors problem over the embedding space, which would be much more complex to deal with than a dictionary lookup.


p.nec...@gmail.com

unread,
Jun 17, 2015, 2:18:05 PM6/17/15
to keras...@googlegroups.com
Hm ok, I never considered the input embedding would affect the output in that kind of manner. I thought some form of embedding was going to be done automatically regardless. I'll recheck my code.

François Chollet

unread,
Jun 17, 2015, 2:21:37 PM6/17/15
to p.nec...@gmail.com, keras...@googlegroups.com
Can you share your code? Depending on what your targets are, you might be fine. If you're only using the embedding for the input but you're mapping the output with a simple dictionary, then you'll be fine (i.e. your output is still a probability distribution over characters).

On 17 June 2015 at 11:18, <p.nec...@gmail.com> wrote:
Hm ok, I never considered the input embedding would affect the output in that kind of manner. I thought some form of embedding was going to be done automatically regardless. I'll recheck my code.
--
You received this message because you are subscribed to the Google Groups "Keras-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to keras-users...@googlegroups.com.

p.nec...@gmail.com

unread,
Jun 17, 2015, 2:37:22 PM6/17/15
to keras...@googlegroups.com
I'm not at the computer right now. But i just use an embedding layer from the letters then 2 lstms and finish off with a timedistributeddense softmax.

Logically there shouldn't be any real difference but i was just curious of the justification. The embedding layer avoids that sparse multiplication. I haven't yet gotten to the sampling part. :p

p.nec...@gmail.com

unread,
Jun 19, 2015, 9:23:14 AM6/19/15
to keras...@googlegroups.com
Also, why use the return_sequences=False flag? Isn't it kind of redundant to train each next letter individually instead of using the whole sequence?
I thought the structure for letter prediction would look more like this:

model = Sequential()
model.add(Embedding(vocabsize,256))
model.add(LSTM(256, rsize, return_sequences=True, truncate_gradient=-1))
model.add(Dropout(0.2))
model.add(LSTM(rsize, rsize, return_sequences=True, truncate_gradient=-1))
model.add(Dropout(0.2))
model.add(TimeDistributedDense(rsize,vocabsize))
model.add(Activation('time_distributed_softmax'))
loss = RMSprop(lr=0.01, rho=0.99, epsilon=1e-6)
model.compile(loss='categorical_crossentropy', optimizer=loss, class_mode="categorical")

p.nec...@gmail.com

unread,
Jun 25, 2015, 6:32:31 AM6/25/15
to keras...@googlegroups.com
Actually I suspect in the end what the logical difference would be, in case anyone else is having this dilemma.

By doing it as in the example you avoid clinging to irrelevant long term dependencies (a letter in a word rarely depends on another letter 200 characters back), while if you do it this way, you enforce taking such correlations into account.
Quite interesting

François Chollet

unread,
Jun 25, 2015, 12:48:21 PM6/25/15
to p.nec...@gmail.com, keras...@googlegroups.com
In my experience, character by character prediction has worked better for text generation. Then again, both are possible, as well as everything in between: for instance you could use the past 200 characters to predict the next 5...

--
You received this message because you are subscribed to the Google Groups "Keras-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to keras-users...@googlegroups.com.

ionesc...@gmail.com

unread,
May 5, 2016, 5:08:55 PM5/5/16
to Keras-users, dav...@gmail.com
Hello,

Is there an "official" example using stateful LSTMs? I am not able to get anything working myself. I have found this:


But it does not seem to learn anything at all, although it does appear to be quite faster.

Has anyone managed to get text generation working with stateful RNNs?

Thanks!

dha...@infocusp.in

unread,
Jul 12, 2017, 3:33:05 AM7/12/17
to Keras-users, dav...@gmail.com
Hello,

I am trying to implement text generation using keras. For that i am using example code given on keras github (https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py). But i am using input corpus as Sherlock homes the cannon (https://sherlock-holm.es/stories/plain-text/cano.txt) trim using (https://gist.github.com/rongjiecomputer/94154e0bf01ef19a4999fef70264c48a).

but it keeps outputting gibberish even after 50 epochs. It doesn't seem to make any progress from epoch to epoch. What results have you had so far? 

Is their any change in code?

Thanks
Dhaval
Reply all
Reply to author
Forward
0 new messages