Having trouble understanding the stateful RNN example

KN

unread,

Mar 10, 2016, 7:27:39 AM3/10/16

to Keras-users

In the documentation on stateful RNNs, the following statement is made:

"Making a RNN stateful means that the states for the samples of each batch will be reused as initial states for the samples in the next batch."

Given K samples per batch:

Batch1 = [sample1, sample2, ..., samplek] producing [B1State1, B1State2, ..., B1StateK]
Batch2 = [sample1, sample2, ..., samplek] producing [B2State1, B2State2, ..., B2StateK]

Does that statement mean that the state B1State1 is used as the initial state for arriving at B2State1?

If that's the case, I am not able to follow the Stateful LSTM example on Github: The shape of `cos` is (50000, 1, 1) and with a batch size of 25, the 26th "point" of the timeseries (i.e. the first sample of the second batch) will use the 1st point of the timeseries as the initial state. I would expect the initial state of the 26th point to be that of the 25th point...

Can someone please explain what's happening here? Thank you in advance.

KN.

alex.trem...@gmail.com

unread,

Apr 19, 2016, 5:43:20 PM4/19/16

to Keras-users

I think you are correct to be confused. The batch_size of 25 means (as far as I can tell) that output[100] is predicted from input[100], and by recursion, also from input[75], input[50], input[25], input[0].

My guess is that this example works because the signals are predictable over that time-scale. It seems significant that the cosine period is also 25.

In fact, it appears that you get somewhat different learning curves if you use different batch sizes. You can also see what happens if you change the cosine period to something like 15, although it is not as sensitive to this as I expected.

I'm hoping someone will correct me if there is something I have gotten wrong here.

I would prefer an example code for training on multiple time-series examples, say: 100 examples with 500 time-steps each. Use a batchsize of 25 so that each batch has size (25, 500, 1). Now, each of these batches must be broken up into 500 sub-batches, each of size (25, 1, 1). These subbatches would be provided using the train_on_batch API, and you'd call reset_states() after 500 sub-batches, before the next batch of 25. This is like the code in the documentation that you linked to.

-A

Keith Trnka

unread,

Apr 19, 2016, 6:26:04 PM4/19/16

to Keras-users

I had a TON of trouble with that example. What I ended up learning is that it's not useful:

-Replacing all LSTM with Dense/ReLU achieves much lower error

-Reducing from 2 hidden to 1 hidden layer achieves much lower error

The comment "since we are using stateful rnn tsteps can be set to 1" is very misleading. As far as I can tell, the stateful option does no more than what it says - sets initial hidden values to the corresponding ones from the previous batch (like KN says). It doesn't backprop into the previous batch. You still need to unroll time by stacking shifted versions of your input matrix. With or without statefulness this is what's controlling the backprop-through-time.

Statefulness is just carrying a tiny bit of info over into your BPTT so that your network has the capability of seeing information beyond the time steps.

To mimic a traditional training setup you want to have batch size and time steps equal. If they're both 5 then you have the current input (call it input_i) and the inputs at 4 previous time steps (input_i-1, ... input_i-4). With batch size 5 the aligned input in the previous batch is input_i-5 with it's time-unrolled part input_i-6, ... input_i-9. Stateful is copying the output hidden from the sample (i-5, ... i-9) as the input hidden for sample (i, ... i-4) which is mimicking what a longer window of BPTT would do without backprop across the batch.

... At least that's my understanding after struggling with it. I was hoping it was a way to avoid increasing my input size 10x for 10 time steps but it doesn't seem that way at all.

alex.trem...@gmail.com

unread,

Apr 19, 2016, 6:48:18 PM4/19/16

to Keras-users

Thanks for sharing your insights. To me, this is key: "It doesn't backprop into the previous batch." So using tsteps=1 puts a pretty severe limitation on the model's ability to learn long-term relationships.

Also, if I understand you correctly, you are suggesting forming a Hankel matrix from the series for each batch to get the samples to line up correctly between batches. That seems right.

I was hoping this would provide a mechanism for working with series of variable length, but like you, my hopes are dashed.

Keith Trnka

unread,

Apr 19, 2016, 6:58:25 PM4/19/16

to alex.trem...@gmail.com, Keras-users

Heh I had to google Hankel matrix but yes, each row of input becomes an Hankel matrix if I understand the Wikipedia page right. If you have a single input, keras.preprocessing.sequence.pad_sequences will work. Now that I know this Hankel term... it seems that scipy has scipy.linalg.hankel to do this too.

Here's the helper I'm using to convert my (num_samples, num_features) to (num_samples, num_time_steps, num_features) like Keras wants:

https://gist.github.com/ktrnka/8f447bfdd1a1beb96af96733db563145

--
You received this message because you are subscribed to a topic in the Google Groups "Keras-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/keras-users/AbJ-LsEbRNc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to keras-users...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/keras-users/7285a33c-8292-48cf-aa8d-9c0bd844a11a%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Dmitry Lukovkin

unread,

Apr 21, 2016, 7:04:54 AM4/21/16

to Keras-users

Hi, Keith!

Probably you are right, if take a look at the code.

States have shape of (nb_samples, output_dim), so the state[sample_N, :] will correspond to the 'final' state of the time sequence contained in this sample. And it seems that the only way to obtain consecutive states for all timesteps is to stack shifted subsets of the input matrix.

But in this case training will take a lot of time for long enough input sequences.

Have you compared accuracy of stateful and stateless implementations?

Best regards,

Dmitry Lukovkin

Keith Trnka

unread,

Apr 22, 2016, 1:20:59 AM4/22/16

to Dmitry Lukovkin, Keras-users

I tried it out today though I'm not 100% sure that I've implemented it correctly.

I tried confirming the implementation by having X = random 1000x1, Y column 0 = mean of previous 10 X, Y column 1 = mean of previous 20 X with the time unrolling set at 10 units. The first output is what I use to unit test my plain LSTM and I require that it achieves improvement over MLP. The second (hopefully) should be something stateful LSTM can model at this time setting. Strangely what I found was that the stateless LSTM performed horribly when I added that second column (often worse than linear regression). As a sanity check I tested them both on just outputting the first column and the odd thing is that the stateless LSTM typically had about half the error even though the full data is with the BPTT window. (Long story short I *think* the code is working even if I don't quite understand my test)

On a small real-world data set what I found

-stateful had slightly higher error (with all other params the same and early stopping disabled)

-it's annoying to make sure all the arrays are multiples of the batch size, especially for outputs you need to pad X_test then unpad y_pred

-because we don't want to time unroll too much my batch size was limited; I tried 8 for time and batch size. But training is very slow compared to 64 batch size. Probably it'd be faster to train stateless with 16 time steps just due to the batch size

-I disabled validation loss because I think that needs to be a multiple of the batch size as well and it's annoying to deal with that (actually I don't know if there's a way to reset the state before doing the internal validation anyway)

-I had to disable a training "warmup" run I like to do - a single epoch at batch size 1. (I learned last week that doing this eliminated most of the NaN issues I was having before.)

--

You received this message because you are subscribed to a topic in the Google Groups "Keras-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/keras-users/AbJ-LsEbRNc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to keras-users...@googlegroups.com.

To view this discussion on the web, visit https://groups.google.com/d/msgid/keras-users/6b803cbe-f389-4aa1-9df8-0db93083d650%40googlegroups.com.

Dmitry Lukovkin

unread,

Apr 22, 2016, 5:25:21 AM4/22/16

to Keras-users, dmitry....@gmail.com

I don't think that I completely understood your setting, but there's very interesting point on if states are being reset before validation during epoch run. Probably not, as soon as you have to manually reset states before the next epoch.

Best regards,

Dmitry Lukovkin

bais...@gmail.com

unread,

Oct 10, 2018, 11:02:38 AM10/10/18

to Keras-users

Hi KT,

Thanks for you explanation. It helps me understand stateful LSTM better. But I am still confused that should we reset the state when we start a new epoch?

I am really confused with stateful LSTM. Could you please give me more info, thank you very much.

Lei

在 2016年4月20日星期三 UTC+10上午8:26:04，Keith Trnka写道：

Reply all

Reply to author

Forward