I'm using
tf.estimator.train_and_evaluate on TF 1.4.1, and training locally. If I set
tf.estimator.TrainSpec(max_steps=None) and the training input_fn declares
dataset.repeat(1)what I see is that the training input function will never throw an OutOfRangeError. So what I tried to do instead is to specify max_steps in the TrainSpec. But now I can only train the model once.
For example,
train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=max_steps)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
This will train the model for the expected max_steps (say, 10,000 steps). When training finishes, I'll see this, as expected:
INFO:tensorflow:Saving dict for global step 10000: auc_eval = 0.743603, global_step = 10000, loss = 0.151304
DEBUG:tensorflow:Calling exporter with the `is_the_final_export=True`.
The problem is that now if I want to train the model for an additional 10,000 steps, regardless of what value I use for max_steps (I've tried 10,000, 20,000, 1,000,000, and None), the model won't train, and I'll instead get:
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after 600 secs (eval_spec.throttle_secs) or training is finished.
INFO:tensorflow:Skipping training since max_steps has already saved.
INFO:tensorflow:Starting evaluation at 2018-02-02-12:19:09
INFO:tensorflow:Restoring parameters from models/test_model/model.ckpt-10000
INFO:tensorflow:Finished evaluation at 2018-02-02-12:19:46
INFO:tensorflow:Saving dict for global step 10000: auc_eval = 0.743603, global_step = 10000, loss = 0.151304
DEBUG:tensorflow:Calling exporter with the `is_the_final_export=True`.
Any advice?