Breakpoints on PyCharm while training with tf.estimator

Roy

unread,

May 8, 2017, 12:57:23 AM5/8/17

to Discuss

One may debug tf by printing output or by tf.summary.

PyCharm IDE, similarly to Matlab, offers use of breakpoints which I really like for development.

When I build a nn and train it using tf estimator (tf.contrib.learn.estimator, with input_fn for features/labels dictionaries), when I add breakpoints the code stops in them only in a first run, that seems like a first 'initialization' of the nn, and this helps verify sizes; however afterwards, during training/evaluation, these are ignored.

Two Q's:
1. Is there a way to do stop in breakpoints during the training, so I can see, for example, how the weights and loss make progress? I have a NaN loss and try to figure why, this could be helpful.
2. What is actually the reason for this initial run?

Thanks for your time!

Igor Pechersky

unread,

May 8, 2017, 1:46:23 AM5/8/17

to Roy, Discuss

tfdbg [https://www.tensorflow.org/programmers_guide/debugger ] will do?

--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+unsubscribe@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/f29df6e8-2440-4ac5-bcb2-ee7db9be7693%40tensorflow.org.

Igor Pechersky

unread,

May 8, 2017, 1:59:50 AM5/8/17

to Roy, Discuss

Actually, https://github.com/tensorflow/tensorflow/blob/r1.1/tensorflow/python/debug/examples/debug_tflearn_iris.py seems to be quite relevant to your question :-)

Roy

unread,

May 8, 2017, 5:44:36 AM5/8/17

to Discuss

thank you Igor!

Indeed I see that the documentation for the tfdbg with tf.contrib.learn is helpful, when i checked it a while ago it was lacking.

It's worth checking for me, although it doesn't enable adding a breakpoint on PyCharm - thanks!

On Monday, May 8, 2017 at 8:59:50 AM UTC+3, Igor Pechersky wrote:

Actually, https://github.com/tensorflow/tensorflow/blob/r1.1/tensorflow/python/debug/examples/debug_tflearn_iris.py seems to be quite relevant to your question :-)

On May 8, 2017 8:46 AM, "Igor Pechersky" <igor.pe...@gmail.com> wrote:

tfdbg [https://www.tensorflow.org/programmers_guide/debugger ] will do?

On May 8, 2017 7:57 AM, "Roy" <roy.fren...@gmail.com> wrote:

One may debug tf by printing output or by tf.summary.

PyCharm IDE, similarly to Matlab, offers use of breakpoints which I really like for development.

When I build a nn and train it using tf estimator (tf.contrib.learn.estimator, with input_fn for features/labels dictionaries), when I add breakpoints the code stops in them only in a first run, that seems like a first 'initialization' of the nn, and this helps verify sizes; however afterwards, during training/evaluation, these are ignored.

Two Q's:
1. Is there a way to do stop in breakpoints during the training, so I can see, for example, how the weights and loss make progress? I have a NaN loss and try to figure why, this could be helpful.
2. What is actually the reason for this initial run?

Thanks for your time!

--
You received this message because you are subscribed to the Google Groups "Discuss" group.

To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.

Roy

unread,

May 8, 2017, 8:21:49 AM5/8/17

to Discuss

Used the guidelines here and here for debug, and it starts to look ok; The "only" problem is that this filter never practically stops the code, it just halts with error. And I hear from some colleagues who work on "low level" mode that they also gave up, as these filters never function.

Anyone here managed to write his own code which stops within debug mode upon nan/inf? any ideas?

Martin Wicke

unread,

May 8, 2017, 12:18:10 PM5/8/17

to Roy, Shanqing Cai, Discuss

+Shanqing -- this question about tfdbg from dis...@tensorflow.org looks like you would know the answer to.

To unsubscribe from this group and stop receiving emails from it, send an email to discuss+unsubscribe@tensorflow.org.

To post to this group, send email to dis...@tensorflow.org.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/c9f7acbc-02a8-45c2-8dd0-f2b700da3fb8%40tensorflow.org.

Shanqing Cai

unread,

May 8, 2017, 1:00:11 PM5/8/17

to Martin Wicke, Roy, Discuss

Roy - It is true that the filters of tfdbg doesn't halt the graph run in the middle. But even without that capability, you can still figure out the reason behind things like nans and infs using the tfdbg result. For example, when you do

tfdbg> run -f has_inf_or_nan

the program will drop back to the tfdbg command-line interface (CLI) after the first Session.run() in which any tensors contain infs or nans. The list of tensors is sorted chronologically, so the one on the to is the "culprit", i.e., the one that first generates infs or nans. You can click it to view its value and then use the "list_inputs" and "node_info" commands to reason why the infs and nans appeared.

Let me know if you have more questions.

--

---

Shanqing Cai

Software Engineer, Tools and Infrastructure

Applied Machine Intelligence

Google, Inc.

ca...@google.com

Roy

unread,

May 8, 2017, 4:15:15 PM5/8/17

to Discuss, ca...@google.com

Thanks +Shanqing

In fact when the net gets a nan, the program is thrown back to the usual terminal mode and doesn't let me investigate the tensors with nans, as you describe and as described in the tutorial. The error message I see is pasted below.
I wonder if there is anything else that should be done besides the lines mentioned in the tutorial (and pasted on a separate post i posted today), or maybe i should set some other settings or turn off something? Can it be because I use tf.contrib.learn.Estimator, and not tf.estimator.Estimator (due to other problems)?

these are the lines added:

from tensorflow.python import debug as tf_debug

...
debug_hook = tf_debug.LocalCLIDebugHook(ui_type="curses")
debug_hook.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan)
hooks = [debug_hook]
Estimator.fit(input_fn=..., monitors=hooks)

this is the message when the code halts and goes back to terminal window (as you see, the net is training for some time and saves checkpoints)::
INFO:tensorflow:Saving checkpoints for 208 into /home/royfr/work/outputs/out_tf_roi_Adam/model.ckpt.
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
File "work/PycharmProjects/initial_projects_Roi/dev_roi/road3TF/road3TF_like_Dor.py", line 652, in <module>
    rtf.run_nn()
File "work/PycharmProjects/initial_projects_Roi/dev_roi/road3TF/road3TF_like_Dor.py", line 240, in run_nn
    self.Estimator.fit(input_fn=lambda: self.input_fn(modekeys.TRAIN), monitors=hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 281, in new_func
    return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 430, in fit
    loss = self._train_model(input_fn=input_fn, hooks=hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 978, in _train_model
    _, loss = mon_sess.run([model_fn_ops.train_op, model_fn_ops.loss])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 484, in run
    run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 820, in run
    run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 776, in run
    return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 938, in run
    run_metadata=run_metadata))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 481, in after_run
    raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/c9f7acbc-02a8-45c2-8dd0-f2b700da3fb8%40tensorflow.org.

Roy

unread,

May 8, 2017, 4:21:50 PM5/8/17

to Discuss

(the problem with using tf.estimator.Estimator is that I can't set the model_fn as a function within a class, and then call it by self.model_fn, where's in tf.contrib.learn.Estimator I can; In the earlier case, it throws an error:
ValueError: model_fn (<bound method Road3TF.model_fn of <__main__.Road3TF instance at 0x7fffd19fbd40>>) has following not expected args: ['self']

Martin Wicke

unread,

May 8, 2017, 4:28:54 PM5/8/17

to Roy, Discuss

Our checking code is a bit overzealous there. Can you file an issue?

You can work around that error by passing

lambda features, labels, mode, ...: self.model_fn(features, labels, mode, ...)

To unsubscribe from this group and stop receiving emails from it, send an email to discuss+unsubscribe@tensorflow.org.

To post to this group, send email to dis...@tensorflow.org.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/549da353-c9c3-4373-8208-fb4b92e04bde%40tensorflow.org.

Message has been deleted

Martin Wicke

unread,

May 9, 2017, 12:05:37 AM5/9/17

to Roy, Discuss

No it has nothing to do with that. Shanqing talked about that.

On May 8, 2017 8:25 PM, "Roy" <roy.fren...@gmail.com> wrote:

Thanks Martin, I'll try that,
However is this a cause for tfdbg not stopping on nan?

--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+unsubscribe@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/e8602e00-36ea-44e5-8c9a-72dbefb5a2aa%40tensorflow.org.

Roy

unread,

May 9, 2017, 3:48:08 AM5/9/17

to Discuss

Martin - apparently someone just opened this issue very recently, #9654

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/549da353-c9c3-4373-8208-fb4b92e04bde%40tensorflow.org.

Reply all

Reply to author

Forward