memory leak in sess.run

791 views
Skip to first unread message

Vlad Firoiu

unread,
Dec 27, 2016, 8:11:51 PM12/27/16
to Discuss
I've determined that tensorflow is leaking memory on each iteration of training. If I remove the call to sess.run, then memory usage remains constant; otherwise, `resource.getrusage(resource.RUSAGE_SELF).ru_maxrss` shows that every so often a few MB are leaked. It appears that the amount leaked depends on the size of the graph (larger graph/batch size -> more leaked). The amount leaked does not appear to plateau - if I run for long enough (on the order of a day) I can even exhaust the system memory of 64GB.

I am using python 3.5, and I've tried multiple tensorflow versions (0.10, 0.11, 0.12) on different OSes (Ubuntu, CentOS) with both cpu and gpu, and the leak always seems to occur. Are there any ways of debugging this?

Derek Murray

unread,
Dec 27, 2016, 8:44:41 PM12/27/16
to Vlad Firoiu, Discuss
There are some suggestions for how to debug a memory leak here:


Using tcmalloc can be particularly useful, because many common TensorFlow allocation patterns can lead to heap fragmentation with the standard malloc implementation.

If none of these suggestions work, please open a GitHub issue with a minimal program that reproduces the issue, and someone on the team will take a look.

Derek.

On Tue, Dec 27, 2016 at 5:11 PM, Vlad Firoiu <vla...@gmail.com> wrote:
I've determined that tensorflow is leaking memory on each iteration of training. If I remove the call to sess.run, then memory usage remains constant; otherwise, `resource.getrusage(resource.RUSAGE_SELF).ru_maxrss` shows that every so often a few MB are leaked. It appears that the amount leaked depends on the size of the graph (larger graph/batch size -> more leaked). The amount leaked does not appear to plateau - if I run for long enough (on the order of a day) I can even exhaust the system memory of 64GB.

I am using python 3.5, and I've tried multiple tensorflow versions (0.10, 0.11, 0.12) on different OSes (Ubuntu, CentOS) with both cpu and gpu, and the leak always seems to occur. Are there any ways of debugging this?

--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+unsubscribe@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/1da90762-9d2a-47f2-acdd-90d9744eb673%40tensorflow.org.

abolfazl....@gmail.com

unread,
Jan 2, 2017, 7:31:39 AM1/2/17
to Discuss, vla...@gmail.com
Hello, No of these suggestions worked. Today We see a huge memory leak in our server with 32 GB RAM. We didn't see this in tensorflow 0.9. I guess the reason of leak is tensorflow queues. 

On Wednesday, December 28, 2016 at 5:14:41 AM UTC+3:30, Derek Murray wrote:
There are some suggestions for how to debug a memory leak here:


Using tcmalloc can be particularly useful, because many common TensorFlow allocation patterns can lead to heap fragmentation with the standard malloc implementation.

If none of these suggestions work, please open a GitHub issue with a minimal program that reproduces the issue, and someone on the team will take a look.

Derek.
On Tue, Dec 27, 2016 at 5:11 PM, Vlad Firoiu <vla...@gmail.com> wrote:
I've determined that tensorflow is leaking memory on each iteration of training. If I remove the call to sess.run, then memory usage remains constant; otherwise, `resource.getrusage(resource.RUSAGE_SELF).ru_maxrss` shows that every so often a few MB are leaked. It appears that the amount leaked depends on the size of the graph (larger graph/batch size -> more leaked). The amount leaked does not appear to plateau - if I run for long enough (on the order of a day) I can even exhaust the system memory of 64GB.

I am using python 3.5, and I've tried multiple tensorflow versions (0.10, 0.11, 0.12) on different OSes (Ubuntu, CentOS) with both cpu and gpu, and the leak always seems to occur. Are there any ways of debugging this?

--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.

Vlad Firoiu

unread,
May 18, 2017, 8:10:24 AM5/18/17
to abolfazl....@gmail.com, Discuss
I'm getting this issue again after upgrading to tf 1.1, even with tcmalloc.

Martin Wicke

unread,
May 18, 2017, 7:58:56 PM5/18/17
to Vlad Firoiu, abolfazl....@gmail.com, Discuss
Can you open an issue with a minimal piece of code to reproduce?

To unsubscribe from this group and stop receiving emails from it, send an email to discuss+unsubscribe@tensorflow.org.

To post to this group, send email to dis...@tensorflow.org.

Vlad Firoiu

unread,
May 19, 2017, 10:15:31 AM5/19/17
to Martin Wicke, Abolfazl Mahdizade, Discuss
Nevermind, it seems the leak is in a different part of my code.
Reply all
Reply to author
Forward
0 new messages