Re-trying eager op execution on failure

14 views

Skip to first unread message

Ryan Nett

unread,

Jan 31, 2021, 12:32:56 AM1/31/21

to SIG JVM

I'm experimenting with automatic cleanup for eager tensors, like JavaCPP does for tensors. I have it working on CPU using JavaCPP's detection (it allocates over the memory limit during execution, then pulls back). However, on GPU I'm forced to catch OOM exceptions on execute, then run cleanup and re-try. The catching works fine, but when I execute the same op again I get "expected 2 inputs, got 0". This is despite TFE_OpGetInputLength(opHandle, "value", TF_Status.newStatus()) returning 1 for both arguments.

The relevant code is here: https://github.com/rnett/java/blob/rn_eager_memory/tensorflow-core/tensorflow-core-api/src/main/java/org/tensorflow/EagerOperationBuilder.java#L297

To see this, run something that would force an OOM on a GPU.

Any thoughts on how to get around this? The easiest option I can see is to store set inputs/attributes in the builder and re-build the native op each call, which is less than ideal.

Samuel Audet

unread,

Jan 31, 2021, 10:18:52 PM1/31/21

to Ryan Nett, SIG JVM

That's precisely the kind of thing I meant by "abysmal performance".
Java's GC was never meant to manage any resource at all but heap memory.
I like to think of it as simply an extension of "the stack", where other
(functional) programming languages such as Haskell actually use GC on
the actual stack instead of coming up with a special purpose heap (but
doing it both cleanly and efficiently like that is harder than it looks,
and that still doesn't help us at all with resource management :)

Samuel

Reply all

Reply to author

Forward

0 new messages