Doing inference on large input sizes w/ a SavedModel seg faults (TF 2.0)

Skip to first unread message

Peter Harrington

Aug 22, 2019, 8:14:19 PM8/22/19
to TensorFlow Community Testing
Hi all,

I've been running into some issues using TF 2.0 (beta) with the SavedModel interface for inference, was wondering if people had any suggestions. I have a pre-trained, 3D U-net generator model, defined as a tf.keras.Model, which I exported to the SavedModel format for doing inference. I need to run inference on some very large input sizes, so I am using SavedModel to strip away the unnecessary training bells+whistles, etc, and just serve the model for prediction. The input sizes I am trying to predict on are admittedly huge (1024x128x1024), but my network is fully convolutional and I am not trying to run them on a GPU -- I am running them on a CPU node within a HPC system, and the node has over 380 GB of RAM available. Should be more than enough memory, I even went through and calculated (as an order-of-magnitude estimate) the total memory required for all the feature maps and convolution kernels in my network (at float32 precision) and got ~47GB. However, I have been unable to successfully run inference on this large size, due to several alarming things:

1) If I load the SavedModel and try to run inference on larger and larger input sizes, the peak RAM usage grows unreasonably quickly. For example, if run a prediction for a single input of size (128x128x128), the peak RAM usage is over 5GB. Increasing the input size to, say (256x128x1024), the peak RAM usage is over 79 GB.

2) Doing the above with a (512x128x1024) input size, the prediction script dies with a segmentation fault. Monitoring RAM usage when doing this, the peak is at ~156 GB, which is a ton, but less than half the total RAM available on the compute node.

3) To inspect the seg fault more closely, I ran gdb and saw something surprising -- there was a 'Conv3DCustomBackpropInputOp' somewhere in the stack trace. Having exported my model to SavedModel with the 'serving_default' tag, I thought running inference from the SavedModel would eliminate any training utilities/other unnecessary 'fluff'. The presence of backprop ops suggests that what I am running is not an inference-only version of my model? This would also be consistent with the huge memory footprint. Below is the stack trace at the seg fault:

#0  0x00002aaac1b7d260 in void (anonymous namespace)::Col2im<float>(float const*, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, float*) ()

   from /global/common/cori_cle7/software/tensorflow/gpu-tensorflow/2.0.0-beta-py36/lib/python3.6/site-packages/tensorflow/python/

#1  0x00002aaac1ce1006 in tensorflow::Conv3DCustomBackpropInputOp<Eigen::ThreadPoolDevice, float>::Compute(tensorflow::OpKernelContext*) ()

   from /global/common/cori_cle7/software/tensorflow/gpu-tensorflow/2.0.0-beta-py36/lib/python3.6/site-packages/tensorflow/python/

#2  0x00002aaaf4c97f6b in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long) ()

   from /global/common/cori_cle7/software/tensorflow/gpu-tensorflow/2.0.0-beta-py36/lib/python3.6/site-packages/tensorflow/python/../

#3  0x00002aaaf4c89ad0 in std::_Function_handler<void (), std::_Bind<std::_Mem_fn<void (tensorflow::(anonymous namespace)::ExecutorState::*)(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long)> (tensorflow::(anonymous namespace)::ExecutorState*, tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long)> >::_M_invoke(std::_Any_data const&) ()

   from /global/common/cori_cle7/software/tensorflow/gpu-tensorflow/2.0.0-beta-py36/lib/python3.6/site-packages/tensorflow/python/../

#4  0x00002aaaf4d39e84 in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) ()

   from /global/common/cori_cle7/software/tensorflow/gpu-tensorflow/2.0.0-beta-py36/lib/python3.6/site-packages/tensorflow/python/../

#5  0x00002aaaf4d38cf4 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()

   from /global/common/cori_cle7/software/tensorflow/gpu-tensorflow/2.0.0-beta-py36/lib/python3.6/site-packages/tensorflow/python/../

#6  0x00002aaaf5ac4421 in std::execute_native_thread_routine_compat (__p=<optimized out>)

    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/

#7  0x00002aaaaacda569 in start_thread () from /lib64/

#8  0x00002aaaaafe9a2f in clone () from /lib64/

I have posted a tar of the SavedModel and my scripts at, if anyone wants to see for themselves. The file contains the (simple) prediction script, and the script will run the script and monitor the peak memory usage while the prediction script is running.

Martin Wicke

Sep 18, 2019, 11:51:44 AM9/18/19
to Peter Harrington, TensorFlow Community Testing
Saving the model will by itself not eliminate the backward pass. If you save it before you compute gradients, the backwards pass shouldn't be included.

To unsubscribe from this group and stop receiving emails from it, send an email to

Aakash Kumar Nain

Sep 18, 2019, 11:59:47 AM9/18/19
to Martin Wicke, Peter Harrington, TensorFlow Community Testing
@Martin is this documented? If not, then it should be IMHO

Martin Wicke

Sep 18, 2019, 4:28:20 PM9/18/19
to Aakash Kumar Nain, Kathy Wu, Peter Harrington, TensorFlow Community Testing
+Kathy Wu this would be a good thing to add to the SavedModel guide.

Kathy Wu

Sep 18, 2019, 9:02:58 PM9/18/19
to Martin Wicke, Aakash Kumar Nain, Peter Harrington, TensorFlow Community Testing
As Martin mentioned, training with the serving_default function adds extra ops the SavedModel. Saving Keras models also generates a lot of additional ops (in case the user loads it back for retraining later). We should add an option to save an inference-only SavedModel that does not contain the extra Keras-specific functions and removes any forward/backward functions from the graph. 

As a note, tf.nn.conv3dtranspose appears to always generate the Conv3DBackpropInputV2 op in the forward pass. I'm not sure this is the reason why so much memory is being used. 

@Martin Is there a way to profile memory usage in TensorFlow 2.0? 

Martin Wicke

Oct 22, 2019, 12:36:23 PM10/22/19
to Kathy Wu,, TensorFlow Community Testing

+CK Luk do you know what our best practices for memory profiling are?

CK Luk

Oct 22, 2019, 1:41:18 PM10/22/19
to Martin Wicke, xprof Developers, Kathy Wu, TensorFlow Community Testing
AFAIK, there is no TensorFlow specific memory profiler.
For CPU, the closest thing you can use today is probably pprof's heapz profiler (
The breakdown is shown at C-function level, not TF-Op level (see section 3 in

The Xprof team is developing a memory profiler for device (TPU, GPU) memory. 

Reply all
Reply to author
0 new messages