Model inference speed using saved_model signatures TF1.14 vs TF2.0beta1

141 views
Skip to first unread message

Skyler Roh

unread,
Aug 22, 2019, 1:58:52 PM8/22/19
to TensorFlow Community Testing
Hi everyone, 

I have trained equivalent custom estimators on TF1.14 and TF2.0 (2000 input --> 256 dense --> 128 dense --> logit)
I then loaded each saved_model also in TF1.14 and TF2.0 to test prediction times as these models need to perform in very strict timing bounds.

As expected, in TF1.14, both models show increased total times to inference larger batches, but the scaling is sub linear wrt batch size
warming up...
Batch size	batch time	mean
  1		1.827ms		1.827ms
  5		2.183ms		0.437ms
 10		2.285ms		0.228ms
 20		2.601ms		0.130ms
 30		2.874ms		0.096ms
 40		3.247ms		0.081ms
 50		3.674ms		0.073ms
 70		4.268ms		0.061ms
100		5.272ms		0.053ms
120		6.181ms		0.052ms
150		6.794ms		0.045ms
170		7.837ms		0.046ms
200		8.320ms		0.042ms


However, the odd thing in TF2.0, is that the inference times seem to be the same regardless of batch size, and overall slower for all batch sizes
warming up...
Batch size	batch time	mean
  1		21.548ms		21.548ms
  5		23.487ms		4.697ms
 10		19.810ms		1.981ms
 20		19.354ms		0.968ms
 30		22.020ms		0.734ms
 40		22.368ms		0.559ms
 50		24.338ms		0.487ms
 70		22.637ms		0.323ms
100		19.137ms		0.191ms
120		21.951ms		0.183ms
150		22.061ms		0.147ms
170		23.557ms		0.139ms
200		18.811ms		0.094ms

The TF2.0 timing function looks something like this
"""
@tf.function
def time_batch(signature, records):
    print("warming up...")
    for i in range(1000):
        signature(tf.constant(np.array(records[0:10])))
    for batchsize in (1,5,10,20,30,40,50,70,100, 120, 150, 170, 200):
        data = records
        times = []
        t_total = 0.0
        for i in range(1000):
            batch = tf.constant(np.array(data[0:batchsize]))
            data = data[batchsize:]
            t0 = time.time()
            signature(batch)
            t1 = time.time()
            times.append((t1 - t0) * 1000)
            t_total += (t1-t0)
        print('%3d\t\t%3.3fms\t\t%3.3fms' % (batchsize, t_total, t_total/batchsize))
    return signature(batch)
"""

Any insight into this behavior would be greatly appreciated. Thanks!
Skyler

Martin Wicke

unread,
Aug 22, 2019, 2:21:46 PM8/22/19
to Skyler Roh, TensorFlow Community Testing
Is this measured on CPU? How do you execute the SavedModel? Using Servo? Loading into Python? Could you share the code you have used?

--
To unsubscribe from this group and stop receiving emails from it, send an email to testing+u...@tensorflow.org.

Skyler Roh

unread,
Aug 22, 2019, 2:31:02 PM8/22/19
to TensorFlow Community Testing, skyl...@gmail.com

Hi Martin,

These tests were performed in python on a kubernetes container with TF1.14 and TF2.0beta1 respectively. They are CPU only pods.

For TF1.14 a session is established and the model loaded into the graph like so
def time_saved_model(saved_model_dir):
    tf.reset_default_graph()
    config = tf.ConfigProto(intra_op_parallelism_threads = 1, inter_op_parallelism_threads = 1, device_count={'CPU': 1})

    with tf.Session(graph=tf.Graph(), config=config) as session:
        tf.saved_model.loader.load(session, [tf.saved_model.tag_constants.SERVING], saved_model_dir)
        examples = session.graph.get_tensor_by_name('input_tensors:0'),
                   classes = session.graph.get_tensor_by_name('predictions/str_classes:0'),
                   scores = session.graph.get_tensor_by_name('predictions/probabilities:0')
        print("warming up...")
        for i in range(1000):
            session.run([classes, scores],
                              feed_dict = {examples: np.array(records[0:10])},
            #                  options=run_options, 
            #                  run_metadata=run_metadata
                             )
        print("Batch size\tbatch time\tmean")
        for batchsize in (1,5,10,20,30,40,50,70,100, 120, 150, 170, 200):
            data = records
            times = []
            t_total = 0.0
            for i in range(1000):
                batch = np.array(data[0:batchsize])
                data = data[batchsize:]
                t0 = time.time()
                session.run([classes, scores], 
                                  feed_dict = {examples: batch}
                                 )
                t1 = time.time()
                times.append((t1 - t0) * 1000)
                t_total += (t1-t0)
            print('%3d\t\t%3.3fms\t\t%3.3fms' % (batchsize, t_total, t_total/batchsize))


For TF2.0 models were loaded with saved_model and called with "serving_default" signature and called with the @tf.function annotation described in my original post
model = tf.saved_model.load(...)
signature = model.signatures["serving_default"]
To unsubscribe from this group and stop receiving emails from it, send an email to tes...@tensorflow.org.

Martin Wicke

unread,
Aug 22, 2019, 7:21:55 PM8/22/19
to Skyler Roh, TensorFlow Community Testing
It would be good to see your entire code to make sure what I'm saying isn't silly, but from the code you pasted, it seems like your timing is performed inside the @tf.function, which would execute only once, during tracing time. So in your 2.0 code, you're measuring tracing time, not compute time.

To unsubscribe from this group and stop receiving emails from it, send an email to testing+u...@tensorflow.org.

Skyler Roh

unread,
Aug 23, 2019, 6:31:55 PM8/23/19
to TensorFlow Community Testing, skyl...@gmail.com
Hi Martin, 

Thank you for your comment, I made the change you suggested as seen below, creating a smaller @tf.function just for runnign the model signature and that showed times that were more expected.

@tf.function
def batch_time(batch):
    return signature(batch)

def time_batch_sizes(records):
    print("warming up...")
    for i in range(1000):
        signature(tf.constant(np.array(records[0:10])))
    print("Batch size\tbatch time\tmean")
    for batchsize in (1,5,10,20,30,40,50,70,100, 120, 150, 170, 200):
        data = records
        times = []
        t_total = 0.0
        for i in range(1000):
            batch = tf.constant(np.array(data[0:batchsize]))
            data = data[batchsize:]
            t0 = time.time()
            batch_time(batch)
            t1 = time.time()
            times.append((t1 - t0) * 1000)
            t_total += (t1-t0)
        print('%3d\t\t%3.3fms\t\t%3.3fms' % (batchsize, t_total, t_total/batchsize))

warming up...
Batch size    batch time    mean
 1        2.093ms        2.093ms
 5        2.015ms        0.403ms
10        2.162ms        0.216ms
20        2.207ms        0.110ms
30        2.304ms        0.077ms
40        2.354ms        0.059ms
50        2.458ms        0.049ms
70        2.706ms        0.039ms
100        2.825ms        0.028ms
120        2.994ms        0.025ms
150        3.078ms        0.021ms
170        3.132ms        0.018ms
200        3.412ms        0.017ms


Could you explain a little bit further what you mean by tracing time?

Thanks!
To unsubscribe from this group and stop receiving emails from it, send an email to tes...@tensorflow.org.

Martin Wicke

unread,
Aug 23, 2019, 7:24:00 PM8/23/19
to Skyler Roh, TensorFlow Community Testing
Whatever is inside the function is traced on first execution and converted to a TensorFlow program. Any python timing code inside the function would be executed then (and only then). Hence you measured the time it takes to trace the code (convert it to a TensorFlow expression), not the time it takes to execute that expression. 

Is this measured on CPU? How do you execute the SavedModel? Using Servo? Loading into Python? Could you share the code you have used?</

Skyler Roh

unread,
Aug 26, 2019, 2:42:34 PM8/26/19
to TensorFlow Community Testing, skyl...@gmail.com
Hi Martin,

Thanks for the insight, makes sense and will make a note to self to only wrap tensorflow native code in @tf.function

This is probably a long shot, but do you have any idea when a TF2.0 based serving image will be released? We are working on a problem with low latency requirements for realtime inference and would be eager to see if these time improvements extend to real serving calls.

Best,
Skyler

Martin Wicke

unread,
Aug 26, 2019, 8:05:14 PM8/26/19
to Skyler Roh, TensorFlow Community Testing
A serving image based on RC0 should be published in the coming weeks. However, you should not need one to be able to serve these models. tensorflow/serving works off of the SavedModel, and a 1.x tensorflow/serving container should be able to serve these models just fine, even if they're created with 2.0.

--
To unsubscribe from this group and stop receiving emails from it, send an email to testing+u...@tensorflow.org.

Skyler Roh

unread,
Aug 27, 2019, 2:18:32 PM8/27/19
to TensorFlow Community Testing, skyl...@gmail.com
Correct me if I am wrong, but the performance benefits from a speed perspective of tf2 would not be observed if running on a 1.x serving container since its executing the graph on older versions of the ops.
I was able to test 2.0 trained model on 1.14 serving earlier and as expected from using the saved_model format, it performed very similarly on inference time.
To unsubscribe from this group and stop receiving emails from it, send an email to tes...@tensorflow.org.

Martin Wicke

unread,
Aug 27, 2019, 2:23:11 PM8/27/19
to Skyler Roh, TensorFlow Community Testing
Until today, most of the core execution infrastructure was shared between 1.x and 2.0, so I would not expect a speed advantage in serving.

To unsubscribe from this group and stop receiving emails from it, send an email to testing+u...@tensorflow.org.
Reply all
Reply to author
Forward
0 new messages