The problem of unstable running time

Daniel Wang

unread,

Jul 21, 2018, 3:11:53 AM7/21/18

to gemmlowp

Hi guys,

When I runing with sape(2048,640,2) , and use code as:

gemmlowp::GemmWithOutputPipeline<::std::uint8_t, ::std::uint8_t, gemmlowp::L8R8WithLhsNonzeroBitDepthParams>(

&gemm_context, uint8_lhs, uint8_rhs, &actual_uint8_result_map, lhs_offset, rhs_offset, output_pipeline);

Why does the calculation of time fluctuation be larger ?

Experimental results:

: On Qualcomm Snapdragon 835.; When using gemm_context.set_max_num_threads(4), consuming range from 0.6747 ms to 10.9 ms.; When using gemm_context.set_max_num_threads(1), consuming range from 2.589 ms to 13.77 ms.

Is there any way to make the running time more stable?

thanks~~

Daniel Wang

unread,

Jul 21, 2018, 3:17:30 AM7/21/18

to gemmlowp

Sorry, wrongly written "with sape(2048,640,2) " - > "with shape(2048,640,2) meaning matrix(2048*640)*matrix(640*2) "

在 2018年7月21日星期六 UTC+8下午3:11:53，Daniel Wang写道：

Benoit Jacob

unread,

Aug 13, 2018, 3:55:12 PM8/13/18

to dane...@gmail.com, gemmlowp

Hi,

First, given the specific shape of your GEMM here, with a very narrow right-hand side, gemmlowp is far from optimal. If you care about getting optimal performance for such a shape, you will need custom code. Since there are only two right-hand-side columns, this is roughly a GEMV, you will want to perform it in a single pass (no packing pass) on the large LHS matrix, and typically, if this large LHS matrix is a constant, you will want to consider pre-shuffling it for optimal traversal. Here is an example of that in TensorFlow Lite code, however there I had only coded kernels for the 1-column (GEMV) and 4-column cases, not 2-columns as you, though you would trivially derive that from the 4-columns case:

https://github.com/tensorflow/tensorflow/blob/959f075558b33674c201367aef4bfc9c2dc116c4/tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h#L1296

Going back to your more general benchmarking question.

Since the latency you're measuring here is so small (< 1 ms) there will inevitably be fluctuation.

If you're OK to only measuring warm-cache performance, a typical approach to stabilizing this result would be to run this many times in a loop, then averaging. You may also want to ignore the first iteration, as special one-off costs occur there only.

If on the other hand you want to measure cold-cache performance, then this is very hard to do anyway, and probably only doable meaningfully as integrated into your application, not as a micro-bencmark.

--
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+u...@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/6e03ec5b-efc6-48c1-8d3d-7ba0717aab53%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward