The problem of unstable running time

90 views
Skip to first unread message

Daniel Wang

unread,
Jul 21, 2018, 3:11:53 AM7/21/18
to gemmlowp
Hi guys,

When I runing with sape(2048,640,2) , and use code as:
gemmlowp::GemmWithOutputPipeline<::std::uint8_t, ::std::uint8_t, gemmlowp::L8R8WithLhsNonzeroBitDepthParams>(
&gemm_context, uint8_lhs, uint8_rhs, &actual_uint8_result_map, lhs_offset, rhs_offset, output_pipeline);

Why does the calculation of time fluctuation be larger ? 

Experimental results:
On Qualcomm Snapdragon 835. 
When using gemm_context.set_max_num_threads(4), consuming range from 0.6747 ms  to 10.9 ms.
When using gemm_context.set_max_num_threads(1), consuming range from 2.589 ms  to 13.77 ms.



Is there any way to make the running time more stable?
thanks~~

Daniel Wang

unread,
Jul 21, 2018, 3:17:30 AM7/21/18
to gemmlowp
Sorry, wrongly written "with sape(2048,640,2) " - > "with shape(2048,640,2)  meaning matrix(2048*640)*matrix(640*2)  " 

在 2018年7月21日星期六 UTC+8下午3:11:53,Daniel Wang写道:

Benoit Jacob

unread,
Aug 13, 2018, 3:55:12 PM8/13/18
to dane...@gmail.com, gemmlowp
Hi,

First, given the specific shape of your GEMM here, with a very narrow right-hand side, gemmlowp is far from optimal.  If you care about getting optimal performance for such a shape, you will need custom code. Since there are only two right-hand-side columns, this is roughly a GEMV, you will want to perform it in a single pass (no packing pass) on the large LHS matrix, and typically, if this large LHS matrix is a constant, you will want to consider pre-shuffling it for optimal traversal.  Here is an example of that in TensorFlow Lite code, however there I had only coded kernels for the 1-column (GEMV) and 4-column cases, not 2-columns as you, though you would trivially derive that from the 4-columns case:

Going back to your more general benchmarking question.
Since the latency you're measuring here is so small (< 1 ms) there will inevitably be fluctuation.
If you're OK to only measuring warm-cache performance, a typical approach to stabilizing this result would be to run this many times in a loop, then averaging.  You may also want to ignore the first iteration, as special one-off costs occur there only.
If on the other hand you want to measure cold-cache performance, then this is very hard to do anyway, and probably only doable meaningfully as integrated into your application, not as a micro-bencmark.

--
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+u...@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/6e03ec5b-efc6-48c1-8d3d-7ba0717aab53%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages