inconsistency in timing af::where

30 views
Skip to first unread message

Avinash Rangarajan

unread,
Jul 6, 2021, 4:14:57 AM7/6/21
to ArrayFire Users

I have an af::array for which i want to find the non zero location (2048*2048).

af::timer start1 = af::timer::start(); 
af::array index = af::where(mat); // for the first time 
cout << "elapsed time (ms): " << af::timer::stop(start1)*1000 <<" ms"<<endl; 
start1 = af::timer::start(); 
index = af::where(mat); //for the second time 
cout << "elapsed time (ms): " << af::timer::stop(start1)*1000 <<" ms"<<endl;

First iteration: elapsed time (ms) : 1.8792 ms

elapsed time (ms) : 0.7094 ms

second iteration elapsed time (ms) : 16.8074 ms

elapsed time (ms) : 0.4738 ms

third iteration elpased time (ms) : 17.3236 ms

elpased time (ms) : 0.4543 ms

why do the time is so inconsistent?

Pradeep Garigipati

unread,
Jul 6, 2021, 6:15:08 AM7/6/21
to Avinash Rangarajan, ArrayFire Users
Hello Avinash,

Accelerator/GPU warmup cost is included in the first run of any function/operation. This warmup cost includes multiple things ranging from driver warmup time, kernel runtime compilation time etc. Please go through this tutorial to understand how to time ArrayFire functions. Once you have corrected your timing logic, if you still notice widely inconsistent runtimes, note that runtimes also depend on any other processes that are using the same accelerator/GPU on your system.

Feel free to share your timing code if despite the above suggestions you experience varying runtimes for "where" API.

Hope that helps.
Pradeep.

--
You received this message because you are subscribed to the Google Groups "ArrayFire Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arrayfire-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arrayfire-users/328eaf96-3128-4f0e-bfd9-b4f24561eebbn%40googlegroups.com.


--

Avinash Rangarajan

unread,
Jul 6, 2021, 8:33:05 AM7/6/21
to Pradeep Garigipati, ArrayFire Users
I am sharing the main code and the where_function code. 

int main(int argc, char* argv[])
{
auto width = 2048;
auto height = 2048;
auto actual_height = 1000;
auto actual_width = 1000;
af::array output = af::constant(0, width, height, f32);
af::array guard = af::constant(0, width, height, f32);
output(af::seq(0, actual_height-1), af::seq(0, actual_width-1)) = randu(actual_height, actual_width, f32) * 30;
guard(af::seq(0, actual_height-1), af::seq(0, actual_width-1)) = randu(actual_height, actual_width, f32) * 30;
auto win_height = 4;
auto win_width = 4;
auto gaurd_width = 1;
auto gaurd_height = 1;
for (int i = 0; i < 100; i++) {
auto mask = af::array(2 * win_height + 1, 2 * win_width + 1, b8);
mask(af::seq(0, 2 * win_height), af::seq(0, 2 * win_width)) = 1;
mask(af::seq(win_height - gaurd_height, win_height + gaurd_height), af::seq(win_width - gaurd_width, win_width + gaurd_width)) = 0;
af::array output_size = af::array(width, height, f32);
af::array c_output = af::array(width, height, f32);
output_size(af::span, af::span) = af::constant(0, output.dims(), f32);
output_size(af::seq(0, actual_height - 1), af::seq(0, actual_width - 1)) = 1;
auto rd_window = af::convolve2(output_size, mask);
auto rd_windowed_sum = af::convolve2(output, mask);
auto avg_noise = rd_windowed_sum / rd_window;
auto rd_snr = output / avg_noise;
af::array rd_window_output = rd_snr < 1.0f;
af::replace(c_output, rd_window_output, rd_snr);

/*peak finding*/
af::array peak_find = af::array(width, height, f32);
af::array peak_op = af::constant(0, peak_find.dims(), f32);
af::array peak_loc = af::maxfilt(c_output, 3, 3, AF_PAD_ZERO);

peak_loc = (peak_loc != c_output);
af::replace(peak_op, peak_loc, c_output);

af::array compare_result = af::constant(0, peak_op.dims(), f32);
auto com_val = output(af_span, af_span) / guard(af_span, af_span) - 0.2;
auto com_out = !((com_val > 1) && (peak_op));
af::replace(compare_result, com_out, peak_op);
where_fn(compare_result, output);
}
}

void where_fn(af::array peak_op, af::array output)
{

af::timer start1 = af::timer::start();
af::array index1 = af::where(peak_op);
std::cout << "elapsed Time(in ms)_where_1: " << af::timer::stop(start1) * 1000 << " ms" << std::endl;

af::timer start2 = af::timer::start();
index1 = af::where(peak_op);
std::cout << "elapsed Time(in ms)_where_2: " << af::timer::stop(start2) * 1000 << " ms" << std::endl;
}

Pradeep Garigipati

unread,
Jul 7, 2021, 12:11:16 AM7/7/21
to ArrayFire Users, Avinash Rangarajan
For future users.

---------- Forwarded message ---------
From: Pradeep Garigipati <pra...@arrayfire.com>
Date: Wed, Jul 7, 2021 at 9:38 AM
Subject: Re: inconsistency in timing af::where
To: Avinash Rangarajan <avi.ranga...@gmail.com>


That is because `peak_op` is a jit node during the first call to "where" inside any given iteration and the during the second "where" API call, `peak_op` has already been converted to a buffer(memory) from the jit node, so it needs no evaluation.

It is not a problem rather a consequence of passing the same `af::array`, which is a JIT node, to "where" twice in a single iteration.



On Tue, Jul 6, 2021 at 6:11 PM Avinash Rangarajan <avi.ranga...@gmail.com> wrote:
In the where function 

the first where function is taking more time (4 ms) when compared to second where function(0.5 ms) in all the 100 for iteration.

If i print af_print(peak_op.numdims()) before calling the first where function, the time it takes is (0.5 ms)

I am not sure where the problem is

Regards
R Avinash

Avinash Rangarajan

unread,
Jul 7, 2021, 12:30:41 AM7/7/21
to Pradeep Garigipati, ArrayFire Users
Thanks Pradeep.

If I call af::count(peak_op) instead of af::where(peak_op), execution time is around 0.5 ms. What is different in both execution (if peak_op is jit node in both case)

Regards
R Avinash

On Wed, Jul 7, 2021 at 9:39 AM Pradeep Garigipati <pra...@arrayfire.com> wrote:
That is because `peak_op` is a jit node during the first call to "where" inside any given iteration and the during the second "where" API call, `peak_op` has already been converted to a buffer(memory) from the jit node, so it needs no evaluation.

It is not a problem rather a consequence of passing the same `af::array`, which is a JIT node, to "where" twice in a single iteration.



On Tue, Jul 6, 2021 at 6:11 PM Avinash Rangarajan <avi.ranga...@gmail.com> wrote:
In the where function 

the first where function is taking more time (4 ms) when compared to second where function(0.5 ms) in all the 100 for iteration.

If i print af_print(peak_op.numdims()) before calling the first where function, the time it takes is (0.5 ms)

I am not sure where the problem is

Regards
R Avinash

Pradeep Garigipati

unread,
Jul 7, 2021, 12:49:53 AM7/7/21
to Avinash Rangarajan, ArrayFire Users
JIT is just-in-time compile engine to convert basic arithmetic, comparison, conditional, tiling etc operations at runtime for a given accelerator device. I think the following articles better explain performance benefits of JIT. Please go through them

https://arrayfire.com/performance-of-arrayfire-jit-code-generation/

https://arrayfire.com/performance-improvements-to-jit-in-arrayfire-v3-4/

replace/select is a jit node, and any other basic arithmetic you might have are also jit operations. Hence, the input array to where is a jit node on it's first call. But "where" uses a custom hand written kernel, so it evaluates the JIT node before it can run. Thus in the second run since it is already evaluated, the runtime of the evaluation is eliminated.

Hope that helps, do go through the blogs to get a better understanding.
--

Avinash Rangarajan

unread,
Jul 8, 2021, 8:37:54 AM7/8/21
to ArrayFire Users
in that case is there better method to retrieve the index of non zero elements in an array without using where to reduce time?

Regards
R Avinash

Pradeep Garigipati

unread,
Jul 8, 2021, 11:27:35 AM7/8/21
to Avinash Rangarajan, ArrayFire Users
Maybe my earlier explanation was too terse or confusing. Let me try again.

"Where" isn't slow or inefficient, neither is JIT. The kind of operation(replace API) you are doing is creating the input to first "where" API call, therefore the runtime for first "where" call essentially includes the execution time for "replace" kernel and "where" kernel.

If your input array to the first "where" call is a non-JIT(internally) af::array, then you may see the same execution time for same sized arrays.

So, there is no additional time "where" is taking just on the first call. For example, if you want to see the difference try the following timing


void where_fn(af::array peak_op, af::array output)
{

af::sync(); // this should evalue `peak_op` before where auto-triggers it.


af::timer start1 = af::timer::start();
af::array index1 = af::where(peak_op);
std::cout << "elapsed Time(in ms)_where_1: " << af::timer::stop(start1) * 1000 << " ms" << std::endl;

af::timer start2 = af::timer::start();
index1 = af::where(peak_op);
std::cout << "elapsed Time(in ms)_where_2: " << af::timer::stop(start2) * 1000 << " ms" << std::endl;
}

Avinash Rangarajan

unread,
Jul 8, 2021, 11:37:21 AM7/8/21
to Pradeep Garigipati, ArrayFire Users
Thanks for your patience. I understood finally. I tried eval() without calling sync(). So I couldn't see the difference. Now I know where API is not the time consuming. It's something above that's taking more time than I expected. I will look into that. JIT made me assume that my above code are efficient and running in less time. That's the confusion. 

Regards
Avinash

Regards
Avinash

Pradeep Garigipati

unread,
Jul 8, 2021, 11:57:56 AM7/8/21
to Avinash Rangarajan, ArrayFire Users
JIT helps with combining multiple individual operations into a single GPU kernel. For example, let's take this code.

array a = constant(1, 10, 10);
array b = constant(1, 10, 10);
array c = constant(1, 10, 10);

array d = a - b;
array e = a + b;
array f = d * e / c;

Now, if each such operation results in a kernel, that causes a lot of overhead with kernel launches especially when done repetitively. With JIT, all of the above element wise operations are merged into a single kernel.

something like the following(only for illustration, not the exact optimized JIT generated code)

kernel void jit12345(float* a, float* b, float* c, float* out) {
  unsigned int i = get_global_id(0);
  out[i] = (a[i] - b[i]) * (a[i] + b[i]) / c[i];
}

Doing something similar is definitely more efficient than launching 7 kernels(3 for constant arrays, 1 for addition, 1 for subtraction, 1 for mul and 1 for div) for all these operations. Therefore, JIT almost always results in more efficient runtimes.
--
Reply all
Reply to author
Forward
0 new messages