User event and Callback usage for performance improvements

39 views
Skip to first unread message

Francesc Lordan Gomis

unread,
Dec 20, 2016, 7:20:16 PM12/20/16
to chai-dev
Hi,

I've been taking a look on several of the benchmarks and I've seen that applications are always using in order queues to submit the kernels to the GPU and waiting for the end of the events using the CLFinish() method. I think that using an out_of_order queue and defining waiting lists for each enqueued element you could take a better performance when running the benchmarks.

For instance, in the CEDT code, it copies the input data and waits until the operation has finished to enqueue the two kernels. Once both have finished their executions, the applications enqueues a Read from the buffer. And then the GPU starts processing the following iteration while the CPU runs the last 2 kernels. Instead of doing that, you could submit all the work to be performed by the GPU at a time but defining dependencies among them. All the CLenqueue operations allow three parameters at the end (winting list size, waiting_list events and event). That allow you to define this dependencies.
you could have something like

cl_event event1;
cl_event event2;
cl_event event3;
cl_event event4;
clEnqueueWriteBuffer(ocl.clCommandQueue, d_in_out, CL_FALSE, 0, in_size, h_in_out[rep], 0, NULL, event);
clEnqueueNDRangeKernel(ocl.clCommandQueue, ocl.clKernel_gauss, 2, offset, gs, ls, 1, [event1], event2);
clEnqueueNDRangeKernel(ocl.clCommandQueue, ocl.clKernel_sobel, 2, offset, gs, ls, 1, [event2], event3);
clStatus = clEnqueueReadBuffer(ocl.clCommandQueue, d_in_out, CL_FALSE, 0, in_size, h_in_out[rep], 1, [event3], event4);

The application would save all the overhead of coming back from the library to the application code and going back to the library to submit new work (as already happens inbetween the kernels executions). In this way you can control GPU-GPU dependencies. To control CPU-GPU dependencies (for instance, reading a new frame from the input file) you can use user events with the following operation https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clCreateUserEvent.html. You can put this event on the waiting_list of the EnqueueWriteBuffer operation. To trigger the execution you can use the following operation: https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clSetUserEventStatus.html. Conversely to control GPU-CPU dependencies, you can define a callbacks function and attach it to an event so when the corresponding operation finishes (CL_COMPLETE command execution status) the application is notified https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clSetEventCallback.html. Be careful when using callbacks since usually a single thread (internal of the library) executes them  and if it performs a heavy computation, that may block the notification of other callbacks.

If the only problem was to profile the times that have been spent on each operation, insted of using you own timer, you can use the  getEventProfilingInfo method (https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetEventProfilingInfo.html) to obtain the time that it took to execute or the time that it was waiting on the queue,...

Liwen Chang

unread,
Dec 21, 2016, 4:39:54 PM12/21/16
to chai-dev
Thanks Francesc,
I really like your idea.

It not only might be beneficial in terms of performance (depending on workload of each task),
but also seems to make code structure cleaner,

Also, as you mentioned, event profile can recall time stamps and serve as a timer.  
From my understanding (or just impression), clGetEventProfilingInfo might highly depend on an OpenCL stack.
It might not be supported properly across OpenCL stacks.
Since Chai is a benchmark suite, by design, we might want it to be executable across OpenCL stacks.
The issue might be my old impression. 
Maybe it has changed.

From your experience, did you ever face any non-supporting cases for event/callback/clGetEventProfilingInfo?

Francesc Lordan Gomis

unread,
Jan 10, 2017, 4:13:03 AM1/10/17
to chai-dev
Hi Liwen,

I have only worked with implementations for the Intel GPU stack and Qualcom's SnapDragron and I had no problems with any of them. I guess that all these features should be working on any mature OpenCL implementation since they are available since version 1.1. However, the obtained performance might differ depending on the internals of each implementation (specially on the callbacks).

Liwen Chang

unread,
Jan 12, 2017, 3:40:34 AM1/12/17
to chai-dev
Good to know.
We are also evaluating Chai on some mobile CPU-GPU SoC and Intel IGP.
Thanks for the insightful suggestion. 
We definitely will check it. 
Reply all
Reply to author
Forward
0 new messages