I've been taking a look on several of the benchmarks and I've seen that applications are always using in order queues to submit the kernels to the GPU and waiting for the end of the events using the CLFinish() method. I think that using an out_of_order queue and defining waiting lists for each enqueued element you could take a better performance when running the benchmarks.
For instance, in the CEDT code, it copies the input data and waits until the operation has finished to enqueue the two kernels. Once both have finished their executions, the applications enqueues a Read from the buffer. And then the GPU starts processing the following iteration while the CPU runs the last 2 kernels. Instead of doing that, you could submit all the work to be performed by the GPU at a time but defining dependencies among them. All the CLenqueue operations allow three parameters at the end (winting list size, waiting_list events and event). That allow you to define this dependencies.
you could have something like
cl_event event1;
cl_event event2;
cl_event event3;
cl_event event4;
clEnqueueWriteBuffer(ocl.clCommandQueue, d_in_out, CL_FALSE, 0, in_size, h_in_out[rep], 0, NULL, event);
clEnqueueNDRangeKernel(ocl.clCommandQueue, ocl.clKernel_gauss, 2, offset, gs, ls, 1, [event1], event2);
clEnqueueNDRangeKernel(ocl.clCommandQueue, ocl.clKernel_sobel, 2, offset, gs, ls, 1, [event2], event3);
clStatus = clEnqueueReadBuffer(ocl.clCommandQueue, d_in_out, CL_FALSE, 0, in_size, h_in_out[rep], 1, [event3], event4);