Question about clppScan_GPU

49 views
Skip to first unread message

Mingcheng Chen

unread,
Aug 31, 2012, 1:49:08 PM8/31/12
to cl...@googlegroups.com
Hi,

Thank you for your time!

Recently I am finding a good implementation of Prefix Sum and a friend referred me to your project, so now I am comparing the default and the "GPU" version in your package.

As I found, in scan_GPU you only use one thread block all the time, right?

Thanks again!

Best regards,
Mingcheng

Chad Brewbaker

unread,
Aug 31, 2012, 1:59:25 PM8/31/12
to cl...@googlegroups.com

I would have to look at the code.

Read Duane Merrill's paper if you want to know the state of the art. His idea was to segment the array so each compute block would compute a reduction on a contiguious portion of it in parallel, then these reductions would be turned into a prefix sum on one compute block.   This much smaller prefix sum would then be used to compute the prefix sum of each segment. It has an IO tax of about 2 reads and 1 write per element which is very good.

Chad

kr...@polarlights.net

unread,
Sep 1, 2012, 8:59:30 AM9/1/12
to cl...@googlegroups.com
What do you mean by one thread block ?

I have to review the code, but as you know GPU has more constraints than CPU... so for GPU it is more efficient to work locally (thread scope) and then process them by block.

Does it answer your question ?

BTW, the goal of this library is to 'use' the best possible algorithm for any device, automatically. It is up to the library to choose the right algorithm depending of your request. Unfortunately I have no more time to work on it for now !

Krys

Mingcheng Chen

unread,
Sep 1, 2012, 9:40:27 AM9/1/12
to cl...@googlegroups.com
I mean when you track the global size and work group size before each call it is always the same.

Sent from my iPhone
Reply all
Reply to author
Forward
0 new messages