I would have to look at the code.
Read Duane Merrill's paper if you want to know the state of the art. His idea was to segment the array so each compute block would compute a reduction on a contiguious portion of it in parallel, then these reductions would be turned into a prefix sum on one compute block. This much smaller prefix sum would then be used to compute the prefix sum of each segment. It has an IO tax of about 2 reads and 1 write per element which is very good.
Chad
I have to review the code, but as you know GPU has more constraints than CPU... so for GPU it is more efficient to work locally (thread scope) and then process them by block.
Does it answer your question ?
BTW, the goal of this library is to 'use' the best possible algorithm for any device, automatically. It is up to the library to choose the right algorithm depending of your request. Unfortunately I have no more time to work on it for now !
Krys