We can focus a huge problem, say find EST fragment in a complete
genome human sequence.
size of EST database is 1GB
size of genome sequence is 3GB
it is impossible to complete this task in one kernel call.
In CUDA 4.0, we can try
1) one PFAC context binds to several GPUs and divide patterns and
inputs into small ones and
iterate all combinations, then merge.
The overhead of merging is huge. so if we can have a space-
efficient version of PFAC, then
we can reduce number of merging.
2) how about MPI? We need a MPI + PFAC example.
GPU computing is useful when app is HUGE.
If we only demo small app running ms, then no one cares about this
because CPU can do it very well.