load unbalance issue: AC versus PFAC

Lung-Sheng Chien

unread,

Mar 4, 2011, 7:59:45 AM3/4/11

to pfacForum

There is already one paper [1] implementing AC on GPU against DNA
analysis.
As we know, load unbalance is an important issue on PFAC, side effect
of load unbalance causes performance fluctuation, ranging between
6Gbps to 100Gbps.

We have some idea to solve load unbalance of PFAC, however we need to
understand a fundamental problem:
complexity of AC is linear, O(N) whereas complexity of PFAC is psudo
linear O(\alpha * N), \alpha is maximum pattern length.

Conjecture: AC is much more tolerant of load unbalance than PFAC.
In order to prove or disprove this conjecture, we need to re-implement
AC on GPU.

extreme case: if pattern is aaa...a and input is also aaa...aaa..,
then load unbalance does not exist on PFAC, but performance of PFAC is
inverse proportional to length of pattern where performance of AC is
independent of length of the pattern.

[1] A. Tumeo and O. Villa, “Accelerating DNA analysis applications on
GPU clusters”, IEEE Symposium on Application Specific Processors
(SASP), Anaheim, CA, June 13-14, 2010, pp. 71-76

Message has been deleted

Cheng-Hung Lin

unread,

Mar 5, 2011, 9:50:44 PM3/5/11

to pfacForum

Consider the special case that matches a sequence of 'a' against a
pattern of multiple 'a'. In my experiments, when the pattern has more
than 32 instances of 'a', the performance of AC is better than PFAC.
The following table shows the performance for processing 128M of 'a'.
#.a AC PFAC_lib PFAC(global memory version)
25 5.07 8.54 6.59
32 5.07 6.74 5.17
50 5.07 4.38 3.33
100 5.07 2.24 1.69
200 5.07 1.05 0.84
400 5.07 0.52 0.42
The result is reasonable by calculating the frequence of memory
accesses.

Because the AC version cannot meet the memory coalesce, the frequence
of accessing global memory of AC is
128M x 1024 x 16 / 512

On the other hand, the frequence of accessing global memory of PFAC is
128M x # of 'a'

Therefore, when the # of 'a' is equal to 32, the above two equations
are equal and the performance of AC and PFAC are almost equal.

Cheng-Hung Lin

unread,

Mar 5, 2011, 9:56:18 PM3/5/11

to pfacForum

Additional remark:
In the AC version, each thread is in charge of 512 bytes. Because of
boundary detection problem described in our paper, each thread has to
scan 1024 bytes.

Lung-Sheng Chien

unread,

Mar 5, 2011, 10:24:12 PM3/5/11

to pfacForum

I don't think so, if we ignore writing the matched result, then

total amount of input of AC
(128M/512 threads) x (1024 read per thread) x (128byte, each thread
loads one cache line in Fermi)
= 128MB x 256

total amount of look-up table of AC
(128M/512 threads) x (1024 read per thread) = 128M x 2

total amount of input of PFAC
128MB x 1.5

total amount of look-up table of AC
128M x length(pattern)

If pattern length is 32, then

input of AC : input of PFAC = 256 : 1.5
lookup table of AC : lookup table of PFAC = 2 : 32

The key is kernel of AC has HUGE panelty on reading input data,
we need to modify the kernel first.

I summarize what we should do in the following:

1) put input to shared memory
2) write match result to shared memory, then finally write to global
memory together to reduce overhead of write
3) you don't need output table because we have reordered final states
4) consider bank-conflict of shared memory
5) read integer instead of character, this is done in the paper
"accelerating ...", please read it carefully

On 3月6日, 上午10時50分, Cheng-Hung Lin <bruceli...@gmail.com> wrote:

> > > (SASP), Anaheim, CA, June 13-14, 2010, pp. 71-76- 隱藏被引用文字 -
>
> - 顯示被引用文字 -

Reply all

Reply to author

Forward