I don't think so, if we ignore writing the matched result, then
total amount of input of AC
(128M/512 threads) x (1024 read per thread) x (128byte, each thread
loads one cache line in Fermi)
= 128MB x 256
total amount of look-up table of AC
(128M/512 threads) x (1024 read per thread) = 128M x 2
total amount of input of PFAC
128MB x 1.5
total amount of look-up table of AC
128M x length(pattern)
If pattern length is 32, then
input of AC : input of PFAC = 256 : 1.5
lookup table of AC : lookup table of PFAC = 2 : 32
The key is kernel of AC has HUGE panelty on reading input data,
we need to modify the kernel first.
I summarize what we should do in the following:
1) put input to shared memory
2) write match result to shared memory, then finally write to global
memory together to reduce overhead of write
3) you don't need output table because we have reordered final states
4) consider bank-conflict of shared memory
5) read integer instead of character, this is done in the paper
"accelerating ...", please read it carefully
> > > (SASP), Anaheim, CA, June 13-14, 2010, pp. 71-76- 隱藏被引用文字 -
>
> - 顯示被引用文字 -