Slow initialization in parsePatternFile step 5

Grigory Dobrov

unread,

Jan 19, 2012, 6:11:32 AM1/19/12

to pfac...@googlegroups.com

Hello.

If I understand correct, in this code PFAC sorts patterns and then finds where they are sorted.

// step 4: sort patterns by lexicographic order

qsort( *rowPtr, rowIdxArray.size()-1, sizeof(char*), (int (*)(const void*, const void*)) pattern_cmp ) ;

*max_state_num_ptr = file_size + 1 ;

*pattern_num_ptr = rowIdxArray.size() - 1 ;

// step 5: compute f(final state) = patternID

for( int i = 0 ; i < *pattern_num_ptr ; i++){

char *key = (*rowPtr)[i];

// find patterns whose pointer is the same as "key"

for( int j = 0 ; j < *pattern_num_ptr ; j++){

if ( key == rowIdxArray[j] ){

(*patternID_table_ptr)[i] = j + 1 ; // pattern number starts from 1

break ;

}

The cycle in step 5 is O(n^2)

So if pattern_num_ptr is quite large (as in my case about 10^6) the initialization spents lots of time.

The simpliest solution is to sort not just patterns but structures like

struct StringIndex
{
char *str;
size_t index;
};

with the pattern_cmp function wrapper and then just take corresponding index values.

I think this improvement is quite simple and useful. I can just share the code with you.

Lung-Sheng Chien

unread,

Jan 20, 2012, 1:24:24 AM1/20/12

to pfacForum

Yes, you are right.

For some reason, we sort patterns by lexicographic order prior to
construction of the PFAC state machine.
I don't remember the reason, it should be related to some works on
load imbalance.

Now I don't think we need to sort the patterns.
The only one constraint is that pattern.number should start from 1
because 0 is reserved for initial state and each pattern number
corresponds to a final state.

@Bruce, do you think we need to sort the patterns?

@Grigory, thanks for your opinion, we will fix this bug in the next
release.
I would appreciate that you can share the code.

Lung-Sheng

Grigory Dobrov

unread,

Jan 24, 2012, 10:34:38 AM1/24/12

to pfacForum

Thank you for answer.

I tried something like this and it works fine
http://pastebin.com/unmrPyqX

There are some other modifications in my code so I tried to leave only
interesting lines. Hope, I didn't forget anything

I have one more question about PFAC. I am trying to access the same
PFAC_handle_t from the different cpu threads and it works very
unstable. Sometimes I get "unspecified cuda failure" during cudaMalloc
or even cudaFree but I've never seen PFAC library's error. So I
suppose that there is some problem in PFAC_handle_t access but by this
time I haven't found the race. Could you help me in my searches? I
checked that there was enough device memory for multithread matching,
I'm calling PFAC_matchFromDeviceReduce on GeForce 560 Ti. The main
question is am i able to launch PFAC in multithreading app?
Also I haven't realized why all PFAC functions can't be async?

Lung-Sheng Chien

unread,

Jan 25, 2012, 1:42:00 AM1/25/12

to pfacForum

Thank you for the code, I will check this on the weekend.

As for your questions, I try to clarify them as follows:

Q1: I am trying to access the same PFAC_handle_t from the different

cpu threads and it works very unstable.

Ans: In page 15 of user guide (PFAC_userGuide_r1.1), it says
"PFAC context binds to only one GPU context and only default stream is
used in PFAC library.
Programmers can bind multiple PFAC contexts to multiple GPU by OpenMP
library or pThread
library, please see OpenMP example in $(PFAC_LIB_ROOT)/test/
omp_PFAC.cpp or pThread
example in $(PFAC_LIB_ROOT)/test/SimpleMultiGPU_pthread.cpp"

More precisely, PFAC context is allocated from heap, so all threads
can share the same context.
However PFAC context will bind to current GPU, check compute
capability of that GPU and load proper dynamic library.
So you should keep the same device on the same PFAC context.
For example, the following code does not work.
-----------------------------------------------------
// suppose GPU0 is GTX480 and GPU1 is GTX280
cudaSetDevice(0); // Set device 0 as current float* p0;
PFAC_create( &handle ) ; // PFAC context bind to GPU0
FAC_readPatternFromFile( handle, patternFile); // create state machine
on GPU0
cudaSetDevice(1); // Set device 1 as current
PFAC_matchFromHost( handle, ...); // this does not work because PFAC
will allocate buffer on GPU1, and do pattern matching on GPU1.
// however state transition table is
on GPU0.
-----------------------------------------------------
Do you change GPU device during your computation?

Since you got "unspecified cuda failure" during cudaMalloc or even
cudaFree, this is not a normal situation.

Q2: I checked that there was enough device memory for multithread

matching,
I'm calling PFAC_matchFromDeviceReduce on GeForce 560 Ti. The main
question is am i able to launch PFAC in multithreading app?

Ans: Yes, PFAC is thread-safe, so you can do multithread programming.
Please check $(PFAC_LIB_ROOT)/test/SimpleMultiGPU_pthread.cpp

I don't recommand that many threads share one PFAC context.
Suppose you have a huge problem and want to adopt multi-GPU solution.
There are two extreme cases.
case 1: one small pattern set and two input streams
You can create two threads binding to two different PFAC contexts,
and each PFAC context binds to different GPU.
Each PFAC context has the same patterns but deal with different
input streams.

case 2: one huge pattern set and one input stream.
you can divide pattern set into two small ones.
Again you can create two threads binding to two different PFAC
contexts, and each PFAC context binds to different GPU.
Each PFAC context has different patterns, but process the same
input stream.

If you only have one GPU and one pattern set, then you should be able
to share one PFAC context (i.e. share the same pattern set) and each
thread processes different input stream.
However if each thread has different combination of pattern set and
input stream, then you can not share PFAC context among threads.

Q3: I haven't realized why all PFAC functions can't be
async?

Ans: In page 23 of user guide, table 3 lists all main functions, only
one among them is asynchronous (PFAC_matchFromDevice() +
PFAC_TEXTURE_OFF).
The reason is because of cudaMalloc and texture binding.
(from CUDA programming guide, only the following cases are
asynchronous,
Kernel launches;
 Device-device memory copies;
 Host-device memory copies of a memory block of 64 KB or less;
 Memory copies performed by functions that are suffixed with Async;
 Memory set function calls.
)

I'd like to recommand multi-thread programming to overcome synchronous
launch of PFAC.

Lung-Sheng

On 1月24日, 上午7時34分, Grigory Dobrov <wsl...@gmail.com> wrote:
> Thank you for answer.
>

> I tried something like this and it works finehttp://pastebin.com/unmrPyqX

>
> There are some other modifications in my code so I tried to leave only
> interesting lines. Hope, I didn't forget anything
>
> I have one more question about PFAC. I am trying to access the same
> PFAC_handle_t from the different cpu threads and it works very
> unstable. Sometimes I get "unspecified cuda failure" during cudaMalloc
> or even cudaFree but I've never seen PFAC library's error. So I
> suppose that there is some problem in PFAC_handle_t access but by this
> time I haven't found the race. Could you help me in my searches? I
> checked that there was enough device memory for multithread matching,
> I'm calling PFAC_matchFromDeviceReduce on GeForce 560 Ti. The main
> question is am i able to launch PFAC in multithreading app?
> Also I haven't realized why all PFAC functions can't be async?
>

> On 20 ÑÎ×, 10:24, Lung-Sheng Chien <lungshengch...@gmail.com> wrote:
>
>
>
> > Yes, you are right.
>
> > For some reason, we sort patterns by lexicographic order prior to
> > construction of the PFAC state machine.
> > I don't remember the reason, it should be related to some works on
> > load imbalance.
>
> > Now I don't think we need to sort the patterns.
> > The only one constraint is that pattern.number should start from 1
> > because 0 is reserved for initial state and each pattern number
> > corresponds to a final state.
>
> > @Bruce, do you think we need to sort the patterns?
>
> > @Grigory, thanks for your opinion, we will fix this bug in the next
> > release.

> > š š I would appreciate that you can share the code.
>
> > š š š š š š š š š š š š š š š š š š š š š š š šLung-Sheng

>
> > On Jan 19, 3:11šam, Grigory Dobrov <wsl...@gmail.com> wrote:
>
> > > Hello.
>
> > > If I understand correct, in this code PFAC sorts patterns and then finds
> > > where they are sorted.
>
> > > // step 4: sort patterns by lexicographic order

> > > š qsort( *rowPtr, rowIdxArray.size()-1, sizeof(char*), (int (*)(const

> > > void*, const void*)) pattern_cmp ) ;
>
> > > *max_state_num_ptr = file_size + 1 ;
> > > *pattern_num_ptr = rowIdxArray.size() - 1 ;
>
> > > // step 5: compute f(final state) = patternID
> > > for( int i = 0 ; i < *pattern_num_ptr ; i++){
> > > char *key = (*rowPtr)[i];
> > > // find patterns whose pointer is the same as "key"
> > > for( int j = 0 ; j < *pattern_num_ptr ; j++){
> > > if ( key == rowIdxArray[j] ){
> > > (*patternID_table_ptr)[i] = j + 1 ; // pattern number starts from 1
> > > break ;
>
> > > }
> > > }
> > > }
>
> > > The cycle in step 5 is O(n^2)
> > > So if pattern_num_ptr is quite large (as in my case about 10^6) the
> > > initialization spents lots of time.
> > > The simpliest solution is to sort not just patterns but structures like
> > > struct StringIndex
> > > {
> > > char *str;
> > > size_t index;};
>
> > > with the pattern_cmp function wrapper and then just take corresponding
> > > index values.
>
> > > I think this improvement is quite simple and useful. I can just share the

> > > code with you.- 隱藏被引用文字 -
>
> - 顯示被引用文字 -

Cheng-Hung Lin

unread,

Jan 25, 2012, 10:50:34 AM1/25/12

to pfac...@googlegroups.com

Sorting patterns in lexicographic order can benefit prefix sharing when we divide patterns into different groups. Of course, sorting is unnecessary for a group of patterns.

--
Sincerely,
Cheng-Hung Lin

Lung-Sheng Chien

unread,

Jan 26, 2012, 1:02:40 AM1/26/12

to pfacForum

I don't understand.
Do you want to divide patterns into several small groups and use multi-
gpu solution?

I just want to know why we sort the patterns, do you remember?

Lung-Sheng

On Jan 25, 7:50 am, Cheng-Hung Lin <bruceli...@gmail.com> wrote:
> Sorting patterns in lexicographic order can benefit prefix sharing when we
> divide patterns into different groups. Of course, sorting is unnecessary
> for a group of patterns.
>

> On Thu, Jan 19, 2012 at 10:24 PM, Lung-Sheng Chien <lungshengch...@gmail.com

Cheng-Hung Lin

unread,

Jan 26, 2012, 10:11:43 AM1/26/12

to pfac...@googlegroups.com

The reason is that we reorder the output vector to remove the output table. We sort the pattern to avoid "prefix problem."

Consider to compile these two patterns, "abcd", "ab" without sorting them.

When compiling the first pattern, the state machine is as follows.

3->(a)->4->(b)->5->(c)->6->(d)->1(the final state of "abcd")

Then, we have to adjust the state number 5 to 2 when compiling the second pattern "ab". It would be more complicated to construct the state machine.

If we sort the patterns first, we can simplify the construction of PFAC state machine.

--
Sincerely,
Cheng-Hung Lin

Dobrov Grigoriy

unread,

Jan 26, 2012, 11:13:38 AM1/26/12

to pfac...@googlegroups.com

Thank you.

I've read this part of manual about multiple GPUs and the examples are also about multiple GPU systems but my case is simplier.

I am using the PFAC library on one GPU and trying to access one PFAC handle from different streams. So I'm trying to match different texts with one dictionary. I understand that there is no performance benefit in this case but my application uses this architecture and it would be nice not to use anything like locks.

Now I am setting the mutex before calling PFAC_matchFromDeviceReduce but all memory allocations are not protected and everything works fine. When I remove it, I get error. So sharing context seems to be a problem but I don't understand if it is expected in this case or not.

I have one more question but I think it can be quite stupid.Why PFAC binds the texture every call? I haven't noticed any writes to the texture.

Great thanks for your help

2012/1/26 Cheng-Hung Lin <bruce...@gmail.com>

Lung-Sheng Chien

unread,

Jan 27, 2012, 12:10:09 AM1/27/12

to pfacForum

Thanks, now it is clear. So we should keep sorting.

@Dobrov, I will adopt your idea on sorting, and write your
contribution on file $(PFAC_LIB_ROOT)/NOTICE.
Is this O.K. for you?

Lung-Sheng

On Jan 26, 7:11 am, Cheng-Hung Lin <bruceli...@gmail.com> wrote:
> The reason is that we reorder the output vector to remove the output table.
> We sort the pattern to avoid "prefix problem."
> Consider to compile these two patterns, "abcd", "ab" without sorting them.
> When compiling the first pattern, the state machine is as follows.
> 3->(a)->4->(b)->5->(c)->6->(d)->1(the final state of "abcd")
>
> Then, we have to adjust the state number 5 to 2 when compiling the second
> pattern "ab". It would be more complicated to construct the state machine.
>
> If we sort the patterns first, we can simplify the construction of PFAC
> state machine.
>
> On Thu, Jan 26, 2012 at 2:02 PM, Lung-Sheng Chien

> <lungshengch...@gmail.com>wrote:

Lung-Sheng Chien

unread,

Jan 27, 2012, 12:25:30 AM1/27/12

to pfacForum

Do you have a small example to reproduce this bug?

What is your CUDA driver? You need 4.0 (4.1 is released today).

If you install CUDA 4.0/4.1 and it still does not work, then it is
weird, I need to find out this bug.

Basically you have only one GPU and want to process many texts.
The good choice is to run these tasks in a single thread with multiple
CUDA streams.
However PFAC does not support multiple streams (we can do that, but
still have not implemented this).
However if you only have one thread, then you can only rely on
asynchronous call (matchFromDevice + texture free).

I think your approach is a right way to avoid synchronous blocking.
One PFAC context should be able to shared by several threads.

Q: Why PFAC binds the texture every call?
PFAC just binds pointer of memory block to the texture reference, then
hardware will activate texture cache on this memory block.
You can regard texture cache as L1 cache, so data read from DRAM, then
go to L2 cache and then texture cache.
Also texture cache is not writable, you can check file src/
PFAC_kernel.cu,
function tex_lookup() uses text1Dfetch to read data.
In our experiments, texture cache can give us 10% performance.
@Bruce, could you confirm this?

In fact, the speedup is not a big deal, and I don't like texture
binding also because it is synchronous.

Lung-
Sheng

On Jan 26, 8:13 am, Dobrov Grigoriy <wsl...@gmail.com> wrote:
> Thank you.
> I've read this part of manual about multiple GPUs and the examples are also
> about multiple GPU systems but my case is simplier.
> I am using the PFAC library on one GPU and trying to access one PFAC handle
> from different streams. So I'm trying to match different texts with one
> dictionary. I understand that there is no performance benefit in this case
> but my application uses this architecture and it would be nice not to use
> anything like locks.
> Now I am setting the mutex before calling PFAC_matchFromDeviceReduce but
> all memory allocations are not protected and everything works fine. When I
> remove it, I get error. So sharing context seems to be a problem but I
> don't understand if it is expected in this case or not.
>
> I have one more question but I think it can be quite stupid.Why PFAC binds
> the texture every call? I haven't noticed any writes to the texture.
> Great thanks for your help
>

> 2012/1/26 Cheng-Hung Lin <bruceli...@gmail.com>

Dobrov Grigoriy

unread,

Jan 30, 2012, 5:08:43 AM1/30/12

to pfac...@googlegroups.com

I have 4.0 driver and 4.0 sdk from debian packages. By the way PFAC makefile uses the incorrect way to determine the driver version for building sm21. sm21 won't be build on 4.1 because of this line

cuda_32 := $(if $(shell nvcc -V | grep 3.2),1,)
cuda_40 := $(if $(shell nvcc -V | grep 4.0),1,)
sm_21_support := $(if $(filter 1, $(cuda_32) $(cuda_40)),1,)

I suppose simple regexp can solve this problem in an easier and more correct way.

I attached the test file for reproduction of my problem and simple script to build the test. For code simplification I used C++11 features so you need a proper compiler. I assume that the test is builded in PFAC test directory but I didn't want to edit your makefile and attached separate script.

The rest can be launched in sync or async mode (first and the only parameter -async; default is sync mode). If there is a lock in pfac_mutex both modes work fine. But if you comment this line

std::lock_guard<std::mutex> lock(pfac_mutex);

then the async version would fail. It happens not every time on my computer but quite often to reproduce. I also inserted simple memory amount check in thread procedure for not starting undesireable threads.

----

It will be great to see my name in NOTICE file. Thank you. It is O.K.

2012/1/27 Lung-Sheng Chien <lungshe...@gmail.com>

parallel_test.7z

Cheng-Hung Lin

unread,

Jan 30, 2012, 12:08:38 PM1/30/12

to pfac...@googlegroups.com

Yes, binding PFAC table to texture memory can obtain an average of 12% speedup compared to global memory.

--
Sincerely,
Cheng-Hung Lin

Lung-Sheng Chien

unread,

Feb 6, 2012, 1:27:58 AM2/6/12

to pfac...@googlegroups.com

Could you check if you can install g++-4.6 in your workstation and try Dobrov's case?

I cannot install g++-4.6 in my machine which has Fedora 13.

So I will try to implement similar functionality of Dobrov's case in my machine.

Thanks.

Lung-Sheng

2012/1/25 Cheng-Hung Lin <bruce...@gmail.com>

Cheng-Hung Lin

unread,

Feb 6, 2012, 8:27:09 PM2/6/12

to pfac...@googlegroups.com

Are you sure gcc 4.6 can support CUDA? We have install gcc 4.6. 2 on my machine. But, the error message show that "unsupported GNU version! gcc 4.6 and up are not supported!"

--
Sincerely,
Cheng-Hung Lin

Lung-Sheng Chien

unread,

Feb 7, 2012, 1:57:18 AM2/7/12

to pfacForum

I use pthread library to share one PFAC context, and test
PFAC_matchFromHost(), then I can only work for time-driven.
It reports kernel launch failure for space-driven intermittently.
I will look at the code deeply to find the root cause.

Basically CUDA 4.0 moves focus on GPU context, and multiple host
threads can share one GPU context.
So we need to review PFAC library and make sure it is compatible with
CUDA 4.0.

@Dobrov, thank you for this report.

I will try my best to find the root cause as soon as possible.

Lung-Sheng

On Jan 30, 2:08 am, Dobrov Grigoriy <wsl...@gmail.com> wrote:
> I have 4.0 driver and 4.0 sdk from debian packages. By the way PFAC
> makefile uses the incorrect way to determine the driver version for
> building sm21. sm21 won't be build on 4.1 because of this line
> cuda_32 := $(if $(shell nvcc -V | grep 3.2),1,)
> cuda_40 := $(if $(shell nvcc -V | grep 4.0),1,)
> sm_21_support := $(if $(filter 1, $(cuda_32) $(cuda_40)),1,)
> I suppose simple regexp can solve this problem in an easier and more
> correct way.
>
> I attached the test file for reproduction of my problem and simple script
> to build the test. For code simplification I used C++11 features so you
> need a proper compiler. I assume that the test is builded in PFAC test
> directory but I didn't want to edit your makefile and attached separate
> script.
> The rest can be launched in sync or async mode (first and the only
> parameter -async; default is sync mode). If there is a lock in pfac_mutex
> both modes work fine. But if you comment this line
> std::lock_guard<std::mutex> lock(pfac_mutex);
> then the async version would fail. It happens not every time on my computer
> but quite often to reproduce. I also inserted simple memory amount check in
> thread procedure for not starting undesireable threads.
>
> ----
> It will be great to see my name in NOTICE file. Thank you. It is O.K.
>

> 2012/1/27 Lung-Sheng Chien <lungshengch...@gmail.com>

> parallel_test.7z
> 799KViewDownload

Lung-Sheng Chien

unread,

Feb 7, 2012, 1:58:25 AM2/7/12

to pfacForum

You can compile PFAC library seperately by gcc-4.3 or gcc-4.4, then
compile tester by g++-4.6.

Lung-Sheng

On Feb 6, 5:27 pm, Cheng-Hung Lin <bruceli...@gmail.com> wrote:
> Are you sure gcc 4.6 can support CUDA? We have install gcc 4.6. 2 on my
> machine. But, the error message show that "unsupported GNU version! gcc 4.6
> and up are not supported!"
>
> On Mon, Feb 6, 2012 at 12:27 AM, Lung-Sheng Chien

> <lungshengch...@gmail.com>wrote:

>
>
>
>
>
>
>
> > Could you check if you can install g++-4.6 in your workstation and try
> > Dobrov's case?
>
> > I cannot install g++-4.6 in my machine which has Fedora 13.
>
> > So I will try to implement similar functionality of Dobrov's case in my
> > machine.
>
> > Thanks.
>
> > Lung-Sheng
>

> > 2012/1/25 Cheng-Hung Lin <bruceli...@gmail.com>

Dobrov Grigoriy

unread,

Feb 7, 2012, 2:43:55 AM2/7/12

to pfac...@googlegroups.com

I just changed the lines in host_config.h to compile with 4.6 compiler. I compiled with 4.6 compiler examples from sdk and everything worked fine and PFAC sync tests worked too. So I don't consider compiler version to be a problem.

I also tried to build PFAC library with 4.4 but nothing changed - async test failed.

2012/2/7 Lung-Sheng Chien <lungshe...@gmail.com>

Cheng-Hung Lin

unread,

Feb 7, 2012, 10:43:11 AM2/7/12

to pfac...@googlegroups.com

We have install GCC 4.6 on GTX 480. You can use sudo update-alternatives --config gcc to change gcc version.

BTW, PFAC library seems cannot run under CUDA 4.1

--
Sincerely,
Cheng-Hung Lin

Lung-Sheng Chien

unread,

Feb 8, 2012, 2:18:04 AM2/8/12

to pfacForum

I have some progress.
If I turn off the texture ( PFAC_setTextureMode(handle,
PFAC_TEXTURE_OFF) ), then I never failed.
It seems that we cannot trust texture binding if multithreads share
one PFAC context (actually, CUDA runtime does not return any errors on
texture binding, but sometimes kernel launch failure).

@Bruce, what is the error message under CUDA 4.1?

Lung-Sheng

On Feb 7, 7:43 am, Cheng-Hung Lin <bruceli...@gmail.com> wrote:
> We have install GCC 4.6 on GTX 480. You can use *sudo update-alternatives
> --config gcc* to change gcc version.

>
> BTW, PFAC library seems cannot run under CUDA 4.1
>
> On Tue, Feb 7, 2012 at 12:58 AM, Lung-Sheng Chien

> <lungshengch...@gmail.com>wrote:

Dobrov Grigoriy

unread,

Feb 8, 2012, 4:50:32 AM2/8/12

to pfac...@googlegroups.com

I added the lines

PFAC_status = PFAC_setTextureMode(handle, PFAC_TEXTURE_OFF);
assert(PFAC_STATUS_SUCCESS == PFAC_status);

in my test code and it still fails. If you are using my test code don't you forget to remove/comment line:

std::lock_guard<std::mutex> lock(pfac_mutex);

new PFAC_matchFromDeviceReduce call?

With this line the test works fine with or without texture binding.

2012/2/8 Lung-Sheng Chien <lungshe...@gmail.com>

Cheng-Hung Lin

unread,

Feb 8, 2012, 11:18:04 AM2/8/12

to pfac...@googlegroups.com

brucelin@GTX480:~/PFAC-v1.2/bin$ ./simple_example.exe

*** stack smashing detected ***: ./simple_example.exe terminated

======= Backtrace: =========

/lib/libc.so.6(__fortify_fail+0x37)[0x7f0ca9a07217]

/lib/libc.so.6(__fortify_fail+0x0)[0x7f0ca9a071e0]

/home/brucelin/PFAC-v1.2/lib/libpfac.so(+0x56e3)[0x7f0caacc56e3]

./simple_example.exe[0x400c40]

/lib/libc.so.6(__libc_start_main+0xfd)[0x7f0ca9926c4d]

./simple_example.exe[0x400ad9]

======= Memory map: ========

00400000-00402000 r-xp 00000000 08:01 34476508 /home/brucelin/PFAC-v1.2/bin/simple_example.exe

00601000-00602000 rw-p 00001000 08:01 34476508 /home/brucelin/PFAC-v1.2/bin/simple_example.exe

008b1000-008d2000 rw-p 00000000 00:00 0 [heap]

200000000-800000000 ---p 00000000 00:00 0

7f0ca87fa000-7f0ca88f2000 r-xp 00000000 08:01 34472889 /home/brucelin/PFAC-v1.2/lib/libpfac_sm20.so

7f0ca88f2000-7f0ca8af1000 ---p 000f8000 08:01 34472889 /home/brucelin/PFAC-v1.2/lib/libpfac_sm20.so

7f0ca8af1000-7f0ca8af3000 rw-p 000f7000 08:01 34472889 /home/brucelin/PFAC-v1.2/lib/libpfac_sm20.so

7f0ca8af3000-7f0ca8b09000 r-xp 00000000 08:01 25690306 /lib/libz.so.1.2.3.3

7f0ca8b09000-7f0ca8d08000 ---p 00016000 08:01 25690306 /lib/libz.so.1.2.3.3

7f0ca8d08000-7f0ca8d09000 r--p 00015000 08:01 25690306 /lib/libz.so.1.2.3.3

7f0ca8d09000-7f0ca8d0a000 rw-p 00016000 08:01 25690306 /lib/libz.so.1.2.3.3

7f0ca8d0a000-7f0ca93fe000 r-xp 00000000 08:01 53872307 /usr/lib/libcuda.so.285.05.33

7f0ca93fe000-7f0ca95fd000 ---p 006f4000 08:01 53872307 /usr/lib/libcuda.so.285.05.33

7f0ca95fd000-7f0ca96db000 rw-p 006f3000 08:01 53872307 /usr/lib/libcuda.so.285.05.33

7f0ca96db000-7f0ca9700000 rw-p 00000000 00:00 0

7f0ca9700000-7f0ca9707000 r-xp 00000000 08:01 25691196 /lib/librt-2.11.1.so

7f0ca9707000-7f0ca9906000 ---p 00007000 08:01 25691196 /lib/librt-2.11.1.so

7f0ca9906000-7f0ca9907000 r--p 00006000 08:01 25691196 /lib/librt-2.11.1.so

7f0ca9907000-7f0ca9908000 rw-p 00007000 08:01 25691196 /lib/librt-2.11.1.so

7f0ca9908000-7f0ca9a82000 r-xp 00000000 08:01 25691173 /lib/libc-2.11.1.so

7f0ca9a82000-7f0ca9c81000 ---p 0017a000 08:01 25691173 /lib/libc-2.11.1.so

7f0ca9c81000-7f0ca9c85000 r--p 00179000 08:01 25691173 /lib/libc-2.11.1.so

7f0ca9c85000-7f0ca9c86000 rw-p 0017d000 08:01 25691173 /lib/libc-2.11.1.so

7f0ca9c86000-7f0ca9c8b000 rw-p 00000000 00:00 0

7f0ca9c8b000-7f0ca9ca1000 r-xp 00000000 08:01 25690191 /lib/libgcc_s.so.1

7f0ca9ca1000-7f0ca9ea0000 ---p 00016000 08:01 25690191 /lib/libgcc_s.so.1

7f0ca9ea0000-7f0ca9ea1000 r--p 00015000 08:01 25690191 /lib/libgcc_s.so.1

7f0ca9ea1000-7f0ca9ea2000 rw-p 00016000 08:01 25690191 /lib/libgcc_s.so.1

7f0ca9ea2000-7f0ca9eaf000 r-xp 00000000 08:01 53873855 /usr/lib/libgomp.so.1.0.0

7f0ca9eaf000-7f0caa0ae000 ---p 0000d000 08:01 53873855 /usr/lib/libgomp.so.1.0.0

7f0caa0ae000-7f0caa0af000 r--p 0000c000 08:01 53873855 /usr/lib/libgomp.so.1.0.0

7f0caa0af000-7f0caa0b0000 rw-p 0000d000 08:01 53873855 /usr/lib/libgomp.so.1.0.0

7f0caa0b0000-7f0caa132000 r-xp 00000000 08:01 25691177 /lib/libm-2.11.1.so

7f0caa132000-7f0caa331000 ---p 00082000 08:01 25691177 /lib/libm-2.11.1.so

7f0caa331000-7f0caa332000 r--p 00081000 08:01 25691177 /lib/libm-2.11.1.so

7f0caa332000-7f0caa333000 rw-p 00082000 08:01 25691177 /lib/libm-2.11.1.so

7f0caa333000-7f0caa429000 r-xp 00000000 08:01 53874351 /usr/lib/libstdc++.so.6.0.13

7f0caa429000-7f0caa629000 ---p 000f6000 08:01 53874351 /usr/lib/libstdc++.so.6.0.13

7f0caa629000-7f0caa630000 r--p 000f6000 08:01 53874351 /usr/lib/libstdc++.so.6.0.13

7f0caa630000-7f0caa632000 rw-p 000fd000 08:01 53874351 /usr/lib/libstdc++.so.6.0.13

7f0caa632000-7f0caa647000 rw-p 00000000 00:00 0

7f0caa647000-7f0caa65f000 r-xp 00000000 08:01 25691194 /lib/libpthread-2.11.1.so

7f0caa65f000-7f0caa85e000 ---p 00018000 08:01 25691194 /lib/libpthread-2.11.1.so

7f0caa85e000-7f0caa85f000 r--p 00017000 08:01 25691194 /lib/libpthread-2.11.1.so

7f0caa85f000-7f0caa860000 rw-p 00018000 08:01 25691194 /lib/libpthread-2.11.1.so

7f0caa860000-7f0caa864000 rw-p 00000000 00:00 0

7f0caa864000-7f0caa866000 r-xp 00000000 08:01 25691176 /lib/libdl-2.11.1.so

7f0caa866000-7f0caaa66000 ---p 00002000 08:01 25691176 /lib/libdl-2.11.1.so

7f0caaa66000-7f0caaa67000 r--p 00002000 08:01 25691176 /lib/libdl-2.11.1.so

7f0caaa67000-7f0caaa68000 rw-p 00003000 08:01 25691176 /lib/libdl-2.11.1.so

7f0caaa68000-7f0caaabe000 r-xp 00000000 08:01 55578399 /usr/local/cuda/lib64/libcudart.so.4.1.28

7f0caaabe000-7f0caacbe000 ---p 00056000 08:01 55578399 /usr/local/cuda/lib64/libcudart.so.4.1.28

7f0caacbe000-7f0caacbf000 r--p 00056000 08:01 55578399 /usr/local/cuda/lib64/libcudart.so.4.1.28

7f0caacbf000-7f0caacc0000 rw-p 00057000 08:01 55578399 /usr/local/cuda/lib64/libcudart.so.4.1.28

7f0caacc0000-7f0caacc8000 r-xp 00000000 08:01 34476497 /home/brucelin/PFAC-v1.2/lib/libpfac.so

7f0caacc8000-7f0caaec7000 ---p 00008000 08:01 34476497 /home/brucelin/PFAC-v1.2/lib/libpfac.so

7f0caaec7000-7f0caaec8000 rw-p 00007000 08:01 34476497 /home/brucelin/PFAC-v1.2/lib/libpfac.so

7f0caaec8000-7f0caaee8000 r-xp 00000000 08:01 25690161 /lib/ld-2.11.1.so

7f0cab0c1000-7f0cab0c7000 rw-p 00000000 00:00 0

7f0cab0e2000-7f0cab0e3000 r--s f8009000 00:05 6429 /dev/nvidia2

7f0cab0e3000-7f0cab0e4000 r--s f4009000 00:05 6418 /dev/nvidia1

7f0cab0e4000-7f0cab0e5000 r--s f3009000 00:05 6413 /dev/nvidia0

7f0cab0e5000-7f0cab0e7000 rw-p 00000000 00:00 0

7f0cab0e7000-7f0cab0e8000 r--p 0001f000 08:01 25690161 /lib/ld-2.11.1.so

7f0cab0e8000-7f0cab0e9000 rw-p 00020000 08:01 25690161 /lib/ld-2.11.1.so已經終止

--
Sincerely,
Cheng-Hung Lin

Cheng-Hung Lin

unread,

Feb 8, 2012, 12:41:32 PM2/8/12

to pfac...@googlegroups.com

We also modify the host_config.h and use GCC 4.6 to compile CUDA codes under CUDA 4.1.

Except our library examples, most CUDA codes are executed normally.

2012/2/8 Cheng-Hung Lin <bruce...@gmail.com>

--
Sincerely,
Cheng-Hung Lin

Lung-Sheng Chien

unread,

Feb 9, 2012, 2:45:26 AM2/9/12

to pfac...@googlegroups.com

@Dobrov,

I wrote a tester as attaced multiThreads_oneGPU.cpp to check "multi-threads on one PFAC context".

My assertion is "texture binding will fail this scheme".

I noticed that you set space-driven in your tester.

The space-driven version of PFAC_matchFromDeviceReduce() actually has several texture bindings,

one of them is not controllable by PFAC_setTextureMode.

I modified this part such that all texture bindings are controlled by PFAC_setTextureMode.

Please replace src/PFAC_reduce_inplace_kernel.cu by attached one.

Also put attached multiThreads_oneGPU into subdirectory test, add this option in Makefile.

(you can set macro MATCH_REDUCE_ON, TEXTURE_ON and SPACE_DRIVEN_ON to control different paths.)

Basically, if you don't choose TEXTURE_OFF, sometimes it works because PFAC library would try to bind texture when reading patterns.

if texture binding failed, then it would not bind to texture anymore.

In your case, pattern file is huge, so whole state machine may not bind to texture.

Please check if you still failed or not.

@Bruce, please also check this part.

If texture binding is a problem, then we narrow down the bug. I will write a small repro and post it on the CUDA forum.

@Dobrov, I am sorry that I don't focus on your tester.

I will find another HardDisk to install Ubuntu on my machine, then I should compile your code.

First of all, let us verify if texture binding is a problem or not.

By the way, I prefer to disable texture binding because it is synchronous and only give you 10% performance.

Actually 10% is not a big deal if patterns is very huge.

@Dobrov, what is the performance regression in your app without texture binding?

@Bruce, can we have a large application such that we can check if texture binding is a killer or not?

Lung-Sheng

2012/2/8 Dobrov Grigoriy <wsl...@gmail.com>

multiThreads_oneGPU.cpp

PFAC_reduce_inplace_kernel.cu

Dobrov Grigoriy

unread,

Feb 9, 2012, 6:13:03 AM2/9/12

to pfac...@googlegroups.com

Well, I have two news: one great and one strange.

The great one is after your patch everything works fine.

The strange one is that I get a huge performance gain when I disabled texture_mode and this is opposite to your benchmarks.

I've tested parallel pfac match in my app. I hoped that when I remove lock near PFAC call I would win some performance but it is occurred that test speed doesn't change at all. So now we have correct parallel PFAC version but without any benefits from is its parallelism. Anyway great thanks for you work.

2012/2/9 Lung-Sheng Chien <lungshe...@gmail.com>

Lung-Sheng Chien

unread,

Feb 10, 2012, 1:17:38 AM2/10/12

to pfacForum

Today I am informed that texture binding is not thread safe.
The only way to make it thread safe is mutex, that is,
bind texture, kernel launch and unbind texture must be in a critical
section.
Also such mutex must be global. The syntax is

lock( mutex) ;
bind texture ;
kernel launch;
unbind texture;
unlock(mutex);

In order to keep backward compatibility of PFAC library, I think mutex
is necessary.
(In the user manual, we say PFAC can be called by pthread or openmp,
this is not true.)

I will try to implement mutex and then multithread PFAC should work.
In the future, we can collect more information to judge necessity of
texture binding.

@Dobrov, I am surprised that you got performance gain without
texture.
What is the speedup you have?

As for multithread PFAC, I want to mention a point:
If you only have one GPU and share this GPU by multithreads, then you
have some benefits from concurrent kernel launch supported on Fermi
card.
However the power of concurrent kernel lanuch is not computation
itself but overlap between PCI transfer and computation on GPU.
In other words, you cannot exceed hardware limit.
For example, if you run two threads on one GPU, and only have one PFAC
context, then you have one PFAC state machine in device memory, also
two input streams in device memory. When you launch two kernels from
two threads, the NVIDIA driver will keep these two jobs in the queue
and push them to GPU. However you have only one GPU, these two jobs
must be executed sequentially. It is almost the same as you launch the
kernel twice in a single thread.
The advantage here is overlap of one kernal executed by thread 0 and
data transfer via PCIe by thread 1.

So you need to analyze your app first, what is your strategy on single
GPU?

@Bruce, could you confirm efficiency of texture binding in various
applications?

Lung-Sheng

On Feb 9, 3:13 am, Dobrov Grigoriy <wsl...@gmail.com> wrote:
> Well, I have two news: one great and one strange.
>
> The great one is after your patch everything works fine.
>
> The strange one is that I get a huge performance gain when I disabled
> texture_mode and this is opposite to your benchmarks.
>
> I've tested parallel pfac match in my app. I hoped that when I remove lock
> near PFAC call I would win some performance but it is occurred that test
> speed doesn't change at all. So now we have correct parallel PFAC version
> but without any benefits from is its parallelism. Anyway great thanks for
> you work.
>

> 2012/2/9 Lung-Sheng Chien <lungshengch...@gmail.com>

> >> 2012/2/8 Lung-Sheng Chien <lungshengch...@gmail.com>

Cheng-Hung Lin

unread,

Feb 10, 2012, 3:43:17 PM2/10/12

to pfacForum

@Lungsheng
The code multiThreads_oneGPU.cpp is tested OK on GTX480 with CUDA 4.1
and GCC 4.6.2.
I tested PFAC-v1.2 library with our test pattern (state 27754) and
input (192MB).
Using texture memory, the throughput is 117Gbps.
Disabling texture memory, the throughput is 99Gbps.

> > 2012/2/8 Lung-Sheng Chien <lungshengch...@gmail.com>

> multiThreads_oneGPU.cpp
> 8K檢視下載
>
> PFAC_reduce_inplace_kernel.cu
> 54K檢視下載

Lung-Sheng Chien

unread,

Feb 11, 2012, 1:38:20 AM2/11/12

to pfacForum

I know that benchmark.
I just want more applications to confirm this point.

Texture cache can fetch 32 bytes data, not whole cache line.
It should be better than typical load, I just worry about penalty of
cache miss.

If we have an application that cache hit rate is low, then we can know
which is better.
One way to know cache hit rate is through viper, visual profiler on
linux, you can find it in CUDA4.1

Lung-Sheng

Dobrov Grigoriy

unread,

Mar 11, 2012, 11:28:34 AM3/11/12

to pfac...@googlegroups.com

Sorry, missed your question about speedup. I have many short patterns (the longest is 58 chars) and long text. The speedup when I turned the texture mode off was about 30 percent at CUDA4.0.

Now I am trying CUDA 4.1 and it seems that PFAC don't want to work at all at sm_21. I can't perform successfully any match because of memory error. The strange thing is that when I enable device debug mode (-G option to nvcc) there are no errors. I reproduce it with the same test that I posted earlier. As there is no debug info it's hard to know where the error is exactly placed.

Do you have any ideas what to check?

Thank you

2012/2/11 Lung-Sheng Chien <lungshe...@gmail.com>

Lung-Sheng Chien

unread,

Mar 12, 2012, 4:42:45 AM3/12/12

to pfacForum

What kind of memory error do you have?

In fact, we don't test PFAC library on sm21 because we don't have such
device.

I would try to find some sm21 device to reproduce the error.

However I am on vacation now, and I will be available one week later.

Lung-Sheng

On 3月11日, 下午11時28分, Dobrov Grigoriy <wsl...@gmail.com> wrote:
> Sorry, missed your question about speedup. I have many short patterns (the
> longest is 58 chars) and long text. The speedup when I turned the texture
> mode off was about 30 percent at CUDA4.0.
>
> Now I am trying CUDA 4.1 and it seems that PFAC don't want to work at all
> at sm_21. I can't perform successfully any match because of memory error.
> The strange thing is that when I enable device debug mode (-G option to
> nvcc) there are no errors. I reproduce it with the same test that I posted
> earlier. As there is no debug info it's hard to know where the error is
> exactly placed.
>
> Do you have any ideas what to check?
>
> Thank you
>

> 2012/2/11 Lung-Sheng Chien <lungshengch...@gmail.com>

> ...
>
> 閱讀更多 >>- 隱藏被引用文字 -
>
> - 顯示被引用文字 -

Dobrov Grigoriy

unread,

Mar 12, 2012, 5:20:14 AM3/12/12

to pfac...@googlegroups.com

Well, I have Geforce 560 Ti. Anyhow suppose the problem not in the sm21 but in the new llvm frontend. There are several complaints about break of the kernel when update from 4.0 to 4.1. When I set the compiler to open64 the code begins to work. I've read some topics about differences between 4.1 and 4.0 and one of the most important is the number of used registers. Can it be a problem for PFAC?

After some tests and forum messages like

http://forums.developer.nvidia.com/devforum/discussion/4386/cuda-4-1-broke-my-kernel-wont-be-executed/p1

http://forums.developer.nvidia.com/devforum/discussion/4161/cuda-4-1-slow-compared-with-4-0/p1

http://groups.google.com/group/thrust-users/browse_thread/thread/50e5fdc02a7515b7

http://forums.nvidia.com/index.php?showtopic=222547

I think that it's better to wait some time before upgrade.

Thank you for attention. Good vacations!

2012/3/12 Lung-Sheng Chien <lungshe...@gmail.com>

Lung-Sheng Chien

unread,

Mar 13, 2012, 10:21:49 AM3/13/12

to pfacForum

The kernels of PFAC use 256 threads per thread-block. Even llvm uses
more registers per thread, we still can launch one thread-block per SM
at least.
Since you have tried open64 and it works well.
I think we still need a sm21 to know what's going on.

On Mar 12, 2:20 am, Dobrov Grigoriy <wsl...@gmail.com> wrote:
> Well, I have Geforce 560 Ti. Anyhow suppose the problem not in the sm21 but
> in the new llvm frontend. There are several complaints about break of the
> kernel when update from 4.0 to 4.1. When I set the compiler to open64 the
> code begins to work. I've read some topics about differences between 4.1
> and 4.0 and one of the most important is the number of used registers. Can
> it be a problem for PFAC?
>

> After some tests and forum messages likehttp://forums.developer.nvidia.com/devforum/discussion/4386/cuda-4-1-...http://forums.developer.nvidia.com/devforum/discussion/4161/cuda-4-1-...http://groups.google.com/group/thrust-users/browse_thread/thread/50e5...http://forums.nvidia.com/index.php?showtopic=222547

> I think that it's better to wait some time before upgrade.
>
> Thank you for attention. Good vacations!
>

> 2012/3/12 Lung-Sheng Chien <lungshengch...@gmail.com>

> ...
>
> read more >>

Reply all

Reply to author

Forward