Thank you for the code, I will check this on the weekend.
As for your questions, I try to clarify them as follows:
Q1: I am trying to access the same PFAC_handle_t from the different
cpu threads and it works very unstable.
Ans: In page 15 of user guide (PFAC_userGuide_r1.1), it says
"PFAC context binds to only one GPU context and only default stream is
used in PFAC library.
Programmers can bind multiple PFAC contexts to multiple GPU by OpenMP
library or pThread
library, please see OpenMP example in $(PFAC_LIB_ROOT)/test/
omp_PFAC.cpp or pThread
example in $(PFAC_LIB_ROOT)/test/SimpleMultiGPU_pthread.cpp"
More precisely, PFAC context is allocated from heap, so all threads
can share the same context.
However PFAC context will bind to current GPU, check compute
capability of that GPU and load proper dynamic library.
So you should keep the same device on the same PFAC context.
For example, the following code does not work.
-----------------------------------------------------
// suppose GPU0 is GTX480 and GPU1 is GTX280
cudaSetDevice(0); // Set device 0 as current float* p0;
PFAC_create( &handle ) ; // PFAC context bind to GPU0
FAC_readPatternFromFile( handle, patternFile); // create state machine
on GPU0
cudaSetDevice(1); // Set device 1 as current
PFAC_matchFromHost( handle, ...); // this does not work because PFAC
will allocate buffer on GPU1, and do pattern matching on GPU1.
// however state transition table is
on GPU0.
-----------------------------------------------------
Do you change GPU device during your computation?
Since you got "unspecified cuda failure" during cudaMalloc or even
cudaFree, this is not a normal situation.
Q2: I checked that there was enough device memory for multithread
matching,
I'm calling PFAC_matchFromDeviceReduce on GeForce 560 Ti. The main
question is am i able to launch PFAC in multithreading app?
Ans: Yes, PFAC is thread-safe, so you can do multithread programming.
Please check $(PFAC_LIB_ROOT)/test/SimpleMultiGPU_pthread.cpp
I don't recommand that many threads share one PFAC context.
Suppose you have a huge problem and want to adopt multi-GPU solution.
There are two extreme cases.
case 1: one small pattern set and two input streams
You can create two threads binding to two different PFAC contexts,
and each PFAC context binds to different GPU.
Each PFAC context has the same patterns but deal with different
input streams.
case 2: one huge pattern set and one input stream.
you can divide pattern set into two small ones.
Again you can create two threads binding to two different PFAC
contexts, and each PFAC context binds to different GPU.
Each PFAC context has different patterns, but process the same
input stream.
If you only have one GPU and one pattern set, then you should be able
to share one PFAC context (i.e. share the same pattern set) and each
thread processes different input stream.
However if each thread has different combination of pattern set and
input stream, then you can not share PFAC context among threads.
Q3: I haven't realized why all PFAC functions can't be
async?
Ans: In page 23 of user guide, table 3 lists all main functions, only
one among them is asynchronous (PFAC_matchFromDevice() +
PFAC_TEXTURE_OFF).
The reason is because of cudaMalloc and texture binding.
(from CUDA programming guide, only the following cases are
asynchronous,
Kernel launches;
Device-device memory copies;
Host-device memory copies of a memory block of 64 KB or less;
Memory copies performed by functions that are suffixed with Async;
Memory set function calls.
)
I'd like to recommand multi-thread programming to overcome synchronous
launch of PFAC.
Lung-Sheng
On 1月24日, 上午7時34分, Grigory Dobrov <
wsl...@gmail.com> wrote:
> Thank you for answer.
>
> I tried something like this and it works finehttp://
pastebin.com/unmrPyqX
>
> There are some other modifications in my code so I tried to leave only
> interesting lines. Hope, I didn't forget anything
>
> I have one more question about PFAC. I am trying to access the same
> PFAC_handle_t from the different cpu threads and it works very
> unstable. Sometimes I get "unspecified cuda failure" during cudaMalloc
> or even cudaFree but I've never seen PFAC library's error. So I
> suppose that there is some problem in PFAC_handle_t access but by this
> time I haven't found the race. Could you help me in my searches? I
> checked that there was enough device memory for multithread matching,
> I'm calling PFAC_matchFromDeviceReduce on GeForce 560 Ti. The main
> question is am i able to launch PFAC in multithreading app?
> Also I haven't realized why all PFAC functions can't be async?
>
> On 20 ÑÎ×, 10:24, Lung-Sheng Chien <
lungshengch...@gmail.com> wrote:
>
>
>
> > Yes, you are right.
>
> > For some reason, we sort patterns by lexicographic order prior to
> > construction of the PFAC state machine.
> > I don't remember the reason, it should be related to some works on
> > load imbalance.
>
> > Now I don't think we need to sort the patterns.
> > The only one constraint is that pattern.number should start from 1
> > because 0 is reserved for initial state and each pattern number
> > corresponds to a final state.
>
> > @Bruce, do you think we need to sort the patterns?
>
> > @Grigory, thanks for your opinion, we will fix this bug in the next
> > release.
> > š š I would appreciate that you can share the code.
>
> > š š š š š š š š š š š š š š š š š š š š š š š šLung-Sheng
>
> > On Jan 19, 3:11šam, Grigory Dobrov <
wsl...@gmail.com> wrote:
>
> > > Hello.
>
> > > If I understand correct, in this code PFAC sorts patterns and then finds
> > > where they are sorted.
>
> > > // step 4: sort patterns by lexicographic order
> > > š qsort( *rowPtr, rowIdxArray.size()-1, sizeof(char*), (int (*)(const
> > > void*, const void*)) pattern_cmp ) ;
>
> > > *max_state_num_ptr = file_size + 1 ;
> > > *pattern_num_ptr = rowIdxArray.size() - 1 ;
>
> > > // step 5: compute f(final state) = patternID
> > > for( int i = 0 ; i < *pattern_num_ptr ; i++){
> > > char *key = (*rowPtr)[i];
> > > // find patterns whose pointer is the same as "key"
> > > for( int j = 0 ; j < *pattern_num_ptr ; j++){
> > > if ( key == rowIdxArray[j] ){
> > > (*patternID_table_ptr)[i] = j + 1 ; // pattern number starts from 1
> > > break ;
>
> > > }
> > > }
> > > }
>
> > > The cycle in step 5 is O(n^2)
> > > So if pattern_num_ptr is quite large (as in my case about 10^6) the
> > > initialization spents lots of time.
> > > The simpliest solution is to sort not just patterns but structures like
> > > struct StringIndex
> > > {
> > > char *str;
> > > size_t index;};
>
> > > with the pattern_cmp function wrapper and then just take corresponding
> > > index values.
>
> > > I think this improvement is quite simple and useful. I can just share the
> > > code with you.- 隱藏被引用文字 -
>
> - 顯示被引用文字 -