running path_similarity in parallel fails

30 views
Skip to first unread message

rtwomey

unread,
Apr 10, 2013, 9:47:18 AM4/10/13
to nltk...@googlegroups.com
Hi, 

I'm searching a set of sentences to find those that match search terms with a path_similarity() value above a threshold. This works great as a linear, single-process search, though it is rather slow. I was hoping to speed the process by splitting the data set into chunks and testing them in parallel. 

In the attached code, the function to find related sentences, findRelatedMP(), succeeds when I run it with only one process (serial execution):

rtwomeys-work-object-4:nltk2.0 rtwomey$ python find_related_exampleMP.py 

sentences:  ["OK.  I've got a ham sandwich here.  Is it energy or is it information?", 'where in the field of activity can you derive pleasure.', 'I keep wanting to run into kate.', 'school made more sense in high school.', "oil, chemical and atomic workers int'l union.", 'find some teeth, buddy.', 'donald judd had teeth.', "there's a part of me that goes out and meets something in each of these things that I see.  why am I so eager to identify?", "I don't think it's ever been exhausted, that sense of potential.", 'life as an uncoverable aesthetic phenomenon.'] 

search terms:  ['teeth', 'pleasure'] 

searching... * * * 
3 matches:
find some teeth, buddy.
where in the field of activity can you derive pleasure.
donald judd had teeth.

parallel: time elapsed: 1.11793398857

It fails when I try to execute with more than one process:

twomeys-work-object-4:nltk2.0 rtwomey$ python find_related_exampleMP.py 

sentences:  ["OK.  I've got a ham sandwich here.  Is it energy or is it information?", 'where in the field of activity can you derive pleasure.', 'I keep wanting to run into kate.', 'school made more sense in high school.', "oil, chemical and atomic workers int'l union.", 'find some teeth, buddy.', 'donald judd had teeth.', "there's a part of me that goes out and meets something in each of these things that I see.  why am I so eager to identify?", "I don't think it's ever been exhausted, that sense of potential.", 'life as an uncoverable aesthetic phenomenon.'] 

search terms:  ['teeth', 'pleasure'] 

searching...Process Process-1:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "find_related_exampleMP.py", line 42, in worker
    result = findRelatedStatementsSyns(sent, term_syns)
  File "find_related_exampleMP.py", line 27, in findRelatedStatementsSyns
    if checkSyns(word, term_syns, wn.NOUN):
  File "find_related_exampleMP.py", line 13, in checkSyns
    for syns in wn.synsets(word):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/wordnet.py", line 1201, in synsets
    for offset in index[form].get(p, [])]
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/wordnet.py", line 1059, in _synset_from_pos_and_offset
    synset = self._synset_from_pos_and_line(pos, data_file_line)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/wordnet.py", line 1159, in _synset_from_pos_and_line
    raise WordNetError('line %r: %s' % (data_file_line, e))
WordNetError: line 'n 0000 ~ 05808794 n 0000 | the cognitive processes involved in producing and understanding linguistic communication; "he didn\'t have the language to express his feelings"  \n': invalid literal for int() with base 10: 'n'
 + +^C
Traceback (most recent call last):
  File "find_related_exampleMP.py", line 133, in <module>
    main()
  File "find_related_exampleMP.py", line 117, in main
    results = findrelatedMP(sentences, 2, term_syns)
  File "find_related_exampleMP.py", line 70, in findrelatedMP
    resultdict.update(out_q.get())
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/queues.py", line 117, in get
    res = self._recv()
 

It looks like _synset_from_pos_and_offset in wordnet.py is seeking in the pos data file and then reading. So, my two (or more) processes are likely interfering with each other in that file read position. 

Does this mean the path_similarity() function is not thread safe?

Do you have any suggestions for using the path_similarity() functions in parallel fashion? 

Thanks!

Robert
find_relatedMP.py
Reply all
Reply to author
Forward
0 new messages