problem with using nltk parser

720 views
Skip to first unread message

aziya mehboob

unread,
Oct 16, 2017, 12:43:38 PM10/16/17
to nltk-users
 Hi, i am new in nltk and  trying to extract certain key phrases from my text.this text is not in English language.i defined a rule using regular expression but this is giving me error.if some one can guide me which is the issue??


Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nltk/chunk/regexp.py", line 1016, in parse
    chunk_struct.label()
AttributeError: 'str' object has no attribute 'label'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "cvalue.py", line 43, in <module>
    candidates = main(domain_corpus, PATTERN)
  File "cvalue.py", line 32, in main
    chunks_freqs = chunk_sents(domain_sents, pos_pattern)
  File "cvalue.py", line 17, in chunk_sents
    for chk in chunker.parse(sent).subtrees():
  File "/usr/local/lib/python3.5/dist-packages/nltk/chunk/regexp.py", line 1208, in parse
    chunk_struct = parser.parse(chunk_struct, trace=trace)
  File "/usr/local/lib/python3.5/dist-packages/nltk/chunk/regexp.py", line 1018, in parse
    chunk_struct = Tree(self._root_label, chunk_struct)
  File "/usr/local/lib/python3.5/dist-packages/nltk/tree.py", line 106, in __init__
    "string" % type(self).__name__)
TypeError: Tree() argument 2 should be a list, not a string

George Orton

unread,
Oct 16, 2017, 12:54:27 PM10/16/17
to nltk-...@googlegroups.com
Hello, it is a bit difficult to answer your question without seeing your code however the error message indicates that the entity you are performing the regex find on requires the entity be a list and not a string. Try converting the string to a list with this syntax: list(string) where string is the entity you are currently performing the regex find on. 

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dimitriadis, A. (Alexis)

unread,
Oct 16, 2017, 12:57:05 PM10/16/17
to nltk-...@googlegroups.com
Hi Aziya,

All we can tell from the message is that somewhere there was a string where maybe a tree (or other structure) was expected. How could anyone guess what you were doing that led to this? Please distill your code down to something short that reproduces the problem, and write another email.

Alexis

Dr. Alexis Dimitriadis | Assistant Professor and Senior Research Fellow | Utrecht Institute of Linguistics OTS | Utrecht University | Trans 10, 3512 JK Utrecht, room 2.33 | +31 30 253 65 68 | a.dimi...@uu.nl | www.hum.uu.nl/medewerkers/a.dimitriadis

George Orton

unread,
Oct 16, 2017, 1:25:05 PM10/16/17
to nltk-...@googlegroups.com
Alexis’s response was more to the point than mine. The argument you are supplying to the tree method must be a list rather than, as I indicated,  the argument you are providing to the regex find. Find out what type the argument you are proving to Tree() is and if it is not a list then convert it to a list. Depending on how the string is formatted you can either enclose the entire string in closed brackets([]) or you can place each separate element of the string into a list one element at a time. Converting a string to a list is pretty basic Python. As Alexis states your best bet is to provide us with a stripped down version of your code to look at. George

aziya mehboob

unread,
Oct 16, 2017, 2:37:14 PM10/16/17
to nltk-users

thank you @Bio and @Alexis for replyin.i attached my code below.i tagged my corpus already and after tagging the result is some thing like this [['کچھ', 'Q'], ['روز', 'NN'], ['پہلے', 'NN'], ['ہی', 'I'], ['ریلیز', 'PN'], ['ہونے', 'VB'], ['والی', 'WALA'], ['فلم', 'NN'], ['باہو', 'PN'], ['بلی', 'PN'], ['2', 'CA'], ['دی', 'PN'], ['كنكلوژن', 'PN'], ['نے', 'P'], ['باکس', 'PN']]
i also tried by using ([]) this to enclose the whole in ( ) this.but still i am getting this error.

def load_corpus():
    with open('testout_data/123.txt', 'r') as f:
        tagged_sents = f.read()
    return tagged_sents

def chunk_sents(tagged_sents, pos_pattern):
    chunk_freq_dict = defaultdict(int)
    chunker = nltk.RegexpParser(pos_pattern)
    all_chunks = list(itertools.chain.from_iterable(chunker.parse(tagged_sent)for tagged_sent in tagged_sents))
    print(all_chunks)
    #print(chunk_freq_dict)
    return chunk_freq_dict
def main(domain_corpus, pos_pattern):
    # STEP 1
    domain_sents = domain_corpus
    # Extract matching patterns
    chunks_freqs = chunk_sents(domain_sents, pos_pattern)
    return chunks_freqs


if __name__ == '__main__':
    PATTERN = r"""
       NP: {<NN.*|adj>*<NN.*>}
        """
 
    domain_corpus = load_corpus()
    candidates = main(domain_corpus, PATTERN)




George Orton

unread,
Oct 16, 2017, 2:51:22 PM10/16/17
to nltk-...@googlegroups.com
Hi, It is not clear to me from your code where the problem lies. I would suggest using some print statements to try and localize the problem. Try adding these two statements just after the domain_sents = domain_corpus statement: print(‘domain_sents:’, domain_sents) and then under that print(‘type(domain_sents):’, type(domain_sents)) these will tell you if your problem originates in the domain_sents variable. If the type comes back as a string then you know you must convert it to a list. 

aziya mehboob

unread,
Oct 16, 2017, 4:36:20 PM10/16/17
to nltk-users
 Hi,thank you.this is the ouput.and yes its type is str.in both case if i put this in ([]) or this [[]]
domain_sents: [['فوٹو', 'PN'], ['گرافر', 'PN'], ['انتونیو', 'PN'], ['ریپیسی', 'PN'], ['نے', 'P'], ['سنہ', 'NN'], ['2011', 'CA'], ['میں', 'P'], ['ری', 'ADJ'], ['سائیکل', 'NN'], ['ہونے', 'VB'], ['والی', 'WALA'], ['ذاتی', 'ADJ'], ['استعمال', 'NN'], ['کی', 'P'], ['اشیا', 'NN'], ['کا', 'P'], ['کچرا', 'NN'], ['جمع', 'ADJ'], ['کرنا', 'VB'], ['شروع', 'NN'], ['کیا', 'VB'], ['اور', 'CC'], ['چار', 'CA'], ['سال', 'NN'], ['بعد', 'NN'], ['ان', 'PP'], ['کی', 'P'], ['مدد', 'NN'], ['سے', 'SE'], ['طاقتور', 'ADJ'], ['تصاویر', 'NN'], ['کی', 'P'], ['ایک', 'CA'], ['سیریز', 'NN'], ['بنائی', 'VB'], ['جس', 'REP'], ['کی', 'P'], ['مدد', 'NN'], ['سے', 'SE'], ['انھوں', 'NN'], ['کے', 'P'], ['لوگوں', 'NN'], ['کے', 'P'], ['بحیثیت', 'ADV'], ['صارف', 'NN'], ['خیالات', 'NN'], ['بدلنے', 'VB'], ['کی', 'P'], ['کوشش', 'NN'], ['کی', 'P'], ['ہے', 'VB']]
type
(domain_sents): <class 'str'>





Dimitriadis, A. (Alexis)

unread,
Oct 16, 2017, 4:55:37 PM10/16/17
to nltk-...@googlegroups.com
I was afraid of that. Actually it is very clear from your code where the problem lies: You cannot read a tagged corpus with `open()` and `read()`.

You are passing a string to the regexp parser, but it expects a list of tagged sentences (a list of lists of (word, pos) tuples). But first you need to fix the code that created the files in `testout_data`, because the current format is useless. Please take the time to study the nltk book and find out how to write a tagged corpus to disk, and how to read it back in properly using `nltk.corpus.reader.TaggedCorpusReader`. Once your data is actually a list of tagged sentences, this code should work.

Good luck,

Alexis

aziya mehboob

unread,
Oct 17, 2017, 7:14:32 AM10/17/17
to nltk-users
thank you Alexis for guiding me. i stored my tagged corpus in Nltk formate like this پاکستان/PN کے/P صوبہ/NN سندھ/NN سے/SE لاپتہ/ADJ سیاسی/ADJ کارکنوں/NN
but now when i try to pass it to parser.it is showing me error like this.
('کچھ', 'Q')

Traceback (most recent call last):
  File "cvalue.py", line 59, in <module>
    candidates = main(domain_corpus, PATTERN)
  File "cvalue.py", line 49, in main
    chunks_freqs = chunk_sents(domain_sents, pos_pattern)
  File "cvalue.py", line 34, in chunk_sents
    for chk in chunker.parse(sents).subtrees():

  File "/usr/local/lib/python3.5/dist-packages/nltk/chunk/regexp.py", line 1208, in parse
    chunk_struct = parser.parse(chunk_struct, trace=trace)
  File "/usr/local/lib/python3.5/dist-packages/nltk/chunk/regexp.py", line 1023, in parse
    chunkstr = ChunkString(chunk_struct)
  File "/usr/local/lib/python3.5/dist-packages/nltk/chunk/regexp.py", line 97, in __init__
    tags = [self._tag(tok) for tok in self._pieces]
  File "/usr/local/lib/python3.5/dist-packages/nltk/chunk/regexp.py", line 97, in <listcomp>
    tags = [self._tag(tok) for tok in self._pieces]
  File "/usr/local/lib/python3.5/dist-packages/nltk/chunk/regexp.py", line 107, in _tag
    raise ValueError('chunk structures must contain tagged '
ValueError: chunk structures must contain tagged tokens or trees


Dimitriadis, A. (Alexis)

unread,
Oct 17, 2017, 7:53:24 AM10/17/17
to nltk-...@googlegroups.com
That looks better. Now read it with `TaggedCorpusReader`, call `.tagged_sents()` on the reader and pass the output to the chunker. You’re clearly still not doing that. Good luck.

Alexis

Dr. Alexis Dimitriadis | Assistant Professor and Senior Research Fellow | Utrecht Institute of Linguistics OTS | Utrecht University | Trans 10, 3512 JK Utrecht, room 2.33 | +31 30 253 65 68 | a.dimi...@uu.nl | www.hum.uu.nl/medewerkers/a.dimitriadis

aziya mehboob

unread,
Oct 17, 2017, 8:49:12 AM10/17/17
to nltk-users
thank you alexis but i am already doing that.
def load_corpus():
    corpus_root
= os.path.abspath('../out1_data')
    mycorpus
= nltk.corpus.reader.TaggedCorpusReader(corpus_root,'.*')
   
#for infile in (mycorpus.fileids()):
       
#print(infile)
   
for sent in mycorpus.tagged_sents():
           
print(sent)
            tagged_sents
=sent  return tagged_sents

def chunk_sents(tagged_sents, pos_pattern):
    chunk_freq_dict
= defaultdict(int)
    chunker
= nltk.RegexpParser(pos_pattern)

    chunked
= []
   
for s in tagged_sents:
        chunked
.append(chunker.parse(s))
   
print(chunked)

   
def main(domain_corpus, pos_pattern):
   
# STEP 1
    domain_sents
=
domain_corpus
   
#print("domain_sents:", domain_sents)
   
#print("type(domain_sents):", type(domain_sents))

   
# Extract matching patterns
    chunks_freqs
= chunk_sents(domain_sents, pos_pattern)
   
return chunks_freqs


if __name__ == '__main__':
    PATTERN
= r
"""
       NP: {<NN.*|ADJ>*<NN.*>}
        """

 
    domain_corpus
= load_corpus()
    candidates
= main(domain_corpus, PATTERN)

Dimitriadis, A. (Alexis)

unread,
Oct 17, 2017, 10:01:15 AM10/17/17
to nltk-...@googlegroups.com
You should really get in the habit of testing your code and examining the output. Your `load_corpus()` is wasn’t pasted quite right, but it looks like it returns a single sentence only. Try this:

def load_corpus():
    corpus_root = os.path.abspath('../out1_data')
    mycorpus = nltk.corpus.reader.TaggedCorpusReader(corpus_root,'.*’)
    return mycorpus.tagged_sents()

aziya mehboob

unread,
Oct 17, 2017, 11:32:21 AM10/17/17
to nltk-users
thank you so much .it is working :)
Reply all
Reply to author
Forward
0 new messages