Problem in loading the corpora

1,000 views
Skip to first unread message

Freeda

unread,
Nov 23, 2011, 1:51:42 PM11/23/11
to nltk-users, free...@gmail.com
Hi,
I am using conceptnet code, in that it is importing nltk module. I
am running pcfgpatterns.py file, in that it is loading, a tagger like
this:
treebank_brown = LazyCorpusLoader(
'treebank/combined', BracketParseCorpusReader, r'c.*\.mrg')

If I execute the code for this, it is throwing an error saying,
<BracketParseCorpusReader in '.../corpora/treebank/combined' (not
loaded yet)>
Traceback (most recent call last):
File "C:\Documents and Settings\personal\Desktop
\ConceptNet-4.0rc2\ConceptNet-4.0rc2\csc\corpus\parse\pcfgpattern.py",
line 472, in <module>
theunigrams = UnigramProbDist.from_treebank()
File "C:\Documents and Settings\personal\Desktop
\ConceptNet-4.0rc2\ConceptNet-4.0rc2\csc\corpus\parse\pcfgpattern.py",
line 386, in from_treebank
for sent in treebank_brown.tagged_sents():
File "C:\Python26\lib\site-packages\nltk\corpus\reader\api.py", line
401, in tagged_sents
for fileid, enc in self.abspaths(fileids, True)])
File "C:\Python26\lib\site-packages\nltk\corpus\reader\util.py",
line 421, in concat
raise ValueError('concat() expects at least one object!')
ValueError: concat() expects at least one object!

So how shud i proceed..? I badly need to execute that code.

I wanted to ask, is there any difference between treebank and
treebank_brown,..?

You can see the code which I am using in this link:
http://nullege.com/codes/show/src@c@o...@ConceptNet-4.0rc4@csc@corpus@pa...@pcfgpattern.py

--Freeda

David Gerő

unread,
Nov 23, 2011, 4:06:12 PM11/23/11
to nltk-users
Hi Freeda,

The .../corpora/treebank/combined file is exists your computer?
Because your error from util.py in ConcatenatedCorpusView object
concat() function.
if len(docs) == 0:


raise ValueError('concat() expects at least one object!')

Brown and treebank corpus is have lot of difference. Example Brown
corpus make in 1964 (see the manual: http://icame.uib.no/brown/bcm.html)
and the Penn Treebank make 1992 (http://www.cis.upenn.edu/~treebank/
home.html)

Best regards,
David

Freeda Dsouza

unread,
Nov 23, 2011, 10:40:43 PM11/23/11
to nltk-...@googlegroups.com
hi,
thanks for the reply David.
That error is bcos of  this line:
treebank_brown = LazyCorpusLoader(
   'treebank/combined', BracketParseCorpusReader, r'c.*\.mrg')
Here it is trying to load this using LazyCorpusLoader. I think that entry will be null because it is not loaded.
The result of this will be used in the later function where,
 for sent in treebank_brown.tagged_sents():
here treebank_brown will be null i assume so..
Therefore I am getting that error. So it started from the first line where they are loading the trrbank/combined.
------------------------------------------------------------------------------------------------------------------------------------------------------------
Yeah..  .../corpora/treebank/combined file is exists my computer. And i noticed that the files in that folder starts with "wsj_*.mrg" but in the regex what it is given in loading that treebank/combined saya it starts with "c.*.mrg" which doesn't exist in that folder.

One more thing i noticed is if i assume it is brown because of the regex dey have given, then also it is wrong, cos in brown the files doesn't end with ".mrg"

Waiting for the reply,
Thanks,
--Freeda


--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.


Alexis Dimitriadis

unread,
Nov 24, 2011, 9:02:09 AM11/24/11
to nltk-...@googlegroups.com
Looks like it's failing to find any files that match the path and
pattern you gave. Are you sure there are files starting with "c" in the
treebank/combined folder?

Best,

Alexis

Freeda Dsouza

unread,
Nov 24, 2011, 11:07:49 AM11/24/11
to nltk-...@googlegroups.com
No.. In treebank/combined folder files starts with "wsj_" followed by numbers and ".mrg" extension..
So I really dunno wat it shud be.. 
The files in brown folder starts with "c" but der s no extension..

--Freeda

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+unsubscribe@googlegroups.com.

Alexis Dimitriadis

unread,
Nov 24, 2011, 11:41:34 AM11/24/11
to nltk-...@googlegroups.com

On 24/11/2011 17:07, Freeda Dsouza wrote:
No.. In treebank/combined folder files starts with "wsj_" followed by numbers and ".mrg" extension..
So I really dunno wat it shud be.. 
The files in brown folder starts with "c" but der s no extension..

--Freeda


You're asking the function to load files from the treebank folder; it'll never see the brown folder unless you give it the appropriate path.

treebank_brown = LazyCorpusLoader(
    'treebank/combined', BracketParseCorpusReader, r'c.*\.mrg')
I wanted to ask, is there any difference between treebank and
treebank_brown,..?

treebank_brown just is a variable you define; you could call it crown_jewels and it would make no difference: It's just a place to keep the object LazyCorpusLoader returns. Solution: Read up on the basics of python, so you can understand and modify the examples appropriately.

Good luck,

Alexis



To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.

Freeda Dsouza

unread,
Nov 24, 2011, 11:54:43 AM11/24/11
to nltk-...@googlegroups.com
Ya.. tat i understand that ita a variable. But i got a doubt bcos, in
treebank_brown = LazyCorpusLoader(
    'treebank/combined', BracketParseCorpusReader, r'c.*\.mrg')
It is loading treebank/combined folder and i assume that it is taking the files which match the above regex.
My question is treebank/combined folder doesn't contain files which starts with 'c'.
U can go thru the code which I am using, 

http://nullege.com/codes/show/src@c@o...@ConceptNet-4.0rc4@csc@corpus@pa...@pcfgpattern.py
  
     Here to generate the unigrams, it ll call a function 
def from_treebank(klass):
        from nltk.corpus import brown, treebank
        probdist = klass()
        for sent in treebank.tagged_sents(): ##this for loops works properly
            for word, tag in sent:
                probdist.inc(word.lower(), tag)
        for sent in treebank_brown.tagged_sents(): ## here it s giving error,
            for word, tag in sent:
                probdist.inc(word.lower(), tag)
        for word, tag in get_lexicon():
            probdist.inc(word, tag, closed_class=False)
        for i in range(10): probdist.inc('can', 'VB')
        return probdist
What i assumed is, may be it is not able to load, so it is null,, therefore it cannot call the function tagged_sents..
So what shud i do now..? I dint write this code, its from the "Conceptnet", I am using 1 file of that.

--Freeda
Reply all
Reply to author
Forward
0 new messages