Please help, new to nltk/python. Importing own corpora

505 views
Skip to first unread message

Sania

unread,
Apr 8, 2012, 1:51:34 PM4/8/12
to nltk-users
Hi Everyone!
Well I am just learning nltk and I have a problem importing my own
corpus.
I looked at chapter 2 in the book but when I try to follow the way
they teach it, I get errors.
Here is what I did.


>>> from nltk.corpus import PlaintextCorpusReader
>>> corpus_root= 'Spring2012/work'
>>> reuterarticle=PlaintextCorpusReader(corpus_root, '.*')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/site-packages/nltk/corpus/reader/plaintext.py", line 61, in
__init__
CorpusReader.__init__(self, root, fileids, encoding)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/site-packages/nltk/corpus/reader/api.py", line 82, in
__init__
root = FileSystemPathPointer(root)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/site-packages/nltk/data.py", line 155, in __init__
raise IOError('No such file or directory: %r' % path)
IOError: No such file or directory: '/Users/Sania/Spring2012/work'



Then I thought since is says that the second parameter in the
PlaintextCorpusReader is supposed to be an initializer I can enter


>>> reuterarticle=PlaintextCorpusReader(corpus_root, '.txt\*')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/site-packages/nltk/corpus/reader/plaintext.py", line 61, in
__init__
CorpusReader.__init__(self, root, fileids, encoding)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/site-packages/nltk/corpus/reader/api.py", line 82, in
__init__
root = FileSystemPathPointer(root)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/site-packages/nltk/data.py", line 155, in __init__
raise IOError('No such file or directory: %r' % path)
IOError: No such file or directory: '/Users/Sania/Spring2012/work'


because the file name for my corpus is ..... reuters1article.txt so
the * means 0 or more .txt files within that root that I defined, so
within that folder....right?

what am I doing wrong? Any help would be appreciated.
Message has been deleted
Message has been deleted

Sania

unread,
Apr 8, 2012, 2:16:59 PM4/8/12
to nltk-users
ok so now I changed it to...
>>> file_pattern=r".*\.txt"
>>> reuterarticle=PlaintextCorpusReader(corpus_root, file_pattern)

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/site-packages/nltk/corpus/reader/plaintext.py", line 61, in
__init__
CorpusReader.__init__(self, root, fileids, encoding)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/site-packages/nltk/corpus/reader/api.py", line 82, in
__init__
root = FileSystemPathPointer(root)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/site-packages/nltk/data.py", line 155, in __init__
raise IOError('No such file or directory: %r' % path)
IOError: No such file or directory: '/Spring2012/work'


with the ".*\.txt" meaning any character followed by .txt
but I still get an error :(
Any ideas would be appreciated
Thanks
Sania

Kristina Striegnitz

unread,
Apr 8, 2012, 2:27:10 PM4/8/12
to nltk-...@googlegroups.com
Hi Sania,

I think the problem may be with it not finding the corpus_root path
that you are specifying. Notice how at the very end of the error
message it says: "IOError: No such file or directory:
'/SeniorSpring2012/CDS490'". Is that the correct and full path to the
directory containing your corpus files? It looks as if there is maybe
missing something in the beginning of the path.

Kristina

On Sun, Apr 8, 2012 at 2:14 PM, Sania <fantas...@gmail.com> wrote:
> ok so now I changed it to...
>
>>>> file_pattern=r".*\.txt"
>>>> reuterarticle=PlaintextCorpusReader(corpus_root, file_pattern)

> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
>  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
> python2.5/site-packages/nltk/corpus/reader/plaintext.py", line 61, in
> __init__
>    CorpusReader.__init__(self, root, fileids, encoding)
>  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
> python2.5/site-packages/nltk/corpus/reader/api.py", line 82, in
> __init__
>    root = FileSystemPathPointer(root)
>  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
> python2.5/site-packages/nltk/data.py", line 155, in __init__
>    raise IOError('No such file or directory: %r' % path)

> IOError: No such file or directory: '/SeniorSpring2012/CDS490'


>
> with the ".*\.txt" meaning any character followed by .txt
> but I still get an error :(
>
> Any ideas would be appreciated
>
> Thanks
> Sania
>

> --
> You received this message because you are subscribed to the Google Groups "nltk-users" group.
> To post to this group, send email to nltk-...@googlegroups.com.
> To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.
>

Sania

unread,
Apr 8, 2012, 2:34:29 PM4/8/12
to nltk-users
YAY! Thank you so much!

Shikha Singh

unread,
Jan 4, 2015, 2:39:37 AM1/4/15
to nltk-...@googlegroups.com
i am able to load one corpus.. but its not working with other corpuses. any help will be appreciated.
Reply all
Reply to author
Forward
0 new messages