[nltk-users] loading your own corpus

389 views
Skip to first unread message

rye

unread,
Apr 20, 2010, 6:06:25 PM4/20/10
to nltk-users
Please if anybody knows why am geting errors for loading my own
corpus. Am using windows and it keeps returning file does not exist
error...........

Also I have built html pages for the interface for my language
tutorial system, just learning python, does anybody know how I can get
my tagger to be called when the user types text in the textarea and
clicks submit and then its POS tagged.

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.

Steven Bird

unread,
Apr 20, 2010, 6:11:39 PM4/20/10
to nltk-users
Hi -- please show us some code that demonstrates you are at least able
to open an individual file of the corpus, using something like the
following:

>>> contents = open("full-path-to-myfile.txt").read()
>>> len(contents)
239483

If you can do this, you can progress to loading the corpus using the
methods described here:

http://nltk.googlecode.com/svn/trunk/doc/howto/corpus.html

-Steven Bird

rye

unread,
Apr 21, 2010, 5:57:34 AM4/21/10
to nltk-users
Hi Steven,

I tried it and i got this

>>> import nltk
>>> contents = open("C:\Python26\web\html\sampl.txt").read()
>>> len(contents)
1808

I also did the examples on PlaintextCorpusReader, do you mean to
incoperate this to loading my own corpus?

Best
Raula

On Apr 20, 11:11 pm, Steven Bird <stevenbi...@gmail.com> wrote:
> Hi -- please show us some code that demonstrates you are at least able
> to open an individual file of the corpus, using something like the
> following:
>
> >>> contents = open("full-path-to-myfile.txt").read()
> >>> len(contents)
>
> 239483
>
> If you can do this, you can progress to loading the corpus using the
> methods described here:
>
> http://nltk.googlecode.com/svn/trunk/doc/howto/corpus.html
>
> -Steven Bird
>
> On 21 April 2010 08:06, rye <ry...@yahoo.com> wrote:
>
>
>
>
>
> > Please if anybody knows why am geting errors for loading my own
> > corpus.  Am using windows and it keeps returning file does not exist
> > error...........
>
> > Also I have built html pages for the interface for my language
> > tutorial system, just learning python, does anybody know how I can get
> > my tagger to be called when the user types text in the textarea and
> > clicks submit and then its POS tagged.
>
> > --
> > You received this message because you are subscribed to the Google Groups "nltk-users" group.
> > To post to this group, send email to nltk-...@googlegroups.com.
> > To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
> > For more options, visit this group athttp://groups.google.com/group/nltk-users?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups "nltk-users" group.
> To post to this group, send email to nltk-...@googlegroups.com.
> To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
> For more options, visit this group athttp://groups.google.com/group/nltk-users?hl=en.- Hide quoted text -
>
> - Show quoted text -

iobike...@gmail.com

unread,
Jan 19, 2017, 1:17:25 PM1/19/17
to nltk-users, ry...@yahoo.com
Hello I am able to load the file but I am not able to understand how many words there are in the document and why its not following the path I copy and pasted from finder on my mac.  

Below is some of the attempts I have tried to get the file to load. 


>>> wordlists.words('THE_VOYAGE_OUT_Woolf.txt')

[u'Chapter', u'I', u'As', u'the', u'streets', u'that', ...]

>>> 

>>> 

>>> contents =open("Users/harleyhegel/Documents/f_verbs/Room_Of_Ones_Own_Woolf.txt").read()

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

IOError: [Errno 2] No such file or directory: 'Users/harleyhegel/Documents/f_verbs/Room_Of_Ones_Own_Woolf.txt'

>>> 

iobike...@gmail.com

unread,
Jan 19, 2017, 1:17:30 PM1/19/17
to nltk-users
 Yet  this worke.  

len(wordlists.words('THE_VOYAGE_OUT_Woolf.txt'))          167425

Dimitriadis, A. (Alexis)

unread,
Jan 19, 2017, 4:46:41 PM1/19/17
to nltk-...@googlegroups.com
You need to understand that you have been mixing two different ways of accessing data. 

`open()` is an operating system call; you can use it to access any file, but you must tell it where the file is. Your `open()` call failed because of a small error: You must add a slash at the very beginning of the path, like this:

contents = open(“/Users/harleyhegel/Documents/f_verbs/Room_Of_Ones_Own_Woolf.txt”).read()

Without the initial slash, the path is interpreted as a “relative path”, which means it is essentially appended to the path of the directory containing your script.

Your call to `wordlists.words()`, on the other hand, accesses one of the files in the `wordlists` corpus. The nltk “knows” where its corpora are, so you don’t need to provide a path; just the filename (or no name at all, if you want all the files in a corpus). Note again that these are two separate ways of accessing data, and they support different options. `contents` is a single string with the entire contents of the file, while `corpus.words()` returns a list of tokens (words and punctuation). Read the nltk book with these distinctions in mind, to find out what you can do with nltk corpora.

Alexis

To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages