Creating and Storing a Corpus for analysis

551 views
Skip to first unread message

J Price

unread,
Nov 8, 2012, 2:24:33 PM11/8/12
to nltk-...@googlegroups.com
I am new to NLTK but not new to coding. I would like to create a corpus and analyze the content. I have found information on reading and creating corpus but not writting out the created corpus. My plan was to go as far as chunking the data and store it so that I can analyse the sentence and paragraph meanings.
 
I am able to use NLTK to read the .txt file (other file types to follow..). I can read in the base text and process the tagging.... and write the new file saving the work. Once I get to chunking the data, the returned format is in tree format. I am unsure as to the best way to store this new  data in a file. There is much documentation on the NLTK parsing.... but little documentation on the creation of corpus for analysis. Where can I get more information on the storing of my corpus, this is for my use only at the moment. What is the best file system or database that will provide the greates flexability in my analysis?
 
Thanks

nawafpower

unread,
Nov 24, 2012, 10:34:32 AM11/24/12
to nltk-...@googlegroups.com
Same here, I have tried to create a folder inside corpora folder and try to access it by 
from nltk.corpus import myCorpus

but always failed, what I'm using so far is moving the gutenberg files to another folder and put my own files inside gutenberg, by this I was able to access my files, but I'm wondering why I couldn't do my own folder, and if I can, would anyone please let me know how.

Thanks,

Nawaf

Alexis Dimitriadis

unread,
Nov 24, 2012, 3:16:02 PM11/24/12
to nltk-...@googlegroups.com
On 08/11/2012 20:24, J Price wrote:
I am able to use NLTK to read the .txt file (other file types to follow..). I can read in the base text and process the tagging.... and write the new file saving the work. Once I get to chunking the data, the returned format is in tree format. I am unsure as to the best way to store this new  data in a file.

On 24/11/2012 16:34, nawafpower wrote:
Same here, I have tried to create a folder inside corpora folder and try to access it by 
from nltk.corpus import myCorpus


Dear Joseph and Nawaf,

NLTK corpora are stored as collections of text files. The NLTK corpus functionality is organized as a number of reader classes for various file formats.  You'll find them in nltk.corpus.reader. The nltk.corpus module also provides shortcuts to the corpora in nltk_data; they just launch the appropriate reader class with the path to the corpus files. But new corpora don't magically appear as objects in nltk.corpus; to read your own, instantiate the appropriate reader class. For example, in nltk/corpus/__init__.py you'll find the following:

    gutenberg = LazyCorpusLoader(
        'gutenberg', PlaintextCorpusReader, r'(?!\.).*\.txt')

PlaintextCorpusReader is imported from nltk.corpus.reader, where all the other reader classes can be found. You can use it directly without relying on LazyCorpusReader; check the documentation.

But indeed there's no support for writing corpora in the various supported formats. To do that, find a corpus that's similar to yours, and emulate its format. You can then use the same reader to read your corpus. (For example, the Brown corpus reveals that it consists of space-separated tokens in the format word/tag)

I realize I haven't given you the exact chunked corpus format; that's because I don't know it. I could look for a chunked corpus among the NLTK's offerings and track down its file format and the class used to read it, but I trust you'll have no trouble doing that yourself.

Best,

Alexis
--
 
 

nawafpower

unread,
Nov 25, 2012, 1:11:41 AM11/25/12
to nltk-...@googlegroups.com
Dear Price,

I have managed to create a feature matrix using the TFIDF, now I have a 870X3 matrix, 870 tokens as columns and three rows for three different authors, and I'm stuck right now for how to go further from this step, how to train and classify, and later test on a new set of documents for these authors, can you or anyone help me? For the label, will it be for each TFIDF value or one label for the whole row? I don't care about what classifier to use, whether Naive or SVM, the most important issue for me is to get this to work and be able to test on new files.

I appreciate your time and effort.

Nawaf

On Thursday, November 8, 2012 2:24:33 PM UTC-5, J Price wrote:

Steve Vogel

unread,
Nov 25, 2012, 6:49:03 AM11/25/12
to nltk-...@googlegroups.com
Jacob Perkins just had a post on his blog about tfidf vs bag of words. It was at http://streamhacker.com/2012/11/22/text-classification-sentiment-analysis-nltk-scikitlearn/ and used sklearn and nltk. It might be worth looking at his post and https://github.com/japerk/nltk-trainer to see how he implemented it.

Clair Milanovich

unread,
May 6, 2020, 1:33:17 PM5/6/20
to nltk-users
J Price -- I too was attempting my own  corpus and Alexis (below) gave us the line that did the trick.

Here is my  code in case it might help somebody in the future: (Made a directory called books and put three Woolf books in it.)

import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus import LazyCorpusLoader

# next line is from Alexis on the GGroup nltk-user
books = LazyCorpusLoader('books', PlaintextCorpusReader, r'(?!\.).*\.txt')

corpus_root = '../corpora/books/'
wordlists = PlaintextCorpusReader(corpus_root, '.txt')
print(books.fileids())
import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus import LazyCorpusLoader

# next line is from Alexis on the GGroup nltk-user
books = LazyCorpusLoader('books', PlaintextCorpusReader, r'(?!\.).*\.txt')

corpus_root = '../corpora/books/'
wordlists = PlaintextCorpusReader(corpus_root, '.txt')
print(books.fileids())

import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus import LazyCorpusLoader

# next line is from Alexis on the GGroup nltk-user
books = LazyCorpusLoader('books', PlaintextCorpusReader, r'(?!\.).*\.txt')

corpus_root = '../corpora/books/'
wordlists = PlaintextCorpusReader(corpus_root, '.txt')
print(books.fileids())

['woolf-acts.txt', 'woolf-night-day.txt', 'woolf-voyage.txt']

Reply all
Reply to author
Forward
0 new messages