Text classification with only one category

167 views
Skip to first unread message

Pierrot

unread,
Nov 17, 2012, 1:01:31 PM11/17/12
to nltk-...@googlegroups.com
Hello,

I would like to develop a program that :
- takes a text in input,
- shows me if the text belongs to a particular category such as "politics" in output.

Ex 1:
Input: "The best sports coverage from around the world, covering: Football, Cricket, Golf, Rugby, WWE, Boxing"
Output: False (not "politics")

Ex 2:
Input: Obama announced an increase to U.S. troop levels of 17,000 in February 2009 to "stabilize a deteriorating situation in Afghanistan"
Output: True ("politics")

I have a list of words related to the lexicon of "politics" that can be a corpus.
I think I need to use Naive Bayes Classifier and train it with my custom corpus but I'm not sure ? Is it the good approach ?

Moreover, in the Naive Bayes Classifier examples that I have seen on the net, they always use 2 categories (pos, neg for movie_reviews, male or female for genders). I have only one categorie with only one corpus, so how to do it ? Any idea, documentation or code that could help me to understand how to it ?

Pierrot






Nigel Legg

unread,
Nov 17, 2012, 3:40:27 PM11/17/12
to nltk-...@googlegroups.com
I'm not sure that your example 2 is politics, unless politics includes any foreign policy or government action. There would be a lot of actions that would be classed as poliitics if this is politics.



Pierrot






--
 
 



--
Regards,
Nigel Legg
07722 652866
http://twitter.com/nigellegg
http://uk.linkedin.com/in/nigellegg

Krzysztof Langner

unread,
Nov 18, 2012, 3:32:59 AM11/18/12
to nltk-...@googlegroups.com
Hello Pierrot,

Your task is called binary text classification. You have 2 categories:
  1. Documents about politics (POLITICS)
  2. Documents not about politics (NOT_POLITICS)

It is similar to spam detection (spam and not spam).

As introduction I recommend some usefull links:

  1. Video from coursera nlp course: https://class.coursera.org/nlp/lecture/preview/index. Section about text classification
  2. http://nltk.org/book/ch06.html Book about nltk which describes how to train simple classificator. But remember that name in book example, in your case is full document.


First thing is to create corpus. Corpus should consists of documents with text about politics and not politics (not words). You will need to classify manually these documents as POLITICS or NOT_POLITICS. You should split the corpus to 2 sets:

  1. Training set (to train classifier)
  2. Test set (to test how good is classifier)

It is event better to create 3 sets. More explanation you will find in the nltk book.

Preparing good corpus is probably the hardest part. But at least try to have the same proportion of each category in corpus as you will expect in real world.


Next you have to decide which features you will extract from text and provide to classifier. Maybe the list of words can be used for it. Simple create feature vector with all words and then when document contains this word put 1 and when not put 0. Check book for examples.
But remember: The more features you will select the more documents you will need to train classifier.


BTW if you are looking for examples on the net, then don't look at movie reviews. This task is called Sentiment Analysis and is quite often more complicated then text classification (what you really need).

I hope it helps :-)

Regards
Krzysztof

Pierrot

unread,
Nov 18, 2012, 8:41:41 AM11/18/12
to nltk-...@googlegroups.com
Hello Krzysztof,

Thanks a lot for your answer, it helps me a lot!
I'm going to check your links carefully.

I have just two others questions:

1. So I need to create a corpus with non politics documents. But it can be a lot of things.
There are more documents not about politics than documents about politics. It need to be 50%-50% in my corpus ?

2. You say, corpus is not just words.
In fact in my case, it's just a lot of articles titles (about 200 000) with a few words (5-10 words).
Is it not enough or it should be longer ?

Thanks for all,

Pierrot

Bio

unread,
Nov 18, 2012, 9:15:15 AM11/18/12
to nltk-...@googlegroups.com
Hi Pierrot, I thought I might add another resource for your consideration. As pointed out by Krzysztof chapter 6 of the Natural Language Processing with Python text covers the text classification methods that you are interested in. I would highly recommend giving this a thorough read. Jacob Perkins, a frequent contributor to this forum, has written some really easy to use and understand code to help with the type of classification you are trying to accomplish. If you read ch 6 you will see that one of the biggest challenges to classifying text is to create the feature extractor that is used to train your classifier. The code by J. Perkins automates the entire classification process, even the creation of the feature extractor. About 2 years ago when I first started using nltk I attempted a classification project that was very similar to yours. I used Jacob Perkins code and had excellent success. His code gave me an accuracy of 95%. I haven't had a chance to look at the underlying methodology that J. Perkins uses to create a feature extractor but I suspect it is very similar to the methodology suggested by Krzysztof. I am not sure about the answer to your first question but I would suggest using as large a corpus as you can reasonably create, and then modify the proportion between the two types of corpus based upon the accuracy reading you get. As to your second question, during my use of J. Perkins code I also classified sentences that were of typical sentence length and the process worked fine. If you are using the titles of articles to create your corpus you may need to put a period (.) at the end of each. I'm not sure about exactly how J. Perkin's code handles this but typically when segmenting sentences in nltk you need to have some form of punctuation at the end of a sentence for the sentence segmentation to properly occur. In my experience article titles do not have ending punctuation. Here are the links to J. Perkins code:

Krzysztof Langner

unread,
Nov 18, 2012, 9:41:39 AM11/18/12
to nltk-...@googlegroups.com


On Sunday, November 18, 2012 2:41:41 PM UTC+1, Pierrot wrote:

1. So I need to create a corpus with non politics documents. But it can be a lot of things.
There are more documents not about politics than documents about politics. It need to be 50%-50% in my corpus ?

Yes you need documents with politics and not_politics so classifier can learn features from both sets. The ration should be similar to real world examples. It gives best results.
 

2. You say, corpus is not just words.
In fact in my case, it's just a lot of articles titles (about 200 000) with a few words (5-10 words).
Is it not enough or it should be longer ?

In your case dataset (It is better name in this case then corpus) is collection of document titles with flag which marks each document as politics or not_politics.
So the sample records look like:
("Obama announced an increase to U.S. troop levels ", POLITICS)
("The best sports coverage from around the world, ", NOT_POLITICS)

Based on this you will create features vector. You will find how to create feature vector in the book.
Finding good features is where creative part starts :-)

Since you need to mark documents manually, then probably you will not use all 200K documents :-) But try with as many I you can.
Later you will use half for training and half for test. (Or better yet create development set.)

If you have already marked all documents, then great. Use them all.

Have fun :-)
Krzysztof

Pierrot

unread,
Nov 20, 2012, 2:58:33 PM11/20/12
to nltk-...@googlegroups.com
A big thanks to you Krzysztof and a big thanks to you George !
It helps me a lot.

Erica wilson

unread,
Jan 18, 2015, 1:06:09 PM1/18/15
to nltk-...@googlegroups.com

Buy your high quality real or fake passport,(fani...@gmail.com) Counterfeit Bills,Real and Fake Driver’s licenses, ID cards, visas, stamps, diploma, certificates, degrees, citizenship and other products for a number of countries like: USA, Australia, Belgium, Brazil, Canada, Italy, Finland, France, Germany, Israel, Russia,Mexico, Finland,Netherlands ,South Africa,Spain,United Kingdom.Japan when producing; magnetic encoded strips and/or scan able bar-code. UV-spectrum analysis test standards,magnetic strip,

Watch video here for more details........... http://vimeo.com/82973635

Contact us............... fani...@gmail.com

Email.......................... fani...@yahoo.com

SKYPE US for quick chat …………….. fandena.fandena

SKYPE US for quick chat …………….. fandena.fandena

SKYPE US for quick chat …………….. fandena.fandena

Contact e-mails: fani...@gmail.com Technical support: fani...@gmail.com

Reply all
Reply to author
Forward
0 new messages