Pierrot
--
It is similar to spam detection (spam and not spam).
As introduction I recommend some usefull links:
First thing is to create corpus. Corpus should consists of documents with text about politics and not politics (not words). You will need to classify manually these documents as POLITICS or NOT_POLITICS. You should split the corpus to 2 sets:
It is event better to create 3 sets. More explanation you will find in the nltk book.
Preparing good corpus is probably the hardest part. But at least try to have the same proportion of each category in corpus as you will expect in real world.
https://github.com/japerk/nltk-trainer
http://streamhacker.com/2010/10/25/training-binary-text-classifiers-nltk-trainer/
1. So I need to create a corpus with non politics documents. But it can be a lot of things.
There are more documents not about politics than documents about politics. It need to be 50%-50% in my corpus ?
2. You say, corpus is not just words.
In fact in my case, it's just a lot of articles titles (about 200 000) with a few words (5-10 words).
Is it not enough or it should be longer ?
Buy your high quality real or fake passport,(fani...@gmail.com) Counterfeit Bills,Real and Fake Driver’s licenses, ID cards, visas, stamps, diploma, certificates, degrees, citizenship and other products for a number of countries like: USA, Australia, Belgium, Brazil, Canada, Italy, Finland, France, Germany, Israel, Russia,Mexico, Finland,Netherlands ,South Africa,Spain,United Kingdom.Japan when producing; magnetic encoded strips and/or scan able bar-code. UV-spectrum analysis test standards,magnetic strip,
Watch video here for more details........... http://vimeo.com/82973635
Contact us............... fani...@gmail.com
Email.......................... fani...@yahoo.com
SKYPE US for quick chat …………….. fandena.fandena
SKYPE US for quick chat …………….. fandena.fandena
SKYPE US for quick chat …………….. fandena.fandena
Contact e-mails: fani...@gmail.com Technical support: fani...@gmail.com