Classify content generated from wikipedia based on the type associated with wikipedia page

wa...@userworkstech.com

unread,

Jul 16, 2014, 4:12:16 AM7/16/14

to wekamooc...@googlegroups.com

We are a service in which we get lots of queries and we go on web to find relevant content out of it.

We classify the content to many domains which we created.

What we were thinking of now as we have huge data created to classify the queries based on the content which we have already classified. Data what we have with us is something like this, we have a folder which is class name (around 1000 classes) and we have many files with the content (text) in it.

In order to create a classifier how do we need to go forward that what we are trying to figure out.

How should we create attributes for the instances ?

Do we have to use FilteredClassifier ?

What are the possible ways we should try in order to create a classifier and query that classifier at run time to get the probable domain for the query?

Do we have to create an arff file manually so that we can use some classifier ?

Gabriel Santos

unread,

Jul 16, 2014, 4:34:52 AM7/16/14

to wekamooc...@googlegroups.com

Hi,

For this kind of operations you will need, at least, 1024M heap size. Edit the file RunWeka.ini and change the value of the key maxheap to 1024M, if it is not already set so.

Try convert the text from Wikipedia to ARFF or XRFF. See the following link as a reference
http://weka.wikispaces.com/Text+categorization+with+WEKA

Then you can start exploring some clustering algorithms, such as SimpleKMeans and try playing around with some "distanceFunctions".

Have fun!
Gabriel Santos
Macau
Community TA

wa...@userworkstech.com

unread,

Jul 16, 2014, 4:41:03 AM7/16/14

to wekamooc...@googlegroups.com

Thanks for the reply.

I already increased the heap size to 1024M heap size with Xmx1024M option...

I also converted text to ARFF.

Will look at SimpleKMeans, and distanceFunctions as you suggested.

Gabriel Santos

unread,

Jul 16, 2014, 5:18:31 AM7/16/14

to wekamooc...@googlegroups.com

If you are patient enough, the "More Data Mining with Weka" will give you more insights, starting with the lesson 2.4 document classification. You will also learn how to prepare the text, for instance removing stopwords, and so on.

If you are happy so far with this course then you will love Weka even more!

Have fun!
Gabriel Santos
Macau
Community TA

Reply all

Reply to author

Forward