Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Help with Reuters corpus for algorithm testing

3 views

Skip to first unread message

maarouf

unread,

Mar 26, 2007, 10:02:01 PM3/26/07

Hello

First, I apologize if my questions are very basic. I am a theoretical
physicist who happen to have some interest in text categorization.

I need some help with the Reuters 21578 coprus. I downloaded a 90
category set, and noticed that there were some documents that were
assigned to multiple categories. I wrote some code to extract unique
documents, i.e. documents that are assigned to one category only. I
ended up with 3299 docs for the test set, and 9598 docs for the
training one. Comparing those numbers to what is in the literature,
the test set count seems ok, but it looks like i am missing 5 docs in
the training one.

The other problem i am having is that some cats are empty(in the
unique sets i extracted). For example, the Corn category has no docs,
in both sets. So, my 3299 count of the test set is ok, but its Corn
category is empty.

Can somebody please help me with a link to download the ModApte set of
UNIQUE docs?, or am I wrong about the need to use a unique set of docs
for both the test and the training sets? Is it acceptable to use docs
assigned to multiple cats, but with the condition that when
calculating recall and precision we would account for that(by allowing
a doc to be a true positive of more than one cat) This seems a bit
unnatural to me.

Any help is greatly appreciated.

Thanks,
Ahmed

0 new messages