text classification example using 20 newsgroup

315 views
Skip to first unread message

Yichao Jin

unread,
Aug 18, 2017, 5:44:41 AM8/18/17
to BigDL User Group
Hi there, 

As looking deeply into the example code, I realize that the dataset used for text classification example seems not appropriate. 

In the dataset code (i.e., BigDL/pyspark/bigdl/dataset/news20.py), it downloads the data from http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz, where the first a few lines actually indicate the category and some other info. 

At the same time, in the sample code (i.e., https://github.com/intel-analytics/BigDL/blob/master/pyspark/bigdl/models/textclassifier/textclassifier.py), we set "sequence_len = 50" which means only the first 50 word vectors are used as the feature. It seems unfair to use such indication to predict the category itself. 

I think the right way to do this task should be using the clean dataset (i.e., http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz).

I will try to make some changes to this example, see the results, and create a PR to fix this issue. 

Please feel free to correct me if there is anything wrong :)

Best Regards
Yichao

zhichao

unread,
Aug 22, 2017, 3:38:40 AM8/22/17
to Yichao Jin, BigDL User Group
We indeed need to switch the link.
Feel free to create a PR for this, not sure if it's due to content changes on that link recently.
Previously we depend on http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz which is broken temporary, so we change to the current one.

refer to: https://github.com/intel-analytics/BigDL/tree/master/spark/dl/src/main/scala/com/intel/analytics/bigdl/example/textclassification

Thanks,
Zhichao

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/1e077a6d-51b9-4ab1-8886-396136e4d72c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Li, Zhichao

unread,
Aug 27, 2017, 8:42:38 PM8/27/17
to Yichao Jin, BigDL User Group

We should revise it same as: https://github.com/intel-analytics/BigDL/tree/master/spark/dl/src/main/scala/com/intel/analytics/bigdl/example/textclassification

 

From: bigdl-us...@googlegroups.com [mailto:bigdl-us...@googlegroups.com] On Behalf Of Yichao Jin
Sent: Tuesday, August 22, 2017 6:23 PM
To: BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] text classification example using 20 newsgroup

 

I just checked the data from cmu link. It also contains the same data format. 

 

The issue here is that, if you look at the first 50 words from any of the documents inside, they are something like:

Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49960 alt.atheism.moderated:713 news.answers:7054 alt.answers:126
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!magnus.acs.ohio-state.edu!usenet.ins.cwru.edu!agate!spool.mu.edu!uunet!pipex!ibmpcug!mantis!mathew

So the pyspark example (https://github.com/intel-analytics/BigDL/blob/master/pyspark/bigdl/models/textclassifier/textclassifier.py) does not actually look into the content, but just this meta data info which contains the category itself (i.e., "alt.atheism" in the above example) for at least two times. 

 

When I changed the dataset into the clean one, which does not contain the above metadata, and keep all the rest the same, the accuracy immediately drops into around 65%, though I believe it is mostly due to the short sequence_len setting (where the original keras example uses 1000). 

 

So I am going to try the same setting, and see the results. If there is any suggestions or insights, please do let me know :)

 

Best Regards

Yichao

 

On Tuesday, August 22, 2017 at 3:38:40 PM UTC+8, Chao Li wrote:

We indeed need to switch the link.
Feel free to create a PR for this, not sure if it's due to content changes on that link recently.

Previously we depend on http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz which is broken temporary, so we change to the current one.

refer to: https://github.com/intel-analytics/BigDL/tree/master/spark/dl/src/main/scala/com/intel/analytics/bigdl/example/textclassification

Thanks,

Zhichao

On Fri, Aug 18, 2017 at 5:44 PM, Yichao Jin <jiny...@gmail.com> wrote:

Hi there, 

 

As looking deeply into the example code, I realize that the dataset used for text classification example seems not appropriate. 

 

In the dataset code (i.e., BigDL/pyspark/bigdl/dataset/news20.py), it downloads the data from http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz, where the first a few lines actually indicate the category and some other info. 

 

At the same time, in the sample code (i.e., https://github.com/intel-analytics/BigDL/blob/master/pyspark/bigdl/models/textclassifier/textclassifier.py), we set "sequence_len = 50" which means only the first 50 word vectors are used as the feature. It seems unfair to use such indication to predict the category itself. 

 

I think the right way to do this task should be using the clean dataset (i.e., http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz).

 

I will try to make some changes to this example, see the results, and create a PR to fix this issue. 

 

Please feel free to correct me if there is anything wrong :)

 

Best Regards

Yichao

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.

 

--

You received this message because you are subscribed to the Google Groups "BigDL User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/fa46ee07-89a8-46dd-83f4-b232ff62443a%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages