Using fasttext to classify news articles

ben...@googlemail.com

unread,

Sep 22, 2016, 10:59:40 PM9/22/16

to fastText library

I'm trying to use fastText to classify news articles by topic (eg "sports", "economy", "science" etc), and I'm not having too much luck. I was wondering if anyone had any advice on how I _should_ be doing it...

Currently, to build my training set, I dump out my hand-assigned tags, the headline and entire article text, tokenised. eg (made-up examples):

__label__economy bank runs out of money . one of the big banks shut their doors today after admitting ...etc etc
boring article about nothing . articles like this one don ' t fit into any of the categories we ' re interested in ...
__label__education __label__science 90 % of school leavers think moon made of cheese . scientists are appalled as recent survey reveals ...
__label__science nasa : moon made of cheese . nasa announced today that their latest research proves that the moon is , in fact ... 

etc etc...

Of note: some articles have multiple labels, some have none. I've got about 20 topic labels in all.

Then I'll build a model like this:

$ fasttest supervised -input articles.txt -output topics.model -wordNgrams 2
Read 3M words
Number of words:  55754
Number of labels: 15
Progress: 100.0%  words/sec/thread: 1732843  lr: 0.000000  loss: 2.331086  eta: 0h0m h-14m

But when I use the model to classify the same articles I used to train it, I seem to get labels that have no real correspondence to what they should be. One label always seems to dominate all the articles, eg:

$ fasttext predict-prob topics.model.bin articles.txt 3

__label__economy 0.285156 __label__education 0.236328 __label__science 0.0820313
__label__economy 0.265625 __label__education 0.222656 __label__science 0.0839844
__label__economy 0.242188 __label__education 0.207031 __label__science 0.0859375
... etc ...

The "test" option doesn't seem helpful for evaluating multi-label data - I think this is due to it assuming there should be a constant k labels assigned to each article...

fasttext test topics.model.bin articles.txt
P@1: 0.384
R@1: 0.197
Number of examples: 5499

So I suspect I'm totally misunderstanding how the fasttext commandline tool _should_ be used for this kind of task.
Can anyone give me some advice as to what I might be doing wrong?

Thanks,
Ben.

simon.j...@gmail.com

unread,

Oct 14, 2016, 7:13:48 AM10/14/16

to fastText library, ben...@googlemail.com

Hi, changing the number of epochs did the trick here (try 50, 100, 500, 1000). It basically tells how many iterations the learning will take on the whole collection.

Cheers!

ben...@googlemail.com

unread,

Oct 19, 2016, 10:55:08 PM10/19/16

to fastText library, ben...@googlemail.com, simon.j...@gmail.com

On Saturday, October 15, 2016 at 12:13:48 AM UTC+13, simon.j...@gmail.com wrote:

Hi, changing the number of epochs did the trick here (try 50, 100, 500, 1000). It basically tells how many iterations the learning will take on the whole collection.

Ahh, thanks - this seems to be helping a lot!
The main issue now is evaluating accuracy during prediciton. The 'test' operation assumes there should be exactly K labels for each and every classified document, but my set could have any number of labels (or none) that apply to each one.
I think the right thing to just accept any labels which have a probability above a given threshold instead. So I'm going to have a go at patching the 'predict' and 'test' operations to do this as an option instead of a fixed K.

Ben.

simon.j...@gmail.com

unread,

Oct 20, 2016, 3:20:32 AM10/20/16

to fastText library, ben...@googlemail.com, simon.j...@gmail.com

Hi Ben, you could also look at something like Discounted Cumulative Gain or Mean Average Precision metrics, depending if your truth labels are ranked or not.

https://blog.lateral.io/2016/09/fasttext-based-hybrid-recommender/ -- some of these parameters and the explanation helped me a lot. Moreover, I found that increasing learning rate may give a better effect than increasing the number of epocs. Too high learning rate or too many epochs, however, end up with a very low loss (say 0.008 instead of 0.024) on the training data without a significant accuracy improvement on the test set (0.864), so it probably indicates overfitting.

Cheers!

Alex Ott

unread,

Oct 25, 2016, 7:42:12 PM10/25/16

to ben...@googlemail.com, fastText library

Do you have "stable" label combinations., or there could be any label combination?

If former, then I suggest to use "combined" labels, instead of the individual ones...

Another useful thing - try to reconsider the labels taxonomy to have fewer "similar" labels, for example, education & science, etc.

--
You received this message because you are subscribed to the Google Groups "fastText library" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fasttext-library+unsubscribe@googlegroups.com.
To post to this group, send email to fasttext-library@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fasttext-library/1728c3bf-1735-41ed-9874-0804a08281d9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

With best wishes, Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)
Skype: alex.ott

congth...@gmail.com

unread,

Dec 27, 2016, 11:36:37 PM12/27/16

to fastText library, ben...@googlemail.com, simon.j...@gmail.com

Hi, right now, I have a similar issue with yours. One label dominated all so I don't know how to choose a right probability threshold for my model. Do you have any idea?

Thanks

sruthip...@gmail.com

unread,

Aug 31, 2017, 7:55:03 AM8/31/17

to fastText library, ben...@googlemail.com, simon.j...@gmail.com

Hi, how to find the accuracy of fasttext classifier.

simon.j...@gmail.com

unread,

Sep 1, 2017, 1:39:44 AM9/1/17

to fastText library, ben...@googlemail.com, simon.j...@gmail.com, sruthip...@gmail.com

Hi, there is a "fasttext test" command. See https://github.com/facebookresearch/fastText/blob/master/tutorials/supervised-learning.md, for examples. Otherwise you could hold out you test and evaluation sets, use the text to predict labels and then write a custom evaluator that combines input and predictions and gives you metrics you want.

hope this helps,

Reply all

Reply to author

Forward