I'm trying to use fastText to classify news articles by topic (eg "sports", "economy", "science" etc), and I'm not having too much luck. I was wondering if anyone had any advice on how I _should_ be doing it...
Currently, to build my training set, I dump out my hand-assigned tags, the headline and entire article text, tokenised. eg (made-up examples):
__label__economy bank runs out of money . one of the big banks shut their doors today after admitting ...etc etc
boring article about nothing . articles like this one don ' t fit into any of the categories we ' re interested in ...
__label__education __label__science 90 % of school leavers think moon made of cheese . scientists are appalled as recent survey reveals ...
__label__science nasa : moon made of cheese . nasa announced today that their latest research proves that the moon is , in fact ...
etc etc...
Of note: some articles have multiple labels, some have none. I've got about 20 topic labels in all.
Then I'll build a model like this:
$ fasttest supervised -input articles.txt -output topics.model -wordNgrams 2
Read 3M words
Number of words: 55754
Number of labels: 15
Progress: 100.0% words/sec/thread: 1732843 lr: 0.000000 loss: 2.331086 eta: 0h0m h-14m
But when I use the model to classify the same articles I used to train it, I seem to get labels that have no real correspondence to what they should be. One label always seems to dominate all the articles, eg:
$ fasttext predict-prob topics.model.bin articles.txt 3
__label__economy 0.285156 __label__education 0.236328 __label__science 0.0820313
__label__economy 0.265625 __label__education 0.222656 __label__science 0.0839844
__label__economy 0.242188 __label__education 0.207031 __label__science 0.0859375
... etc ...
The "test" option doesn't seem helpful for evaluating multi-label data - I think this is due to it assuming there should be a constant k labels assigned to each article...
fasttext test topics.model.bin articles.txt
P@1: 0.384
R@1: 0.197
Number of examples: 5499
So I suspect I'm totally misunderstanding how the fasttext commandline tool _should_ be used for this kind of task.
Can anyone give me some advice as to what I might be doing wrong?
Thanks,
Ben.