Hi,
I assembled a training set of ~15k samples for this by scraping. Here's an example:
.txt file:
Denzel and Pauletta Washington to host fundraiser for African American museum
.key file:
Denzel Washington
Washington
museums
However, most .key/.txt pairs are qualitatively worse than this, perhaps containing just 1 keyword when a human would put down 3. The underlying tag set is of high quality and human annotated, but I had to whittle down the tags to ones that appear in the headline so the quality is reduced (all examples have at least one keyword in the .key file).
I tried to train Maui on these, but the model doesn't perform well. It usually outputs no results for similarly short inputs and when it does output something, it's often incorrect (e.g. selecting words like 'who').
Any thoughts on what to do? Training on standard datasets (e.g. SemEval 2010) results in a model that just doesn't output anything at all for short sentences. I thought Maui could work for headlines since it worked on Twitter for this ACL 2015 paper:
http://www.cs.cmu.edu/~lingwang/papers/acl2015-3.pdf. If anything, that seems like a harder problem because of Twitter's informal English
Do I just not have enough training data? Or is the data quality perhaps too poor?
Regards,
Gautam Shine