Using NLP to Retrieve Product Names

52 views
Skip to first unread message

Jules Courtois

unread,
Mar 20, 2017, 4:22:35 PM3/20/17
to nltk-users
Good afternoon,

I'm currently trying to extract product names from webpages, in an effort to associate said products with patent numbers. For example, I'm trying to extract phrases like "TiVo Bolt and Bolt+", "TiVo Roamio"... from this page : https://www.tivo.com/legal/patents

So far, the best success I've had is by using nltk to preprocess the html parsing and tokenize the text, then using SpaCy's NER and/or Noun-Phrase Chunking and comparing words with some english dictionnaries (Wordnet, ntlk.words, Enchant). From this, I get a precision of at best 5% and a recall of at best 15%, which is very underperforming.

I suppose one of the main drawbacks is that patent-joining web pages don't have very 'natural' language patterns inside of them.

I've been looking into CRFs and FACTORIE to maybe get better results, but creating a training corpus is a huge amount of work which basically wouldn't be worth it compared to doing everything by hand.

One final idea I've been having is looking into product databases, but that's far from NLP and I'd like to stick to NLP for now.

What could be a way to improve the results I'm getting, in the pipeline or in the realization ?
Any ideas or suggestions are greatly appreciated.

Best,
Jules
Reply all
Reply to author
Forward
0 new messages