Keyword Extraction

64 views
Skip to first unread message

Dan Schultz

unread,
Oct 6, 2011, 12:26:21 PM10/6/11
to meta-met...@googlegroups.com
The keyword extraction algorithm I put together for the demo last week was, well, thrown together.

I'm looking to make it real today, and I've posed a question on StackOverflow as I research techniques and tricks.  I'm also adding an API parameter "klen" which will allow you to specify how many words per keyword you are targeting (e.g. klen=2 would mean the returned keywords have 2 words each.)

The Question on StackOverflow for anyone who cares to check (or if you know NLP folks and can point them to it): http://bit.ly/o4mQlk

Best,
 - Dan

--
Dan Schultz
P: (215) 400-1233
E: schu...@mit.edu
T: @slifty
W: http://www.pbs.org/idealab/dan_schultz/

Tathagata Dasgupta

unread,
Oct 6, 2011, 12:54:02 PM10/6/11
to meta-met...@googlegroups.com
Did you have a look at http://bit.ly/nhxrKJ ? I had to do something
similar for one of my projects and it looked acceptable
@Laurian what do you think of using Pointwise Mutual information for
keyword extraction?

--
Cheers,
T

Dan Schultz

unread,
Oct 6, 2011, 1:00:24 PM10/6/11
to meta-met...@googlegroups.com
Yeah T -- that's what I used to do the original implementation.  It only does bigrams (and trigrams) however, and doesn't work particularly well for small documents.

To identify unique 1-grams I need to do tf-idf which requires a corpus to base uniqueness on.  I'll just use a default (e.g. crappy) corpus for now, since nltk has a few default corpuses built in.  I'm going to write up a proposed architecture for multi-document operations (this would allow someone to upload their own corpus).  I have some ideas on that front already.

To identify good 2 and 3-grams in small bodies of text I'll need to think a bit more.

Higher-than-3-grams are a low priority, but would be nice to have just because why not.

Best,
 - Dan

Tathagata Dasgupta

unread,
Oct 6, 2011, 1:17:05 PM10/6/11
to meta-met...@googlegroups.com
On Thu, Oct 6, 2011 at 12:00 PM, Dan Schultz <sli...@gmail.com> wrote:
> Yeah T -- that's what I used to do the original implementation.  It only
> does bigrams (and trigrams) however, and doesn't work particularly well for
> small documents.
140 char-ish small?

> To identify unique 1-grams I need to do tf-idf which requires a corpus to
> base uniqueness on.  I'll just use a default (e.g. crappy) corpus for now,

Think the inlcuded Wall street journal corpus should be less crappy
for this job http://bit.ly/oEgPKw

--
Cheers,
T

Raynor Vliegendhart

unread,
Oct 7, 2011, 12:29:33 AM10/7/11
to meta-met...@googlegroups.com
About short documents, I just did a quick search and stumbled upon this poster paper: http://www.cs.uiuc.edu/~hanj/pdf/www10_zli.pdf
I haven't read it yet (it's too early in the morning :p), so I can't tell whether it's useful. However, the existence does show that at least research is being done. :)
 
-Raynor

Raynor Vliegendhart

unread,
Oct 7, 2011, 1:30:50 AM10/7/11
to meta-met...@googlegroups.com
Also, I'm not sure if this would be an interesting form of metadata for the Meta^2 Project, but there's also some work done on identifying the language used in a tweet: http://staff.science.uva.nl/~tsagias/wp-content/uploads/2011/01/dir2011-carter.pdf

-Raynor
Reply all
Reply to author
Forward
0 new messages