Keyword Extraction

Dan Schultz

unread,

Oct 6, 2011, 12:26:21 PM10/6/11

to meta-met...@googlegroups.com

The keyword extraction algorithm I put together for the demo last week was, well, thrown together.

I'm looking to make it real today, and I've posed a question on StackOverflow as I research techniques and tricks. I'm also adding an API parameter "klen" which will allow you to specify how many words per keyword you are targeting (e.g. klen=2 would mean the returned keywords have 2 words each.)

The Question on StackOverflow for anyone who cares to check (or if you know NLP folks and can point them to it): http://bit.ly/o4mQlk

Best,

- Dan

--
Dan Schultz
P: (215) 400-1233
E: schu...@mit.edu
T: @slifty
W: http://www.pbs.org/idealab/dan_schultz/

Tathagata Dasgupta

unread,

Oct 6, 2011, 12:54:02 PM10/6/11

to meta-met...@googlegroups.com

Did you have a look at http://bit.ly/nhxrKJ ? I had to do something
similar for one of my projects and it looked acceptable
@Laurian what do you think of using Pointwise Mutual information for
keyword extraction?

--
Cheers,
T

Dan Schultz

unread,

Oct 6, 2011, 1:00:24 PM10/6/11

to meta-met...@googlegroups.com

Yeah T -- that's what I used to do the original implementation. It only does bigrams (and trigrams) however, and doesn't work particularly well for small documents.

To identify unique 1-grams I need to do tf-idf which requires a corpus to base uniqueness on. I'll just use a default (e.g. crappy) corpus for now, since nltk has a few default corpuses built in. I'm going to write up a proposed architecture for multi-document operations (this would allow someone to upload their own corpus). I have some ideas on that front already.

To identify good 2 and 3-grams in small bodies of text I'll need to think a bit more.

Higher-than-3-grams are a low priority, but would be nice to have just because why not.

Best,

- Dan

--
Dan Schultz
P: (215) 400-1233
E: schu...@mit.edu
T: @slifty
W: http://www.pbs.org/idealab/dan_schultz/

Tathagata Dasgupta

unread,

Oct 6, 2011, 1:17:05 PM10/6/11

to meta-met...@googlegroups.com

On Thu, Oct 6, 2011 at 12:00 PM, Dan Schultz <sli...@gmail.com> wrote:
> Yeah T -- that's what I used to do the original implementation. It only
> does bigrams (and trigrams) however, and doesn't work particularly well for
> small documents.

140 char-ish small?

> To identify unique 1-grams I need to do tf-idf which requires a corpus to
> base uniqueness on. I'll just use a default (e.g. crappy) corpus for now,

Think the inlcuded Wall street journal corpus should be less crappy
for this job http://bit.ly/oEgPKw

--
Cheers,
T

Raynor Vliegendhart

unread,

Oct 7, 2011, 12:29:33 AM10/7/11

to meta-met...@googlegroups.com

About short documents, I just did a quick search and stumbled upon this poster paper: http://www.cs.uiuc.edu/~hanj/pdf/www10_zli.pdf

I haven't read it yet (it's too early in the morning :p), so I can't tell whether it's useful. However, the existence does show that at least research is being done. :)

-Raynor

Raynor Vliegendhart

unread,

Oct 7, 2011, 1:30:50 AM10/7/11

to meta-met...@googlegroups.com

Also, I'm not sure if this would be an interesting form of metadata for the Meta^2 Project, but there's also some work done on identifying the language used in a tweet: http://staff.science.uva.nl/~tsagias/wp-content/uploads/2011/01/dir2011-carter.pdf

-Raynor

Reply all

Reply to author

Forward