Summarization using NLTK

4,944 views
Skip to first unread message

Tristan Havelick

unread,
Feb 21, 2010, 8:42:51 PM2/21/10
to nltk-dev
All,

I recently had a need to do some automatic document summarization in
python, and couldn't find a decent pre-existing python library to do
so. So, having toyed with nltk for a bit, I decided I could use it as
a base to code up a simple summarizer. It can be downloaded from
here:

http://tristanhavelick.com/summarize.zip (basic usage can be found in
test/summarize.doctest as well as at the end of this document)

I took the algorithm from the Java library Classifier4J here:
http://classifier4j.sourceforge.net/ but I hesitate to call it a port
of that package, as I have

(1) Used nltk wherever possible to do things like find word
frequencies and tokenize sentences rather than porting Classifier4J's
internal methods for doing such things
(2) Used a more python-like coding style, than a Java/C# style
(3) Tried to keep file formatting, unit tests and other style aspects
as similar to those used by nltk internally as I could.

I plan to officially release this code under some kind of open source
license soon, but I was curious if it would be appropriate to have if
released as part of nltk itself. It seems appropriate as it is a
fairly common NLP action, and other libraries that do similar things
to nltk (such a lemur and mahout) have summarization capabilities.

Take a look and let me know what you think. I've really enjoying
working with nltk, and I'd love to hear if I'd be able to bring
anything to the table, this package or otherwise.

Here is the basic usage:

>>> import summarize


A SimpleSummarizer (currently the only summarizer) makes a summary by
using sentences with the most frequent words:

>>> ss = summarize.SimpleSummarizer()
>>> input = "NLTK is a python library for working human-written text.
Summarize is a package that uses NLTK to create summaries."
>>> ss.summarize(input, 1)
'NLTK is a python library for working human-written text.'

You can specify any number of sentenecs in the summary as you like.

>>> input = "NLTK is a python library for working human-written text.
Summarize is a package that uses NLTK to create summaries. A
Summariser is really cool. I don't think there are any other python
summarisers."
>>> ss.summarize(input, 2)
"NLTK is a python library for working human-written text. I don't
think there are any other python summarisers."

Unlike the original algorithm from Classifier4J, this summarizer works
correctly with punctuation other than periods:

>>> input = "NLTK is a python library for working human-written text!
Summarize is a package that uses NLTK to create summaries."
>>> ss.summarize(input, 1)
'NLTK is a python library for working human-written text!'

Thanks,

Tristan Havelick

Tristan Havelick

unread,
Feb 21, 2010, 8:49:33 PM2/21/10
to nltk-dev
Oh, I should mention, the version posted here was written on a windows
machine and hasn't been tested in a unix environment yet. I think it
will work, but I'm not 100% sure. I should be able to test it on
linux later this week


On Feb 21, 6:42 pm, Tristan Havelick <thavel...@gmail.com> wrote:
> All,
>
> I recently had a need to do some automatic document summarization in
> python, and couldn't find a decent pre-existing python library to do
> so.  So, having toyed with nltk for a bit, I decided I could use it as
> a base to code up a simple summarizer.  It can be downloaded from
> here:
>

> http://tristanhavelick.com/summarize.zip(basic usage can be found in


> test/summarize.doctest as well as at the end of this document)
>

> I took the algorithm from the Java library Classifier4J here:http://classifier4j.sourceforge.net/but I hesitate to call it a port

Nitin Madnani

unread,
Feb 21, 2010, 9:18:36 PM2/21/10
to nltk...@googlegroups.com
Tristan,

This is really a good start to a summarization module in NLTK which is
what I have wanted to do for a while but haven't found the time to do
it. Here are a few thoughts and ideas:

(1) I ran this code on the mac and it threw up the following error:

ValueError: line 13 of the docstring for summarize.doctest has
inconsistent leading whitespace: '\r'

So, I ran dos2unix on the files and then everything ran fine.

(2) I would make a couple of suggestions as to coding style. These are
just my suggestions and other more knowledgeable people on this list
should certainly overrule me if I am wrong. I would suggest using the
built-in 'sorted()' function rather than the 'sort()' method of a
list. That method won't work in Python 3.0 and it's nice to use
forward looking code. Also, you shouldn't split list comprehensions on
two lines. Finally, instead of:

if len(output_sentences) >= num_sentences: break

I would recommend doing:

if len(output_sentences) >= num_sentences:
break

Assuming the other folks are okay with it, this would probably fit
best under nltk_contrib for now.

Cheers,
Nitin

> --
> You received this message because you are subscribed to the Google
> Groups "nltk-dev" group.
> To post to this group, send email to nltk...@googlegroups.com.
> To unsubscribe from this group, send email to nltk-dev+u...@googlegroups.com
> .
> For more options, visit this group at http://groups.google.com/group/nltk-dev?hl=en
> .
>
>

Steven Bethard

unread,
Feb 21, 2010, 9:48:58 PM2/21/10
to nltk...@googlegroups.com
On Sun, Feb 21, 2010 at 6:18 PM, Nitin Madnani <nmad...@umiacs.umd.edu> wrote:
> (2) I would make a couple of suggestions as to coding style. These are just
> my suggestions and other more knowledgeable  people on this list should
> certainly overrule me if I am wrong. I would suggest using the built-in
> 'sorted()' function rather than the 'sort()' method of a list. That method
> won't work in Python 3.0 and it's nice to use forward looking code.

Maybe I misunderstand you, but list.sort certainly still exists in Python 3.0:

Python 3.1 (r31:73574, Jun 26 2009, 20:21:35) [MSC v.1500 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> lst = [1, 5, 3, 2, 4]
>>> lst.sort()
>>> lst
[1, 2, 3, 4, 5]

Steve
--
Where did you get that preposterous hypothesis?
Did Steve tell you that?
--- The Hiphopopotamus

Nitin Madnani

unread,
Feb 21, 2010, 10:02:05 PM2/21/10
to nltk...@googlegroups.com
Steve,

I was under the mistaken impression that the list.sort method was
being deprecated in favor of the sorted method. Of course, upon closer
inspection of the 'What's new in Python 3.0' document, I found out
that it is for dictionaries that list.sort will doesn't work: the
'keys()' method now returns views instead of lists and so you can't
call the sort method on the result of the keys() method.

I apologize for the confusion. FWIW, I personally find the sorted()
method nicer to use even though it's a bit more inefficient than the
in-place sort.

Nitin

Nitin Madnani

unread,
Feb 21, 2010, 10:12:23 PM2/21/10
to nltk...@googlegroups.com
Whoops,

I meant the sorted() 'function' not 'method' :)

Nitin

Steven Bird

unread,
Feb 21, 2010, 10:55:23 PM2/21/10
to nltk...@googlegroups.com
Thanks Tristan.

On 22 February 2010 13:18, Nitin Madnani <nmad...@umiacs.umd.edu> wrote:
> Assuming the other folks are okay with it, this would probably fit best
> under nltk_contrib for now.

Yes, I'd be happy for it to go there, and would welcome multiple
approaches. It should be easy to define an interface class for a
summarizer, and to have all implementations inherit that, as we've
done for other processing classes in NLTK.

Note that we have some coding guidelines here:
http://code.google.com/p/nltk/wiki/DevelopersGuide

-Steven

Reply all
Reply to author
Forward
0 new messages