Extract sentences from Brown corpus

1,562 views
Skip to first unread message

Andrew Lloyd

unread,
Oct 10, 2013, 8:21:20 AM10/10/13
to nltk-...@googlegroups.com
Hi,

I'm trying to extract sentences from the Brown corpus. After importing the corpus, I ran below command

l1 = len(brown.sents())

l1 is 57340.

But according to multiple sources the number of sentences is about 52000.

For example, in http://ir.shef.ac.uk/cloughie/papers/sentences.pdf page 17 section 5.3.1 there are 52108 sentences. How do I extract these many sentences using NLTK (probably with some pre-processing of text)

Thanks.

sujitpal

unread,
Oct 10, 2013, 12:15:00 PM10/10/13
to nltk-...@googlegroups.com
Hi Andrew, 

I print the first 10 sentences of the brown corpus using the code below. Is this what you were looking for?

>>> import nltk
>>> from nltk.corpus import brown
>>> [" ".join(sent) for sent in brown.sents()[0:10]]

-sujit

Andrew Lloyd

unread,
Oct 10, 2013, 9:39:38 PM10/10/13
to nltk-...@googlegroups.com
Hi Sujeet,

Thanks for help. But I'm looking for a way to get ~52000 sentences (i.e. correct segmentation of brown corpus into sentences). If I follow your way I will get 57340 sentences in total, which seems to be an incorrect number. There is some preprocessing involved but I'm not sure what exactly.

sujitpal

unread,
Oct 11, 2013, 12:22:29 PM10/11/13
to nltk-...@googlegroups.com
Oh, okay, sorry didn't read the last part of your original email (also don't know the answer to the question in that part).

-sujit

Alexis Dimitriadis

unread,
Oct 14, 2013, 7:05:32 AM10/14/13
to nltk-...@googlegroups.com
Since nobody in the know has commented, I'll venture some comments:

The nltk's brown module breaks up sentences using an automatic sentence tokenizer, so perhaps it is simply wrong. But the NLTK total is about 10% higher than the counts you cite, so I really doubt that's the case. I've worked with the corpus and have not noticed anything like 10% tokenization errors.

Is the figure 52108 authoritative? That's not clear in the article you link to. If it's based on the parsed version of the corpus it should be reliable, but then why does the author cite alternative estimates as well? The NLTK's sents() list includes headlines and other interspersed textual material (a quick count gave me 2615 "sentences" that do not end with punctuation), and it's likely following different rules about how to treat quotations etc.

I'd take a good look at the NLTK's sentences, and if you don't notice pervasive errors, maybe you can just accept it. If you need something authoritative, your best option might be to get your hands on the parsed version of the Brown corpus (it's not in the NLTK).

Best,

Alexis

Andrew Lloyd

unread,
Oct 14, 2013, 12:01:36 PM10/14/13
to nltk-...@googlegroups.com
Thanks for the reply Alexis. You're right, the 52108 number seems to come from a parsed version of Brown corpus (Penn treebank) and not the version in NLTK.


--
You received this message because you are subscribed to a topic in the Google Groups "nltk-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/nltk-users/KcjEGTlSkQY/unsubscribe.
To unsubscribe from this group and all of its topics, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all
Reply to author
Forward
0 new messages