Line breaks in sentence tokenization

1,655 views
Skip to first unread message

Erick Fonseca

unread,
Jul 10, 2012, 9:55:33 AM7/10/12
to nltk...@googlegroups.com
Hi,
I recently found out that, apparently, the Punkt tokenizer ignores line breaks as sentence delimiters. It also removes them, which made a little mess on some parts of my corpus.
I believe it would be useful to treat line breaks as delimiters, or at least have a parameter for doing so. 

Anyways, for now, I'd be glad if some Punkt expert could help me out with what I should change in the NLTK code.

Joel Nothman

unread,
Jul 10, 2012, 6:34:09 PM7/10/12
to nltk...@googlegroups.com, Erick Fonseca

Hi Erick,

IIRC, the original Punkt implementation assumed the text may be wrapped
(after all, it was going for a generic, multi-lingual solution that would
apply over a variety of news corpora).

Paragraph and line markers are stored in
PunktBaseClass._tokenize_words(..). They are interpreted as SBD cues for
training only in PunktTrainer._get_orthography_data(..).

However, Punkt rightly also assumes that -- at application time -- you
shouldn't be passing in text where sentence boundaries are known.

What's wrong with:

sentences = []
for para in text.split('\n'):
sentences.extend(punkt.tokenize(para))

?

- Joel

Erick Fonseca

unread,
Jul 11, 2012, 11:19:06 AM7/11/12
to Joel Nothman, nltk...@googlegroups.com
Thank you for your reply.

I see. I could certainly use this approach of extending a sentences
list, but I thought it would be cleaner if Punkt took care of it all.
I didn't think that wrapped text could be a common issue, though.

Cheers,
Erick Fonseca

2012/7/10 Joel Nothman <jnot...@student.usyd.edu.au>:

Joel Nothman

unread,
Jul 17, 2012, 2:05:51 AM7/17/12
to Erick Fonseca, nltk...@googlegroups.com

In the traditional corpora that Computational Linguists deal with, wrapped
text is a very common issue, coming from newswire or emails.

As its role is sentence boundary detection, I think it's quite reasonable
for Punkt to ignore newlines. At the end of the day, if there's
information you already know, you shouldn't be asking a tool to guess it
for you!

- Joel
Reply all
Reply to author
Forward
0 new messages