Hi Erick,
IIRC, the original Punkt implementation assumed the text may be wrapped
(after all, it was going for a generic, multi-lingual solution that would
apply over a variety of news corpora).
Paragraph and line markers are stored in
PunktBaseClass._tokenize_words(..). They are interpreted as SBD cues for
training only in PunktTrainer._get_orthography_data(..).
However, Punkt rightly also assumes that -- at application time -- you
shouldn't be passing in text where sentence boundaries are known.
What's wrong with:
sentences = []
for para in text.split('\n'):
sentences.extend(punkt.tokenize(para))
?
- Joel