I'm putting together my own corpus, and my problem right now is that sents() isn't segmenting the way I'd like. The way my text files are put together I've already segmented sentences the way I want, and each file has exactly one sentence per line. So, I'm thinking I need sents() to treat the newline character as the only sentence delimiter. My text files are UTF-8, use Unix-style line terminators, and have no byte order marker. I'm using nltk.corpus.CategorizedPlaintextCorpusReader.
How can I get sents() to ignore any punctuation, and just treat each line as a sentence?
Thanks.