CategorizedPlaintextCorpusReader, Overriding sents() Sentence Segmentation

Skip to first unread message

Matt Miller

May 21, 2023, 2:44:17 AMMay 21
to nltk-users
I'm putting together my own corpus, and my problem right now is that sents() isn't segmenting the way I'd like. The way my text files are put together I've already segmented sentences the way I want, and each file has exactly one sentence per line. So,  I'm thinking I need sents() to treat the newline character as the only sentence delimiter. My text files are UTF-8, use Unix-style line terminators, and have no byte order marker. I'm using nltk.corpus.CategorizedPlaintextCorpusReader.

How can I get sents() to ignore any punctuation, and just treat each line as a sentence?


Matt Miller

Jun 9, 2023, 10:47:25 AMJun 9
to nltk-users
I found that nltk.corpus.CategorizedPlaintextCorpusReader has a "sent_tokenizer" optional argument, and that the following tokenizer works to give one sentence per line:

sent_tokenizer = nltk.RegexpTokenizer('[^\n]+')
Reply all
Reply to author
0 new messages