CategorizedPlaintextCorpusReader, Overriding sents() Sentence Segmentation

33 views
Skip to first unread message

Matt Miller

unread,
May 21, 2023, 2:44:17 AM5/21/23
to nltk-users
I'm putting together my own corpus, and my problem right now is that sents() isn't segmenting the way I'd like. The way my text files are put together I've already segmented sentences the way I want, and each file has exactly one sentence per line. So,  I'm thinking I need sents() to treat the newline character as the only sentence delimiter. My text files are UTF-8, use Unix-style line terminators, and have no byte order marker. I'm using nltk.corpus.CategorizedPlaintextCorpusReader.

How can I get sents() to ignore any punctuation, and just treat each line as a sentence?

Thanks.




Matt Miller

unread,
Jun 9, 2023, 10:47:25 AM6/9/23
to nltk-users
I found that nltk.corpus.CategorizedPlaintextCorpusReader has a "sent_tokenizer" optional argument, and that the following tokenizer works to give one sentence per line:

sent_tokenizer = nltk.RegexpTokenizer('[^\n]+')
Reply all
Reply to author
Forward
0 new messages