Tokenize paragraphs

Robert Schafish

unread,

Jan 3, 2022, 4:25:50 AM1/3/22

to nltk-users

I want to tokenize the paragraphs in a block of text. I searched for "NLTK tokenize paragraphs" on the internet and found a few postings referencing NLTK but when I look on the NLTK site I can find no tokenize paragraph module. Is there or was there such a module? If not I would appreciate any references to how to tokenize paragraphs.

Cheers, BobS

Julius Hamilton

unread,

Jan 3, 2022, 11:37:16 AM1/3/22

to nltk-...@googlegroups.com

Great question.

Well, if you’re lucky, your text just separates paragraphs with new lines - often represented by the literal sequence of symbols “\n” in the file / string of text.

If your text follows this pattern, you just have to split the string on newlines - Python offers a built-in method for this, str.splitlines(), returning a list of strings.

In reality, it may be more complicated.

If there are sequences of multiple newlines, you could first reduce them to single line characters with import re; re.sub(“\n+”, “\n”, str), I think.

If for whatever reason your text has paragraphs on a human conceptual level but they aren’t precisely structured on a typographical level enough for simple string processing methods, I’d use Spacy. Spacy uses machine learning to analyse and process text.

I don’t know a specific method they have for paragraph identification but they have sentence segmentation - usually with Spacy you load an English sort of module/model, then pass a string and a number of NLP operations are then carried out. You can access different kinds of language elements as attributes of the primary doc() object. For example, for sentences, it would look like this:

import spacy

text = “some long string”

# en_core_web_sm means English core web small, and it’s sort of representation of English in terms of being trained to understand and work with English grammar and so on. There are other things you could pass, in English or other languages.

doc = spacy.load(“en_core_web_sm”)(text)

# After the above, Spacy has already carried out a number of NLP operations, and they’re available as attributes to the object. Here are the sentences:

sentences = doc.sents

I’m learning this stuff myself at the moment, so I can help you a bit more if you have more questions.

Best,

Julius

On Mon 3. Jan 2022 at 10:25, Robert Schafish <rscha...@gmail.com> wrote:

I want to tokenize the paragraphs in a block of text. I searched for "NLTK tokenize paragraphs" on the internet and found a few postings referencing NLTK but when I look on the NLTK site I can find no tokenize paragraph module. Is there or was there such a module? If not I would appreciate any references to how to tokenize paragraphs.

Cheers, BobS

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/nltk-users/c64a027c-81e8-413b-b767-101f2c085166n%40googlegroups.com.

rscha...@gmail.com

unread,

Jan 3, 2022, 1:56:35 PM1/3/22

to nltk-...@googlegroups.com

Julius, thanks for the suggestions, it is likely that splitting on newlines will work. I will give it a try. I am working with transcripts from corporate earnings calls so the text structure is conventional. I want to extract paragraphs that contain any keyword from a list (solar, wind, renewable etc.). I can do this for sentences but not yet for paragraphs which would provide more context.

Cheers, BobS

To view this discussion on the web, visit https://groups.google.com/d/msgid/nltk-users/CAEsMKX1-KDsEYP9A-Rc9K-oY4MKbvB%2Bct%3DM0gaC4AmT4rVGU-A%40mail.gmail.com.

Alexis

unread,

Jan 3, 2022, 2:10:54 PM1/3/22

to nltk-users

And here is the nltk answer: Use the predefined BlanklineTokenizer, which uses a blank line as the sign of a paragraph break:

from nltk.tokenize import BlanklineTokenizer

text = ntk.corpus.brown.raw()[:1000]

paras = BlanklineTokenizer().tokenize(text)

This tokenizer is just an example of the nltk's RegexpTokenizer, using the regex '\s*\n\s*\n\s*' as the separator. So you can easily roll your own for other document formats.

Alexis

Reply all

Reply to author

Forward