How to divide text data into various paragraphs/sentences on basis of special rules ?

2,271 views
Skip to first unread message

saurabh vyas

unread,
Mar 6, 2017, 3:06:21 AM3/6/17
to nltk-users
I am trying to segment a text file having many paragraphs(english) into various paragraphs , each needs to be assigned to a particular class on basis of the following rule :

Scan the text from start to end , in whichever sentence / paragraph , first occurrence of a "key" word occurs , classify that paragraph to some class c_key

So I am thinking if I apply this rule , all paragraphs will be assigned a class which I can use for my further analysis

Dimitriadis, A. (Alexis)

unread,
Mar 7, 2017, 4:29:48 AM3/7/17
to nltk-...@googlegroups.com
Are you asking how to divide text into paragraphs? If so, it depends on the format of the text. In Word documents etc., each newline indicates a new paragraph so you’d just use `text.split(“\n”)` (where `text` is a string variable containing the text of your file). In other formats, paragraphs are separated by a blank line (two consecutive newlines), so you’d use `text.split(“\n\n”)`. If you are working with XML or HTML data, paragraph boundaries would be indicated in the original (until your corpus reader strips the markup to extract the words.) 

Alexis


Dr. Alexis Dimitriadis | Assistant Professor and Senior Research Fellow | Utrecht Institute of Linguistics OTS | Utrecht University | Trans 10, 3512 JK Utrecht, room 2.33 | +31 30 253 65 68 | a.dimi...@uu.nl | www.hum.uu.nl/medewerkers/a.dimitriadis

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

saurabh vyas

unread,
Mar 7, 2017, 9:46:27 AM3/7/17
to nltk-users
Yes , I was basically looking for something like that , Thanks
Reply all
Reply to author
Forward
0 new messages