Sentence segmentation to handle dialogue/quotations using nltk and stanford parser

43 views

Skip to first unread message

liampwa...@gmail.com

unread,

Sep 28, 2016, 7:38:49 PM9/28/16

to nltk-users

Hi,

I am trying to parse some texts from Project Gutenberg into sentences for further analysis using nltk Stanford parser but it seems the nltk.parse.stanford package requires sentences/ or lists of sentences that have already been segmented as input.

From the nltk docs:

parse_sents(sentences, verbose=False)[source]
Use StanfordParser to parse multiple sentences. Takes multiple sentences as a list where each sentence is a list of words. Each sentence will be automatically tagged with this StanfordParser instance’s tagger. If whitespaces exists inside a token, then the token will be treated as separate tokens.
Parameters:
sentences (list(list(str))) – Input sentences to parse
Return type:
iter(iter(Tree))
raw_parse(sentence, verbose=False)[source]
Use StanfordParser to parse a sentence. Takes a sentence as a string; before parsing, it will be automatically tokenized and tagged by the Stanford Parser.

I have tried using the nltk sent-tokenize function but it doesn’t seem to segment sentences properly for dialogue e.g. the sentence - “Hello”, he said. - is split into two sentences
- “hello” - and - "he said".

If anyone has any insight into whether there is a way to segment text into sentences in the manner needed using nltk it would be greatly appreciated.

Thanks,

Liam

Reply all

Reply to author

Forward

0 new messages