Sentence segmentation to handle dialogue/quotations using nltk and stanford parser

43 views
Skip to first unread message

liampwa...@gmail.com

unread,
Sep 28, 2016, 7:38:49 PM9/28/16
to nltk-users
Hi,

I am trying to parse some texts from Project Gutenberg into sentences for further analysis using nltk Stanford parser but it seems the nltk.parse.stanford package requires sentences/ or lists of sentences that have already been segmented as input.
 
From the nltk docs:
 
parse_sents(sentences, verbose=False)[source]
Use StanfordParser to parse multiple sentences. Takes multiple sentences as a list where each sentence is a list of words. Each sentence will be automatically tagged with this StanfordParser instance’s tagger. If whitespaces exists inside a token, then the token will be treated as separate tokens.
Parameters:
sentences (list(list(str))) – Input sentences to parse
Return type:
iter(iter(Tree))
raw_parse(sentence, verbose=False)[source]
Use StanfordParser to parse a sentence. Takes a sentence as a string; before parsing, it will be automatically tokenized and tagged by the Stanford Parser.
 
 
I have tried using the nltk sent-tokenize function but it doesn’t seem to segment sentences properly for dialogue e.g. the sentence - “Hello”, he said. - is split into two sentences
- “hello” - and  - "he said". 

If anyone has any insight into whether there is a way to segment text into sentences in the manner needed using nltk it would be greatly appreciated.

Thanks,

Liam
 
Reply all
Reply to author
Forward
0 new messages