Hi,
I am trying to parse some texts from Project Gutenberg into sentences for further analysis using nltk Stanford parser but it seems the nltk.parse.stanford package requires sentences/ or lists of sentences that have already been segmented as input.
From the nltk docs:
parse_sents(sentences, verbose=False)[source]
Use StanfordParser to parse multiple sentences. Takes multiple sentences as a list where each sentence is a list of words. Each sentence will be automatically tagged with this StanfordParser instance’s tagger. If whitespaces exists inside a token, then the token will be treated as separate tokens.
Parameters:
sentences (list(list(str))) – Input sentences to parse
Return type:
iter(iter(Tree))
raw_parse(sentence, verbose=False)[source]
Use StanfordParser to parse a sentence. Takes a sentence as a string; before parsing, it will be automatically tokenized and tagged by the Stanford Parser.
I have tried using the nltk sent-tokenize function but it doesn’t seem to segment sentences properly for dialogue e.g. the sentence - “Hello”, he said. - is split into two sentences
- “hello” - and - "he said".
If anyone has any insight into whether there is a way to segment text into sentences in the manner needed using nltk it would be greatly appreciated.
Thanks,
Liam