I have few questions about how to use the Berkeley parser plugin in ClearTK:I noticed that to use the Berkeley parser, texts have to be split on sentences, then each sentence has to be tokenized and pos tagged.1- I was wondering what is the proper way to do these steps (i.e. sentence boundary detection, tokenizing and pos tagging), so that I can obtain a syntax parse tree of a text.
2- Is it possible to remove these constraints from texts so that the Berkeley parser plugin works with raw texts (i.e. without token annotation and pos tags) and just use a sentence annotator (e.g. the Opennlp sentence annotator) before the parser.
3- I have read the codes of the Berkeley parser wrapper. I am afraid the parser does not parse some sentences properly. Specially sentences with tokens needed to be normalized to the Penn Treebank convention (e.g. '(' should be converted to '-LRB-' before the parsing step).
--
You received this message because you are subscribed to the Google Groups "cleartk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cleartk-user...@googlegroups.com.
To post to this group, send email to cleart...@googlegroups.com.
Visit this group at http://groups.google.com/group/cleartk-users.
For more options, visit https://groups.google.com/d/optout.
Thank you for you response. I just finished an update for the Berkeley parser wrapper that address my points. Is it fine I create an issue (as mentioned in the website) and submit my patch for it?
--
@Test
public void test() throws ResourceInitializationException, AnalysisEngineProcessException{
AnalysisEngine engine = AnalysisEngineFactory.createEngine(
ParserAnnotator.getDescription(MODEL_PATH));
String sent = "I've provided new evidence.";
//(ROOT (S (@S (NP (NN I)) (VP (VBN 've) (S (VP (VBN provided) (S (NP (JJ new) (NN evidence))))))) (. .)))
jCas.setDocumentText(sent);
new Sentence(jCas, 0 , sent.length()).addToIndexes();
engine.process(jCas);
engine.collectionProcessComplete();
Assert.assertEquals(1, JCasUtil.select(jCas, TopTreebankNode.class).size());
Assert.assertEquals(6, JCasUtil.select(jCas, TerminalTreebankNode.class).size());
Assert.assertEquals(14, JCasUtil.select(jCas, TreebankNode.class).size());
}
Hi,Follow up from my previous emails, I guess the Berkeley parser wrapper adds duplicate annotations to a document. More precisely, the parser adds two TopTreebankNode for each sentence and also for each TerminalTreebankNode annotation, it creates one extra TreebankNode annotation. Here is the test case to show the case:@Test
public void test() throws ResourceInitializationException, AnalysisEngineProcessException{
AnalysisEngine engine = AnalysisEngineFactory.createEngine(
ParserAnnotator.getDescription(MODEL_PATH));String sent = "I've provided new evidence.";
//(ROOT (S (@S (NP (NN I)) (VP (VBN 've) (S (VP (VBN provided) (S (NP (JJ new) (NN evidence))))))) (. .)))
jCas.setDocumentText(sent);
new Sentence(jCas, 0 , sent.length()).addToIndexes();
engine.process(jCas);
engine.collectionProcessComplete();
Assert.assertEquals(1, JCasUtil.select(jCas, TopTreebankNode.class).size());
Assert.assertEquals(6, JCasUtil.select(jCas, TerminalTreebankNode.class).size());
Assert.assertEquals(14, JCasUtil.select(jCas, TreebankNode.class).size());
}
Please let me know if it is true, so that I create an issue in Github and submit a patch for it.ThanksMajid,