How to use the Berkeley parser plugin in ClearTK

Majid Laali

unread,

Oct 18, 2015, 4:08:05 PM10/18/15

to cleartk-users

Hi,

I have few questions about how to use the Berkeley parser plugin in ClearTK:

I noticed that to use the Berkeley parser, texts have to be split on sentences, then each sentence has to be tokenized and pos tagged.

1- I was wondering what is the proper way to do these steps (i.e. sentence boundary detection, tokenizing and pos tagging), so that I can obtain a syntax parse tree of a text.

2- Is it possible to remove these constraints from texts so that the Berkeley parser plugin works with raw texts (i.e. without token annotation and pos tags) and just use a sentence annotator (e.g. the Opennlp sentence annotator) before the parser.

3- I have read the codes of the Berkeley parser wrapper. I am afraid the parser does not parse some sentences properly. Specially sentences with tokens needed to be normalized to the Penn Treebank convention (e.g. '(' should be converted to '-LRB-' before the parsing step).

Thanks,

Majid

--------------

Majid Laali, Ph.D. Candidate,

Computer Science & Software Engineering Department

Concordia University

Lee Becker

unread,

Oct 18, 2015, 10:30:59 PM10/18/15

to cleart...@googlegroups.com

On Sun, Oct 18, 2015 at 2:08 PM, Majid Laali <mjl...@gmail.com> wrote:

I have few questions about how to use the Berkeley parser plugin in ClearTK:
I noticed that to use the Berkeley parser, texts have to be split on sentences, then each sentence has to be tokenized and pos tagged.
1- I was wondering what is the proper way to do these steps (i.e. sentence boundary detection, tokenizing and pos tagging), so that I can obtain a syntax parse tree of a text.

As it is currently written, the prerequisites for the ParserAnnotator require Sentences and POS-tagged tokens to exist in the CAS prior to running. There's no standard way to get this preprocessing would be to run

org.cleartk.opennlp.tools.SentenceAnnotator;

org.cleartk.token.tokenizer.TokenAnnotator;

org.cleartk.opennlp.tools.PosTaggerAnnotator;

or you can use ClearNLP to do tokenization and POS tagging:

org.cleartk.opennlp.tools.SentenceAnnotator;

org.cleartk.clearnlp.Tokenizer;

org.cleartk.clearnlp.MpAnalyzer;

org.cleartk.clearnlp.PosTagger;

(with ClearTK branch feature/issue419 you could use ClearNLP's Sentence segmenter as well, but it has not been merged yet as it requires Java 8)

2- Is it possible to remove these constraints from texts so that the Berkeley parser plugin works with raw texts (i.e. without token annotation and pos tags) and just use a sentence annotator (e.g. the Opennlp sentence annotator) before the parser.

We would need to update / extend the wrappers in cleartk-berkeley parser to make the calls to the Berkeley Parser APIs to handle tokenization and POS tagging much like we do with ClearNLP's APIs.

3- I have read the codes of the Berkeley parser wrapper. I am afraid the parser does not parse some sentences properly. Specially sentences with tokens needed to be normalized to the Penn Treebank convention (e.g. '(' should be converted to '-LRB-' before the parsing step).

There's probably not much we can do on the ClearTK side for this.

Let me know if you need any clarification.

Lee

Majid Laali

unread,

Oct 18, 2015, 10:53:10 PM10/18/15

to cleart...@googlegroups.com

Hi Lee,

Thank you for you response. I just finished an update for the Berkeley parser wrapper that address my points. Is it fine I create an issue (as mentioned in the website) and submit my patch for it?

Thanks,

Majid

--
You received this message because you are subscribed to the Google Groups "cleartk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cleartk-user...@googlegroups.com.
To post to this group, send email to cleart...@googlegroups.com.
Visit this group at http://groups.google.com/group/cleartk-users.
For more options, visit https://groups.google.com/d/optout.

Lee Becker

unread,

Oct 19, 2015, 1:45:38 AM10/19/15

to cleart...@googlegroups.com

On Sun, Oct 18, 2015 at 8:53 PM, Majid Laali <mjl...@gmail.com> wrote:

Thank you for you response. I just finished an update for the Berkeley parser wrapper that address my points. Is it fine I create an issue (as mentioned in the website) and submit my patch for it?

Yes please. I'll try to review it in a timely manner.

Majid Laali

unread,

Oct 19, 2015, 5:10:00 PM10/19/15

to cleart...@googlegroups.com

Please check the issue 420 and the "fixes #420" pull request.

Thanks,

Majid

--------------

Majid Laali, Ph.D. Candidate,

Computer Science & Software Engineering Department

Concordia University

1515 St. Catherine St. West, EV9-401

Montreal, Quebec, Canada H3G 1M8

Phone: (514) 848-2424#7169

Phone: (514) 690-7071

Email: mjl...@gmail.com / m_l...@encs.concordia.ca

--

Majid Laali

unread,

Oct 22, 2015, 6:59:43 PM10/22/15

to cleart...@googlegroups.com

Hi,

Follow up from my previous emails, I guess the Berkeley parser wrapper adds duplicate annotations to a document. More precisely, the parser adds two TopTreebankNode for each sentence and also for each TerminalTreebankNode annotation, it creates one extra TreebankNode annotation. Here is the test case to show the case:

@Test
public void test() throws ResourceInitializationException, AnalysisEngineProcessException{
AnalysisEngine engine = AnalysisEngineFactory.createEngine(
ParserAnnotator.getDescription(MODEL_PATH));

String sent = "I've provided new evidence.";

//(ROOT (S (@S (NP (NN I)) (VP (VBN 've) (S (VP (VBN provided) (S (NP (JJ new) (NN evidence))))))) (. .)))

   jCas.setDocumentText(sent);
   new Sentence(jCas, 0 , sent.length()).addToIndexes();

   engine.process(jCas);
   engine.collectionProcessComplete();
   Assert.assertEquals(1, JCasUtil.select(jCas, TopTreebankNode.class).size());
   Assert.assertEquals(6, JCasUtil.select(jCas, TerminalTreebankNode.class).size());

   Assert.assertEquals(14, JCasUtil.select(jCas, TreebankNode.class).size());
}

Please let me know if it is true, so that I create an issue in Github and submit a patch for it.

Thanks

Majid,

Majid Laali PhD Student, Concordia University

Lee Becker

unread,

Oct 23, 2015, 12:07:55 AM10/23/15

to cleart...@googlegroups.com

On Thu, Oct 22, 2015 at 4:59 PM, Majid Laali <mjl...@gmail.com> wrote:

Hi,

Follow up from my previous emails, I guess the Berkeley parser wrapper adds duplicate annotations to a document. More precisely, the parser adds two TopTreebankNode for each sentence and also for each TerminalTreebankNode annotation, it creates one extra TreebankNode annotation. Here is the test case to show the case:

@Test
public void test() throws ResourceInitializationException, AnalysisEngineProcessException{
   AnalysisEngine engine = AnalysisEngineFactory.createEngine(
   ParserAnnotator.getDescription(MODEL_PATH));
String sent = "I've provided new evidence.";
//(ROOT (S (@S (NP (NN I)) (VP (VBN 've) (S (VP (VBN provided) (S (NP (JJ new) (NN evidence))))))) (. .)))

   jCas.setDocumentText(sent);
   new Sentence(jCas, 0 , sent.length()).addToIndexes();

   engine.process(jCas);
   engine.collectionProcessComplete();
   Assert.assertEquals(1, JCasUtil.select(jCas, TopTreebankNode.class).size());
   Assert.assertEquals(6, JCasUtil.select(jCas, TerminalTreebankNode.class).size());

   Assert.assertEquals(14, JCasUtil.select(jCas, TreebankNode.class).size());
}

Please let me know if it is true, so that I create an issue in Github and submit a patch for it.

Thanks
Majid,

Is this happening in master or in the Issue/420 branch?

Majid Laali

unread,

Oct 23, 2015, 3:27:56 PM10/23/15

to cleart...@googlegroups.com

It happens in the master branch. I resolved the issue in Issue/420.

Majid Laali PhD Student, Concordia University

--

Reply all

Reply to author

Forward