Working with the Ontonotes corpus

184 views
Skip to first unread message

Ross Hendrickson

unread,
Dec 8, 2011, 9:10:05 AM12/8/11
to cleart...@googlegroups.com
Hey all,

I'm trying to do some work with the Ontonotes corpus but I'm failing at getting a Collection Reader up and running. I tried working off the PropBankGoldReader, that was a little too complex, so I backed off and tried to just get the TreebankParsingExample to work with a single file. That is also failing. The stack trace for the TreebankParsingExample is below. Any pointers? Also, how do  you create a new type system within UIMA fit? I believe they are automatically created from the type xml files, but are you supposed to create the XML files using the normal UIMA plugin and then do something in code? A pointer to a good example project to look at would be helpful there as well. What I'm trying to do is as follows

1. Get a simple pipeline up and running that does the following
 - read in PTB parse files from Ontonotes
 - sets up a treebank view
 - runs through an analysis engine that does nothing 
 - writes the cas to an xmi so I can look at it via CVD
2. Add into the AE a simple tree based annotator
3. Add into the pipeline a datawriter to format the data for the mallet maxent package (cleartk internal)
4. Create a collection reader for my verb sense data
5. Create a new view for the verb sense data
6. Add into the AE a simple annotator that uses the sense view
7. Expand the AE's/Annotators to include a dozen or so other things.

Along the way I'm proposing to write several tutorials around getting up and running writing code within the cleartk framework and working with UIMAfit. Any suggestions would be welcome. I believe I have wiki access so I will start stubbing these out. What else am I going to do over Christmas break?

1. Writing a new Collection reader
2. Using UIMA's tools to trouble shoot. 
3. Setting up Type System
4. Working with Custom Views
5. Creating an Analysis Engine
6. Creating a new Feature Extractor
7. Creating a new Datawriter for a new ML/Analysis package
7. Aggregating AE's
8. Building a Simple Pipeline
9. Creating an Experiment Pipeline

Thanks for all the help guys.

Ross

Stack Trace
Dec 8, 2011 6:53:20 AM org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl callAnalysisComponentProcess(405)
SEVERE: Exception occurred
org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:391)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:295)
at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:567)
at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.<init>(ASB_impl.java:409)
at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:342)
at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:267)
at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
at org.uimafit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:209)
at org.uimafit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:145)
at org.cleartk.examples.treebank.TreebankParsingExample.main(TreebankParsingExample.java:128)
Caused by: java.lang.IllegalArgumentException: Parentheses counts do not match for treebank sentence: (TOP (S (NP-SBJ (NP (NNP Pierre)
at org.cleartk.syntax.constituent.util.TreebankFormatParser.splitSentences(TreebankFormatParser.java:504)
at org.cleartk.syntax.constituent.util.TreebankFormatParser.inferPlainText(TreebankFormatParser.java:204)
at org.cleartk.syntax.constituent.TreebankGoldAnnotator.process(TreebankGoldAnnotator.java:109)
at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:377)
... 9 more
Dec 8, 2011 6:53:20 AM org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl processAndOutputNewCASes(275)
SEVERE: Exception occurred
org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:391)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:295)
at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:567)
at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.<init>(ASB_impl.java:409)
at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:342)
at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:267)
at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
at org.uimafit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:209)
at org.uimafit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:145)
at org.cleartk.examples.treebank.TreebankParsingExample.main(TreebankParsingExample.java:128)
Caused by: java.lang.IllegalArgumentException: Parentheses counts do not match for treebank sentence: (TOP (S (NP-SBJ (NP (NNP Pierre)
at org.cleartk.syntax.constituent.util.TreebankFormatParser.splitSentences(TreebankFormatParser.java:504)
at org.cleartk.syntax.constituent.util.TreebankFormatParser.inferPlainText(TreebankFormatParser.java:204)
at org.cleartk.syntax.constituent.TreebankGoldAnnotator.process(TreebankGoldAnnotator.java:109)
at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:377)
... 9 more
Exception in thread "main" org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:391)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:295)
at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:567)
at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.<init>(ASB_impl.java:409)
at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:342)
at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:267)
at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
at org.uimafit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:209)
at org.uimafit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:145)
at org.cleartk.examples.treebank.TreebankParsingExample.main(TreebankParsingExample.java:128)
Caused by: java.lang.IllegalArgumentException: Parentheses counts do not match for treebank sentence: (TOP (S (NP-SBJ (NP (NNP Pierre)
at org.cleartk.syntax.constituent.util.TreebankFormatParser.splitSentences(TreebankFormatParser.java:504)
at org.cleartk.syntax.constituent.util.TreebankFormatParser.inferPlainText(TreebankFormatParser.java:204)
at org.cleartk.syntax.constituent.TreebankGoldAnnotator.process(TreebankGoldAnnotator.java:109)
at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:377)
... 9 more

Steven Bethard

unread,
Dec 8, 2011, 10:09:20 AM12/8/11
to cleart...@googlegroups.com
On Thu, Dec 8, 2011 at 3:10 PM, Ross Hendrickson
<ross.hen...@gmail.com> wrote:
> I'm trying to do some work with the Ontonotes corpus but I'm failing at
> getting a Collection Reader up and running. I tried working off the
> PropBankGoldReader, that was a little too complex, so I backed off and tried
> to just get the TreebankParsingExample to work with a single file. That is
> also failing. The stack trace for the TreebankParsingExample is below.

The exception is "Parentheses counts do not match for treebank
sentence: (TOP (S (NP-SBJ (NP (NNP Pierre)". My guess is that the
TreebankParsingExample expects trees one per line, and your trees span
several lines. TreebankFormatParser is really a pretty poorly
implemented class. I have it on my TODO list to implement a better
version using http://jparsec.codehaus.org/. But don't hold your
breath - it might be quite a while before I get around to that...

> pointers? Also, how do  you create a new type system within UIMA fit? I
> believe they are automatically created from the type xml files, but are you
> supposed to create the XML files using the normal UIMA plugin and then do
> something in code? A pointer to a good example project to look at would be
> helpful there as well.

I'm not 100% sure which part you're having trouble with, so here's the
whole process:

(1) Write an XML file defining the type system you want to use.
Typically you would put this XML file under src/main/resources/
somewhere. For actually creating and editing the XML file, you can
follow the UIMA tutorial here:

http://uima.apache.org/d/uimaj-2.4.0/tutorials_and_users_guides.html#ugr.tug.aae.defining_types

(2) Use JCasGen to compile your XML into Java files. In traditional
UIMA, you would use the JCasGen button (as described in the UIMA
tutorial) and be careful to always click that button every time you
modify your type system. In ClearTK, the parent pom provides a way of
having JCasGen invoked via Maven. Add a snippet like the following to
your pom.xml (assuming you've already added cleartk-release as a
<parent> as instructed in the User Setup page):

<properties>
<!-- Generate type system files using ${jcasgen.typesystem} -->
<jcasgen.typesystem>path/to/your/TypeSystem.xml</jcasgen.typesystem>
</properties>

<build>
<plugins>
<!-- Generate type system files using ${jcasgen.typesystem} -->
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
</plugin>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>build-helper-maven-plugin</artifactId>
</plugin>
</plugins>
</build>

If you want to see how exec-maven-plugin and build-helper-maven-plugin
have been configured to make this work, you can take a look at the
cleartk-parent pom - they invoke org.uimafit.util.JCasGenPomFriendly
and add ${project.build.directory}/generated-sources/jcasgen to your
build path:

http://cleartk.googlecode.com/svn/repo/org/cleartk/cleartk-parent/1.0.0/cleartk-parent-1.0.0.pom

(3) Once you have compiled your XML type system into Java classes, you
can reference it directly in your Java code. You can start writing
your code now.

(4) When you go to run your CollectionReader, you'll want to make sure
UIMA loads the type system along with your code. In traditional UIMA,
you would reference your XML type system in the XML descriptor for
your CollectionReader. In UimaFIT, you pass in an appropriate
TypeSystemDescription when you create your CollectionReader with
CollectionReaderFactory. You can either do this explicitly, using
TypeSystemDescriptionFactory to create a TypeSystemDescription, or you
can ask UimaFIT to automatically load your type system for you, by
adding a src/main/resources/META-INF/org.uimafit/types.txt file that
contains something like "classpath*:path/to/your/TypeSystem.xml". The
UimaFIT documentation for this approach is here:

http://code.google.com/p/uimafit/wiki/TypeDescriptorDetection

(5) You have now created a type system, compiled it to Java classes,
written a CollectionReader using those Java classes, and loaded the
type system along side your code when you run your CollectionReader.


> What I'm trying to do is as follows
>
> 1. Get a simple pipeline up and running that does the following
>  - read in PTB parse files from Ontonotes
>  - sets up a treebank view
>  - runs through an analysis engine that does nothing
>  - writes the cas to an xmi so I can look at it via CVD
> 2. Add into the AE a simple tree based annotator
> 3. Add into the pipeline a datawriter to format the data for the mallet
> maxent package (cleartk internal)
> 4. Create a collection reader for my verb sense data
> 5. Create a new view for the verb sense data
> 6. Add into the AE a simple annotator that uses the sense view
> 7. Expand the AE's/Annotators to include a dozen or so other things.

You'll probably have more questions as you work through these, but
lets sort out the type system and CollectionReader issues first. Let
me know how far you get after following my steps above.

> Along the way I'm proposing to write several tutorials around getting up and
> running writing code within the cleartk framework and working with UIMAfit.
> Any suggestions would be welcome. I believe I have wiki access so I will
> start stubbing these out. What else am I going to do over Christmas break?
>
> 1. Writing a new Collection reader
> 2. Using UIMA's tools to trouble shoot.
> 3. Setting up Type System
> 4. Working with Custom Views
> 5. Creating an Analysis Engine
> 6. Creating a new Feature Extractor
> 7. Creating a new Datawriter for a new ML/Analysis package
> 7. Aggregating AE's
> 8. Building a Simple Pipeline
> 9. Creating an Experiment Pipeline

This would be awesome. Feel free to steal any of my text above that is
useful for your purposes. I've tried to point out which things are
UIMA, which things are UimaFIT and which things are ClearTK - we
definitely need to get better at that...

Steve
--
Where did you get that preposterous hypothesis?
Did Steve tell you that?
        --- The Hiphopopotamus

Ross Hendrickson

unread,
Dec 8, 2011, 11:51:07 AM12/8/11
to cleart...@googlegroups.com
Steve,

Thanks for the speedy reply Steve. I think it might also be a good idea to write a how Cleartk is not traditional UIMA. I've been through most of the UIMA tutorials and that in some ways has confused me as to how to do things with Cleartk and UIMAfit. Thanks again for the pointers, extremely helpful. Just reading them cleared up a bunch of issues in my head. I'll see what I can get banged out using the info you provided. I am still relatively new to maven so the pom information already has lights turning on in my head. I'll start outlining stuff on the wiki after I work through each one on my own. Thanks again.

Sincerely,

Ross


--
You received this message because you are subscribed to the Google Groups "cleartk-users" group.
To post to this group, send email to cleart...@googlegroups.com.
To unsubscribe from this group, send email to cleartk-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cleartk-users?hl=en.


gun...@gmail.com

unread,
Oct 29, 2012, 10:07:32 PM10/29/12
to cleart...@googlegroups.com
Hi everyone,

I'm trying to implement a WSD system within the ClearTK framework using OntoNotes. Thus I would love to get a discussion going about making ClearTK directly compatible with OntoNotes!

This includes, but is not limited to...
1. adding appropriate types - There seems to be infrastructure in place for Treebank, Propbank and named entities, but types for word senses and coreferences are needed. It would probably be worthwhile to make these a permanent addition to cleartk-type-system.

2. adding/improving necessary corpus readers -  There are existing corpus readers for Treebank and Propbank (though these might need some editing). My main confusion is in how I can create a single corpus reader that produces annotated documents based on multiple different OntoNotes files (.parse, .prop, .sense, etc.), or if this is even the right way to go about doing this. My thought was to read in the treebank parse first, then add annotations from the other corresponding files. Maybe it would be better (easier?) to interface with the database, instead of using the raw Ontonotes data.

Any insight on the matter would be much appreciated.

Thanks!
James Gung

Steven Bethard

unread,
Oct 30, 2012, 9:54:22 AM10/30/12
to cleart...@googlegroups.com
On Tue, Oct 30, 2012 at 3:07 AM, <gun...@gmail.com> wrote:
> 1. adding appropriate types - There seems to be infrastructure in place for
> Treebank, Propbank and named entities, but types for word senses and
> coreferences are needed. It would probably be worthwhile to make these a
> permanent addition to cleartk-type-system.

I agree. It might be worth looking at what DKPro does here. We've
talked about trying to merge our type systems in the past, so if we're
going to add some times, we should aim to be as compatible with the
DKPro types as seems reasonable.

Here's their coref types:

https://code.google.com/p/dkpro-core-asl/source/browse/de.tudarmstadt.ukp.dkpro.core-asl/trunk/de.tudarmstadt.ukp.dkpro.core.api.coref-asl/src/main/resources/desc/type/coref.xml

They may have some other types we should look at. It might be worth
asking them directly.

> 2. adding/improving necessary corpus readers - There are existing corpus
> readers for Treebank and Propbank (though these might need some editing). My
> main confusion is in how I can create a single corpus reader that produces
> annotated documents based on multiple different OntoNotes files (.parse,
> .prop, .sense, etc.), or if this is even the right way to go about doing
> this. My thought was to read in the treebank parse first, then add
> annotations from the other corresponding files. Maybe it would be better
> (easier?) to interface with the database, instead of using the raw Ontonotes
> data.

Typically, I would suggest reading all of the files into different
views as a first step. Then your code to parse the different formats
actually goes into JCasAnnotator_ImplBase classes, which look at, say,
a .prop view, and convert it to annotations on top of the plain text
(CAS.NAME_DEFAULT_SOFA) view.

Steve

James Gung

unread,
Oct 31, 2012, 12:25:59 AM10/31/12
to cleart...@googlegroups.com
Thanks for the advice! I'll attack the corpus reader as you suggested. I'll post anything else I dig up about DKPro's types.

James

--
You received this message because you are subscribed to the Google Groups "cleartk-users" group.

Torsten Zesch

unread,
Oct 31, 2012, 12:00:03 PM10/31/12
to cleart...@googlegroups.com
Hi,

some notes about the DKPro Types.

Steven already pointed to the CoRef types.
They are already publicly available and we have a wrapper for the
Stanford CoRef tool that uses the types here:
http://code.google.com/p/dkpro-core-gpl/source/browse/de.tudarmstadt.ukp.dkpro.core-gpl/trunk/de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl/src/main/java/de/tudarmstadt/ukp/dkpro/core/stanfordnlp/StanfordCoreferenceResolver.java

We also have types for WSD, but this is more complicated as this part
of the code is not yet open source.
We are working on it, but it will take a bit more time.

I asked my colleague Tristan Miller (who mainly works on that part of
the code), and he provided the following details:

--------
Our type system for word senses is very generic; it's not tied to any
one sense inventory or corpus format. This has both advantages and
disadvantages. Because the system is hierarchical it's difficult to
describe it in prose, but here's a rough attempt.

We have four basic UIMA annotation types:

1. Sense is intended to represent a given annotation of a single sense.
It has three features:

a) id (String) is a string which identifies the sense in an
implementation-defined manner. For example, if you are working with
WordNet, this feature can hold the WordNet sense key or WordNet synset
ID.

b) confidence (Double) is a numeric representation of how confident
the annotator is that this sense applies to the subject of annotation.
The values it takes, and whether or not it's even used, are
implementation-defined.

c) description (String) is intended to store a human-readable
description of this sense in an implementation-defined manner. For
example, it may be a reference to the WordNet definition.


2. LexicalItemConstituent represents a constituent of a given subject
of disambiguation. It has two features:

a) constituentType (String) is an identifier which describes the type
of constituent (for example, a head word, particle, separable verb
prefix, etc.). The values it takes, and whether or not it's even
used, are implementation-defined.

b) id (String) is the unique identifier for the subject of
disambiguation this constituent is a part of.


3. WSDItem is intended to represent a given subject of disambiguation
(i.e., a target word). It has four features:

a) subjectOfDisambiguation (String) is the canonical lexical form of
the subject of disambiguation—that is, the lexical form the system
uses to look up this item in the sense inventory.

b) constituents (FSArray) is an array of LexicalItemConstituents for
this subject of disambiguation. It should have at least one element.

c) id (String) is a unique identifier for the subject of disambiguation.

d) pos (String) is an identifier for the part of speech for this
subject of disambiguation. The values it takes, and whether or not
it's even used, are implementation-defined.


4. WSDResult represents how a given disambiguation algorithm (or gold
standard) assigns Senses to a given WSDItem. It has five features:

a) wsdItem (WSDItem) is the WSDItem this WSDResult applies to.

b) senses (FSArray) is an array of Senses which are assigned to wsdItem.

c) disambiguationMethod (String) is an identifier for the
disambiguation algorithm (or gold standard) which produced this
assignment of Senses.

d) senseInventory (String) is an identifier for the sense inventory
from which the Senses are drawn.

e) comment (String) is an optional comment.
---------

If you would like to give it a try, I can provide the type descriptors
and probably also a little example how to use it.

-Torsten

2012/10/31 James Gung <gun...@gmail.com>:

James Gung

unread,
Oct 31, 2012, 3:05:13 PM10/31/12
to cleart...@googlegroups.com
Thanks Torsten--this is super helpful! I would definitely like to try using this type system.

James
Reply all
Reply to author
Forward
0 new messages