Baseline systems for pos tagging, (Named Entity) chunking, and syntax parsing...

131 views
Skip to first unread message

Nicolas Hernandez

unread,
Apr 5, 2011, 6:35:30 AM4/5/11
to cleartk-users
Hello Everyone,

I dare to ask newbie questions. I had the dream... to find baseline
systems which show how to develop both trainer and tagger of basic NLP
tasks such pos tagging, (Named Entity) chunking, and syntax parsing...
I would like to build models for French processing (we have got some
resources for the previous tasks).

The following thread
http://groups.google.com/group/cleartk-users/browse_thread/thread/19dc22579bb551e1/e64de22be4e4aacb?lnk=gst&q=chunk#e64de22be4e4aacb
was talking about that.

I must confess I m a bit lost. The "do it yourself" is not so easy
since you have to understand how to uimaFIT works, the global project
structure (which dependencies between the subprojets, which package,
class do what and what are the default values of annotators), which
machine learning algo is better for which tasks... and finally
consider also the source version since I found some examples in the
test directories.

The current tutorial answers to the question of training and using a
pos tagger (http://code.google.com/p/cleartk/wiki/Tutorial).

I saw a wiki page, unfortunately out of date, for the chunking task
http://code.google.com/p/cleartk/wiki/ChunkTokenizer. I found the
cleartk-token/src/test/java/org/cleartk/token/tokenizer/chunk/
BuildTestTokenChunkModel.java but it depends on some pakage not
available in the binary release.

I do not have faced yet the problem of parsing.

Does someone have some basic examples of AE which performs the
previously mentioned tasks, or some pointers to look at for a better
general understanding ?

It will help me to save time...

Thanks in advance


Steven Bethard

unread,
Apr 5, 2011, 6:53:21 AM4/5/11
to cleart...@googlegroups.com
On Tue, Apr 5, 2011 at 12:35 PM, Nicolas Hernandez
<nicolas....@gmail.com> wrote:
> I dare to ask newbie questions. I had the dream... to find baseline
> systems which show how to develop both trainer and tagger of basic NLP
> tasks such pos tagging, (Named Entity) chunking, and syntax parsing...

Yeah, it's a bit overwhelming and we don't have enough documentation.
Sorry about that!

For part of speech tagging, look at the package
org.cleartk.examples.pos. Basically, the ExamplePOSAnnotator is the
same as the tutorial, BuildTestExamplePosModel shows you how to train
the model, and RunExamplePOSAnnotator shows you how to apply your
trained model to new data.

For named entity chunking, I don't think we have any example code for
training a model. But if you don't mind looking at a slightly
different task, you can see how the chunking code is used in
org.cleartk.timeml.event.EventAnnotator, which is a chunk-based event
annotator. Basically, you define a FeatureExtractor, and then you
create an AnalysisEngineDescription based on configuring a
org.cleartk.chunker.Chunker to use your FeatureExtractor. EventTrain
shows you how to train the model (it trains a few other models as well
at the same time, but you can ignore those), and EventAnnotate shows
you how to apply the model to new data.

For syntactic parsing, we don't really have any code in ClearTK other
than wrappers to various syntactic parsers provided by others (e.g.
OpenNLP, Berkeley, Stanford). So if you want to train a new syntactic
parser, you'll probably have to work through their APIs. If you do end
up going this route, we'd of course welcome any contributions that
made this easier.

Steve
--
Where did you get that preposterous hypothesis?
Did Steve tell you that?
        --- The Hiphopopotamus

Nicolas Hernandez

unread,
Nov 24, 2011, 12:48:52 PM11/24/11
to cleart...@googlegroups.com
Thanks Steven for your answer.

I bother you again...

I recently have a look again at the code and its seems the pointers (I
mean the name of the classes) you gave me concerning the chunking
(performed in the event package) are not up to date. I do not find any
class named EventTrain or EventAnnotate.

Thank you for helping me again to understand how to develop a chunker
(train and annotate) with ClearTk.

/Nicolas

> --
> You received this message because you are subscribed to the Google Groups "cleartk-users" group.
> To post to this group, send email to cleart...@googlegroups.com.
> To unsubscribe from this group, send email to cleartk-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/cleartk-users?hl=en.
>
>

--
Dr. Nicolas Hernandez
Associate Professor (Maître de Conférences)
Université de Nantes - LINA CNRS
http://enicolashernandez.blogspot.com
http://www.univ-nantes.fr/hernandez-n
+33 (0)2 51 12 53 94
+33 (0)2 40 30 60 67

Steven Bethard

unread,
Nov 26, 2011, 9:05:25 PM11/26/11
to cleart...@googlegroups.com
On Thu, Nov 24, 2011 at 10:48 AM, Nicolas Hernandez
<nicolas....@gmail.com> wrote:
> I recently have a look again at the code and its seems the pointers (I
> mean the name of the classes) you gave me concerning the chunking
> (performed in the event package) are not up to date. I do not find any
> class named EventTrain or EventAnnotate.

Sorry about that. I'm not quite sure how we lost EventTrain, though
the "correct" way to train an EventAnnotator is now via the TempEval
data, using TempEval2010TaskBExtents. However, that's pretty far from
a simple example now, so I wouldn't look at that. I also discovered
that it doesn't actually help event identification to train it as a
chunking task - you get better accuracy just training it as a word
classification task, so I've subsequently converted EventAnnotator to
a simple CleartkAnnotator instead of a Chunker - which means it's not
a great chunking example anymore.

That said, if you just want to see an example, look at revision 2843,
back when EventAnnotator was still as a Chunker. Here is
EventAnnotator:

http://code.google.com/p/cleartk/source/browse/trunk/cleartk-timeml/src/main/java/org/cleartk/timeml/event/EventAnnotator.java?r=2843

Here's EventTrain:

http://code.google.com/p/cleartk/source/browse/trunk/cleartk-timeml/src/main/java/org/cleartk/timeml/event/EventTrain.java?r=2843

And here's EventAnnotate:

http://code.google.com/p/cleartk/source/browse/trunk/cleartk-timeml/src/main/java/org/cleartk/timeml/event/EventAnnotate.java?r=2843

Hope that helps,

Steve

Reply all
Reply to author
Forward
0 new messages