CrossValidation: SequenceAnnotator.isTraining() vs. ViewNames.SYSTEM

Renaud.Richardet

unread,

Jan 20, 2012, 6:43:33 PM1/20/12

to cleartk-users

Hello,

I have just started to use ClearTK and really like it, because it
allows me to reuse the CollectionReaders I have already developed, and
test different ML models without rewriting the feature extraction
every time. On top of that, I find the code really easy to understand
and well structured. So, thanks for sharing it with the world :-)

What I would like to do is 10-fold cross validation of a protein NER
on the Biocreative2 corpus. I have managed to get started with the
org.cleartk.eval.Evaluation components, but got stuck with the
org.cleartk.eval.provider.AnnotationEvaluator...
1) On the wiki tutorial, one can use if (this.isTraining()) {...}
within a CleartkSequenceAnnotator to differentiate between training
and evaluation.
2) From what I understand from the Evaluation components, one has to
use different views to store annotations in gold_standard_view and
system_view, in order to use AnnotationEvaluator.

Question: Is it possible to use the mechanism of 1) in 2)? If not, how
shall I write my CleartkSequenceAnnotator? More precisely: how shall I
use the 2 views? Just copy the gold standard annotations into the
system_view, and add the annotation from testing?

Any pointers appreciated.

All the best, Renaud

Philip Ogren

unread,

Jan 25, 2012, 11:47:19 PM1/25/12

to cleart...@googlegroups.com

Hi Renaud,

Thank you for the kind words about ClearTK. I'm glad the code is easy for you to read because the documentation is not nearly as complete as we would like it to be! It remains a work in progress.

I'm not sure if we have a complete end-to-end example that fully uses the evaluation package. Probably the most likely candidate would be in the timeml sub-project which looks to have a fair bit of code in it that does evaluation (see the org.cleartk.timeml.eval package in cleartk-timeml).

You may also find it useful to look at this wiki on the uimaFIT project page which I think addresses your question fairly well and gives some nice background on how to work with views in an experimental setup:

http://code.google.com/p/uimafit/wiki/RunningExperiments

Hope this helps.

Philip

--
You received this message because you are subscribed to the Google Groups "cleartk-users" group.
To post to this group, send email to cleart...@googlegroups.com.
To unsubscribe from this group, send email to cleartk-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cleartk-users?hl=en.

Lee Becker

unread,

Jan 26, 2012, 12:16:22 AM1/26/12

to cleartk-users

On Jan 20, 4:43 pm, "Renaud.Richardet" <renaud.richar...@gmail.com>
wrote:

Hi Renaud,

The cleartk-eval module is well suited to help you do exactly this
kind of evaluation. We really need to provide a more concrete example
to illustrate usage of the eval tools (feel free to file an issue),
but in the meantime here is an overview of what you need to get
started with the evaluation components.

At the highest level you have the Evaluation class, which provides
convenience methods for running cross validation and holdout set
evaluation:
Evaluation.runCrossValidation()
Evaluation.runHoldoutEvaluation()

Both of these methods accept three required parameters along with any
optional training arguments:
1) CorpusReaderPipeline - provides collection readers corresponding to
your training / testing data
2) CleartkPipelineProvider - provides a pipeline processing your
training and testing your data. This usually corresponds to the
pipeline of analysis engines you would normally run your CASes
through.
3) EvaluationPipelineProvider - provides a pipeline that that analyzes
your CASes for evaluation. This is typically where you will typically
do comparison on a gold and system view.

You will likely need to provide implementations for each of the above
three classes to match your own evaluation needs. Typically the
CorpusReaderPipeline will extend NameBasedReaderProvider or
FixedFoldsXmiCorpusFactory.

The simplest way to get an EvaluationPipelineProvider is to use the
BatchBasedEvaluationPipelineProvider. Its constructor accepts a list
of analysis engines to run after running the cleartkPipeline. In my
experience, I usually write a new analysis engine that extends
org.uimafit.component.JCasAnnotator_ImplBase. Usually I make my gold
and system views have identical text, so the offsets are aligned
across views. This allows me to write process methods similar to
this:

process(JCas jcas) {
goldView = jcas.getView("GOLDVIEW");
sysView = jcas.getView("SYSTEM_VIEW");

for (Token token: JCasUtil.select(jcas, Token.class)) {
YourAnnotation goldAnnotation = JCasUtil.selectSingle(goldView,
YourAnnotation.class);
YourAnnotation sysAnnotation = JCasUtil.selectSingle(sysView,
YourAnnotation.class);

// Do some sort of comparison and tallying of accuracy,
precision, recall, etc...
if (goldAnnotation.getLabel().equals(sysAnnotation.getLabel())
{
this.numAgree++;
} else {
this.numDisagree++;
}
}
}

Your initialization() method will initialize any counters, and you
will typically write out and do final computation of evaluation in
collectionProcessComplete().
There are also convenience classes for tallying precision / recall or
building confusion matrices in org.cleartk.eval.util.

Sorry if this was long winded and confusing. There are a lot of parts
to eval, but once you've wrapped your brain around it, it's not too
hard to use. The reading the javadocs for the *Provider classes
should give you more details, but as always, keep asking questions.

Richard Eckart de Castilho

unread,

Jan 26, 2012, 4:11:21 PM1/26/12

to cleart...@googlegroups.com

Am 26.01.2012 um 06:16 schrieb Lee Becker:

> On Jan 20, 4:43 pm, "Renaud.Richardet" <renaud.richar...@gmail.com>
> wrote:
>> Hello,
>>
>> I have just started to use ClearTK and really like it, because it
>> allows me to reuse the CollectionReaders I have already developed, and
>> test different ML models without rewriting the feature extraction
>> every time. On top of that, I find the code really easy to understand
>> and well structured. So, thanks for sharing it with the world :-)
>

> The simplest way to get an EvaluationPipelineProvider is to use the
> BatchBasedEvaluationPipelineProvider. Its constructor accepts a list
> of analysis engines to run after running the cleartkPipeline.

I don't want to let this opportunity pass to mention a little framework that
I have recently released: DKPro Lab.

With DKPro Lab you can nicely model complex batch pipelines including ML pipelines on which
you perform cross-validation and parameter sweeping. I'm afraid is really not much
documentation so far, but there is a ClearTK-ML example coming with it.

Here you can find a ClearTK Maxent-based POS-tagger example:

http://code.google.com/p/dkpro-lab/source/browse/de.tudarmstadt.ukp.dkpro.lab/de.tudarmstadt.ukp.dkpro.lab.ml.example/src/test/java/de/tudarmstadt/ukp/dkpro/lab/ml/example/PosExampleMaxEnt.java

The JARs area available from the UKP OSS Maven repository:
http://zoidberg.ukp.informatik.tu-darmstadt.de/artifactory/webapp/search/artifact?q=lab

See http://code.google.com/p/dkpro-lab/wiki/DeveloperSetup on how to configure Maven to use it.

Feel free to get back to me if you want to use if and have any questions.

Best,

-- Richard

Reply all

Reply to author

Forward

CrossValidation: SequenceAnnotator.isTraining() vs. ViewNames.SYSTEM_VIEW

Renaud.Richardet

Philip Ogren

Lee Becker

Richard Eckart de Castilho