ClearTk machine learning classifiers with Dkpro NLP components

33 views
Skip to first unread message

Alain Loisel

unread,
Apr 7, 2016, 5:43:07 AM4/7/16
to cleartk-users
Hi,

 I am trying to use both ClearTk and Dkpro to be able to create machine learning classifiers with a large set of NLP components. I have already managed to create some pipelines that create models from text training files. However I am now trying to create a classifier from training files in XML. To do that I am using the DKpro component XmlReaderXPath : 
and I am trying to include it into my cleartk Evaluation_ImplBase class. To do that :

I tried to replace the Cleartk CollectionReader :
CollectionReader reader =CollectionReaderFactory.createCollectionReader(
UriCollectionReader.getDescriptionFromFiles(files));

by the DKpro one that I need: 
CollectionReader reader = createReader(
XmlReaderXPath.class,
XmlReaderXPath.PARAM_SOURCE_LOCATION, LN_ROOT,
XmlReaderXPath.PARAM_PATTERNS, new String[] { "[+]*.txt" },
XmlReaderXPath.PARAM_XPATH_EXPRESSION, "/*/*[local-name()='body']/*[local-name()='body.content']/*[local-name()='bodytext']/*[local-name()='p']",
XmlReaderXPath.PARAM_LANGUAGE, "en"
);

Unfortunately It chokes with this exception 
Caused by: org.apache.uima.cas.CASRuntimeException: No sofaFS with name UriView found.
at org.apache.uima.cas.impl.CASImpl.getSofa(CASImpl.java:670)
at org.apache.uima.cas.impl.CASImpl.getView(CASImpl.java:2570)
at org.apache.uima.jcas.impl.JCasImpl.getView(JCasImpl.java:1402)
at org.cleartk.util.ViewURIUtil.getURI(ViewURIUtil.java:86)
at org.cleartk.util.ae.UriToDocumentTextAnnotator.process(UriToDocumentTextAnnotator.java:79)
at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)
... 14 more

I think that the problem cold be that UriView is a Sofa created byCleartk that doesn't exist in DKpro. I think I would need much more skills with UIMAFit to solve this by myself. Thank you very much for your help. I am posting the message in the two Google groups as I don't know if the answer will come from modifying Dkpro or Cleartk classes. 

  
Alain Loisel, 
Comp linguist. 

Richard Eckart de Castilho

unread,
Apr 7, 2016, 5:58:27 AM4/7/16
to cleart...@googlegroups.com
Hi,

ClearTK has a different approach to reading data than DKPro Core.

My understanding is that ClearTK first just creates a very simple basic CAS in the Reader, basically just pointing to the URL of the data.
Then an AE (e.g. UriToDocumentTextAnnotator) is used to access that URL and load its content into the CAS.
I think they use views in that process (e.g. the UriView to store the URL of the data and maybe the default view then for then content).

DKPro Core readers do that all in one step. The read gets the data and dumps it directly into the default view.
So if you use a DKPro Core reader, you should not need to use the UriToDocumentTextAnnotator.

It has been a while, but I could nicely use the cleartk-ml stuff in conjunction with DKPro Core readers, writers, and analysis components.
I didn't use any of the non-ML analysis components in ClearTK at that point. ClearTK-ML was (probably still is) largely type-system agnostic.
But the other components in ClearTK are tied to the ClearTK type system which differs from the DKPro Core type system.

Cheers,

-- Richard (from DKPro Core)
> --
> You received this message because you are subscribed to the Google Groups "cleartk-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cleartk-user...@googlegroups.com.
> To post to this group, send email to cleart...@googlegroups.com.
> Visit this group at https://groups.google.com/group/cleartk-users.
> For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages