Hi,
I am trying to use both ClearTk and Dkpro to be able to create machine learning classifiers with a large set of NLP components. I have already managed to create some pipelines that create models from text training files. However I am now trying to create a classifier from training files in XML. To do that I am using the DKpro component XmlReaderXPath :
and I am trying to include it into my cleartk Evaluation_ImplBase class. To do that :
I tried to replace the Cleartk CollectionReader :
CollectionReader reader =CollectionReaderFactory.createCollectionReader(
UriCollectionReader.getDescriptionFromFiles(files));
by the DKpro one that I need:
CollectionReader reader = createReader(
XmlReaderXPath.class,
XmlReaderXPath.PARAM_SOURCE_LOCATION, LN_ROOT,
XmlReaderXPath.PARAM_PATTERNS, new String[] { "[+]*.txt" },
XmlReaderXPath.PARAM_XPATH_EXPRESSION, "/*/*[local-name()='body']/*[local-name()='body.content']/*[local-name()='bodytext']/*[local-name()='p']",
XmlReaderXPath.PARAM_LANGUAGE, "en"
);
Unfortunately It chokes with this exception
Caused by: org.apache.uima.cas.CASRuntimeException: No sofaFS with name UriView found.
at org.apache.uima.cas.impl.CASImpl.getSofa(CASImpl.java:670)
at org.apache.uima.cas.impl.CASImpl.getView(CASImpl.java:2570)
at org.apache.uima.jcas.impl.JCasImpl.getView(JCasImpl.java:1402)
at org.cleartk.util.ViewURIUtil.getURI(ViewURIUtil.java:86)
at org.cleartk.util.ae.UriToDocumentTextAnnotator.process(UriToDocumentTextAnnotator.java:79)
at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)
... 14 more
I think that the problem cold be that UriView is a Sofa created byCleartk that doesn't exist in DKpro. I think I would need much more skills with UIMAFit to solve this by myself. Thank you very much for your help. I am posting the message in the two Google groups as I don't know if the answer will come from modifying Dkpro or Cleartk classes.
Alain Loisel,
Comp linguist.