I started with running the example, "UimaPipelineOnHadoop" and was having some trouble. My understanding of how this works using dkpro-bigdata is as follows (correct me if I am wrong. I really want to understand this stuff :) ):
You specify a path on your file system from where the CollectionReader reader loads the txt files. These files are converted into a sequence file and then dumped into HDFS as a sequence file in the location specified by the first argument in args. The second argument specifies the location where the job output (that of the reducer) will be stored.
hadoop jar <project_jar_file>.jar edu.sunysb.cs.dsl.lydia2.annotatorhadoop.UimaPipelineOnHadoop /user/sabanerjee/annotatorhadoop/ /user/sabanerjee/annotatorhadoop/output
Feb 19, 2014 6:35:06 PM org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl callAnalysisComponentProcess(410)
SEVERE: Exception occurred
org.apache.uima.analysis_engine.AnalysisEngineProcessException
at de.tudarmstadt.ukp.dkpro.bigdata.io.hadoop.CASWritableSequenceFileWriter.process(CASWritableSequenceFileWriter.java:144)
at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:378)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:298)
at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
at org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:224)
at org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:145)
at de.tudarmstadt.ukp.dkpro.bigdata.hadoop.DkproHadoopDriver.run(DkproHadoopDriver.java:158)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at edu.sunysb.cs.dsl.lydia2.annotatorhadoop.UimaPipelineOnHadoop.main(UimaPipelineOnHadoop.java:80)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)job.setOutputValueClass(BinCasWritable.class);
Caused by: java.io.IOException: wrong value class: de.tudarmstadt.ukp.dkpro.bigdata.io.hadoop.BinCasWithTypeSystemWritable is not class de.tudarmstadt.ukp.dkpro.bigdata.io.hadoop.BinCasWritable
at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.append(SequenceFile.java:1177)
at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1039)
at de.tudarmstadt.ukp.dkpro.bigdata.io.hadoop.CASWritableSequenceFileWriter.process(CASWritableSequenceFileWriter.java:139)
... 14 more
org.apache.uima.analysis_engine.AnalysisEngineProcessException
at de.tudarmstadt.ukp.dkpro.bigdata.io.hadoop.CASWritableSequenceFileWriter.process(CASWritableSequenceFileWriter.java:144)
at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:378)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:298)
at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
at org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:224)
at org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:145)
at de.tudarmstadt.ukp.dkpro.bigdata.hadoop.DkproHadoopDriver.run(DkproHadoopDriver.java:158)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at edu.sunysb.cs.dsl.lydia2.annotatorhadoop.UimaPipelineOnHadoop.main(UimaPipelineOnHadoop.java:80)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
Caused by: java.io.IOException: wrong value class: de.tudarmstadt.ukp.dkpro.bigdata.io.hadoop.BinCasWithTypeSystemWritable is not class de.tudarmstadt.ukp.dkpro.bigdata.io.hadoop.BinCasWritable
at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.append(SequenceFile.java:1177)
at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1039)
at de.tudarmstadt.ukp.dkpro.bigdata.io.hadoop.CASWritableSequenceFileWriter.process(CASWritableSequenceFileWriter.java:139)
... 14 more
I had another question. Suppose this issue is resolved and the pipeline runs successfully. Where will the sequence file of the annotated CAS objects be saved? Will it be the same location as the first argument? Suppose I want to retrieve and process it, can it be processed outside hadoop?