Exception while running sample code, UimaPipelineOnHadoop

53 views
Skip to first unread message

Samudra Banerjee

unread,
Feb 19, 2014, 7:23:53 PM2/19/14
to dkpro-big...@googlegroups.com
Hi Experts,

I started with running the example, "UimaPipelineOnHadoop" and was having some trouble. My understanding of how this works using dkpro-bigdata is as follows (correct me if I am wrong. I really want to understand this stuff :) ):

You specify a path on your file system from where the CollectionReader reader loads the txt files. These files are converted into a sequence file and then dumped into HDFS as a sequence file in the location specified by the first argument in args. The second argument specifies the location where the job output (that of the reducer) will be stored. 

Am I right? 


Now the problem is when I run this code on hadoop using the following command, 

hadoop jar <project_jar_file>.jar edu.sunysb.cs.dsl.lydia2.annotatorhadoop.UimaPipelineOnHadoop /user/sabanerjee/annotatorhadoop/ /user/sabanerjee/annotatorhadoop/output

I get the following exception:

Feb 19, 2014 6:35:06 PM org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl callAnalysisComponentProcess(410)
SEVERE: Exception occurred
org.apache.uima.analysis_engine.AnalysisEngineProcessException
at de.tudarmstadt.ukp.dkpro.bigdata.io.hadoop.CASWritableSequenceFileWriter.process(CASWritableSequenceFileWriter.java:144)
at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:378)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:298)
at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
at org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:224)
at org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:145)
at de.tudarmstadt.ukp.dkpro.bigdata.hadoop.DkproHadoopDriver.run(DkproHadoopDriver.java:158)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at edu.sunysb.cs.dsl.lydia2.annotatorhadoop.UimaPipelineOnHadoop.main(UimaPipelineOnHadoop.java:80)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)job.setOutputValueClass(BinCasWritable.class);
Caused by: java.io.IOException: wrong value class: de.tudarmstadt.ukp.dkpro.bigdata.io.hadoop.BinCasWithTypeSystemWritable is not class de.tudarmstadt.ukp.dkpro.bigdata.io.hadoop.BinCasWritable
at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.append(SequenceFile.java:1177)
at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1039)
at de.tudarmstadt.ukp.dkpro.bigdata.io.hadoop.CASWritableSequenceFileWriter.process(CASWritableSequenceFileWriter.java:139)
... 14 more

org.apache.uima.analysis_engine.AnalysisEngineProcessException
at de.tudarmstadt.ukp.dkpro.bigdata.io.hadoop.CASWritableSequenceFileWriter.process(CASWritableSequenceFileWriter.java:144)
at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:378)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:298)
at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
at org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:224)
at org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:145)
at de.tudarmstadt.ukp.dkpro.bigdata.hadoop.DkproHadoopDriver.run(DkproHadoopDriver.java:158)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at edu.sunysb.cs.dsl.lydia2.annotatorhadoop.UimaPipelineOnHadoop.main(UimaPipelineOnHadoop.java:80)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
Caused by: java.io.IOException: wrong value class: de.tudarmstadt.ukp.dkpro.bigdata.io.hadoop.BinCasWithTypeSystemWritable is not class de.tudarmstadt.ukp.dkpro.bigdata.io.hadoop.BinCasWritable
at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.append(SequenceFile.java:1177)
at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1039)
at de.tudarmstadt.ukp.dkpro.bigdata.io.hadoop.CASWritableSequenceFileWriter.process(CASWritableSequenceFileWriter.java:139)
... 14 more

Setting the parameter, job.setOutputValueClass(BinCasWritable.class); in the configure() method does not seem to help. Any idea what is going wrong here?

I had another question. Suppose this issue is resolved and the pipeline runs successfully. Where will the sequence file of the annotated CAS objects be saved? Will it be the same location as the first argument? Suppose I want to retrieve and process it, can it be processed outside hadoop?

Regards,
Samudra

Samudra Banerjee

unread,
Feb 20, 2014, 8:00:58 PM2/20/14
to dkpro-big...@googlegroups.com
I tried running the script, start_local.sh using the following command:

./start_local.sh /user/sabanerjee/annotatorhadoop/ /user/sabanerjee/annotatorhadoop/output

Here, /user/sabanerjee/annotatorhadoop/ is a path on my local HDFS instance. Also, I made the following change in the UimaPipelineOnHadoop.java :

return createReader(TextReader.class, TextReader.PARAM_PATH, "/home/sabanerjee/dkpro-bigdata/docs/",
                TextReader.PARAM_PATTERNS, new String[] { INCLUDE_PREFIX + "*.txt" },
                TextReader.PARAM_LANGUAGE, "en");

So the steps I followed are:

1. Run "mvn clean install" from the directory where the git repo has been cloned. This builds the jars.
2. Run start_local.sh in the examples package.

I get the following message:

14/02/20 19:38:47 INFO mapred.FileInputFormat: Total input paths to process : 0
14/02/20 19:38:48 INFO mapred.JobClient: Running job: job_201402172050_0021
14/02/20 19:38:49 INFO mapred.JobClient:  map 0% reduce 0%
14/02/20 19:39:03 INFO mapred.JobClient: Job complete: job_201402172050_0021
14/02/20 19:39:03 INFO mapred.JobClient: Counters: 4
14/02/20 19:39:03 INFO mapred.JobClient:   Job Counters 
14/02/20 19:39:03 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=5009
14/02/20 19:39:03 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/02/20 19:39:03 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/02/20 19:39:03 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0

When I open the jobtracker URL, I do not see any map and reduce tasks. Looks like "14/02/20 19:38:47 INFO mapred.FileInputFormat: Total input paths to process : 0" implies that the job is unable to get the required files. I think I am misunderstanding something here. Any insights on how I can go about running this simple example would be great!

Regards,
Samudra

Hans-Peter Zorn

unread,
Feb 21, 2014, 3:18:24 AM2/21/14
to dkpro-big...@googlegroups.com
Hi,

I just fixed a bug in CASWriteableSequenceFileWriter, please have a look.

However, my impression was that your data is already on HDFS and you want to process it directly. This is
what Text2CASInputFormat is for as explained in the example I posted here previously.

The UIMAPipelineOnHadoop Example assumes your data is on a local disk and can be read by a UIMA
Collection Reader. It is then read locally and transferred to HDFS. If this fails, no data is on HDFS
when the actual job starts and that is why you see the output below.

So the question is what format your input data has. If there is an CollectionReader available and
it is not too large, you can use buildCollectionReader() and let dkpro bigdata transfer it to the cluster.

Otherwise it is a good idea to use an InputFormat that creates CASes directly in Hadoop, such
as Text2CASInputFormat or the other InputFormats in dkpro.bigdata.hadoop.io.* do.

The Output are serialized CASes in SequenceFiles. You can read them from outside hadoop using
the Hadoop SequenceFile API locally. I thought there was an example for that but it seems it isn't.

If you need that, I can provide you an example.

The normal way is usually to perform computations on hadoop and then store the results as text,
csv on hdfs and transfer them at the end.

-hp 

Samudra Banerjee

unread,
Feb 21, 2014, 11:30:50 AM2/21/14
to Hans-Peter Zorn, dkpro-big...@googlegroups.com
Hi Hans,

Thank you so much for the detailed explanation.

Yes. The ultimate goal would be to read data directly from HDFS, but first I wanted to familiarize myself with dkpro-bigdata by running the examples and see how things are working. I have a bunch of text files (not very huge, just around 40 MB total size) in the location I specified in the UIMAPipelineOnHadoop collection reader. The sequence file to be read by the Hadoop framework was probably not getting generated because of the bug right? So am I right in assuming that the input arguments to the "start_local.sh" script would be just empty locations on HDFS to dump the sequence files and other output stuff?

Again, the format I would be ultimately using would be a set of unannotated serialized CAS objects stored in HDFS either as a sequence file or as an archive and for that I may have to use Text2CASInputFormat. This is because the ultimate set will be around 10 GB in size. I will move to that soon after I have an understanding of the system and I'm able to run "something" on Hadoop :)

The ability to read sequence files using the SequenceFile API sounds good. I might need the example, but maybe not immediately. I will get back to you when I need it. Thanks for that.

Regards,
Samudra

Samudra Banerjee
First Year Graduate Student
Department of Computer Science
State University of New York
Stony Brook, NY 11790
631-496-6939

--
You received this message because you are subscribed to a topic in the Google Groups "dkpro-bigdata-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dkpro-bigdata-users/5azsnBAL1X8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dkpro-bigdata-u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hans-Peter Zorn

unread,
Feb 21, 2014, 11:36:17 AM2/21/14
to dkpro-big...@googlegroups.com, Hans-Peter Zorn
HI Samudra,



The sequence file to be read by the Hadoop framework was probably not getting generated because of the bug right? So am I right in assuming that the input arguments to the "start_local.sh" script would be just empty locations on HDFS to dump the sequence files and other output stuff?

Yes exactly. In case you specify a buildCollectionReader() method, both input and output directory should not exist yet, they will be created during the job. The input will be filled using
the collection reader, afterwards the jobs runs and the results will be found in the output directory.
 

The ability to read sequence files using the SequenceFile API sounds good. I might need the example, but maybe not immediately. I will get back to you when I need it. Thanks for that.

Ok, hope everything works out. Sounds like an interesting project.

Best,
-hp

Samudra Banerjee

unread,
Feb 21, 2014, 5:57:55 PM2/21/14
to Hans-Peter Zorn, dkpro-big...@googlegroups.com
Hi Hans,

Thanks for your wishes!

I still seem to be facing a few issues with this. I see a file named part-00000 inside the specified input directory. I presume this is the sequence file which gets generated right?

The map tasks seem to fail. I get the following output on the console:

INFO: Found [4808] resources to be read
compressing
14/02/21 17:18:02 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/02/21 17:18:02 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
14/02/21 17:18:02 INFO compress.CodecPool: Got brand-new compressor
14/02/21 17:18:18 INFO mapred.FileInputFormat: Total input paths to process : 1
14/02/21 17:18:18 INFO mapred.JobClient: Running job: job_201402172050_0038
14/02/21 17:18:19 INFO mapred.JobClient:  map 0% reduce 0%
14/02/21 17:18:42 INFO mapred.JobClient: Task Id : attempt_201402172050_0038_m_000001_0, Status : FAILED
Error: org.apache.hadoop.fs.LocalFileSystem.listFiles(Lorg/apache/hadoop/fs/Path;Z)Lorg/apache/hadoop/fs/RemoteIterator;
14/02/21 17:18:42 INFO mapred.JobClient: Task Id : attempt_201402172050_0038_m_000000_0, Status : FAILED
Error: org.apache.hadoop.fs.LocalFileSystem.listFiles(Lorg/apache/hadoop/fs/Path;Z)Lorg/apache/hadoop/fs/RemoteIterator;
14/02/21 17:18:50 INFO mapred.JobClient: Task Id : attempt_201402172050_0038_m_000001_1, Status : FAILED
Error: org.apache.hadoop.fs.LocalFileSystem.listFiles(Lorg/apache/hadoop/fs/Path;Z)Lorg/apache/hadoop/fs/RemoteIterator;
14/02/21 17:18:51 INFO mapred.JobClient: Task Id : attempt_201402172050_0038_m_000000_1, Status : FAILED
Error: org.apache.hadoop.fs.LocalFileSystem.listFiles(Lorg/apache/hadoop/fs/Path;Z)Lorg/apache/hadoop/fs/RemoteIterator;
14/02/21 17:19:00 INFO mapred.JobClient: Task Id : attempt_201402172050_0038_m_000001_2, Status : FAILED
Error: org.apache.hadoop.fs.LocalFileSystem.listFiles(Lorg/apache/hadoop/fs/Path;Z)Lorg/apache/hadoop/fs/RemoteIterator;
14/02/21 17:19:00 INFO mapred.JobClient: Task Id : attempt_201402172050_0038_m_000000_2, Status : FAILED
Error: org.apache.hadoop.fs.LocalFileSystem.listFiles(Lorg/apache/hadoop/fs/Path;Z)Lorg/apache/hadoop/fs/RemoteIterator;
14/02/21 17:19:10 INFO mapred.JobClient: Job complete: job_201402172050_0038
14/02/21 17:19:10 INFO mapred.JobClient: Counters: 7
14/02/21 17:19:10 INFO mapred.JobClient:   Job Counters
14/02/21 17:19:10 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=20047
14/02/21 17:19:10 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/02/21 17:19:10 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/02/21 17:19:10 INFO mapred.JobClient:     Launched map tasks=8
14/02/21 17:19:10 INFO mapred.JobClient:     Data-local map tasks=8
14/02/21 17:19:10 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/02/21 17:19:10 INFO mapred.JobClient:     Failed map tasks=1
14/02/21 17:19:10 INFO mapred.JobClient: Job Failed: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201402172050_0038_m_000001
java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
    at de.tudarmstadt.ukp.dkpro.bigdata.hadoop.DkproHadoopDriver.run(DkproHadoopDriver.java:217)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at de.tudarmstadt.ukp.dkpro.bigdata.examples.UimaPipelineOnHadoop.main(UimaPipelineOnHadoop.java:93)

    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:160)

I went to jobtracker to check the exact error and I see the following:

2014-02-21 17:18:39,463 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName sabanerjee for UID 1086 from the native implementation
2014-02-21 17:18:39,465 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.NoSuchMethodError: org.apache.hadoop.fs.LocalFileSystem.listFiles(Lorg/apache/hadoop/fs/Path;Z)Lorg/apache/hadoop/fs/RemoteIterator;
    at de.tudarmstadt.ukp.dkpro.bigdata.hadoop.UIMAMapReduceBase.copyDir(UIMAMapReduceBase.java:189)
    at de.tudarmstadt.ukp.dkpro.bigdata.hadoop.UIMAMapReduceBase.close(UIMAMapReduceBase.java:171)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

I also tried including the hadoop-code jar in the pom.xml of the examples project:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-core</artifactId>
    <version>1.2.1</version>
</dependency>

and I can now see the hadoop-core-1.2.1.jar in the lib folder.

However that does not seem to help. Can there be any hadoop version issue here. I am new to hadoop also. So I don't have much idea!


Regards,
Samudra

Samudra Banerjee
First Year Graduate Student
Department of Computer Science
State University of New York
Stony Brook, NY 11790
631-496-6939

Hans-Peter Zorn

unread,
Feb 22, 2014, 6:41:53 AM2/22/14
to dkpro-big...@googlegroups.com, Hans-Peter Zorn
Hi,

yes, this looks like a hadoop version conflict. Which version is installed on your cluster and which one do you use locally?
You can specifiy the hadoop version as property in you pom. This way, also the dependencies of all the dkpro bigdata
submodules will use this version.

e.g. for mapreduce1/cdh4.5
        <properties>
 <hadoop.version>2.0.0-mr1-cdh4.5.0</hadoop.version> </properties> for hadoop 2, use:
  <hadoop.version>2.2.0</hadoop.version>


Best,
hp
To unsubscribe from this group and all its topics, send an email to dkpro-bigdata-users+unsub...@googlegroups.com.

Samudra Banerjee

unread,
Feb 22, 2014, 1:38:20 PM2/22/14
to Hans-Peter Zorn, dkpro-big...@googlegroups.com
OK .. I have hadoop 1.2.1 installed locally. I'm currently trying to run the code there. I will try out this change.

Thanks,

Samudra Banerjee
First Year Graduate Student
Department of Computer Science
State University of New York
Stony Brook, NY 11790
631-496-6939

To unsubscribe from this group and all its topics, send an email to dkpro-bigdata-u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages