Not able to run hadoop index task on the remote hadoop cluster

841 views
Skip to first unread message

Geeta Iyer

unread,
Apr 9, 2014, 7:51:02 AM4/9/14
to druid-de...@googlegroups.com
Hi,

I am trying the run the hadoop index task using the indexing service. My indexing service runs on one machine. I have a separate remote hadoop cluster. 
As an example, I am trying to run the wikipedia example. I have loaded the input data wikipedia_data.json on my hdfs cluster.
I have specified output path for the segments as a hdfs location.

In my overlord's runtime configuration, I have specified druid.storage.type=hdfs and specified the druid.storage.storageDirectory=hdfs://<namenode>:8020/<path>.
I have also specified druid.indexer.fork.property.druid.storage.type=hdfs and druid.indexer.fork.property.druid.storage.storageDirectory=hdfs://<namenode>:8020/<path>.

I am running again hadoop 2.2. I have added the hadoop client jars, the hadoop conf files and druid-hdfs-storage jar in the class path, while starting the overlord. This is the command I use to start the overlord:

java -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:/usr/lib/hadoop/client/*:/home/fptiuser/hdfs-storage/druid-hdfs-storage-0.6.83-SNAPSHOT.jar:/etc/hadoop/conf/*:config/overlord io.druid.cli.Main server overlord

In the wikipedia_index_hadoop_task.json, I have changed pathSpec to read the input data from hdfs. ("paths" : "hdfs://<namenode>:8020/<path>/wikipedia_data.json".
I have also specified  "hadoopDependencyCoordinates" : ["org.apache.hadoop:hadoop-client:2.2.0"]

When I submit this task to the indexing service, I see two issues:
1. The MR job gets launched as LocalJobRunner (job_localxxxx). In there anything else that I need to configure, so that the job runs on the remote hadoop cluster?

2. The job fails with the following error:
java.lang.IllegalArgumentException: Pathname /<path>/wikipedia/wikipedia/2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z/2014-04-09T11:32:45.222Z/0 from hdfs://<name-node>:8020/<path>/wikipedia/wikipedia/2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z/2014-04-09T11:32:45.222Z/0 is not a valid DFS filename.
 at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:184)
        at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:92)
        at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:817)
        at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:813)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:813)
        at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:806)
        at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1933)
        at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.serializeOutIndex(IndexGeneratorJob.java:404)
        at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:384)
        at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:247)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
        at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:645)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:405)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:445)





Fangjin Yang

unread,
Apr 9, 2014, 11:52:00 PM4/9/14
to druid-de...@googlegroups.com
Hi Geeta, see inline.

On Wednesday, April 9, 2014 4:51:02 AM UTC-7, Geeta Iyer wrote:
Hi,

I am trying the run the hadoop index task using the indexing service. My indexing service runs on one machine. I have a separate remote hadoop cluster. 
As an example, I am trying to run the wikipedia example. I have loaded the input data wikipedia_data.json on my hdfs cluster.
I have specified output path for the segments as a hdfs location.

In my overlord's runtime configuration, I have specified druid.storage.type=hdfs and specified the druid.storage.storageDirectory=hdfs://<namenode>:8020/<path>.
I have also specified druid.indexer.fork.property.druid.storage.type=hdfs and druid.indexer.fork.property.druid.storage.storageDirectory=hdfs://<namenode>:8020/<path>.

I am running again hadoop 2.2. I have added the hadoop client jars, the hadoop conf files and druid-hdfs-storage jar in the class path, while starting the overlord. This is the command I use to start the overlord:

java -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:/usr/lib/hadoop/client/*:/home/fptiuser/hdfs-storage/druid-hdfs-storage-0.6.83-SNAPSHOT.jar:/etc/hadoop/conf/*:config/overlord io.druid.cli.Main server overlord

In the wikipedia_index_hadoop_task.json, I have changed pathSpec to read the input data from hdfs. ("paths" : "hdfs://<namenode>:8020/<path>/wikipedia_data.json".
I have also specified  "hadoopDependencyCoordinates" : ["org.apache.hadoop:hadoop-client:2.2.0"]

When I submit this task to the indexing service, I see two issues:
1. The MR job gets launched as LocalJobRunner (job_localxxxx). In there anything else that I need to configure, so that the job runs on the remote hadoop cluster?

Make sure you also specify that you are using the druid-hdfs-storage module. If Druid cannot find the module (either provided as an extension or on the classpath), it will default to using local storage.

2. The job fails with the following error:
java.lang.IllegalArgumentException: Pathname /<path>/wikipedia/wikipedia/2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z/2014-04-09T11:32:45.222Z/0 from hdfs://<name-node>:8020/<path>/wikipedia/wikipedia/2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z/2014-04-09T11:32:45.222Z/0 is not a valid DFS filename.
 at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:184)
        at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:92)
        at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:817)
        at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:813)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:813)
        at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:806)
        at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1933)
        at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.serializeOutIndex(IndexGeneratorJob.java:404)
        at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:384)
        at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:247)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
        at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:645)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:405)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:445)


FWIW, we are working on better Druid support for different versions of Hadoop. It is a bit cumbersome right now.

Geeta Iyer

unread,
Apr 10, 2014, 6:57:51 AM4/10/14
to druid-de...@googlegroups.com
Hi,

I am specifying druid-hdfs-storage in the classpath while starting the overlord.
java -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:/usr/lib/hadoop/client/*:/home/fptiuser/hdfs-storage/druid-hdfs-storage-0.6.83-SNAPSHOT.jar:/etc/hadoop/conf/*:config/overlord io.druid.cli.Main server overlord

The input data is being read from hdfs properly. But when we try to write the segment to hdfs that's when the problem happens.

I am not able to understand why it is triggering a local job. I have put the hadoop conf in the path. 
Also, I am not able to understand why it is not able to construct a correct name for hdfs output path segments.

I tried running the same thing by using the HadoopDruidIndexerJob from the same machine. I just modified the config to specify segmentOutputPath, etc. Rest everything is same.
This is the command I use to run the Hadoop Job directly.

java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath /etc/hadoop/conf:lib/*:/usr/lib/hadoop/client/* io.druid.cli.Main index hadoop ~/druid_examples/wikipedia_hadoop_conf.json

This trigged the hadoop job in the remote cluster.

So why would it trigger a hadoop job on remote cluster when I run this, but not when I go via the indexer service?

Geeta Iyer

unread,
Apr 10, 2014, 7:22:00 AM4/10/14
to druid-de...@googlegroups.com
Attached the overlord console log...
overlord.log

Geeta Iyer

unread,
Apr 10, 2014, 7:32:18 AM4/10/14
to druid-de...@googlegroups.com
Attached the overlord runtime properties...


On Wednesday, April 9, 2014 5:21:02 PM UTC+5:30, Geeta Iyer wrote:
runtime.properties

Fangjin Yang

unread,
Apr 11, 2014, 12:38:01 AM4/11/14
to druid-de...@googlegroups.com
I believe this was resolved over IRC. Can we share what the final resolution was to close out the thread?

Nishant Bangarwa

unread,
Apr 11, 2014, 3:21:28 AM4/11/14
to druid-de...@googlegroups.com
correcting the classpath to include "/etc/hadoop/conf" instead of "/etc/hadoop/conf/*" worked. 


--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/31ace5ed-0bd7-474e-a8ff-024223c9786f%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

tarun gulyani

unread,
Jul 27, 2014, 3:09:00 PM7/27/14
to druid-de...@googlegroups.com
Hi Geeta,

I am also getting same error. Can you please help me out, how did you resolved this?

Geeta Iyer

unread,
Jul 27, 2014, 11:17:39 PM7/27/14
to druid-de...@googlegroups.com
I modified the classpath to include /etc/hadoop/conf instead of /etc/hadoop/conf/*


--
You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/50_nMe39QlQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-developm...@googlegroups.com.

To post to this group, send email to druid-de...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Thanks,
Geeta

Deepak Jain

unread,
Jul 30, 2014, 5:57:27 AM7/30/14
to druid-de...@googlegroups.com
This is how i run and it worked.

export DRUID_HOME=/home/hdfs/druid-services-0.6.109-SNAPSHOT
java -Xmx12g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath $DRUID_HOME/lib/*:/etc/hadoop/conf/:$DRUID_HOME/config/overlord io.druid.cli.Main server overlord
Reply all
Reply to author
Forward
0 new messages