Not able to run hadoop index task on the remote hadoop cluster

Geeta Iyer

unread,

Apr 9, 2014, 7:51:02 AM4/9/14

to druid-de...@googlegroups.com

Hi,

I am trying the run the hadoop index task using the indexing service. My indexing service runs on one machine. I have a separate remote hadoop cluster.

As an example, I am trying to run the wikipedia example. I have loaded the input data wikipedia_data.json on my hdfs cluster.

I have specified output path for the segments as a hdfs location.

In my overlord's runtime configuration, I have specified druid.storage.type=hdfs and specified the druid.storage.storageDirectory=hdfs://<namenode>:8020/<path>.

I have also specified druid.indexer.fork.property.druid.storage.type=hdfs and druid.indexer.fork.property.druid.storage.storageDirectory=hdfs://<namenode>:8020/<path>.

I am running again hadoop 2.2. I have added the hadoop client jars, the hadoop conf files and druid-hdfs-storage jar in the class path, while starting the overlord. This is the command I use to start the overlord:

java -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:/usr/lib/hadoop/client/*:/home/fptiuser/hdfs-storage/druid-hdfs-storage-0.6.83-SNAPSHOT.jar:/etc/hadoop/conf/*:config/overlord io.druid.cli.Main server overlord

In the wikipedia_index_hadoop_task.json, I have changed pathSpec to read the input data from hdfs. ("paths" : "hdfs://<namenode>:8020/<path>/wikipedia_data.json".

I have also specified "hadoopDependencyCoordinates" : ["org.apache.hadoop:hadoop-client:2.2.0"]

When I submit this task to the indexing service, I see two issues:

1. The MR job gets launched as LocalJobRunner (job_localxxxx). In there anything else that I need to configure, so that the job runs on the remote hadoop cluster?

2. The job fails with the following error:

java.lang.IllegalArgumentException: Pathname /<path>/wikipedia/wikipedia/2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z/2014-04-09T11:32:45.222Z/0 from hdfs://<name-node>:8020/<path>/wikipedia/wikipedia/2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z/2014-04-09T11:32:45.222Z/0 is not a valid DFS filename.

at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:184)

at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:92)

at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:817)

at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:813)

at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:813)

at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:806)

at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1933)

at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.serializeOutIndex(IndexGeneratorJob.java:404)

at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:384)

at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:247)

at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)

at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:645)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:405)

at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:445)

Fangjin Yang

unread,

Apr 9, 2014, 11:52:00 PM4/9/14

to druid-de...@googlegroups.com

Hi Geeta, see inline.

On Wednesday, April 9, 2014 4:51:02 AM UTC-7, Geeta Iyer wrote:

Hi,

I am trying the run the hadoop index task using the indexing service. My indexing service runs on one machine. I have a separate remote hadoop cluster.
As an example, I am trying to run the wikipedia example. I have loaded the input data wikipedia_data.json on my hdfs cluster.
I have specified output path for the segments as a hdfs location.

In my overlord's runtime configuration, I have specified druid.storage.type=hdfs and specified the druid.storage.storageDirectory=hdfs://<namenode>:8020/<path>.
I have also specified druid.indexer.fork.property.druid.storage.type=hdfs and druid.indexer.fork.property.druid.storage.storageDirectory=hdfs://<namenode>:8020/<path>.

I am running again hadoop 2.2. I have added the hadoop client jars, the hadoop conf files and druid-hdfs-storage jar in the class path, while starting the overlord. This is the command I use to start the overlord:

java -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:/usr/lib/hadoop/client/*:/home/fptiuser/hdfs-storage/druid-hdfs-storage-0.6.83-SNAPSHOT.jar:/etc/hadoop/conf/*:config/overlord io.druid.cli.Main server overlord

In the wikipedia_index_hadoop_task.json, I have changed pathSpec to read the input data from hdfs. ("paths" : "hdfs://<namenode>:8020/<path>/wikipedia_data.json".
I have also specified "hadoopDependencyCoordinates" : ["org.apache.hadoop:hadoop-client:2.2.0"]

When I submit this task to the indexing service, I see two issues:
1. The MR job gets launched as LocalJobRunner (job_localxxxx). In there anything else that I need to configure, so that the job runs on the remote hadoop cluster?

Make sure you also specify that you are using the druid-hdfs-storage module. If Druid cannot find the module (either provided as an extension or on the classpath), it will default to using local storage.

2. The job fails with the following error:
java.lang.IllegalArgumentException: Pathname /<path>/wikipedia/wikipedia/2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z/2014-04-09T11:32:45.222Z/0 from hdfs://<name-node>:8020/<path>/wikipedia/wikipedia/2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z/2014-04-09T11:32:45.222Z/0 is not a valid DFS filename.
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:184)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:92)
at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:817)
at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:813)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:813)
at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:806)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1933)
at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.serializeOutIndex(IndexGeneratorJob.java:404)
at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:384)
at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:247)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:645)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:405)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:445)

FWIW, we are working on better Druid support for different versions of Hadoop. It is a bit cumbersome right now.

Geeta Iyer

unread,

Apr 10, 2014, 6:57:51 AM4/10/14

to druid-de...@googlegroups.com

Hi,

I am specifying druid-hdfs-storage in the classpath while starting the overlord.

java -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:/usr/lib/hadoop/client/*:/home/fptiuser/hdfs-storage/druid-hdfs-storage-0.6.83-SNAPSHOT.jar:/etc/hadoop/conf/*:config/overlord io.druid.cli.Main server overlord

The input data is being read from hdfs properly. But when we try to write the segment to hdfs that's when the problem happens.

I am not able to understand why it is triggering a local job. I have put the hadoop conf in the path.

Also, I am not able to understand why it is not able to construct a correct name for hdfs output path segments.

I tried running the same thing by using the HadoopDruidIndexerJob from the same machine. I just modified the config to specify segmentOutputPath, etc. Rest everything is same.

This is the command I use to run the Hadoop Job directly.

java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath /etc/hadoop/conf:lib/*:/usr/lib/hadoop/client/* io.druid.cli.Main index hadoop ~/druid_examples/wikipedia_hadoop_conf.json

This trigged the hadoop job in the remote cluster.

So why would it trigger a hadoop job on remote cluster when I run this, but not when I go via the indexer service?

Geeta Iyer

unread,

Apr 10, 2014, 7:22:00 AM4/10/14

to druid-de...@googlegroups.com

Attached the overlord console log...

overlord.log

Geeta Iyer

unread,

Apr 10, 2014, 7:32:18 AM4/10/14

to druid-de...@googlegroups.com

Attached the overlord runtime properties...

On Wednesday, April 9, 2014 5:21:02 PM UTC+5:30, Geeta Iyer wrote:

runtime.properties

Fangjin Yang

unread,

Apr 11, 2014, 12:38:01 AM4/11/14

to druid-de...@googlegroups.com

I believe this was resolved over IRC. Can we share what the final resolution was to close out the thread?

Nishant Bangarwa

unread,

Apr 11, 2014, 3:21:28 AM4/11/14

to druid-de...@googlegroups.com

correcting the classpath to include "/etc/hadoop/conf" instead of "/etc/hadoop/conf/*" worked.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/31ace5ed-0bd7-474e-a8ff-024223c9786f%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Nishant

Software Engineer

|

METAMARKETS

m	+91-9729200044

nishant....@metamarkets.com

tarun gulyani

unread,

Jul 27, 2014, 3:09:00 PM7/27/14

to druid-de...@googlegroups.com

Hi Geeta,

I am also getting same error. Can you please help me out, how did you resolved this?

Geeta Iyer

unread,

Jul 27, 2014, 11:17:39 PM7/27/14

to druid-de...@googlegroups.com

I modified the classpath to include /etc/hadoop/conf instead of /etc/hadoop/conf/*

--
You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/50_nMe39QlQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-developm...@googlegroups.com.

To post to this group, send email to druid-de...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/e5db8f63-a949-456b-bd2e-fa547fcce01f%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Thanks,
Geeta

Deepak Jain

unread,

Jul 30, 2014, 5:57:27 AM7/30/14

to druid-de...@googlegroups.com

This is how i run and it worked.

export DRUID_HOME=/home/hdfs/druid-services-0.6.109-SNAPSHOT

java -Xmx12g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath $DRUID_HOME/lib/*:/etc/hadoop/conf/:$DRUID_HOME/config/overlord io.druid.cli.Main server overlord

Reply all

Reply to author

Forward