Hi,
I am trying the run the hadoop index task using the indexing service. My indexing service runs on one machine. I have a separate remote hadoop cluster.
As an example, I am trying to run the wikipedia example. I have loaded the input data wikipedia_data.json on my hdfs cluster.
I have specified output path for the segments as a hdfs location.
In my overlord's runtime configuration, I have specified druid.storage.type=hdfs and specified the druid.storage.storageDirectory=hdfs://<namenode>:8020/<path>.
I have also specified druid.indexer.fork.property.druid.storage.type=hdfs and druid.indexer.fork.property.druid.storage.storageDirectory=hdfs://<namenode>:8020/<path>.
I am running again hadoop 2.2. I have added the hadoop client jars, the hadoop conf files and druid-hdfs-storage jar in the class path, while starting the overlord. This is the command I use to start the overlord:
java -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:/usr/lib/hadoop/client/*:/home/fptiuser/hdfs-storage/druid-hdfs-storage-0.6.83-SNAPSHOT.jar:/etc/hadoop/conf/*:config/overlord io.druid.cli.Main server overlord
In the wikipedia_index_hadoop_task.json, I have changed pathSpec to read the input data from hdfs. ("paths" : "hdfs://<namenode>:8020/<path>/wikipedia_data.json".
I have also specified "hadoopDependencyCoordinates" : ["org.apache.hadoop:hadoop-client:2.2.0"]
When I submit this task to the indexing service, I see two issues:
1. The MR job gets launched as LocalJobRunner (job_localxxxx). In there anything else that I need to configure, so that the job runs on the remote hadoop cluster?
2. The job fails with the following error:
java.lang.IllegalArgumentException: Pathname /<path>/wikipedia/wikipedia/2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z/2014-04-09T11:32:45.222Z/0 from hdfs://<name-node>:8020/<path>/wikipedia/wikipedia/2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z/2014-04-09T11:32:45.222Z/0 is not a valid DFS filename.
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:184)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:92)
at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:817)
at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:813)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:813)
at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:806)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1933)
at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.serializeOutIndex(IndexGeneratorJob.java:404)
at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:384)
at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:247)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:645)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:405)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:445)