Can't get hadoop index task to work

Zach Cox

unread,

Apr 10, 2015, 1:31:54 PM4/10/15

to druid...@googlegroups.com

Hi - I'm trying to submit a hadoop index task to Overlord and get it to run on a remote Hadoop cluster. Currently I'm doing this all locally in Docker containers as a PoC [1]. So all of the Druid nodes run in separate Docker containers, and Hadoop (HDFS and Yarn) also run in a Docker container [2].

I can submit the task [3] to the Overlord, and it seems to successfully get the data from HDFS, index it and create a segment [4].

However there are two problems:

1) the job is not run on the remote hadoop cluster; it uses LocalJobRunner

2) the job is unable to write the segment to HDFS: a "Pathname ... is not a valid DFS filename" exception gets thrown [4]

I must not be setting things up properly, but can't find anything else to try. I would hugely appreciate any pointers in the right direction.

Thanks,

Zach

[1] https://github.com/Banno/druid-docker/tree/install-hadoop

[2] https://registry.hub.docker.com/u/sequenceiq/hadoop-docker/

[3] https://github.com/zcox/druid-pageviews/blob/master/task5-hadoop.json

[4] https://gist.githubusercontent.com/zcox/bc2861977509cb111683/raw/56b5f0cd2b0f8fcc812393886650b39b54fb33a1/gistfile1.txt

slim bouguerra

unread,

Apr 10, 2015, 1:59:25 PM4/10/15

to Zach Cox, druid...@googlegroups.com

Hi Zach

i am trying to understand your setup.

So when you say a remote Hadoop cluster what does remote means ?

You might know that druid expect to load the configuration about the Hadoop cluster out of the class path, are you providing that ?

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/CANjo42wfmq%3D2eZvBrf9x8ouWMX65fFrwKFdVgdXhQkvxibdc0Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Prajwal Tuladhar

unread,

Apr 10, 2015, 2:21:30 PM4/10/15

to slim bouguerra, Zach Cox, druid...@googlegroups.com

Hi Zach,

I had similar error but after upgrading to 0.7.1, it went away. Can't be sure but fix was probably in the PR [1].

[1] https://github.com/druid-io/druid/pull/1121

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/F77660E6-42E2-47E9-8079-A8114E8D24EB%40gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Cheers,

Praj

Zach Cox

unread,

Apr 10, 2015, 3:59:30 PM4/10/15

to Prajwal Tuladhar, slim bouguerra, druid...@googlegroups.com

Hi Slim & Prajwal - I upgraded to Druid 0.7.1.1 and now my problem #2 is solved. The hadoop index job properly writes the segment to HDFS, and the historical node picks it up and serves it. Queries are answered correctly. You can see the index task output [1].

However, I still have problem #1, as you can see in [1]. The job still uses LocalJobRunner, and the job does not run on the "remote" Hadoop cluster. I think I do not have the Hadoop config files set up properly on the Overlord node. I'm including a mapred-site.xml on the Overlord [2] and it just has:

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>hadoop:8032</value>
  </property>
</configuration>

That file gets placed in /opt/hadoop/etc/hadoop on the Overlord node, and that dir is included in the Overlord's classpath [3].

Because fig uses Docker links between these containers, Docker writes a line like this in the Overlord's /etc/hosts:

172.17.0.159 hadoop

That IP allows the Overlord node to communicate with the Hadoop node.

The hadoop image I'm using [4] ends up running Yarn, not MapReduce (i.e. no job tracker). The Yarn resource manager listens on 8032:

<name>yarn.resourcemanager.address</name>

<source>programatically</source>

</property>

I am not a Hadoop expert, and just kinda took a guess that's what I should use in the Overlord's mapred-site.xml file. I'm guessing I just don't have that, or some other file hadoop conf xml file, containing the right settings. Any idea what I should try next?

Thanks,

Zach

[1] https://gist.githubusercontent.com/zcox/01e8abfe0922b49ea55a/raw/a40658e2a6f59f06dfa241b00d42d9a8ba3616b2/gistfile1.txt

[2] https://github.com/Banno/druid-docker/blob/install-hadoop/hadoop-base/mapred-site.xml

[3] https://github.com/Banno/druid-docker/blob/install-hadoop/hadoop-base/Dockerfile#L19

[4] https://registry.hub.docker.com/u/sequenceiq/hadoop-docker/

Zach Cox

unread,

Apr 10, 2015, 6:30:13 PM4/10/15

to Prajwal Tuladhar, slim bouguerra, druid...@googlegroups.com

Well with this commit [1] with more settings in mapred-site.xml and yarn-site.xml on Overlord, the job now does get submitted to the remote hadoop, however the job fails [2]. I'm not sure if it fails due to a Druid problem or a Hadoop problem - will continue to debug and report progress.

[1] https://github.com/Banno/druid-docker/commit/05c808c8eefe2e500b7fd139fc3a64601b9148cc

[2] https://gist.githubusercontent.com/zcox/37245d41825633c9b8b4/raw/b651a1d00ec55aa5065182f5dd556726eecb892c/gistfile1.txt

Gian Merlino

unread,

Apr 10, 2015, 9:55:03 PM4/10/15

to druid...@googlegroups.com, pr...@infynyxx.com, slim.bo...@gmail.com

Hi Zach,

The error "File file:/tmp/hadoop-yarn/staging/root/.staging/job_1428703369275_0003/job.xml does not exist" makes me think that something is not wired up right with the hadoop filesystems. That path should be on HDFS. What's your core-site.xml fs.defaultFS set to? Can you try setting it to your HDFS NameNode if it's not already?

Zach Cox

unread,

Apr 11, 2015, 9:57:30 AM4/11/15

to Gian Merlino, druid...@googlegroups.com, Prajwal Tuladhar, Slim Bouguerra

Hi Gian - core-site.xml in this Hadoop Docker container [1] contains only this:

<name>fs.defaultFS</name>

</property>

</configuration>

cbb067ae55a4 is the Docker container's hostname, and /etc/hosts maps it to the container's IP.

bash-4.1# hostname

cbb067ae55a4

bash-4.1# cat /etc/hosts

172.17.0.193 cbb067ae55a4

127.0.0.1 localhost

::1 localhost ip6-localhost ip6-loopback

fe00::0 ip6-localnet

ff00::0 ip6-mcastprefix

ff02::1 ip6-allnodes

ff02::2 ip6-allrouters

I also tried upgrading from Hadoop 2.4.1 to 2.6.0 and get the same error [2].

[1] https://github.com/sequenceiq/hadoop-docker/blob/master/core-site.xml.template

[2] https://gist.githubusercontent.com/zcox/44e35eb369c26491fe8f/raw/f593122acdf468119a2737c4f3f894135508922c/gistfile1.txt

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/e0b90dbd-4c85-498c-9984-4fa5102802f6%40googlegroups.com.

Gian Merlino

unread,

Apr 11, 2015, 11:09:35 AM4/11/15

to druid...@googlegroups.com, gi...@metamarkets.com, pr...@infynyxx.com, slim.bo...@gmail.com

Is that core-site.xml (and all the other xmls) also on the classpath of your druid middleManager?

Zach Cox

unread,

Apr 11, 2015, 11:25:44 AM4/11/15

to Gian Merlino, druid...@googlegroups.com, Prajwal Tuladhar, Slim Bouguerra

Hi Gian - sorry, I was confused earlier. Now I have this core-site.xml on Overlord [1]

<name>fs.default.name</name>

<value>hdfs://hadoop:9000</value>

</property>

<name>fs.defaultFS</name>

<value>hdfs://hadoop:9000</value>

</property>

</configuration>

The job now fails in a different way [2]:

Diagnostics: java.net.UnknownHostException: hadoop

The Overlord container does have an entry in /etc/hosts for hadoop:

[ root@fcb4f04c9d05:/data ]$ cat /etc/hosts

172.17.0.7 fcb4f04c9d05

127.0.0.1 localhost

::1 localhost ip6-localhost ip6-loopback

fe00::0 ip6-localnet

ff00::0 ip6-mcastprefix

ff02::1 ip6-allnodes

ff02::2 ip6-allrouters

172.17.0.6 hadoop_1

172.17.0.2 postgres

172.17.0.2 postgres_1

172.17.0.3 druiddocker_zookeeper_1

172.17.0.6 hadoop

172.17.0.3 zookeeper

172.17.0.3 zookeeper_1

172.17.0.6 druiddocker_hadoop_1

172.17.0.2 druiddocker_postgres_1

However, that doesn't exist in /etc/hosts in the Hadoop container itself:

bash-4.1# cat /etc/hosts

172.17.0.6 0eed1244556e

127.0.0.1 localhost

::1 localhost ip6-localhost ip6-loopback

fe00::0 ip6-localnet

ff00::0 ip6-mcastprefix

ff02::1 ip6-allnodes

ff02::2 ip6-allrouters

I'm guessing something related to hdfs://hadoop:9000 gets put into the job in the Overlord container, but then when the job actually runs in the Hadoop container that hadoop host name is unresolvable. I'll try to figure out a better way for these hostnames to get resolved, and report back.

Thanks,

Zach

[1] https://github.com/Banno/druid-docker/blob/install-hadoop/hadoop-base/core-site.xml

[2] https://gist.githubusercontent.com/zcox/fca6a58e3736ed97d3c2/raw/294e7242cf2dff36fe73bd6f42ccd1821f9eb93e/gistfile1.txt

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/780ca729-927a-4fcd-9a52-19b0a78a1f6e%40googlegroups.com.

Zach Cox

unread,

Apr 11, 2015, 12:15:46 PM4/11/15

to Gian Merlino, druid...@googlegroups.com, Prajwal Tuladhar, Slim Bouguerra

I changed some things to use resolvable host names (e.g. hadoop.dev.banno.com) [1] and now the job seems to get farther but still fails, see druid task output [2] and yarn syslog [3]. The yarn error boils down to:

2015-04-11 12:04:52,192 ERROR [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
java.lang.NoSuchMethodError: org.apache.hadoop.ipc.RPC.getProtocolProxy(Ljava/lang/Class;JLjava/net/InetSocketAddress;Lorg/apache/hadoop/security/UserGroupInformation;Lorg/apache/hadoop/conf/Configuration;Ljavax/net/SocketFactory;ILorg/apache/hadoop/io/retry/RetryPolicy;Ljava/util/concurrent/atomic/AtomicBoolean;)Lorg/apache/hadoop/ipc/ProtocolProxy;

I have no idea what that means, will have to dig into it.

[1] https://github.com/Banno/druid-docker/commit/9429f42b9614cbdf8536e2e60f91541481ae1b04

[2] https://gist.githubusercontent.com/zcox/7b70f447a99bd7e9aded/raw/124b4d87e8d7aa67313ed6205380e89264327ff6/gistfile1.txt

[3] https://gist.githubusercontent.com/zcox/0d72c2dd45d23842b2a1/raw/618ff09f43ed7f2df87c97b93703516912f8897d/gistfile1.txt

Zach Cox

unread,

Apr 11, 2015, 1:24:08 PM4/11/15

to Gian Merlino, druid...@googlegroups.com, Prajwal Tuladhar, Slim Bouguerra

When googling that error message, I got a few hints about issues with different hadoop versions on the classpath.

I switched to Hadoop 2.4.1 [1] [2] and now the yarn error message is slightly different, but pretty much the same:

2015-04-11 13:10:40,853 ERROR [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
java.lang.NoSuchMethodError: org.apache.hadoop.http.HttpConfig.isSecure()Z
	at org.apache.hadoop.yarn.conf.YarnConfiguration.<clinit>(YarnConfiguration.java:322)
	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1374)
2015-04-11 13:10:40,869 INFO [main] org.apache.hadoop.util.ExitUtil: Exiting with status 1

If I examine the job's classpath deps in hdfs it does seem that both hadoop 2.4.1 and 2.3.0 jars were uploaded there. Maybe that's the problem? I'm clearly specifying "hadoopDependencyCoordinates": ["org.apache.hadoop:hadoop-client:2.4.1"] in the hadoop index task. Why would both 2.4.1 and 2.3.0 jars be uploaded there?

[1] https://github.com/Banno/druid-docker/commit/ef0e9096f7c6a222aa04d223c87d79eae4fd26ae

[2] https://github.com/zcox/druid-pageviews/commit/df794d3c3780e38b478d4bb118b2e72cbdd08aa6

[3] https://gist.githubusercontent.com/zcox/56c77860fb75bb844f80/raw/a3e550d95155d84fb7365a81a69dd9fb368f1f74/gistfile1.txt

Zach Cox

unread,

Apr 11, 2015, 1:43:59 PM4/11/15

to Gian Merlino, druid...@googlegroups.com, Prajwal Tuladhar, Slim Bouguerra

I also put this in common.runtime.properties:

druid.indexer.task.defaultHadoopCoordinates=["org.apache.hadoop:hadoop-client:2.4.1"]

However there are still both hadoop 2.4.1 and 2.3.0 jars in /druid-working/classpath in hdfs.

How do I get just the hadoop 2.4.1 jars uploaded to hdfs?

Gian Merlino

unread,

Apr 12, 2015, 1:31:50 PM4/12/15

to druid...@googlegroups.com, gi...@metamarkets.com, pr...@infynyxx.com, slim.bo...@gmail.com

You might need to compile a version of Druid based against hadoop 2.4.1 instead of 2.3.0.

Zach Cox

unread,

Apr 12, 2015, 3:06:12 PM4/12/15

to Gian Merlino, druid...@googlegroups.com, Prajwal Tuladhar, Slim Bouguerra

I had a hunch that the hadoop 2.3.0 jars were being included by the druid-hdfs-storage extension, so I removed that from druid.extensions.coordinates. Now the job seems to get farther but still fails. See task output from overlord [1] and job syslog output [2]. The exception in [2] is this:

2015-04-12 14:51:25,093 ERROR [IPC Server handler 2 on 35430] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1428864489761_0001_m_000000_0 - exited : com.metamx.common.RE: Failure on row[{"eventId":"e1", "timestamp":"2015-03-24T14:00:00Z", "userId":"u1", "url":"http://site.com/1"}]
	at io.druid.indexer.HadoopDruidIndexerMapper.map(HadoopDruidIndexerMapper.java:98)
	at io.druid.indexer.HadoopDruidIndexerMapper.map(HadoopDruidIndexerMapper.java:44)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.lang.NullPointerException
	at io.druid.indexer.HadoopDruidIndexerConfig.getBucket(HadoopDruidIndexerConfig.java:350)
	at io.druid.indexer.IndexGeneratorJob$IndexGeneratorMapper.innerMap(IndexGeneratorJob.java:231)
	at io.druid.indexer.HadoopDruidIndexerMapper.map(HadoopDruidIndexerMapper.java:94)
	... 9 more

Not sure what the problem is though, this exact same data set gets indexed just fine by a non-hadoop index task [3]. For reference, [4] is the hadoop index task I'm using, and [5] is the data set. Am I specifying something incorrectly in the hadoop index task spec json?

[1] https://gist.githubusercontent.com/zcox/b6eddaea76343baebad9/raw/a1cdfda0abaaa65d9a245df39fce7ed17a778946/gistfile1.txt

[2] https://gist.githubusercontent.com/zcox/0fc76242702fb803c0c1/raw/d3c64a1bde1f50e377bedbc4027e1de90b6652f6/gistfile1.txt

[3] https://github.com/zcox/druid-pageviews/blob/master/task5.json

[4] https://github.com/zcox/druid-pageviews/blob/master/task5-hadoop.json

[5] https://github.com/zcox/druid-pageviews/blob/master/events.json

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/93560a4f-6fc5-4232-9588-84bf2880bcce%40googlegroups.com.

Gian Merlino

unread,

Apr 13, 2015, 2:52:01 PM4/13/15

to druid...@googlegroups.com, gi...@metamarkets.com, pr...@infynyxx.com, slim.bo...@gmail.com

Nothing looks wrong with your configs at first glance. This looks sort of like a time bucketing problem- are all of your services running in the same time zone?

Also, do you have logs from one of the mappers that failed? You should be able to get them from the hadoop web ui. There might be something interesting in there.

Zach Cox

unread,

Apr 20, 2015, 1:15:44 PM4/20/15

to Gian Merlino, druid...@googlegroups.com, pr...@infynyxx.com, slim.bo...@gmail.com

Hi Gian,

I'm starting all Druid nodes with -Duser.timezone=UTC [1].

I can't get to any mapper-specific logs via yarn - I can only seem to get the container's syslog output [2].

The exception seems to occur at this line: [3]

final ShardSpec actualSpec = shardSpecLookups.get(timeBucket.get().getStart()).getShardSpec(rollupGran.truncate(inputRow.getTimestampFromEpoch()), inputRow);

Specifically, shardSpecLookups.get(timeBucket.get().getStart()) appears to return null. In the task logs [4] the index spec has this in tuningConfig:

      "shardSpecs" : {
        "2015-03-24T00:00:00.000Z" : [ {
          "actualSpec" : {
            "type" : "none"
          },
          "shardNum" : 0
        } ]
      },

Any ideas why that shardSpecLookups.get() call would return null?

Thanks,

Zach

[1] https://github.com/Banno/druid-docker/blob/install-hadoop/base/start-node.sh#L36

[2] https://gist.githubusercontent.com/zcox/0fc76242702fb803c0c1/raw/d3c64a1bde1f50e377bedbc4027e1de90b6652f6/gistfile1.txt

[3] https://github.com/druid-io/druid/blob/druid-0.7.1.1/indexing-hadoop/src/main/java/io/druid/indexer/HadoopDruidIndexerConfig.java#L350

[4] https://gist.githubusercontent.com/zcox/b6eddaea76343baebad9/raw/a1cdfda0abaaa65d9a245df39fce7ed17a778946/gistfile1.txt

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/baa1437c-16f6-4137-96c4-fb681dbc7a11%40googlegroups.com.

Zach Cox

unread,

Apr 22, 2015, 3:16:21 PM4/22/15

to Gian Merlino, druid...@googlegroups.com, pr...@infynyxx.com, slim.bo...@gmail.com

I finally got the Hadoop batch index task to work - the problem was that the Hadoop Docker container I was using is based on CentOS which appears to use Eastern time zone by default. After forcing it to use UTC then the Druid mapreduce job ran successfully, created the segment and put it in hdfs.

I also had to use Hadoop 2.3.0. If I use any other version of Hadoop, then the dependency jars for both that version and 2.3.0 are used in the mapreduce job, which causes problems. I think the druid-hdfs-storage extension is always pulling in 2.3.0. Will look into that next.

Taylor Jones

unread,

Apr 22, 2015, 8:27:59 PM4/22/15

to druid...@googlegroups.com, slim.bo...@gmail.com, pr...@infynyxx.com, gi...@metamarkets.com

I just ran into this exact issue. The Hadoop job is labeling the segments put in the working directory with a timezone suffix and it later on looks for the same segments without the timezone suffix. I was able to get the indexer to complete successfully by omitting the -Duser.timezone=UTC bit in the, but I would prefer to standardize on UTC time. Is there a way to force the Hadoop job to use UTC, or did you have to actually update the host OS on all the datanodes?

Eric Tschetter

unread,

Apr 22, 2015, 9:14:35 PM4/22/15

to Taylor Jones, druid...@googlegroups.com, slim.bo...@gmail.com, pr...@infynyxx.com, Gian Merlino

Taylor, you should be able to add -Duser.timezone=UTC in the
java.child.opts, which *should* make the child JVMs on Hadoop use the
UTC timezone.

Or, is that what you said you removed to get things to work?

--Eric

> https://groups.google.com/d/msgid/druid-user/e420d70a-dc6e-45f4-b352-b4b7e50a42b3%40googlegroups.com.

Zach Cox

unread,

Apr 22, 2015, 9:44:00 PM4/22/15

to Taylor Jones, druid...@googlegroups.com, slim.bo...@gmail.com, pr...@infynyxx.com, gi...@metamarkets.com

Hi Taylor,

I had to use UTC everywhere: on every single Druid node using -Duser.timezone=UTC and also on every Hadoop node. Since the Hadoop I'm using is based on CentOS I had to do that at the OS level, using:

rm /etc/localtime

ln -s /usr/share/zoneinfo/UTC /etc/localtime

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/e420d70a-dc6e-45f4-b352-b4b7e50a42b3%40googlegroups.com.

Taylor Jones

unread,

Apr 23, 2015, 9:48:28 AM4/23/15

to druid...@googlegroups.com, pr...@infynyxx.com, slim.bo...@gmail.com, monit...@gmail.com, gi...@metamarkets.com

Oh, I'm sorry you had to do that. I actually found a much easier way of doing it, and it's actually listed on the Druid site here: http://druid.io/docs/0.7.1.1/Hadoop-Configuration.html. These lines are the magic words, they let you specify Java opts to the map/reduce JVMs (including timezone instructions):

  <property>
    <name>mapreduce.map.java.opts</name>
    <value>-server -Xmx1536m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps</value>
  </property>
  <property>
    <name>mapreduce.reduce.java.opts</name>
    <value>-server -Xmx2560m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps</value>
  </property>

...

Zach Cox

unread,

Apr 23, 2015, 10:06:30 AM4/23/15

to Taylor Jones, druid...@googlegroups.com, pr...@infynyxx.com, slim.bo...@gmail.com, gi...@metamarkets.com

Nice! I didn't realize you could specify that in mapred-site.xml. I still think it's a good idea to always use UTC at the OS level, but those map/reduce jvm settings also seem like a good idea.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/1f63a169-d80d-4052-9287-aed440ccc430%40googlegroups.com.

Reply all

Reply to author

Forward