How to set Hadoop tmp directory

Wayne Adams

unread,

Jan 9, 2014, 6:06:59 PM1/9/14

to druid-de...@googlegroups.com

Hi, all:

I'm running a 0.6.39 Indexing Service Hadoop task and it is failing with:

org.apache.hadoop.util.DiskChecker$DiskErrorException: 
Could not find any valid local directory for output/spill0.out

/tmp is full. The partition isn't big enough to handle the output even if I clean it out. I have set java.io.tmpdir to a different directory for all my Druid processes. I have also tried (on the Middle Manager)

-Dhadoop.mapred.child.java.opts=-Djava.io.tmpdir=/data1/tmp/hadoop

and

-Dhadoop.mapred.child.java.opts=-Dhadoop.tmp.dir=/data1/tmp/hadoop

and /tmp still fills up. I'm running as user "ec2-user", so it appears that whatever process is creating these tmp files is doing so by appending the user id to the string "hadoop-", as this is a sample of the contents of /tmp:

[ec2-user@ip-10-9-177-94 tmp]$ du *
4    druid
4    druid-indexing/am_extract_H00
4    druid-indexing/am_extract
4    druid-indexing/am_extract_istest_100000
16    druid-indexing
35648    hadoop-ec2-user/mapred/local/taskTracker/ec2-user/jobcache/job_local_0001/attempt_local_0001_m_000006_0/output
35652    hadoop-ec2-user/mapred/local/taskTracker/ec2-user/jobcache/job_local_0001/attempt_local_0001_m_000006_0
35648    hadoop-ec2-user/mapred/local/taskTracker/ec2-user/jobcache/job_local_0001/attempt_local_0001_m_000014_0/output
35652    hadoop-ec2-user/mapred/local/taskTracker/ec2-user/jobcache/job_local_0001/attempt_local_0001_m_000014_0
35648    hadoop-ec2-user/mapred/local/taskTracker/ec2-user/jobcache/job_local_0001/attempt_local_0001_m_000000_0/output

and so on. 99.999... % of the disk taken up in /tmp is in /tmp/hadoop-ec2-user.

I don't believe workingPath is an option here, as this is the Indexing Service and the docs say that workingPath is one of the properties that must not be set in the Hadoop config file, as it is managed internally by the Indexing Service. Perhaps this means there's a property that can be set on either overlord or middle manager -- does anyone know?

Thanks -- Wayne

Fangjin Yang

unread,

Jan 10, 2014, 1:16:14 AM1/10/14

to druid-de...@googlegroups.com

Hi Wayne,

Have you tried setting hadoop.tmp.dir in your hadoop configs as opposed to indexing service node configs? FWIW, I'd recommend running the index task if you are indexing things locally and running the hadoop task when your data volume requires it. Local hadoop indexing should still be fine, but using a real Hadoop cluster would write things to HDFS instead of your local disk.

-- FJ

Wayne Adams

unread,

Jan 10, 2014, 4:23:52 PM1/10/14

to druid-de...@googlegroups.com

Hi Fangjin:

This is related to my other post about tracing an ingest, of course, but...

It appears that under the Indexing service, I can use either what I call the "simple" index task ("index") or the hadoop index task ("index_hadoop"), and that I should use the simple one if I'm running locally. Up until a few minutes ago I have been using "index_hadoop". I don't understand what you mean about setting hadoop.tmp.dir in my hadoop configs -- I don't have a Hadoop install anywhere, but have been relying on Druid to start a local Hadoop instance. So I still don't know how to set hadoop.tmp.dir. I've tried ensuring it will be passed to the Peon by setting

-Ddruid.indexer.fork.property.hadoop.tmp.dir=/data1/tmp/hadoop

in my overlord properties, but that property is clearly not being used.

When I run the "index_hadoop" task under the Indexing Service, any file greater than a minimal size causes the job to crash because /tmp fills up.

I just tried running the simple "index" job under the Indexing Service and it crashed for the same reason. The only operational/troubleshooting difference between the two attempts is that when I run the "index" job, I'm able to briefly see the "no space left on device" message in the log if I refresh the Overlord console in the browser repeatedly, but for "index" jobs, if the job fails, the log gets deleted at the end and you get a 404 if you try to look at it from the Overlord console.

But in either case I can't seem to stop indexing temp files from going to /tmp. I've set java.io.tmpdir for all my Druid processes. I think setting hadoop.tmp.dir might have suppressed some of the output files going to /tmp, but it's the "persistent" directory that is causing the problem now, that is, /tmp/persistent/.

<later, that same day...>

Further info -- it appears some property names have changed, and it isn't clear to me if this property needs to be set on the overlord, or the middle manager, or set to propagate to child processes, but in both those nodes I now have the following:

druid.indexer.taskDir=/data1/tmp/persistent
druid.indexer.runner.taskDir=/data1/tmp/persistent
druid.indexer.fork.property.druid.indexer.taskDir=/data1/tmp/persistent
druid.indexer.fork.property.druid.indexer.runner.taskDir=/data1/tmp/persistent

This has gotten rid of almost all of the writes to /tmp/persistent, but I am still seeing the following, which eventually causes my process to crash from lack of space:

2014-01-10 21:17:31,161 INFO [task-runner-0] io.druid.indexing.common.index.YeOldePlumberSchool - Spilling index[1] with rows[500000] to: /tmp/persistent/task/index_am_extract_2aug2013_simple_index_2014-01-10T21:07:34.055-05:00

Do you know what property I would need to set to redirect this output?

Thanks much -- Wayne

Reply all

Reply to author

Forward