Hadoop disk usage (mapred/local cache cleanup) issue

Allen JB

unread,

Jun 23, 2016, 5:49:22 AM6/23/16

to Druid User

Hi,

I have a single node pseudo-cluster setup using Imply.io (1.2.1, installed manually), Hadoop (2.7.1, installed using the Apache BigTop repo) and MySQL for metadata on CentOS 7.2. I noticed that the disk usage seems to be going up faster than I might expect.

Investigating I found a large amount of disk usage under /var/lib/hadoop-hdfs/cache/imply/mapred/local in the form of many directories which appear to be named as a timestamp (eg. '1463738866867', '1463738866870')

"ls -1a | wc -l" returns 1,499,549

Looking into a number of these directories, they all seem to contains a single file:

[root@server 1463738866890]# ls -al

total 55800

drwxr-xr-x 2 imply imply 55 May 20 11:07 .

drwxr-xr-x 1499549 imply imply 38014976 Jun 23 10:20 ..

-rw-r--r-- 1 imply imply 8976 May 20 11:07 .tmp_aws-java-sdk-dynamodb-1.10.21.jar.crc

[root@server 1463738866879]# ls -al

total 55796

drwxr-xr-x 2 imply imply 60 May 20 11:07 .

drwxr-xr-x 1499549 imply imply 38014976 Jun 23 10:20 ..

-rw-r--r-- 1 imply imply 4160 May 20 11:07 .tmp_aws-java-sdk-swf-libraries-1.10.21.jar.crc

[root@server 1463738866893]# ls -al

total 55792

drwxr-xr-x 2 imply imply 50 May 20 11:07 .

drwxr-xr-x 1499549 imply imply 38014976 Jun 23 10:20 ..

-rw-r--r-- 1 imply imply 1956 May 20 11:07 .tmp_aws-java-sdk-sqs-1.10.21.jar.crc

It looks to me like something isn't getting cleaned up correctly. Is this a known issue? Where should I look next to work out what is/isn't happening and why? Are there any settings I should look at to improve the setup so that this doesn't happen?

Thanks in advance

AllenJB

Gian Merlino

unread,

Jul 1, 2016, 5:05:41 AM7/1/16

to druid...@googlegroups.com

Hey Allen,

Do you mean /var/lib/hadoop-hdfs/cache/imply/mapred/local on hdfs or on your actual filesystem? What is your hadoop.tmp.dir set to?

Gian

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/5bc177e7-1d55-4017-b574-8e07da52a42c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Allen JB

unread,

Jul 1, 2016, 1:36:09 PM7/1/16

to Druid User

Hi,

The specified path is what I see on the local filesystem, not on hdfs.

hadoop.tmp.dir does not appear to be set either in the imply configuration or the hadoop configuration (under /etc/hadoop), so I assume it should be using the default value.

AllenJB

Mark

unread,

Aug 8, 2016, 12:17:07 PM8/8/16

to Druid User

I have the same issue ( druid-0.8.3 ), and it seems to be related to local Hadoop ingestion not cleaning up after itself.

I also have many folders in this local Hadooop directory.

/data/hadoop-tmp/username/mapred/local/
ls -1a | wc -l
1293570

A couple of quotes about local Hadoop ingestion:

https://groups.google.com/d/msg/druid-user/kvvQtb4F1Lw/jFc-ndAJBAAJ

Fangjin Yang

26 Jun

Answers duplicated with https://groups.google.com/forum/#!topic/druid-user/SFYlum_wu38.

Do not use local hadoop ingestion for anything beyond a small POC data set and expect good performance.

https://groups.google.com/d/msg/druid-user/SFYlum_wu38/9TsEW8YJBAAJ

Fangjin Yang

26 Jun

The local hadoop task is _only_ meant for quickstarts and PoCs, it is not designed to be performant at all. For ingestion of large batch static files, we recommend using a remote Hadoop cluster or if you have your data in Kafka and are using Kafka 0.9.1, you can stream your data via the new Kafka indexing task.

Be aware that if you wish to use a remote Hadoop Cluster, it may require a custom Druid distribution.

Auto Generated Inline Image 1

Auto Generated Inline Image 2

Mark

unread,

Aug 8, 2016, 2:04:08 PM8/8/16

to Druid User

The following commands might be helpful

find /data/hadoop-tmp/username/mapred/local -mindepth 1 -maxdepth 1 -type d -mtime +3 -exec echo {} \;
find /data/hadoop-tmp/username/mapred/local -mindepth 1 -maxdepth 1 -type d -mtime +3 -exec rm -rf {} \;

On Thursday, 23 June 2016 05:49:22 UTC-4, Allen JB wrote:

Reply all

Reply to author

Forward