Hadoop disk usage (mapred/local cache cleanup) issue

374 views
Skip to first unread message

Allen JB

unread,
Jun 23, 2016, 5:49:22 AM6/23/16
to Druid User
Hi,

I have a single node pseudo-cluster setup using Imply.io (1.2.1, installed manually), Hadoop (2.7.1, installed using the Apache BigTop repo) and MySQL for metadata on CentOS 7.2. I noticed that the disk usage seems to be going up faster than I might expect.

Investigating I found a large amount of disk usage under /var/lib/hadoop-hdfs/cache/imply/mapred/local in the form of many directories which appear to be named as a timestamp (eg. '1463738866867', '1463738866870')

"ls -1a | wc -l" returns 1,499,549

Looking into a number of these directories, they all seem to contains a single file:
[root@server 1463738866890]# ls -al
total 55800
drwxr-xr-x       2 imply imply       55 May 20 11:07 .
drwxr-xr-x 1499549 imply imply 38014976 Jun 23 10:20 ..
-rw-r--r--       1 imply imply     8976 May 20 11:07 .tmp_aws-java-sdk-dynamodb-1.10.21.jar.crc
[root@server 1463738866879]# ls -al
total 55796
drwxr-xr-x       2 imply imply       60 May 20 11:07 .
drwxr-xr-x 1499549 imply imply 38014976 Jun 23 10:20 ..
-rw-r--r--       1 imply imply     4160 May 20 11:07 .tmp_aws-java-sdk-swf-libraries-1.10.21.jar.crc
[root@server 1463738866893]# ls -al
total 55792
drwxr-xr-x       2 imply imply       50 May 20 11:07 .
drwxr-xr-x 1499549 imply imply 38014976 Jun 23 10:20 ..
-rw-r--r--       1 imply imply     1956 May 20 11:07 .tmp_aws-java-sdk-sqs-1.10.21.jar.crc


It looks to me like something isn't getting cleaned up correctly. Is this a known issue? Where should I look next to work out what is/isn't happening and why? Are there any settings I should look at to improve the setup so that this doesn't happen?

Thanks in advance
AllenJB

Gian Merlino

unread,
Jul 1, 2016, 5:05:41 AM7/1/16
to druid...@googlegroups.com
Hey Allen,

Do you mean /var/lib/hadoop-hdfs/cache/imply/mapred/local on hdfs or on your actual filesystem? What is your hadoop.tmp.dir set to?

Gian

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/5bc177e7-1d55-4017-b574-8e07da52a42c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Allen JB

unread,
Jul 1, 2016, 1:36:09 PM7/1/16
to Druid User
Hi,

The specified path is what I see on the local filesystem, not on hdfs.

hadoop.tmp.dir does not appear to be set either in the imply configuration or the hadoop configuration (under /etc/hadoop), so I assume it should be using the default value.

AllenJB

Mark

unread,
Aug 8, 2016, 12:17:07 PM8/8/16
to Druid User
I have the same issue ( druid-0.8.3 ), and it seems to be related to local Hadoop ingestion not cleaning up after itself.

I also have many folders in this local Hadooop directory.

/data/hadoop-tmp/username/mapred/local/
ls -1a | wc -l
1293570


A couple of quotes about local Hadoop ingestion:


Fangjin Yang
26 Jun
Answers duplicated with https://groups.google.com/forum/#!topic/druid-user/SFYlum_wu38.

Do not use local hadoop ingestion for anything beyond a small POC data set and expect good performance.


https://groups.google.com/d/msg/druid-user/SFYlum_wu38/9TsEW8YJBAAJ

Fangjin Yang
26 Jun
The local hadoop task is _only_ meant for quickstarts and PoCs, it is not designed to be performant at all. For ingestion of large batch static files, we recommend using a remote Hadoop cluster or if you have your data in Kafka and are using Kafka 0.9.1, you can stream your data via the new Kafka indexing task.



Be aware that if you wish to use a remote Hadoop Cluster, it may require a custom Druid distribution.

Auto Generated Inline Image 1
Auto Generated Inline Image 2

Mark

unread,
Aug 8, 2016, 2:04:08 PM8/8/16
to Druid User
The following commands might be helpful

find /data/hadoop-tmp/username/mapred/local -mindepth 1 -maxdepth 1 -type d -mtime +3 -exec echo {} \;
find /data/hadoop-tmp/username/mapred/local -mindepth 1 -maxdepth 1 -type d -mtime +3 -exec rm -rf {} \;



On Thursday, 23 June 2016 05:49:22 UTC-4, Allen JB wrote:
Reply all
Reply to author
Forward
0 new messages