Hi,
I have a single node pseudo-cluster setup using Imply.io (1.2.1, installed manually), Hadoop (2.7.1, installed using the Apache BigTop repo) and MySQL for metadata on CentOS 7.2. I noticed that the disk usage seems to be going up faster than I might expect.
Investigating I found a large amount of disk usage under /var/lib/hadoop-hdfs/cache/imply/mapred/local in the form of many directories which appear to be named as a timestamp (eg. '1463738866867', '1463738866870')
"ls -1a | wc -l" returns 1,499,549
Looking into a number of these directories, they all seem to contains a single file:
[root@server 1463738866890]# ls -al
total 55800
drwxr-xr-x 2 imply imply 55 May 20 11:07 .
drwxr-xr-x 1499549 imply imply 38014976 Jun 23 10:20 ..
-rw-r--r-- 1 imply imply 8976 May 20 11:07 .tmp_aws-java-sdk-dynamodb-1.10.21.jar.crc
[root@server 1463738866879]# ls -al
total 55796
drwxr-xr-x 2 imply imply 60 May 20 11:07 .
drwxr-xr-x 1499549 imply imply 38014976 Jun 23 10:20 ..
-rw-r--r-- 1 imply imply 4160 May 20 11:07 .tmp_aws-java-sdk-swf-libraries-1.10.21.jar.crc
[root@server 1463738866893]# ls -al
total 55792
drwxr-xr-x 2 imply imply 50 May 20 11:07 .
drwxr-xr-x 1499549 imply imply 38014976 Jun 23 10:20 ..
-rw-r--r-- 1 imply imply 1956 May 20 11:07 .tmp_aws-java-sdk-sqs-1.10.21.jar.crc
It looks to me like something isn't getting cleaned up correctly. Is this a known issue? Where should I look next to work out what is/isn't happening and why? Are there any settings I should look at to improve the setup so that this doesn't happen?
Thanks in advance
AllenJB