Hi all,
I'm having some trouble with using Tachyon and also with understanding some concepts. I'm using Tachyon 0.5.0.
We would like to try using Tachyon as a caching layer on top of a parallel file system (GPFS) for use with Hadoop and Spark. I have a few issues that I'm not sure how to diagnose:
1. When listing which files are in memory on slaves, I get the following spew:
$ tachyon tfs lsr tachyon://some-slave-hostname:29998/terasort_in
14/10/29 15:50:51 INFO USER_LOGGER: Trying to connect master @ some-slave-hostname/
10.141.129.117:2999814/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'user_getUserId'
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'liststatus'
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'liststatus'
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'liststatus'
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'liststatus'
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'liststatus'
2. When I try to stage stage in data form GPFS, I get different spew:
$ tachyon tfs copyFromLocal $SCRATCH/terasort_in tachyon://master-hostname:19998/terasort_in
14/10/29 15:51:53 INFO USER_LOGGER: Trying to connect master @ master-hostname/
10.141.129.30:1999814/10/29 15:51:53 INFO USER_LOGGER: User registered at the master master-hostname/
10.141.129.30:19998 got UserId 12
14/10/29 15:51:53 INFO USER_LOGGER: Trying to get local worker host : master-hostname
14/10/29 15:51:53 INFO USER_LOGGER: Connecting local worker @ master-hostname/
10.141.129.30:2999814/10/29 15:51:53 INFO USER_LOGGER: Folder /gpfs/ewan/workdir/ewan.master-hostname.46082/tachyon/ramdisk/users/12 was created!
14/10/29 15:51:53 INFO USER_LOGGER: /gpfs/ewan/workdir/ewan.master-hostname.46082/tachyon/ramdisk/users/12/413390602240 was created!
14/10/29 15:52:01 INFO USER_LOGGER: Failed to request 8388608 bytes local space. Time 0
14/10/29 15:52:01 INFO USER_LOGGER: Failed to request 8388608 bytes local space. Time 1
14/10/29 15:52:01 INFO USER_LOGGER: Failed to request 8388608 bytes local space. Time 2
14/10/29 15:52:01 WARN USER_LOGGER: Fail to cache for: Local tachyon worker does not have enough space (763588) or no worker for 385 413390602240
14/10/29 15:52:01 INFO USER_LOGGER: /gpfs/ewan/workdir/ewan.master-hostname.46082/tachyon/ramdisk/users/12/414464344064 was created!
14/10/29 15:52:01 INFO USER_LOGGER: Failed to request 8388608 bytes local space. Time 0
14/10/29 15:52:01 INFO USER_LOGGER: Failed to request 8388608 bytes local space. Time 1
14/10/29 15:52:01 INFO USER_LOGGER: Failed to request 8388608 bytes local space. Time 2
14/10/29 15:52:01 WARN USER_LOGGER: Fail to cache for: Local tachyon worker does not have enough space (1047552) or no worker for 386 414464344064
14/10/29 15:52:01 WARN USER_LOGGER: Fail to cache for: Can not write cache.
14/10/29 15:52:01 WARN USER_LOGGER: Fail to cache for: Can not write cache.
14/10/29 15:52:01 WARN USER_LOGGER: Fail to cache for: Can not write cache.
14/10/29 15:52:01 WARN USER_LOGGER: Fail to cache for: Can not write cache.
14/10/29 15:52:01 WARN USER_LOGGER: Fail to cache for: Can not write cache.
14/10/29 15:52:01 WARN USER_LOGGER: Fail to cache for: Can not write cache.
14/10/29 15:52:01 WARN USER_LOGGER: Fail to cache for: Can not write cache.
14/10/29 15:52:01 WARN USER_LOGGER: Fail to cache for: Can not write cache.
...
2. Of course with all this strange spew, I'm not hopeful for running jobs. e.g.When I run the Hadoop terasort benchmark I get an error:
hadoop jar /path/to/hadoop-2.3.0-cdh5.0.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar terasort -libjars /path/to/tachyon-0.5.0/core/target/tachyon-0.5.0-jar-with-dependencies.jar -Dmapreduce.job.maps=186 -Dmapreduce.job.reduces=96 tachyon://$(hostname):19998/terasort_in tachyon://$(hostname):19998/terasort_out
...
2014-10-29 15:45:48,890 INFO Job (Job.java:monitorAndPrintJob) - Running job: job_1414578744579_0003
2014-10-29 15:45:51,903 INFO Job (Job.java:monitorAndPrintJob) - Job job_1414578744579_0003 running in uber mode : false
2014-10-29 15:45:51,904 INFO Job (Job.java:monitorAndPrintJob) - map 0% reduce 0%
2014-10-29 15:45:51,915 INFO Job (Job.java:monitorAndPrintJob) - Job job_1414578744579_0003 failed with state FAILED due to: Application application_1414578744579_0003 failed 2 times due to AM Container for appattempt_1414578744579_0003_000002 exited with exitCode: -1000 due to: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class tachyon.hadoop.TFS not found
This would appear to be because thetachyon-client-0.5.0-jar-with-dependencies.jar isn't found, even though it's in the $HADOOP_HOME/lib directory as requested by the documentation[1] and it's set as a libjar param. The class should be available in the jar:
$ unzip -l /path/to/hadoop-2.3.0-cdh5.0.0/lib/tachyon-client-0.5.0-jar-with-dependencies.jar | grep tachyon.hadoop.TFS
13244 07-19-2014 19:53 tachyon/hadoop/TFS.class
N.B.: I haven't explicitly set the HADOOP_CLASSPATH for the NodeManager processes to use $HADOOP_HOME/lib.
I appreciate any help possible here.
Thanks!
-Ewan
[1]
http://tachyon-project.org/Running-Hadoop-MapReduce-on-Tachyon.html