Tachyon copyDir

EwanH

unread,

Oct 29, 2014, 12:18:32 PM10/29/14

to tachyo...@googlegroups.com

Hi all,
I'm having some trouble with using Tachyon and also with understanding some concepts. I'm using Tachyon 0.5.0.

We would like to try using Tachyon as a caching layer on top of a parallel file system (GPFS) for use with Hadoop and Spark. I have a few issues that I'm not sure how to diagnose:

1. When listing which files are in memory on slaves, I get the following spew:

$ tachyon tfs lsr tachyon://some-slave-hostname:29998/terasort_in
14/10/29 15:50:51 INFO USER_LOGGER: Trying to connect master @ some-slave-hostname/10.141.129.117:29998
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'user_getUserId'
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'liststatus'
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'liststatus'
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'liststatus'
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'liststatus'
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'liststatus'

2. When I try to stage stage in data form GPFS, I get different spew:

$ tachyon tfs copyFromLocal $SCRATCH/terasort_in tachyon://master-hostname:19998/terasort_in
14/10/29 15:51:53 INFO USER_LOGGER: Trying to connect master @ master-hostname/10.141.129.30:19998
14/10/29 15:51:53 INFO USER_LOGGER: User registered at the master master-hostname/10.141.129.30:19998 got UserId 12
14/10/29 15:51:53 INFO USER_LOGGER: Trying to get local worker host : master-hostname
14/10/29 15:51:53 INFO USER_LOGGER: Connecting local worker @ master-hostname/10.141.129.30:29998
14/10/29 15:51:53 INFO USER_LOGGER: Folder /gpfs/ewan/workdir/ewan.master-hostname.46082/tachyon/ramdisk/users/12 was created!
14/10/29 15:51:53 INFO USER_LOGGER: /gpfs/ewan/workdir/ewan.master-hostname.46082/tachyon/ramdisk/users/12/413390602240 was created!
14/10/29 15:52:01 INFO USER_LOGGER: Failed to request 8388608 bytes local space. Time 0
14/10/29 15:52:01 INFO USER_LOGGER: Failed to request 8388608 bytes local space. Time 1
14/10/29 15:52:01 INFO USER_LOGGER: Failed to request 8388608 bytes local space. Time 2
14/10/29 15:52:01 WARN USER_LOGGER: Fail to cache for: Local tachyon worker does not have enough space (763588) or no worker for 385 413390602240
14/10/29 15:52:01 INFO USER_LOGGER: /gpfs/ewan/workdir/ewan.master-hostname.46082/tachyon/ramdisk/users/12/414464344064 was created!
14/10/29 15:52:01 INFO USER_LOGGER: Failed to request 8388608 bytes local space. Time 0
14/10/29 15:52:01 INFO USER_LOGGER: Failed to request 8388608 bytes local space. Time 1
14/10/29 15:52:01 INFO USER_LOGGER: Failed to request 8388608 bytes local space. Time 2
14/10/29 15:52:01 WARN USER_LOGGER: Fail to cache for: Local tachyon worker does not have enough space (1047552) or no worker for 386 414464344064
14/10/29 15:52:01 WARN USER_LOGGER: Fail to cache for: Can not write cache.
14/10/29 15:52:01 WARN USER_LOGGER: Fail to cache for: Can not write cache.
14/10/29 15:52:01 WARN USER_LOGGER: Fail to cache for: Can not write cache.
14/10/29 15:52:01 WARN USER_LOGGER: Fail to cache for: Can not write cache.
14/10/29 15:52:01 WARN USER_LOGGER: Fail to cache for: Can not write cache.
14/10/29 15:52:01 WARN USER_LOGGER: Fail to cache for: Can not write cache.
14/10/29 15:52:01 WARN USER_LOGGER: Fail to cache for: Can not write cache.
14/10/29 15:52:01 WARN USER_LOGGER: Fail to cache for: Can not write cache.
...

2. Of course with all this strange spew, I'm not hopeful for running jobs. e.g.When I run the Hadoop terasort benchmark I get an error:

hadoop jar /path/to/hadoop-2.3.0-cdh5.0.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar terasort -libjars /path/to/tachyon-0.5.0/core/target/tachyon-0.5.0-jar-with-dependencies.jar -Dmapreduce.job.maps=186 -Dmapreduce.job.reduces=96 tachyon://$(hostname):19998/terasort_in tachyon://$(hostname):19998/terasort_out
...
2014-10-29 15:45:48,890 INFO Job (Job.java:monitorAndPrintJob) - Running job: job_1414578744579_0003
2014-10-29 15:45:51,903 INFO Job (Job.java:monitorAndPrintJob) - Job job_1414578744579_0003 running in uber mode : false
2014-10-29 15:45:51,904 INFO Job (Job.java:monitorAndPrintJob) - map 0% reduce 0%
2014-10-29 15:45:51,915 INFO Job (Job.java:monitorAndPrintJob) - Job job_1414578744579_0003 failed with state FAILED due to: Application application_1414578744579_0003 failed 2 times due to AM Container for appattempt_1414578744579_0003_000002 exited with exitCode: -1000 due to: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class tachyon.hadoop.TFS not found

This would appear to be because thetachyon-client-0.5.0-jar-with-dependencies.jar isn't found, even though it's in the $HADOOP_HOME/lib directory as requested by the documentation[1] and it's set as a libjar param. The class should be available in the jar:

$ unzip -l /path/to/hadoop-2.3.0-cdh5.0.0/lib/tachyon-client-0.5.0-jar-with-dependencies.jar | grep tachyon.hadoop.TFS
13244 07-19-2014 19:53 tachyon/hadoop/TFS.class

N.B.: I haven't explicitly set the HADOOP_CLASSPATH for the NodeManager processes to use $HADOOP_HOME/lib.

I appreciate any help possible here.

Thanks!

-Ewan

[1] http://tachyon-project.org/Running-Hadoop-MapReduce-on-Tachyon.html

EwanH

unread,

Oct 29, 2014, 12:20:00 PM10/29/14

to tachyo...@googlegroups.com

Oops, I forgot the part about copyDir. I don't understand this command. If I push a file to Tachyon doesn't it automatically push the files to other nodes? That's like the whole point. What is the purpose of the copyDir command?

David Capwell

unread,

Oct 30, 2014, 4:05:13 PM10/30/14

to tachyo...@googlegroups.com

so quick question. You said that you want to use Tachyon with GPFS which is from IBM. As of 0.5 the list of whats allowed is hard-coded (master has it configurable). Are the comments above with GPFS or with HDFS?

Rest of my comments are inline

On Wednesday, October 29, 2014 9:20:00 AM UTC-7, EwanH wrote:

Oops, I forgot the part about copyDir. I don't understand this command. If I push a file to Tachyon doesn't it automatically push the files to other nodes? That's like the whole point. What is the purpose of the copyDir command?

On Wednesday, 29 October 2014 17:18:32 UTC+1, EwanH wrote:
Hi all,
I'm having some trouble with using Tachyon and also with understanding some concepts. I'm using Tachyon 0.5.0.

We would like to try using Tachyon as a caching layer on top of a parallel file system (GPFS) for use with Hadoop and Spark. I have a few issues that I'm not sure how to diagnose:

1. When listing which files are in memory on slaves, I get the following spew:

$ tachyon tfs lsr tachyon://some-slave-hostname:29998/terasort_in
14/10/29 15:50:51 INFO USER_LOGGER: Trying to connect master @ some-slave-hostname/10.141.129.117:29998
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'user_getUserId'
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'liststatus'
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'liststatus'
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'liststatus'
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'liststatus'
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'liststatus'

I have seen this a few times when the server and the client were compiled at different times. Did you compile tachyon yourself? This message is coming from the apache thrift server.

2. When I try to stage stage in data form GPFS, I get different spew:

$ tachyon tfs copyFromLocal $SCRATCH/terasort_in tachyon://master-hostname:19998/terasort_in
14/10/29 15:51:53 INFO USER_LOGGER: Trying to connect master @ master-hostname/10.141.129.30:19998
14/10/29 15:51:53 INFO USER_LOGGER: User registered at the master master-hostname/10.141.129.30:19998 got UserId 12
14/10/29 15:51:53 INFO USER_LOGGER: Trying to get local worker host : master-hostname
14/10/29 15:51:53 INFO USER_LOGGER: Connecting local worker @ master-hostname/10.141.129.30:29998
14/10/29 15:51:53 INFO USER_LOGGER: Folder /gpfs/ewan/workdir/ewan.master-hostname.46082/tachyon/ramdisk/users/12 was created!
14/10/29 15:51:53 INFO USER_LOGGER: /gpfs/ewan/workdir/ewan.master-hostname.46082/tachyon/ramdisk/users/12/413390602240 was created!
14/10/29 15:52:01 INFO USER_LOGGER: Failed to request 8388608 bytes local space. Time 0
14/10/29 15:52:01 INFO USER_LOGGER: Failed to request 8388608 bytes local space. Time 1
14/10/29 15:52:01 INFO USER_LOGGER: Failed to request 8388608 bytes local space. Time 2
14/10/29 15:52:01 WARN USER_LOGGER: Fail to cache for: Local tachyon worker does not have enough space (763588) or no worker for 385 413390602240

If this is happening for the same reason I think it is, then setting tachyon.user.quota.unit.bytes to something like 1-2 will fix it. https://tachyon.atlassian.net/browse/TACHYON-202

If this still happens after setting it low, let me know.

YOu wouldn't need to for NodeManager. Only your MR or Spark code will care. For MR it should read from HADOOP_HOME/lib which is now commons home. Are you able to get the classpath that was used? I forget if YARN lets you see the classpath of a failed job.

If you can get the classpath, verify that the hadoop path you added tachyon to are there. If not, you can always have MR handle adding the jar for you. ./bin/hadoop ... --libjar /path/to/tachyon.jar

EwanH

unread,

Nov 3, 2014, 9:15:37 AM11/3/14

to tachyo...@googlegroups.com

On Thursday, 30 October 2014 21:05:13 UTC+1, David Capwell wrote:

so quick question. You said that you want to use Tachyon with GPFS which is from IBM. As of 0.5 the list of whats allowed is hard-coded (master has it configurable). Are the comments above with GPFS or with HDFS?

The comments above are with GPFS. I'm just treating it at a vanilla posix file system at the moment. Each node has it's own working directory and I'm not trying to use any file placement optimizations.

Rest of my comments are inline

On Wednesday, October 29, 2014 9:20:00 AM UTC-7, EwanH wrote:
Oops, I forgot the part about copyDir. I don't understand this command. If I push a file to Tachyon doesn't it automatically push the files to other nodes? That's like the whole point. What is the purpose of the copyDir command?

On Wednesday, 29 October 2014 17:18:32 UTC+1, EwanH wrote:
Hi all,
I'm having some trouble with using Tachyon and also with understanding some concepts. I'm using Tachyon 0.5.0.

We would like to try using Tachyon as a caching layer on top of a parallel file system (GPFS) for use with Hadoop and Spark. I have a few issues that I'm not sure how to diagnose:

1. When listing which files are in memory on slaves, I get the following spew:

$ tachyon tfs lsr tachyon://some-slave-hostname:29998/terasort_in
14/10/29 15:50:51 INFO USER_LOGGER: Trying to connect master @ some-slave-hostname/10.141.129.117:29998
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'user_getUserId'
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'liststatus'
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'liststatus'
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'liststatus'
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'liststatus'
14/10/29 15:50:51 ERROR USER_LOGGER: Invalid method name: 'liststatus'

I have seen this a few times when the server and the client were compiled at different times. Did you compile tachyon yourself? This message is coming from the apache thrift server.

I'm using the jar provided with the 0.5.0 release found here: https://github.com/amplab/tachyon/releases/download/v0.5.0/tachyon-0.5.0-bin.tar.gz

The md5sum is:
83258d53ecd9d80d35437d5492c20af6 /path/to/tachyon/core/tachyon-0.5.0-jar-with-dependencies.jar

2. When I try to stage stage in data form GPFS, I get different spew:

$ tachyon tfs copyFromLocal $SCRATCH/terasort_in tachyon://master-hostname:19998/terasort_in
14/10/29 15:51:53 INFO USER_LOGGER: Trying to connect master @ master-hostname/10.141.129.30:19998
14/10/29 15:51:53 INFO USER_LOGGER: User registered at the master master-hostname/10.141.129.30:19998 got UserId 12
14/10/29 15:51:53 INFO USER_LOGGER: Trying to get local worker host : master-hostname
14/10/29 15:51:53 INFO USER_LOGGER: Connecting local worker @ master-hostname/10.141.129.30:29998
14/10/29 15:51:53 INFO USER_LOGGER: Folder /gpfs/ewan/workdir/ewan.master-hostname.46082/tachyon/ramdisk/users/12 was created!
14/10/29 15:51:53 INFO USER_LOGGER: /gpfs/ewan/workdir/ewan.master-hostname.46082/tachyon/ramdisk/users/12/413390602240 was created!
14/10/29 15:52:01 INFO USER_LOGGER: Failed to request 8388608 bytes local space. Time 0
14/10/29 15:52:01 INFO USER_LOGGER: Failed to request 8388608 bytes local space. Time 1
14/10/29 15:52:01 INFO USER_LOGGER: Failed to request 8388608 bytes local space. Time 2
14/10/29 15:52:01 WARN USER_LOGGER: Fail to cache for: Local tachyon worker does not have enough space (763588) or no worker for 385 413390602240

If this is happening for the same reason I think it is, then setting tachyon.user.quota.unit.bytes to something like 1-2 will fix it. https://tachyon.atlassian.net/browse/TACHYON-202

Offtopic: That page requires a login and my apache login doesn't work for that. Also, the following ticket doesn't exist: https://issues.apache.org/jira/browse/TACHYON-202
I eventually got to see it, but it's surprising that it's private and requires an entirely different login than the rest of the apache jiras. Is this due to Tachyon's early status or is it perhaps a misconfiguration?

If this still happens after setting it low, let me know.

I set -Dtachyon.user.quota.unit.bytes=1 when starting tachyon master and slaves but I still get the error when copying. I also tried setting it as part of the TACHYON_JAVA_OPTS when I ran the actual script but that didn't help either.

using 'yarn classpath' I see that it doesn't look in HADOOP_HOME/lib; instead it's lookiing in HADOOP_HOME/share/hadoop/{mapreduce|yarn|hdfs|...}/*. I put my tachyon jar in the mapreduce sub-directory and I get a new error message. If I remove the jar from the directory I get the previous error message.

java.lang.NoSuchMethodError: org.apache.hadoop.mapreduce.Job.getInstance(Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/hadoop/mapreduce/Job;
        at org.apache.hadoop.examples.terasort.TeraSort.run(TeraSort.java:283)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.examples.terasort.TeraSort.main(TeraSort.java:326)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
        at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
        at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

My guess is that there is some configuration option that is defaulted to some version of Hadoop (which isn't 2.3.0-cdh5.0.0) so it's not happy. I don't see anything listed here that would help it along:

https://github.com/amplab/tachyon/wiki/Configuration-Settings

If you can get the classpath, verify that the hadoop path you added tachyon to are there. If not, you can always have MR handle adding the jar for you. ./bin/hadoop ... --libjar /path/to/tachyon.jar

Thanks,
Ewan

Haoyuan Li

unread,

Nov 4, 2014, 1:53:50 PM11/4/14

to tachyo...@googlegroups.com

Ewan,

When you ran Tachyon on GPFS, did the Tachyon UI work? By default, the UI should be "host:19999"

Best,

Haoyuan

David Capwell

unread,

Nov 12, 2014, 2:11:57 PM11/12/14

to tachyo...@googlegroups.com

Thanks for the feedback.

"I'm using the jar provided with the 0.5.0 release found here: https://github.com/amplab/tachyon/releases/download/v0.5.0/tachyon-0.5.0-bin.tar.gz"

Then its not the issue that I am aware of. This needs to be investigated more.

"Offtopic: That page requires a login and my apache login doesn't work for that. Also, the following ticket doesn't exist: https://issues.apache.org/jira/browse/TACHYON-202
I eventually got to see it, but it's surprising that it's private and requires an entirely different login than the rest of the apache jiras. Is this due to Tachyon's early status or is it perhaps a misconfiguration?"

Its been asked to go public for a while now, but it seems that its going to be private... I do hope that this changes in the future.

"I set -Dtachyon.user.quota.unit.bytes=1 when starting tachyon master and slaves but I still get the error when copying. I also tried setting it as part of the TACHYON_JAVA_OPTS when I ran the actual script but that didn't help either."

That param only has effect on the client code, but if your client is consuming TACHYON_JAVA_OPTS then this sounds like a different issue than the one I know of. Need to look into this more.

"My guess is that there is some configuration option that is defaulted to some version of Hadoop (which isn't 2.3.0-cdh5.0.0) so it's not happy. I don't see anything listed here that would help it along:"

This is because Tachyon's uber jar contains hadoop as well. So you have two versions of hadoop in the classpath...

As of this point in time, Tachyon needs to be compiled against every version and every distro in order to have the right classpath. There have been effort in the past to fix this, but it caused issues for Spark, so that effort was put on hold. Most likely what we will need to do is have a uber jar for spark (which already ships with tachyon...) and normal setup for hadoop.

David Capwell

unread,

Nov 12, 2014, 2:25:27 PM11/12/14

to tachyo...@googlegroups.com

https://tachyon.atlassian.net/browse/TACHYON-229 for the liststatus issue.

On Wednesday, October 29, 2014 9:18:32 AM UTC-7, EwanH wrote:

Qianhao Dong

unread,

Nov 17, 2014, 4:13:50 AM11/17/14

to tachyo...@googlegroups.com

For the first issue, may be you should use the port 19998, which is the default port of TachyonMaster (29998 is the default port of TachyonWorker).

e.g. $ tachyon tfs lsr tachyon://some-slave-hostname:19998/terasort_in

I have tested it on my machine and I also got the spew when I used the incorrect port.

Reply all

Reply to author

Forward