Standalone java spark programming and core-site.xml

346 views
Skip to first unread message

Albert Kwong

unread,
Nov 7, 2013, 4:16:26 AM11/7/13
to tachyo...@googlegroups.com
Hi there,

I am following the instruction on http://spark.incubator.apache.org/docs/latest/java-programming-guide.html to make a standalone java program, and I would like to save some files to tachyon as follows:

counts.saveAsTextFile("tachyon://localhost:1998/README.wc");


This step fails with the following message:

Exception in thread "main" java.io.IOException: No FileSystem for scheme: tachyon

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1383)

...

My spark + tachyon environment should be fine because I can execute the same command in spark-shell correctly.  After playing for a while I found that this error is probably related to $SPARK_HOME/conf/core-site.xml, which exists, but is not read by the Java app.  

Anyone knows how to fix this issue?  Here's my full program.  Thanks!


private static final String SPARK_HOME = "/usr/local/spark";

private static final String CLUSTER_URL = "spark://localhost:7077";

private static final String APP_NAME = "wordcount";

private static final String APP_JAR = "target/scala-2.10/jobone_2.10-1.0-SNAPSHOT.jar";


public static void main(String[] args) {


JavaSparkContext sc = new JavaSparkContext(CLUSTER_URL, APP_NAME, SPARK_HOME, new String[] { APP_JAR });


String logFile = SPARK_HOME + "/README.md"; // Should be some file on your system


JavaRDD<String> logData = sc

.textFile(logFile)

.cache();


JavaRDD<String> word = logData

.flatMap(new FlatMapFunction<String, String>() {

public Iterable<String> call(String s) {

return Arrays.asList(s.split(" "));

}

});


JavaPairRDD<String, Integer> ones = word

.map(new PairFunction<String, String, Integer>() {

public Tuple2<String, Integer> call(String s) throws Exception {

return new Tuple2<String, Integer>(s, 1);

}

});


JavaPairRDD<String, Integer> counts = ones.reduceByKey(

new Function2<Integer, Integer, Integer>() {

public Integer call(Integer i1, Integer i2) {

return i1 + i2;

}

}

);


counts.saveAsTextFile("tachyon://localhost:19998/README.wc");

}





Wang Tao

unread,
Nov 7, 2013, 11:04:29 AM11/7/13
to tachyo...@googlegroups.com
You already add the tachyon impl into the $SPARK_HOME/conf/core-site.xml, as is decribed here https://github.com/amplab/tachyon/wiki/Running-Spark-on-Tachyon ?

在 2013年11月7日星期四UTC+8下午5时16分26秒,Albert Kwong写道:

Albert Kwong

unread,
Nov 7, 2013, 11:34:33 PM11/7/13
to tachyo...@googlegroups.com
Yup, and already confirmed that the spark-shell with tachyon is working.  

But seems Java is not reading the core-site.xml properly.

Albert

Wang Tao

unread,
Nov 8, 2013, 11:06:50 AM11/8/13
to tachyo...@googlegroups.com
You may try add the tachyon impl into the $HADOOP_HOME/conf/hdfs-site.xml.

Cause never used it, I am not sure if it works.

在 2013年11月8日星期五UTC+8下午12时34分33秒,Albert Kwong写道:

Haoyuan Li

unread,
Nov 14, 2013, 1:13:56 AM11/14/13
to tachyo...@googlegroups.com
Has the Java problem been fixed?

Haoyuan

Albert Kwong

unread,
Nov 14, 2013, 1:34:11 AM11/14/13
to tachyo...@googlegroups.com
Thanks for follow up.  The problem still exists.  

I have the core-site.xml setup correctly for spark-shell, just couldn't figure out how to do the same in a standalone java program.  Your help is appreciated!  Thank you.

Albert

Haoyuan Li

unread,
Nov 14, 2013, 9:45:46 AM11/14/13
to Albert Kwong, tachyo...@googlegroups.com

In particular, please make sure the core-site.xml is in your JAVA program classpath.

Haoyuan

--
You received this message because you are subscribed to the Google Groups "Tachyon Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tachyon-user...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Haoyuan Li
Algorithms, Machines, People Lab, EECS, UC Berkeley

Ramaraju Indukuri

unread,
Jul 26, 2014, 5:08:06 PM7/26/14
to tachyo...@googlegroups.com
Hi Albert, Have you resolved this issue ? I have the same exact problem.

Ram

Henry Saputra

unread,
Jul 26, 2014, 9:01:57 PM7/26/14
to Ramaraju Indukuri, tachyo...@googlegroups.com
Hi Guys,

Could you help to make sure this things are set:
1. export the HADOOP_CONF_DIR to point to directory where you put
hadoop configuraton directory in the environment you run the Java
program.
2. Make sure i the ${HADOOP_CONF_DIR}/core-site.xml you added this content:

<property>
<name>fs.tachyon.impl</name>
<value>tachyon.hadoop.TFS</value>

</property>

to make sure Hadoop FileSystem know how to choose the right FileSystem
implementation for tachyon:// scheme.


Good luck, hope this helps.


- Henry
> --
> You received this message because you are subscribed to the Google Groups
> "Tachyon Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tachyon-user...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Albert Kwong

unread,
Jul 26, 2014, 9:27:35 PM7/26/14
to Ramaraju Indukuri, tachyo...@googlegroups.com
Hi ram, I have since put tachyon aside.

Albert
--
You received this message because you are subscribed to a topic in the Google Groups "Tachyon Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tachyon-users/ztaraRkeIWY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tachyon-user...@googlegroups.com.

Ramaraju Indukuri

unread,
Jul 27, 2014, 12:15:35 PM7/27/14
to tachyo...@googlegroups.com, iram...@gmail.com
No luck Henry, Per http://tachyon-project.org/Running-Spark-on-Tachyon.html the core-site.xml with the fs.tachyon.impl property should be in $SPARK_HOME/conf. But I added the text to core-site.xml in $HADOOP_HOME/conf as you asked and restarted the hadoop-tachyon and spark.

I added HADOOP_CONF_DIR to spark-env.sh and to my environment as well (Through .bashrc on ubuntu).

I also see this in the spark worker log.

Spark Command: java -cp /home/ubuntu/tachyon/client/target/tachyon-client-0.5.0-jar-with-dependencies.jar:::/home/ubuntu/spark/conf:/home/ubuntu/spark/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/ubuntu/spark/lib/datanucleus-api-jdo-3.2.1.jar:/home/ubuntu/spark/lib/datanucleus-rdbms-3.2.1.jar:/home/ubuntu/spark/lib/datanucleus-core-3.2.2.jar:/home/ubuntu/hadoop/conf/ -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://172.31.42.159:7070

Ramaraju Indukuri

unread,
Jul 27, 2014, 12:35:10 PM7/27/14
to tachyo...@googlegroups.com, iram...@gmail.com
Couple of more points.

I am using spark 1.0.0 with tachyon 0.5 (Site says, I need to recompile, which I did not, since my spark shell works fine). Could this be the issue ? if yes, why would shell work and standalone job does not work.

I am using scala and sbt for standalone job. I checked the job with HDFS directly and it works, but fails one I change the file input URL in the code to tachyon://ip:port/

regards
Ram



On Saturday, July 26, 2014 9:01:57 PM UTC-4, Henry Saputra wrote:

Ramaraju Indukuri

unread,
Jul 27, 2014, 2:30:21 PM7/27/14
to tachyo...@googlegroups.com, iram...@gmail.com
One more attempt...
 val conf = new SparkConf()
    .set("fs.tachyon.impl", classOf[tachyon.hadoop.TFS].getName)
    .setJars(Seq(......,"/home/ubuntu/tachyon/client/target/tachyon-client-0.5.0-jar-with-dependencies.jar") 

Did not work either.

On Saturday, July 26, 2014 9:01:57 PM UTC-4, Henry Saputra wrote:

Haoyuan Li

unread,
Jul 29, 2014, 2:24:37 PM7/29/14
to tachyo...@googlegroups.com, iram...@gmail.com
What's the error here?

Best,

Haoyuan

Ramaraju Indukuri

unread,
Jul 29, 2014, 2:47:41 PM7/29/14
to tachyo...@googlegroups.com, iram...@gmail.com

Hi Haoyuan , error is ....

[error] (run-main-0) java.io.IOException: No FileSystem for scheme: tachyon

java.io.IOException: No FileSystem for scheme: tachyon

        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1443)

        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)

        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)

        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263)

        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)

        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)

        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)

        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:172)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)

        at scala.Option.getOrElse(Option.scala:120)

        at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)

        at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)

        at scala.Option.getOrElse(Option.scala:120)

        at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)

        at org.apache.spark.rdd.FilteredRDD.getPartitions(FilteredRDD.scala:29)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)

        at scala.Option.getOrElse(Option.scala:120)

        at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)

        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1094)

        at org.apache.spark.rdd.RDD.count(RDD.scala:847)

        at hadooputil$.main(test.scala:24)

        at hadooputil.main(test.scala)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

        at java.lang.reflect.Method.invoke(Method.java:606)

[trace] Stack trace suppressed: run last *:run for the full output.

Ramaraju Indukuri

unread,
Jul 29, 2014, 9:31:04 PM7/29/14
to tachyo...@googlegroups.com
Finally, found the solution. Just add the following.

    sc.hadoopConfiguration.set("fs.tachyon.impl", "tachyon.hadoop.TFS")

I am not sure why spark driver is not picking core-site.xml through hadoop though. There might be a way to make it happen through the sbt configuration, but this is much simpler and IMHO should be documented on the website.

Regards
Ram

Ramaraju Indukuri

unread,
Jul 29, 2014, 9:31:59 PM7/29/14
to tachyo...@googlegroups.com
I do appreciate Henry and HY taking a look at this problem. Many thanks guys.

Regards
Ram

On Thursday, November 7, 2013 4:16:26 AM UTC-5, Albert Kwong wrote:

Henry Saputra

unread,
Jul 29, 2014, 9:48:40 PM7/29/14
to Ramaraju Indukuri, tachyo...@googlegroups.com
Awesome! 

That is weird why the Hadoop conf directory is not picked up but glad it works.

Please do send pull request to update the website. This is useful information 

John Yost

unread,
Jan 4, 2015, 10:22:14 AM1/4/15
to tachyo...@googlegroups.com
Great work!  This solves the problem I just encountered.  Well done and thanks for sharing your solution with the rest of us.

Best,

--John

Seung-Hwan Lim

unread,
Feb 2, 2015, 10:26:59 AM2/2/15
to tachyo...@googlegroups.com
I'm using spark 1.2.0 and encounter the same problem.

I'm using spark in my python application.

   sc.hadoopConfiguration.set("fs.tachyon.impl", "tachyon.hadoop.TFS")
AttributeError: 'SparkContext' object has no attribute 'hadoopConfiguration'

Bill Metangmo

unread,
Apr 21, 2015, 4:31:08 AM4/21/15
to tachyo...@googlegroups.com
May be you could do : sc.hadoopConfiguration().set("fs.tachyon.impl", "tachyon.hadoop.TFS");

It works for me with spark 1.2.1

Calvin Jia

unread,
Apr 23, 2015, 2:48:54 PM4/23/15
to tachyo...@googlegroups.com
Hi Bill, thanks for providing your solution!

Haoyuan Li

unread,
May 3, 2015, 9:58:09 PM5/3/15
to tachyo...@googlegroups.com
It will be great to have this on Tachyon's document.

Best,

Haoyuan
Reply all
Reply to author
Forward
0 new messages