PySpark input path errors

t...@vindicotech.com

unread,

Jan 18, 2016, 4:42:57 PM1/18/16

to Google Cloud Dataproc Discussions

Hi,

Everytime I try to load a textfile with pyspark, even I specify the full path, and the /path/to/file is existed, I have the error. How can I resolve this?

sc.textFile("/test/test.csv").count()

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: /test/test.txt        at org.apache.hadoop.mapred.LocatedFileStatusFetcher.getFileStatuses(LocatedFileStatusFetcher.java:155)        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:237)        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)        at scala.Option.getOrElse(Option.scala:120)        at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)        at scala.Option.getOrElse(Option.scala:120)        at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:909)        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)        at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)        at org.apache.spark.rdd.RDD.collect(RDD.scala:908)        at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)        at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)        at java.lang.reflect.Method.invoke(Method.java:497)        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)        at py4j.Gateway.invoke(Gateway.java:259)        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)        at py4j.commands.CallCommand.execute(CallCommand.java:79)        at py4j.GatewayConnection.run(GatewayConnection.java:207)        at java.lang.Thread.run(Thread.java:745)

linuxa...@gmail.com

unread,

Jan 19, 2016, 12:51:51 AM1/19/16

to Google Cloud Dataproc Discussions

Hi there,

Error itself says file you trying access does not exist if you saved as text file and u list the directory u will find 5 partition is created file type: text
ls -l /test/
part-00000 part-00001 part-00002 part-00003 part-00004

please provide more details.

Dennis Huo

unread,

Jan 19, 2016, 1:45:25 PM1/19/16

to Google Cloud Dataproc Discussions

What do you see on the cluster if you type "hadoop fs -ls -R /test" ?

Reply all

Reply to author

Forward