Can't read parquet with spark2.0

1,162 views
Skip to first unread message

kiss.k...@gmail.com

unread,
Jul 29, 2016, 3:53:40 AM7/29/16
to Alluxio Users
alluxio1.1.1
hadoop2.7
spark2.0-hadoop2.7

I can create table use hdfs protocol : sqlContext.createExternalTable("tpc1.catalog_sales","hdfs://master1:9000/tpctest/catalog_sales","parquet") 

but when alluxio I got error:

scala> sqlContext.createExternalTable("tpc1.catalog_sales","alluxio://master1:9000/tpctest/catalog_sales","parquet")
16/07/29 15:17:22 WARN TaskSetManager: Lost task 15.0 in stage 5.0 (TID 51, slave1): java.io.IOException: Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://master1:9000/tpctest/catalog_sales/_common_metadata; isDirectory=false; length=3654; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
at org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:812)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://master1:9000/tpctest/catalog_sales/_common_metadata; isDirectory=false; length=3654; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:239)
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: java.io.IOException
at alluxio.AbstractClient.checkVersion(AbstractClient.java:115)
at alluxio.AbstractClient.connect(AbstractClient.java:178)
at alluxio.AbstractClient.retryRPC(AbstractClient.java:325)
at alluxio.client.file.FileSystemMasterClient.getStatus(FileSystemMasterClient.java:185)
at alluxio.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:175)
at alluxio.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:167)
at alluxio.hadoop.HdfsFileInputStream.<init>(HdfsFileInputStream.java:89)
at alluxio.hadoop.AbstractFileSystem.open(AbstractFileSystem.java:519)
at alluxio.hadoop.FileSystem.open(FileSystem.java:25)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:406)
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237)
... 5 more

Pei Sun

unread,
Jul 29, 2016, 11:05:05 AM7/29/16
to kiss.k...@gmail.com, Alluxio Users
It looks like it failed to getStatus somehow. Can you send me the full log? And can you tell me how you populate the catalog_sales? Is it a bunch of directories or files? A screenshot would be good. 

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to alluxio-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Pei Sun

Pei Sun

unread,
Jul 29, 2016, 11:07:04 AM7/29/16
to kiss.k...@gmail.com, Alluxio Users
Also from your command, you are creating a table from hdfs instead of alluxio?
--
Pei Sun

kiss.k...@gmail.com

unread,
Jul 31, 2016, 9:44:04 PM7/31/16
to Alluxio Users, kiss.k...@gmail.com
Thanks for your reply Pei Sun . I populate tpcds table's data by using https://github.com/databricks/spark-sql-perf.git . the data format is parquet,I can create table by using :sqlContext.createExternalTable("tpc1.catalog_sales","hdfs://master1:9000/tpctest/catalog_sales","parquet")  ,but if I chang hdfs to alluxio ,I got errors.


在 2016年7月29日星期五 UTC+8下午11:07:04,Pei Sun写道:

Pei Sun

unread,
Aug 1, 2016, 12:16:31 PM8/1/16
to kiss.k...@gmail.com, Alluxio Users
Hi,
    I actually tried this recently. I didn't encounter this problem. In order to reproduce what you did, can you share a sample code? 

Pei

kevin

unread,
Aug 1, 2016, 9:34:59 PM8/1/16
to Pei Sun, Alluxio Users
Thank you.

To genData from spark-shell:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
  
import com.databricks.spark.sql.perf.tpcds.Tables
val tables = new Tables(sqlContext, "/home/dcos/tpcds-kit-master/tools", 1)
tables.genData("hdfs://master1:9000/tpctest", "parquet", true, false, false, false, false)

To createtable from spark-shell:

sqlContext.sql("CREATE DATABASE tpc1")
sqlContext.sql("use tpc1")
sqlContext.createExternalTable("tpc1.call_center","hdfs://master1:9000/tpctest/call_center","parquet")  //success
sqlContext.sql("select count(1) from call_center").show
sqlContext.createExternalTable("tpc1.catalog_sales","alluxio://master1:9000/tpctest/catalog_sales","parquet")  //fail

kiss.k...@gmail.com

unread,
Aug 2, 2016, 10:44:17 PM8/2/16
to Alluxio Users
Hi,all:
This problem have been resolved. Finally test base on alluxio1.2 spark2.0.

I don't know why I got error like before ,cause I use the wrong port.

在 2016年7月29日星期五 UTC+8下午3:53:40,kiss.k...@gmail.com写道:

Pei Sun

unread,
Aug 2, 2016, 10:49:38 PM8/2/16
to kiss.k...@gmail.com, Alluxio Users
I am glad you have resolved your problem.   

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to alluxio-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Pei Sun
Reply all
Reply to author
Forward
0 new messages