Can't read parquet with spark2.0

kiss.k...@gmail.com

unread,

Jul 29, 2016, 3:53:40 AM7/29/16

to Alluxio Users

alluxio1.1.1

hadoop2.7

spark2.0-hadoop2.7

I can create table use hdfs protocol : sqlContext.createExternalTable("tpc1.catalog_sales","hdfs://master1:9000/tpctest/catalog_sales","parquet")

but when alluxio I got error:

scala> sqlContext.createExternalTable("tpc1.catalog_sales","alluxio://master1:9000/tpctest/catalog_sales","parquet")

16/07/29 15:17:22 WARN TaskSetManager: Lost task 15.0 in stage 5.0 (TID 51, slave1): java.io.IOException: Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://master1:9000/tpctest/catalog_sales/_common_metadata; isDirectory=false; length=3654; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}

at org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)

at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:812)

at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)

at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)

at org.apache.spark.scheduler.Task.run(Task.scala:85)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://master1:9000/tpctest/catalog_sales/_common_metadata; isDirectory=false; length=3654; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}

at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:239)

at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

... 3 more

Caused by: java.io.IOException

at alluxio.AbstractClient.checkVersion(AbstractClient.java:115)

at alluxio.AbstractClient.connect(AbstractClient.java:178)

at alluxio.AbstractClient.retryRPC(AbstractClient.java:325)

at alluxio.client.file.FileSystemMasterClient.getStatus(FileSystemMasterClient.java:185)

at alluxio.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:175)

at alluxio.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:167)

at alluxio.hadoop.HdfsFileInputStream.<init>(HdfsFileInputStream.java:89)

at alluxio.hadoop.AbstractFileSystem.open(AbstractFileSystem.java:519)

at alluxio.hadoop.FileSystem.open(FileSystem.java:25)

at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)

at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:406)

at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237)

... 5 more

Pei Sun

unread,

Jul 29, 2016, 11:05:05 AM7/29/16

to kiss.k...@gmail.com, Alluxio Users

It looks like it failed to getStatus somehow. Can you send me the full log? And can you tell me how you populate the catalog_sales? Is it a bunch of directories or files? A screenshot would be good.

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to alluxio-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Pei Sun

unread,

Jul 29, 2016, 11:07:04 AM7/29/16

to kiss.k...@gmail.com, Alluxio Users

Also from your command, you are creating a table from hdfs instead of alluxio?

--

Pei Sun

kiss.k...@gmail.com

unread,

Jul 31, 2016, 9:44:04 PM7/31/16

to Alluxio Users, kiss.k...@gmail.com

Thanks for your reply Pei Sun . I populate tpcds table's data by using https://github.com/databricks/spark-sql-perf.git . the data format is parquet,I can create table by using :sqlContext.createExternalTable("tpc1.catalog_sales","hdfs://master1:9000/tpctest/catalog_sales","parquet") ,but if I chang hdfs to alluxio ,I got errors.

在 2016年7月29日星期五 UTC+8下午11:07:04，Pei Sun写道：

Pei Sun

unread,

Aug 1, 2016, 12:16:31 PM8/1/16

to kiss.k...@gmail.com, Alluxio Users

Hi,

I actually tried this recently. I didn't encounter this problem. In order to reproduce what you did, can you share a sample code?

Pei

kevin

unread,

Aug 1, 2016, 9:34:59 PM8/1/16

to Pei Sun, Alluxio Users

Thank you.

To genData from spark-shell:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import sqlContext.implicits._

import com.databricks.spark.sql.perf.tpcds.Tables

val tables = new Tables(sqlContext, "/home/dcos/tpcds-kit-master/tools", 1)

tables.genData("hdfs://master1:9000/tpctest", "parquet", true, false, false, false, false)

To createtable from spark-shell:

sqlContext.sql("CREATE DATABASE tpc1")

sqlContext.sql("use tpc1")

sqlContext.createExternalTable("tpc1.call_center","hdfs://master1:9000/tpctest/call_center","parquet") //success

sqlContext.sql("select count(1) from call_center").show

sqlContext.createExternalTable("tpc1.catalog_sales","alluxio://master1:9000/tpctest/catalog_sales","parquet") //fail

kiss.k...@gmail.com

unread,

Aug 2, 2016, 10:44:17 PM8/2/16

to Alluxio Users

Hi,all:

This problem have been resolved. Finally test base on alluxio1.2 spark2.0.

I don't know why I got error like before ,cause I use the wrong port.

在 2016年7月29日星期五 UTC+8下午3:53:40，kiss.k...@gmail.com写道：

Pei Sun

unread,

Aug 2, 2016, 10:49:38 PM8/2/16

to kiss.k...@gmail.com, Alluxio Users

I am glad you have resolved your problem.

--

You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to alluxio-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.