Spark could not read footer for file parquet error

José Raúl Pérez Rodríguez

unread,

Jun 27, 2019, 8:44:55 AM6/27/19

to Alluxio Users

Hi,

I am using alluxio 2.0.0 version and trying to read a parquet file for testing, like

sparkSession.read.parquet("alluxio://[master]:19998/store_sales/")

I got the following error in Spark:

No logs appear in master or worker nodes.

Caused by: java.io.IOException: Could not read footer for file: FileStatus{path=alluxio://[master]:19998/store_sales/part-00000-e7e8c7ca-84ca-4ea4-b3b3-ad8164ca0525-c000.snappy.parquet; isDirectory=false; length=119342549; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:498)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:485)
	at scala.collection.parallel.AugmentedIterableIterator$class.flatmap2combiner(RemainsIterator.scala:132)
	at scala.collection.parallel.immutable.ParVector$ParVectorIterator.flatmap2combiner(ParVector.scala:62)
	at scala.collection.parallel.ParIterableLike$FlatMap.leaf(ParIterableLike.scala:1072)
	at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
	at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
	at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
	at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
	at scala.collection.parallel.ParIterableLike$FlatMap.tryLeaf(ParIterableLike.scala:1068)
	at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:152)
	at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:443)
	at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinTask.doJoin(ForkJoinTask.java:341)
	at scala.concurrent.forkjoin.ForkJoinTask.join(ForkJoinTask.java:673)
	at scala.collection.parallel.ForkJoinTasks$WrappedTask$class.sync(Tasks.scala:378)
	at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.sync(Tasks.scala:443)
	at scala.collection.parallel.ForkJoinTasks$class.executeAndWaitResult(Tasks.scala:426)
	at scala.collection.parallel.ForkJoinTaskSupport.executeAndWaitResult(TaskSupport.scala:56)
	at scala.collection.parallel.ParIterableLike$ResultMapping.leaf(ParIterableLike.scala:958)
	at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
	at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
	at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
	at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
	at scala.collection.parallel.ParIterableLike$ResultMapping.tryLeaf(ParIterableLike.scala:953)
	at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:152)
	at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:443)
	at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.IllegalArgumentException: Seek position past the end of the read region (block or file). [436207616]
	at alluxio.core.client.runtime.com.google.common.base.Preconditions.checkArgument(Preconditions.java:202)
	at alluxio.client.block.stream.BlockInStream.seek(BlockInStream.java:316)
	at alluxio.client.file.FileInStream.updateStream(FileInStream.java:313)
	at alluxio.client.file.FileInStream.read(FileInStream.java:126)
	at alluxio.hadoop.HdfsFileInputStream.read(HdfsFileInputStream.java:98)
	at java.io.FilterInputStream.read(FilterInputStream.java:83)
	at org.apache.parquet.hadoop.util.H1SeekableInputStream.read(H1SeekableInputStream.java:60)
	at org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:67)
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:472)
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:445)
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:421)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:491)
	... 32 more

Thanks,

José Raúl Pérez Rodríguez

unread,

Jun 27, 2019, 2:24:19 PM6/27/19

to Bin Fan, Alluxio Users

Hi Bin,

I executed your command and here is the response:

119342549 PERSISTED 06-25-2019 16:23:43:640 0% /store_sales/part-00000-e7e8c7ca-84ca-4ea4-b3b3-ad8164ca0525-c000.snappy.parquet

Also I have checked using HDFS as datasource the Spark query works well, what I am doing is a simple sparkSession.read.parquet("[path]").count

The client library is in the same jar as the Spark job, so I think is available to both driver and executors.

I have readed this problem in other post (Spark related), but the reasons it could happen does not apply in this case, so I have no idea why wrong offset is being generated to read the file.

Thanks,

Best Regards

El jue., 27 jun. 2019 a las 20:09, Bin Fan (<bin...@alluxio.com>) escribió:

Hi José,

It looks to me that this file alluxio://[master]:19998/store_sales/part-00000-e7e8c7ca-84ca-4ea4-b3b3-ad8164ca0525-c000.snappy.parquet
might be corrupted for some reason.
Can you compare its length as reported by Alluxio, by running

bin/alluxio fs ls /store_sales/part-00000-e7e8c7ca-84ca-4ea4-b3b3-ad8164ca0525-c000.snappy.parquet

and also check its length in the original data source?

- Bin

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to alluxio-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/alluxio-users/0be6ac01-05e2-4648-8714-a728bfbe8094%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
- Bin Fan
powered by Alluxio | alluxio.com | alluxio.org

Bin Fan

unread,

Jun 27, 2019, 2:27:56 PM6/27/19

to José Raúl Pérez Rodríguez, Alluxio Users

Thanks José,

can you check the length of this file in HDFS too? I want to make sure Alluxio "remembers" the correct version of this file.

you can run

hdfs -fs -ls /path/to/your/part-00000-e7e8c7ca-84ca-4ea4-b3b3-ad8164ca0525-c000.snappy.parquet

to check the file length

José Raúl Pérez Rodríguez

unread,

Jun 27, 2019, 4:37:29 PM6/27/19

to Bin Fan, Alluxio Users

Hi Bin, seems is okay, shows the same length

-rw-r--r-- 3 hdfs supergroup 119342549 2019-06-25 18:23 /alluxio/store_sales/part-00000-e7e8c7ca-84ca-4ea4-b3b3-ad8164ca0525-c000.snappy.parquet

Thanks,

Best Regards

Bin Fan

unread,

Jun 27, 2019, 4:41:32 PM6/27/19

to José Raúl Pérez Rodríguez, Alluxio Users

In this case, can you run

bin/alluxio fs free /store_sales/part-00000-e7e8c7ca-84ca-4ea4-b3b3-ad8164ca0525-c000.snappy.parquet

wait for a few minutes and re-run the Spark query?

this free command we evict the data blocks from Alluxio space and make a load from HDFS next time on your query.

Let's see if the same error happens again.

- Bin

José Raúl Pérez Rodríguez

unread,

Jun 27, 2019, 5:21:29 PM6/27/19

to Bin Fan, Alluxio Users

Hi Bin, the same result,

I don't know it is normal replication and blocksize = 0 here.

Could not read footer for file: FileStatus{path=alluxio://[host]:19998/store_sales/part-00000-e7e8c7ca-84ca-4ea4-b3b3-ad8164ca0525-c000.snappy.parquet; isDirectory=false; length=119342549; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}

Thanks,

Best Regards

Bin Fan

unread,

Jun 27, 2019, 5:29:16 PM6/27/19

to José Raúl Pérez Rodríguez, Alluxio Users

Can you try

bin/alluxio fs rm --alluxioOnly /store_sales/part-00000-e7e8c7ca-84ca-4ea4-b3b3-ad8164ca0525-c000.snappy.parquet

which will remove the metadata of this file in Alluxio (but not touching the file in HDFS)?

Then run your query?

On Thu, Jun 27, 2019 at 2:21 PM José Raúl Pérez Rodríguez <jrp...@stratio.com> wrote:

Hi Bin, the same result,

I don't know it is normal replication and blocksize = 0 here.

It simply means nothing has been read into Alluxio space

José Raúl Pérez Rodríguez

unread,

Jun 27, 2019, 5:43:04 PM6/27/19

to Bin Fan, Alluxio Users

Hi Bin, the same result.

It could be a permissions issue, because I was having problems in passing tests due to kerberos authentication problems, but I solved the problem forcing kerberos authentication at alluxio.underfs.hdfs.HdfsUnderFileSystem constructor, using UserGroupInformation.loginUserFromKeytab, y perform this just once, at the class instantiation. It worked for me, it was the only way I found to pass the kerberos authentication issue in tests.

Thanks,

Best Regards

HdfsUnderFileSystem.java

Jiacheng Liu

unread,

Jul 30, 2019, 9:19:37 PM7/30/19

to Alluxio Users

Hi José,

Sorry for the delay in reply. Can you please try reading directly from HDFS without going through Alluxio like below?

sparkSession.read.parquet("hdfs://[master]:8020/{hdfs path}/store_sales/")

I see you created another ticket Kerberos error on some tests on runTests command on the same day. I assume you have set up Alluxio configuration correctly as Alluxio connect to secure HDFS for this test as well, so you shouldn't have to call UserGroupInformation.loginUserFromKeytab() in the constructor manually. We are trying to reproduce the error and the fix. Meanwhile it'll be great if you can confirm it works directly reading the parquet file from HDFS. Thanks!

Best regards,

Jiacheng

在 2019年6月27日星期四 UTC-7下午2:43:04，José Raúl Pérez Rodríguez写道：

To unsubscribe from this group and stop receiving emails from it, send an email to alluxi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/alluxio-users/0be6ac01-05e2-4648-8714-a728bfbe8094%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
- Bin Fan
powered by Alluxio | alluxio.com | alluxio.org

Jiacheng Liu

unread,

Jul 31, 2019, 2:11:13 PM7/31/19

to Alluxio Users

Hi José,

I wasn't able to reproduce this issue. Here's what I did:

1. Connect Alluxio to secure HDFS. Alluxio version 2.0.0-RC4. This is done following guide Alluxio connect to secure HDFS.

2. Start spark shell specifying principal and keytab "spark-shell --principal hdfs/all...@ALLUXIO.COM --keytab {path_to_keytab}/alluxio.keytab"

3. Read and write parquet file from and to Alluxio and HDFS. I was able to do both without running into the exception in Alluxio BlockInputStream.

This exception looks more related to the format of this parquet file. You are already reading this file, so Alluxio should have been connected to Kerberized HDFS and logged in. May I know how this parquet file is generated?

Thanks,

Jiacheng

Jiacheng Liu

unread,

Sep 1, 2019, 2:44:19 PM9/1/19

to Alluxio Users

Hi José,

I wasn't able to reproduce this issue. Here's what I did:

1. Connect Alluxio to secure HDFS. Alluxio version 2.0.0-RC4. This is done following guide Alluxio connect to secure HDFS.

2. Start spark shell specifying principal and keytab "spark-shell --principal hdfs/all...@ALLUXIO.COM --keytab {path_to_keytab}/alluxio.keytab"

3. Read and write parquet file from and to Alluxio and HDFS. I was able to do both without running into the exception in Alluxio BlockInputStream.

This exception looks more related to the format of this parquet file. You are already reading this file, so Alluxio should have been connected to Kerberized HDFS and logged in. May I know how this parquet file is generated?

We are closing this case since there's not enough information. If you have more findings please feel free to re-open this in Github issue https://github.com/Alluxio/alluxio/issues/9398. Thanks!

Best,

Jiacheng

Jiacheng Liu <jiach...@gmail.com> 于2019年7月30日周二下午6:19写道：

To unsubscribe from this group and stop receiving emails from it, send an email to alluxio-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/alluxio-users/6cd425fa-cc48-4715-98fc-79d18cf2740d%40googlegroups.com.

Reply all

Reply to author

Forward