Cascading has issues reading parquet data generated by Spark

53 views
Skip to first unread message

vg1234

unread,
Nov 9, 2017, 6:39:01 PM11/9/17
to cascading-user
When i tried reading parquet data that was generated by spark in cascading it throws following error



Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file ""
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat$RecordReaderWrapper.<init>(DeprecatedParquetInputFormat.java:103)
at org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat.getRecordReader(DeprecatedParquetInputFormat.java:47)
at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:253)
at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:248)
at cascading.util.Util.retry(Util.java:1044)
at cascading.tap.hadoop.io.MultiInputFormat.getRecordReader(MultiInputFormat.java:247)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.ArrayList.elementData(ArrayList.java:418)
at java.util.ArrayList.get(ArrayList.java:431)
at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:98)
at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:98)
at org.apache.parquet.io.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:83)
at org.apache.parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:77)
at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:293)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)
at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)

This is mostly seen when parquet has nested structures.

I have seen similar issues in SPARK but those were taken care by SPARK.

Not sure if this is something to with Parquet version or Cascading has a bug.



Chris K Wensel

unread,
Nov 10, 2017, 12:42:51 PM11/10/17
to cascadi...@googlegroups.com
You might ping the parquet email list. this doesn’t look like a Cascading issue. you might also verify the parquet versions used in spark and your cascading app.

ckw


--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/86b78a69-f882-4445-a801-33879b3f900c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


vg1234

unread,
Nov 14, 2017, 12:54:11 PM11/14/17
to cascading-user
Parquet versions is same(1.7.0) in both spark and cascading apps.

Vikas Gandham
Reply all
Reply to author
Forward
0 new messages