When i tried reading parquet data that was generated by spark in cascading it throws following error
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file ""
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat$RecordReaderWrapper.<init>(DeprecatedParquetInputFormat.java:103)
at org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat.getRecordReader(DeprecatedParquetInputFormat.java:47)
at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:253)
at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:248)
at cascading.util.Util.retry(Util.java:1044)
at cascading.tap.hadoop.io.MultiInputFormat.getRecordReader(MultiInputFormat.java:247)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.ArrayList.elementData(ArrayList.java:418)
at java.util.ArrayList.get(ArrayList.java:431)
at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:98)
at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:98)
at org.apache.parquet.io.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:83)
at org.apache.parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:77)
at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:293)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)
at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
This is mostly seen when parquet has nested structures.
I have seen similar issues in SPARK but those were taken care by SPARK.
Not sure if this is something to with Parquet version or Cascading has a bug.