Continuing the discussion on the forum. I do not have a valid log file for the sample project I created, but attaching the code. Probably I am doing something wrong in the way I am generating the sample log. I will recheck that. The error I get with the actual project indicates an error due to the file not having the magic number- details below (basically not being a parquet file)
If you can spot an issue in the way I am coding, you might be able to help. Except the file and the schema, its basically the same code. The sample log is not parquet encoded.
I just want to know if parquet is a mandatory encoding for this library. The reason for the question is, elephant bird seems to support non-parquet files while
https://github.com/saurfang/sparksql-protobuf doesnt seem to support non-parquet.
java.io.IOException: Could not read footer: java.lang.RuntimeException: file:/Users/skoppar/workspace/pyspark-beacon/stream/allproto.log is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [55, 73, 67, 10]
at org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:812)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: file:/Users/skoppar/workspace/pyspark-beacon/stream/allproto.log is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [55, 73, 67, 10]