JSON File Samples and Schema Detection

412 views
Skip to first unread message

annelise....@gmail.com

unread,
Aug 9, 2018, 9:01:09 AM8/9/18
to Kylo Community
Hello,
I'm trying to upload a JSON file when setting up a feed to detect the schema of the table. I've tried with both sample JSON files from the Kylo Github and with files of our own, and I get the below two errors in the kylo server logs. Any suggestions? 

(1)

2018-08-07 20:49:56 INFO  http-nio-8420-exec-5:SparkFileSchemaParserService:206 - Created temporary file file:/tmp/kylo-spark-parser874516531883217874.dat success? true

2018-08-07 20:49:56 INFO  http-nio-8420-exec-5:SparkFileSchemaParserService:137 - Script sqlContext.read.json("file:///tmp/kylo-spark-parser874516531883217874.dat").limit(10).toDF()

2018-08-07 20:49:56 ERROR http-nio-8420-exec-5:SparkFileSchemaParserService:101 - Error parsing file JSON: java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 79.0 failed 1 times, most recent failure: Lost task 0.0 in stage 79.0 (TID 59, localhost, executor driver): java.io.FileNotFoundException: /tmp/blockmgr-2207eb29-af1c-4516-8069-c8c02a7c30f2/32/temp_shuffle_1eb6b692-8499-4f2f-802f-57747f445764 (No such file or directory)

    at java.io.FileOutputStream.open0(Native Method)

    at java.io.FileOutputStream.open(FileOutputStream.java:270)

    at java.io.FileOutputStream.<init>(FileOutputStream.java:213)

    at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)

    at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)

    at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)

    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)

    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)

    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)

    at org.apache.spark.scheduler.Task.run(Task.scala:109)

    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:


(2)

2018-08-07 20:50:37 INFO  http-nio-8420-exec-8:SparkFileSchemaParserService:206 - Created temporary file file:/tmp/kylo-spark-parser4221910046203569500.dat success? true

2018-08-07 20:50:37 INFO  http-nio-8420-exec-8:SparkFileSchemaParserService:137 - Script sqlContext.read.json("file:///tmp/kylo-spark-parser4221910046203569500.dat").limit(10).toDF()

2018-08-07 20:50:37 ERROR http-nio-8420-exec-8:SparkFileSchemaParserService:101 - Error parsing file JSON: java.util.concurrent.ExecutionException: org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the

referenced columns only include the internal corrupt record column

(named _corrupt_record by default). For example:

spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()

and spark.read.schema(schema).json(file).select("_corrupt_record").show().

Instead, you can cache or save the parsed results and then send the same query.

For example, val df = spark.read.schema(schema).json(file).cache() and then

df.filter($"_corrupt_record".isNotNull).count().;

Scott Reisdorf

unread,
Aug 9, 2018, 11:36:31 AM8/9/18
to Kylo Community
Can you share the JSON file you are trying to use?

Just so I understand the problem correctly, you are using the "Data Ingest" template to create a feed and get this error when you upload the JSON file to help create the schema for the target Hive table?

annelise....@gmail.com

unread,
Aug 9, 2018, 11:44:57 AM8/9/18
to Kylo Community
I've attached the one from the kylo github page. I can't attach the other one we were testing with though. 

We're using the S3 Ingest template, but yes we get the error when we upload the JSON file to help create the schema for the target Hive table. The Kylo UI just displays an error about the file format, but then the error posted above is from they kylo services log. 
books2.json

Greg Hart

unread,
Aug 9, 2018, 11:55:53 AM8/9/18
to Kylo Community
Hi,

It looks like part of the error message was cut off. Could you please try again and attach your entire /var/log/kylo-services/kylo-services.log and /var/log/kylo-services/kylo-spark-shell.log files?

annelise....@gmail.com

unread,
Aug 9, 2018, 12:09:15 PM8/9/18
to Kylo Community
attached.
kylo-services.log

Greg Hart

unread,
Aug 9, 2018, 12:52:41 PM8/9/18
to Kylo Community
Hi,

This error is typically caused when files are deleted from /tmp/ while Kylo is running. You can try killing the Spark processes with 'pkill -e -f SparkShellApp' and they will be restarted the next time you upload a sample file. Alternatively, you can restart kylo-services and that will also restart the Spark processes.

Jagrut Sharma

unread,
Aug 9, 2018, 12:56:23 PM8/9/18
to Kylo Community
Hi Annelise - Try with the attached file (books.json) and see if you still get the same issue.

Thanks.
--
Jagrut
books.json

annelise....@gmail.com

unread,
Aug 9, 2018, 3:02:44 PM8/9/18
to Kylo Community
Hi Greg,
After restarting kylo-services, I'm able to upload the JSON file to create the schema. Thank you! 

annelise....@gmail.com

unread,
Aug 9, 2018, 3:03:43 PM8/9/18
to Kylo Community
Hi Jagrut, 
This file worked for me before and after I restarted kylo-services. Some other validated json files didn't work before the restart, but now they do. Thanks!
Reply all
Reply to author
Forward
0 new messages