How to view and "accept malformed JSON" with CDAP 4.2 Wrangler in a Realtime Pipeline

1,211 views
Skip to first unread message

aremash

unread,
Jul 13, 2017, 11:50:21 AM7/13/17
to CDAP User
The Wrangler in CDAP 4.2 has nice new features and is much improved over version 4.1.  I'm using it to "wrangle" a JSON file, which appeared to be fine in the Data Preparation screen, but when running it in a Realtime Pipeline, it produces the following error:

ERROR

co.cask.wrangler.Wrangler#263 Error threshold reached '1' : com.google.gson.stream.MalformedJsonException: Use JsonReader.setLenient(true) to accept
malformed JSON at line 1 column 14

After increasing the error threshold, it complains about later columns. 

Question 1:
How can I view the columns of my JSON again?  The tab for my JSON file disappeared from the Data Preparation screen (now it has only a CSV file I'm wrangling in a Batch Pipeline).  And clicking on the Wrangle button in the Wrangler Properties shows no data, just prompts to "select or upload a file" even though I already did this before.  Is it no longer visible because I set the File Path (in the File Source that feeds the Wrangler) to be a directory instead of a single file? 

Question 2:
How can I "Use JsonReader.setLenient(true) to accept malformed JSON," please?


Edwin Elia

unread,
Jul 13, 2017, 6:34:08 PM7/13/17
to CDAP User
Hi Aremash,

Regarding you question number 1, were the JSON and CSV files have the same filename? Also, were you using the Upload feature instead of the File Browser? If so, the upload takes the filename as the tab unique identifier, therefore it will overwrite if the filename is the same.

For your question number 2, currently there is no way to do JsonReader.setLenient(true). Can I ask more about your usecase? Is the data you uploaded and the data you are trying to consume from a directory have the same format? Is the JSON file contain one single JSON record or multiple records? Our parse-as-json directive assume that each record in the column is one single JSON object.

Best,
Edwin Elia

aremash

unread,
Jul 14, 2017, 9:59:19 AM7/14/17
to CDAP User
Hi Edwin,
Thanks for your reply.

1.  No, my JSON & CSV files have different names and are in different local directories.  I guess I might have to go through the wrangling again, which is a pain because I had >100 Directives.  I know I can copy & paste Directives between the Wrangler Properties screens once the Wranglers are established in the Pipeline, but I don't see a way to do that when starting with Data Preparation.  Data Preparation has a button to download/export the Directives (as a text file) but no way to upload/import such a file.  Would Cask please add this feature?

2.  Yes, the data I uploaded and the data I am trying to consume from a directory have the same format.  For my testing so far (I'm using the Sandbox version of CDAP 4.2 in Docker), I'm using the exact same file.  My JSON file is actually one big array and I want to parse each element as a separate record.  Additionally, each name-value pair is on its own line in the file.  So it's not technically 100% standard/valid JSON format, but I got the Wrangler to get it the way I want--but only in the Data Preparation view (before it disappeared); I get the error when trying to Deploy & Run, and don't see a way to troubleshoot or debug it.  So I guess I'll try starting over....

aremash

unread,
Jul 14, 2017, 1:30:52 PM7/14/17
to CDAP User
I created a copy of my JSON file (same data, different file name) and went through Data Preparation again with less Directives, so resulting in a slightly different output schema.  After Deploying & Running on that same file, it remained in the Data Preparation, so that's good.  But I still get the same error:


Error threshold reached '1' : com.google.gson.stream.MalformedJsonException: Use JsonReader.setLenient(true) to accept malformed JSON at line 1 column 14

This time I can see what "column 14" is in Data Preparation, and it's a JSON field that is not present in the first record but is in some later records.  During Data Preparation it just has it as null/empty in records like this, which is what I want.  But when Deploying & Running, it apparently doesn't like that the field doesn't exist (at least not in the first record).  FYI: The Null check box is checked for this in the Output Schema, it's Type is String, and I don't see anything invalid about the field name or value when it appears in later records.  What can be done to get the Wrangler to work on sparse data like this, please? 


aremash

unread,
Jul 17, 2017, 4:04:20 PM7/17/17
to CDAP User
I used Data Preparation on a different JSON file which is similar but has a different schema, but get the exact same error again:


ERROR

co.cask.wrangler.Wrangler#263 Error threshold reached '1' : com.google.gson.stream.MalformedJsonException: Use JsonReader.setLenient(true) to accept malformed JSON at line 1 column 14

What does the #263 refer to?

Column 14 in the Wrangler is always populated in this file (unlike my previous file).  But is this error referring to the column number in the Wrangler output, or a column in the raw input, or at some point during the processing? 

In both JSON files/Pipelines, I'm actually Filtering out the original line 1 (which has 15 columns populated, all of which are removed).  I don't see anything invalid or unusual about that JSON format either. 

How can this be debugged, please?



On Friday, July 14, 2017 at 1:30:52 PM UTC-4, aremash wrote:
I created a copy of my JSON file (same data, different file name) and went through Data Preparation again with less Directives, so resulting in a slightly different output schema.  After Deploying & Running on that same file, it remained in the Data Preparation, so that's good.  But I still get the same error:

Error threshold reached '1' : com.google.gson.stream.MalformedJsonException: Use JsonReader.setLenient(true) to accept malformed JSON at line 1 column 14

This time I can see what "column 14" is in Data Preparation, and it's a JSON field that is not present in the first record but is in some later records.  During Data Preparation it just has it as null/empty in records like this, which is what I want.  But when Deploying & Running, it apparently doesn't like that the field doesn't exist (at least not in the first record).  FYI: The Null check box is checked for this in the Output Schema, its Type is String, and I don't see anything invalid about the field name or value when it appears in later records.  What can be done to get the Wrangler to work on sparse data like this, please? 


Edwin Elia

unread,
Jul 17, 2017, 4:21:02 PM7/17/17
to CDAP User
Hi Aremash,

The column 14 in that error message is referring to the character at index 14. Do you have a scrubbed data that you can send us so we can debug it on our end?

Best,
Edwin Elia

aremash

unread,
Jul 18, 2017, 11:10:26 AM7/18/17
to CDAP User
For the record, I sent sample data to Edwin in a private message yesterday.  When available, we'll reply with the conclusion for everyone.

aremash

unread,
Jul 20, 2017, 7:21:09 PM7/20/17
to CDAP User
FYI:  Edwin determined "The problem is that the File source in Pipeline will send one record per line, therefore it is throwing that error that the JSON is malformed. We have to modify the file source to accept record delimiter."
I don't see a way to do this in a Realtime Pipeline, but found a way to do it in a Batch Pipeline, which I have working now. 

In the File Properties Configuration, enter this in the File System Properties:


{

  "textinputformat.record.delimiter": "`"

Reply all
Reply to author
Forward
0 new messages