[JIRA] (PLUGIN-643) BigQuery source: fix handling of nested record with same name as parent record

Prashant Jaikumar (Jira)

unread,

Mar 16, 2021, 4:06:48 PM3/16/21

to cdap...@googlegroups.com

Prashant Jaikumar created an issue

CDAP Plugins /

PLUGIN-643

BigQuery source: fix handling of nested record with same name as parent record

Issue Type:	Bug
Assignee:	Unassigned
Attachments:	bq_schema.json, flatten_recursive-cdap-data-pipeline.json
Created:	16/Mar/21 1:06 PM
Priority:	Major
Reporter:	Prashant Jaikumar

When BQ source reads from a table that contains a nested record with the same name as the parent record, it generates a recursive schema.

In the attached pipeline, there's a record named `record` that contains a nested record with the same name. When we attempt to write the flattened record to BQ sink, it results in an infinite loop which causes StackOverflowError.

```
java.lang.Exception: null
at io.cdap.cdap.internal.app.runtime.AbstractContext.lambda$initializeProgram$6(AbstractContext.java:605) ~[na:na]
at io.cdap.cdap.internal.app.runtime.AbstractContext.execute(AbstractContext.java:560) ~[na:na]
at io.cdap.cdap.internal.app.runtime.AbstractContext.initializeProgram(AbstractContext.java:597) ~[na:na]
at io.cdap.cdap.app.runtime.spark.SparkRuntimeService.initialize(SparkRuntimeService.java:433) ~[io.cdap.cdap.cdap-spark-core2_2.11-6.4.0-SNAPSHOT.jar:na]
at io.cdap.cdap.app.runtime.spark.SparkRuntimeService.startUp(SparkRuntimeService.java:208) ~[io.cdap.cdap.cdap-spark-core2_2.11-6.4.0-SNAPSHOT.jar:na]
at com.google.common.util.concurrent.AbstractExecutionThreadService$1$1.run(AbstractExecutionThreadService.java:47) ~[com.google.guava.guava-13.0.1.jar:na]
at io.cdap.cdap.app.runtime.spark.SparkRuntimeService$5$1.run(SparkRuntimeService.java:404) [io.cdap.cdap.cdap-spark-core2_2.11-6.4.0-SNAPSHOT.jar:na]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_282]
java.lang.StackOverflowError: null
at java.lang.System$2.getEnumConstantsShared(System.java:1252) ~[na:1.8.0_282]
at java.util.EnumSet.getUniverse(EnumSet.java:407) ~[na:1.8.0_282]
at java.util.EnumSet.noneOf(EnumSet.java:110) ~[na:1.8.0_282]
at com.google.api.client.util.GenericData.<init>(GenericData.java:55) ~[na:na]
at com.google.api.client.json.GenericJson.<init>(GenericJson.java:36) ~[na:na]
at com.google.api.services.bigquery.model.TableFieldSchema.<init>(TableFieldSchema.java:30) ~[na:na]
at com.google.cloud.hadoop.io.bigquery.output.BigQueryTableFieldSchema.<init>(BigQueryTableFieldSchema.java:34) ~[na:na]
at io.cdap.plugin.gcp.bigquery.sink.BigQuerySinkUtils.generateTableFieldSchema(BigQuerySinkUtils.java:106) ~[na:na]
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[na:1.8.0_282]
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) ~[na:1.8.0_282]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[na:1.8.0_282]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[na:1.8.0_282]
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[na:1.8.0_282]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[na:1.8.0_282]
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) ~[na:1.8.0_282]
at io.cdap.plugin.gcp.bigquery.sink.BigQuerySinkUtils.generateTableFieldSchema(BigQuerySinkUtils.java:122) ~[na:na]
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[na:1.8.0_282]
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) ~[na:1.8.0_282]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[na:1.8.0_282]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[na:1.8.0_282]
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[na:1.8.0_282]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[na:1.8.0_282]
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) ~[na:1.8.0_282]
at io.cdap.plugin.gcp.bigquery.sink.BigQuerySinkUtils.generateTableFieldSchema(BigQuerySinkUtils.java:122) ~[na:na]
...
```

Add Comment

Get Jira notifications on your phone! Download the Jira Cloud app for Android or iOS

This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100155-sha1:2c3a71f)

Jetron Saiti (Jira)

unread,

Apr 15, 2021, 6:54:09 AM4/15/21

to cdap...@googlegroups.com

Jetron Saiti commented on

PLUGIN-643

Re: BigQuery source: fix handling of nested record with same name as parent record

The issue isn’t related to BQ source nested records but it’s a reference issue, we have tested this within Wrangler and as a File Source and the issue is when there are referenced records, the parent record is saved as a known record and is used for every record that has the same name, causing it to fail.

While debugging we found that the issue may be with StructuredRecord when there are nested records, so we performed some other tests:

1. Created a JSON dataset with nested records where the child has same name as parent and tried to parse from Wrangler, the output was wrong,
2. Created a JSON dataset with nested records where the child has same name as parent and tried to parse with File Plugin as source, if you press Validate button, there will be issue in Output Schema,
3. Created a Unit Test with StructuredRecord where the child has same name as parent and the output was wrong,

 
                                                                Schema mainSchema = Schema.recordOf(“record”, Schema.Field.of(“Field”, Schema.of(Schema.Type.STRING)));
Schema subSchema = Schema.recordOf(“record”, Schema.Field.of(“record”, mainSchema));

Output: {“type”:“record”,“name”:“record”,“fields”:[{“name”:“record”,“type”:“record”}]}

and if we read the source code of StructuredRecord we can see that this is related with the ability to reference other schemas:[ |https://github.com/cdapio/cdap/blob/83798c0cc09cb616a54d1065962e07c2d04c21ce/cdap-api-common/src/main/java/io/cdap/cdap/internal/io/SchemaTypeAdapter.java#L367-L370][https://github.com/cdapio/cdap/blob/83798c0cc09cb616a54d1065962e07c2d04c21ce/cdap-api-common/src/main/java/io/cdap/cdap/internal/io/SchemaTypeAdapter.java#L367-L3+|https://github.com/cdapio/cdap/blob/83798c0cc09cb616a54d1065962e07c2d04c21ce/cdap-api-common/src/main/java/io/cdap/cdap/internal/io/SchemaTypeAdapter.java#L367-L370]+70

Add Comment

Get Jira notifications on your phone! Download the Jira Cloud app for Android or iOS

This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100157-sha1:a51b698)

Prashant Jaikumar (Jira)

unread,

Apr 16, 2021, 1:32:07 AM4/16/21

to cdap...@googlegroups.com

Prashant Jaikumar commented on

PLUGIN-643

Re: BigQuery source: fix handling of nested record with same name as parent record

This seems to be a CDAP issue based, reassigning to Bhooshan to prioritize.

Add Comment

Get Jira notifications on your phone! Download the Jira Cloud app for Android or iOS

This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100157-sha1:cf9946a)

Reply all

Reply to author

Forward