[JIRA] (PLUGIN-643) BigQuery source: fix handling of nested record with same name as parent record

14 views
Skip to first unread message

Prashant Jaikumar (Jira)

unread,
Mar 16, 2021, 4:06:48 PM3/16/21
to cdap...@googlegroups.com
Prashant Jaikumar created an issue
 
CDAP Plugins / Bug PLUGIN-643
BigQuery source: fix handling of nested record with same name as parent record
Issue Type: Bug Bug
Assignee: Unassigned
Attachments: bq_schema.json, flatten_recursive-cdap-data-pipeline.json
Created: 16/Mar/21 1:06 PM
Priority: Major Major
Reporter: Prashant Jaikumar

When BQ source reads from a table that contains a nested record with the same name as the parent record, it generates a recursive schema.

In the attached pipeline, there's a record named `record` that contains a nested record with the same name. When we attempt to write the flattened record to BQ sink, it results in an infinite loop which causes StackOverflowError.

```
java.lang.Exception: null
at io.cdap.cdap.internal.app.runtime.AbstractContext.lambda$initializeProgram$6(AbstractContext.java:605) ~[na:na]
at io.cdap.cdap.internal.app.runtime.AbstractContext.execute(AbstractContext.java:560) ~[na:na]
at io.cdap.cdap.internal.app.runtime.AbstractContext.initializeProgram(AbstractContext.java:597) ~[na:na]
at io.cdap.cdap.app.runtime.spark.SparkRuntimeService.initialize(SparkRuntimeService.java:433) ~[io.cdap.cdap.cdap-spark-core2_2.11-6.4.0-SNAPSHOT.jar:na]
at io.cdap.cdap.app.runtime.spark.SparkRuntimeService.startUp(SparkRuntimeService.java:208) ~[io.cdap.cdap.cdap-spark-core2_2.11-6.4.0-SNAPSHOT.jar:na]
at com.google.common.util.concurrent.AbstractExecutionThreadService$1$1.run(AbstractExecutionThreadService.java:47) ~[com.google.guava.guava-13.0.1.jar:na]
at io.cdap.cdap.app.runtime.spark.SparkRuntimeService$5$1.run(SparkRuntimeService.java:404) [io.cdap.cdap.cdap-spark-core2_2.11-6.4.0-SNAPSHOT.jar:na]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_282]
java.lang.StackOverflowError: null
at java.lang.System$2.getEnumConstantsShared(System.java:1252) ~[na:1.8.0_282]
at java.util.EnumSet.getUniverse(EnumSet.java:407) ~[na:1.8.0_282]
at java.util.EnumSet.noneOf(EnumSet.java:110) ~[na:1.8.0_282]
at com.google.api.client.util.GenericData.<init>(GenericData.java:55) ~[na:na]
at com.google.api.client.json.GenericJson.<init>(GenericJson.java:36) ~[na:na]
at com.google.api.services.bigquery.model.TableFieldSchema.<init>(TableFieldSchema.java:30) ~[na:na]
at com.google.cloud.hadoop.io.bigquery.output.BigQueryTableFieldSchema.<init>(BigQueryTableFieldSchema.java:34) ~[na:na]
at io.cdap.plugin.gcp.bigquery.sink.BigQuerySinkUtils.generateTableFieldSchema(BigQuerySinkUtils.java:106) ~[na:na]
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[na:1.8.0_282]
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) ~[na:1.8.0_282]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[na:1.8.0_282]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[na:1.8.0_282]
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[na:1.8.0_282]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[na:1.8.0_282]
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) ~[na:1.8.0_282]
at io.cdap.plugin.gcp.bigquery.sink.BigQuerySinkUtils.generateTableFieldSchema(BigQuerySinkUtils.java:122) ~[na:na]
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[na:1.8.0_282]
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) ~[na:1.8.0_282]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[na:1.8.0_282]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[na:1.8.0_282]
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[na:1.8.0_282]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[na:1.8.0_282]
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) ~[na:1.8.0_282]
at io.cdap.plugin.gcp.bigquery.sink.BigQuerySinkUtils.generateTableFieldSchema(BigQuerySinkUtils.java:122) ~[na:na]
...
```

Add Comment Add Comment
 
Get Jira notifications on your phone! Download the Jira Cloud app for Android or iOS
This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100155-sha1:2c3a71f)
Atlassian logo

Jetron Saiti (Jira)

unread,
Apr 15, 2021, 6:54:09 AM4/15/21
to cdap...@googlegroups.com
Jetron Saiti commented on Bug PLUGIN-643
 
Re: BigQuery source: fix handling of nested record with same name as parent record

The issue isn’t related to BQ source nested records but it’s a reference issue, we have tested this within Wrangler and as a File Source and the issue is when there are referenced records, the parent record is saved as a known record and is used for every record that has the same name, causing it to fail.

While debugging we found that the issue may be with StructuredRecord when there are nested records, so we performed some other tests:

1. Created a JSON dataset with nested records where the child has same name as parent and tried to parse from Wrangler, the output was wrong,
2. Created a JSON dataset with nested records where the child has same name as parent and tried to parse with File Plugin as source, if you press Validate button, there will be issue in Output Schema,
3. Created a Unit Test with StructuredRecord where the child has same name as parent and the output was wrong,

Schema mainSchema = Schema.recordOf(“record”, Schema.Field.of(“Field”, Schema.of(Schema.Type.STRING)));
Schema subSchema = Schema.recordOf(“record”, Schema.Field.of(“record”, mainSchema));

Output: {“type”:“record”,“name”:“record”,“fields”:[{“name”:“record”,“type”:“record”}]}

and if we read the source code of StructuredRecord we can see that this is related with the ability to reference other schemas:[ |https://github.com/cdapio/cdap/blob/83798c0cc09cb616a54d1065962e07c2d04c21ce/cdap-api-common/src/main/java/io/cdap/cdap/internal/io/SchemaTypeAdapter.java#L367-L370][https://github.com/cdapio/cdap/blob/83798c0cc09cb616a54d1065962e07c2d04c21ce/cdap-api-common/src/main/java/io/cdap/cdap/internal/io/SchemaTypeAdapter.java#L367-L3+|https://github.com/cdapio/cdap/blob/83798c0cc09cb616a54d1065962e07c2d04c21ce/cdap-api-common/src/main/java/io/cdap/cdap/internal/io/SchemaTypeAdapter.java#L367-L370]+70

Get Jira notifications on your phone! Download the Jira Cloud app for Android or iOS
This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100157-sha1:a51b698)
Atlassian logo

Prashant Jaikumar (Jira)

unread,
Apr 16, 2021, 1:32:07 AM4/16/21
to cdap...@googlegroups.com

This seems to be a CDAP issue based, reassigning to Bhooshan to prioritize.

Get Jira notifications on your phone! Download the Jira Cloud app for Android or iOS
This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100157-sha1:cf9946a)
Atlassian logo
Reply all
Reply to author
Forward
0 new messages