[BQ Sink Feature Request] Truncate table if upstream is empty

230 views

Skip to first unread message

Omar Ismail

unread,

Jul 6, 2021, 12:46:27 PM7/6/21

to cdap...@googlegroups.com

Hi everyone,

I am having trouble creating an Issue tracker from the CDAP page so thought I'd post here:

Problem Statement

I have a Data Fusion pipeline that moves data from Cloud SQL Server to BigQuery that is not correctly putting data in the target BigQuery table; this happens when the table in Cloud SQL has been truncated so there are 0 records in the source, but when running the pipeline, the BQ table does not get truncated.

Looking at the pipeline log, I see why no BQ job ran. In the logs it says:

Importing into table 'XYZ' from 0 paths; path[0] is '(empty)'; awaitCompletion: true

As paths is an empty list (because no avro file was materialized due to query returning nothing), no files were uploaded. When a pipeline is run, the query results are saved as an avro file for storage, and the BigQuery sink imports these avro files stored in GCS. As no avro files were made, nothing was imported! I found the code snippet that is causing the BQ table not to be truncated as well:

https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/bigquery/src/main/java/com/google/cloud/hadoop/io/bigquery/BigQueryHelper.java#L131-L174

As no avro file gets written (due to the source table being empty), nothing happens as gcsPath field will be empty. The gcsPath variable is used to setSourceUris, and from the doc:

https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/ExternalDataConfiguration.html#setSourceUris-java.util.List-

it is required and cannot be empty.

Work-around

Adding a Conditional between the source and the sink:

The conditional statement is just:

token['CloudSQL MySQL']['output'] > 0

where "CloudSQL MySQL" is the Label name of the inputting plugin. I tested the attached pipeline by creating a MySQL table with no records, and a BQ table with lots of records. When I ran the pipeline, the BQ table was truncated. I then added a record to the MySQL table, and ran the pipeline to test if the True statement worked, and it did; the BQ table had one record!

Constraints

Would like to do this without using Replication

Feature Request

Avoid use of conditional and have the BQ sink truncate the table if it detects that zero records have been outputted from the upstream source.

Best,

Omar

Vitalii Tymchyshyn

unread,

Jul 12, 2021, 8:12:26 PM7/12/21

to CDAP User

Hi!

Thank you for the report.

Please follow this feature request that was registered: https://cdap.atlassian.net/browse/PLUGIN-442