Variant Transforms working , Big query generated table consisting of 2924 rows of 30 VCF files

46 views

Skip to first unread message

hze...@gmail.com

unread,

Aug 27, 2021, 9:21:38 PM8/27/21

to GCP Life Sciences Discuss

Dear Cloud life sciences team

I want to know the working of Variant Transforms

what I did I run 30 VCF files on Variant Transforms

I used the wildcard approach

which the run of these 30 VCF files it resulted in a table generated by big query with name residue consisting of 2924 rows

I want to know those VCF files collectively generated table

what it represents most of the fields were empty

Chr1-Chr22 Chr x,Chr y empty

only table residue and sample_info were nonempty

also want to know for one VCF file it generates how many fields and rows

the key question of mine is that what the residue table represents

what is the significance of the residue table

what is the significance of variant Transforms

my other question can gvcf files can be processed by variant transforms

what is the role of Bigquery in variant transforms

and my last question is what is Avro role in Variant Transforms

best regards

Aaron

Thanks in Advance

your help will be highly appreciated

Randi Cowin

unread,

Aug 30, 2021, 5:22:41 PM8/30/21

to hze...@gmail.com, GCP Life Sciences Discuss

Hi Aaron,

Variant Transforms is a tool that uses a predefined ETL transformation to load variant call data from VCFs into BigQuery for downstream analyses. BigQuery is Google Cloud's serverless cloud data warehouse. A good overview for how and when to use Variant Transforms (it's significance) and the role of BigQuery as the target for a Variant Transforms pipeline and downstream analyses is provided here.

To use Variant Transforms and get accurate loading to BigQuery, you must first create a new dataset in BigQuery. Variant Transforms will create and hydrate the tables in a new dataset. If you are appending data to existing tables in an existing dataset, there are additional flags you must pass as part of the command. If you need to merge samples across VCF files, please see Variant Merging.

If this was new data in a new dataset, the "empty" rows could be due to a mis-understanding of the BigQuery schema and the different data types available and utilized in BigQuery to save on storage and other performance optimization techniques (structs and arrays for example). If you would like a more flattened view of the data, you can flatten them in the way you write your SQL query or run Variant Transforms with a command to create sample-optimized tables instead of the standard schema using queries. To better understand the BigQuery schema please see the documentation here.

If your tables are definitely completely empty, I would confirm you did not have one or more mal-formed or inconsistent VCF files using the preprocessor.

For the tables that did contain data, Variant Transforms creates two non-chromosome tables in the standard schema. The first is sample_info, which lists the file names that the samples came from and the second is the residual table. Is this the "residue" table you referred to? Or, did you use sharding and this is the output of the residual shard?

If the residual table, and there was no data in any of the chromosome tables, my guess is that your VCFs may have listed the data differently than Variant Transforms was expecting (for human sequencing data) or the data were not human. Can you confirm this?

If none of these explanations make any sense, can you please tell us how you ran Variant Transforms? Using Docker or through Github? Can you send us your command? It would also be helpful to have screen shots of your schema and tables in BigQuery.

You can use gVCFs with VariantTransforms.

Avro files are generated from the VCF files to improve the performance of the data into BigQuery quickly and at no cost for the data load.

Finally, there is not a fixed number of fields or columns you can expect from loading the data into BigQuery using VariantTransforms. This will depend on the number of samples you have as well as whether you have additional headers in the VCF outside the standard v4.3 specifications.

Thank you,

Randi

--
You received this message because you are subscribed to the Google Groups "GCP Life Sciences Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gcp-life-sciences-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gcp-life-sciences-discuss/80211d1e-3ea1-4aa7-b85a-d8397bfd60e2n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages