Variant Transforms working , Big query generated table consisting of 2924 rows of 30 VCF files

46 views
Skip to first unread message

hze...@gmail.com

unread,
Aug 27, 2021, 9:21:38 PM8/27/21
to GCP Life Sciences Discuss
Dear Cloud life sciences team

 I want to know the working of Variant Transforms
what I did I run 30 VCF files on Variant Transforms
I used the wildcard approach
which the run of these 30 VCF files it resulted in a table generated by big query  with name residue consisting of 2924 rows
I want to know those VCF files collectively generated table
what it represents most of the fields were empty
Chr1-Chr22  Chr x,Chr y empty
only table residue and sample_info were nonempty
also want to know for one VCF file it generates how many fields and rows
the key question of mine is that what the residue table represents
what is the significance of the residue table 
what is the significance of variant Transforms 
my other question can gvcf files can be processed by variant transforms
what is the role of Bigquery in   variant transforms
and my last question is what is Avro role in Variant Transforms




best regards
Aaron

Thanks in Advance
your help will be highly appreciated

Randi Cowin

unread,
Aug 30, 2021, 5:22:41 PM8/30/21
to hze...@gmail.com, GCP Life Sciences Discuss
Hi Aaron,

Variant Transforms is a tool that uses a predefined ETL transformation to load variant call data from VCFs into BigQuery for downstream analyses. BigQuery is Google Cloud's serverless cloud data warehouse. A good overview for how and when to use Variant Transforms (it's significance) and the role of BigQuery as the target for a Variant Transforms pipeline and downstream analyses is provided here.

To use Variant Transforms and get accurate loading to BigQuery, you must first create a new dataset in BigQuery. Variant Transforms will create and hydrate the tables in a new dataset. If you are appending data to existing tables in an existing dataset, there are additional flags you must pass as part of the command. If you need to merge samples across VCF files, please see Variant Merging.

If this was new data in a new dataset, the "empty" rows could be due to a mis-understanding of the BigQuery schema and the different data types available and utilized in BigQuery to save on storage and other performance optimization techniques (structs and arrays for example). If you would like a more flattened view of the data, you can flatten them in the way you write your SQL query or run Variant Transforms with a command to create sample-optimized tables instead of the standard schema using queries. To better understand the BigQuery schema please see the documentation here.

If your tables are definitely completely empty, I would confirm you did not have one or more mal-formed or inconsistent VCF files using the preprocessor
For the tables that did contain data, Variant Transforms creates two non-chromosome tables in the standard schema. The first is sample_info, which lists the file names that the samples came from and the second is the residual table. Is this the "residue" table you referred to? Or, did you use sharding and this is the output of the residual shard?

If the residual table, and there was no data in any of the chromosome tables, my guess is that your VCFs may have listed the data differently than Variant Transforms was expecting (for human sequencing data) or the data were not human. Can you confirm this?

If none of these explanations make any sense, can you please tell us how you ran Variant Transforms? Using Docker or through Github? Can you send us your command? It would also be helpful to have screen shots of your schema and tables in BigQuery.

You can use gVCFs with VariantTransforms.

Avro files are generated from the VCF files to improve the performance of the data into BigQuery quickly and at no cost for the data load.

Finally, there is not a fixed number of fields or columns you can expect from loading the data into BigQuery using VariantTransforms. This will depend on the number of samples you have as well as whether you have additional headers in the VCF outside the standard v4.3 specifications.

Thank you,

Randi

--
You received this message because you are subscribed to the Google Groups "GCP Life Sciences Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gcp-life-sciences-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gcp-life-sciences-discuss/80211d1e-3ea1-4aa7-b85a-d8397bfd60e2n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages