Query regarding gnomAD dataset on Bigquery

Josh

unread,

Dec 6, 2021, 8:35:30 AM12/6/21

to GCP Life Sciences Discuss

Hi GCP life sciences team,

I recently got in touch with gnomAD team to find out if the gnomAD dataset on BigQuery was v3.1.2?

I’ve since checked (just manually) that it appears to be an older version.

Do you have any plans to load gnomAD 3.1.2 into Bigquery?

Thanks,

Himanshu

Paul Grosu

unread,

Dec 6, 2021, 6:19:40 PM12/6/21

to GCP Life Sciences Discuss

Hi Himanshu,

Why not just pick the files you want from here:

gs://gcp-public-data--gnomad/release/3.1.2/

And convert them to CSV and then just query the files directly via the following without requiring importing:

https://cloud.google.com/bigquery/external-data-cloud-storage

Hope it helps,

~p

Josh

unread,

Dec 6, 2021, 8:17:29 PM12/6/21

to GCP Life Sciences Discuss

Hi Paul,

Thanks for pointing me to this resource. Can certainly do what you've recommended. I have found the data in BigQuery quite easy to navigate / invaluable - so wanted to explore this avenue before building another workflow. The storage space requirements for gnomAD datasets are also quite heavy.

Will find another approach (using resources you've pointed out). Would be great to know if GCP Life sciences team has any plans to update the gnomAD data.

Thanks,

Himanshu

Randi Cowin

unread,

Dec 6, 2021, 8:25:26 PM12/6/21

to Josh, GCP Life Sciences Discuss

Hi Himanshu

Paul is correct, at the moment that is the best place to get the most up-to-date gnomAD data - especially if you are on a tight timeline. I am working with some new and old members of the public datasets team to work through getting the data loaded to BigQuery.

Unfortunately, I don't have a timeline as of today, but if I can get one soon I will update you.

Thank you for your patience. I apologize for any inconvenience this may cause.

Randi

--
You received this message because you are subscribed to the Google Groups "GCP Life Sciences Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gcp-life-sciences-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gcp-life-sciences-discuss/44dd2b98-9f9c-4635-b1b7-679374a642a1n%40googlegroups.com.

Bashir S. Sadjad

unread,

Dec 6, 2021, 10:39:03 PM12/6/21

to Paul Grosu, GCP Life Sciences Discuss

Hi Paul,

Thanks for your reply, two questions about your suggested solution:

1) Have you tried querying a dataset as big as gnomAD on cloud storage from BigQuery? I don't know the internals of this but my guess is that this is going to be very slow (compared to data being loaded to BigQuery directly). I mean, I feel this can be prohibitively slow but I have not tried it myself, hence curious.

2) How do you do the VCF to CSV conversion? And again does your approach work in a reasonable time frame for a dataset as big as gnomAD?

Disclaimer: I have not been working on GCP Life Sciences projects for a long time but at the time that we decided to provide the older version of gnomAD on BigQuery as a public data set, the performance and ease of use (for general users) were definitely some of the motivations.

Regards

-B

--

You received this message because you are subscribed to the Google Groups "GCP Life Sciences Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gcp-life-sciences-...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/gcp-life-sciences-discuss/49e1e7c2-d5aa-4474-a314-f33790951255n%40googlegroups.com.

--

Bashir S. Sadjad
Google Inc.
http://sadjad.info

Paul Grosu

unread,

Dec 13, 2021, 9:37:06 PM12/13/21

to GCP Life Sciences Discuss

Hi Bashir,

Great questions, and hope your are doing well! So let me answer both questions:

1) Regarding the first, no I have not run BigQuery on Cloud Storage as I was not sure of my monetary costs to optimize for something like that -- and I assumed it would be fairly fast given the network bandwidth within a Google data center -- but I have done something different. So a while ago I built my own functional query-engine so I can tweak for multiple things (i.e. threading to specific CPU core, indexing, caching, etc.) -- unfortunately it's not code I can share, but I think the logic of how it would be possible becomes fairly obvious shortly. These are small Python scripts covering most of the SQL/relational algebra operations. So out of curiosity I applied them to gnomAD for chrY working within the constraints available to me at the university.

So the scripts are the following:

1) The Sharder/Cacher/Indexer script -- this basically pipes in the data and caches for specific information within the VCF file, while also indexing it:

bin/data.py --region $((REGION_OFFSET)):$((REGION_OFFSET + STEP)) --file $1 --index chry | bin/select.py --filters variant_type --columns chrom,variant_type --headers chrom,variant_type --with-index | bin/cache.py --path chry/cache/variant_type --file $((REGION_OFFSET)).txt

2) The Query (notice how similar it is to SQL) -- this reads in the cached data and queries for the requested information, and subsequently saves it for downstream analysis:

bin/select.py --columns variant_type --cache chry/cache/variant_type/${FILE} | bin/count.py --with-headers | bin/save.py --path query_results --file ${FILE}

3) The Aggregate/Combiner query -- this basically combines the results from the above query (2) and reports the information:

bin/aggregate.py --path query_results

Here are the results, on a single compute node with 16 cores using the cached data -- this it to test for the worst-case scenario:

snv: 537783
multi-snv: 385189
mixed: 207795
indel: 26660
multi-indel: 10173

real 0m31.169s
user 0m52.751s
sys 0m12.698s

Based on the above, the query using cached results took about 30 seconds with 1 worker (16-core/threads) machine with bare-minimum optimization. The caching took a bit longer, but I didn't measure that as I was doing it on just one machine, though usually it only has to be done once for any piece of information with a dataset. Regarding the query, it would be fairly obvious how to get it under a second with multiple workers, where one could improve on multiple fronts (inverted indices, hashes, query plan caches, etc). If I take something like chr2, which let's say could be 100 times larger, I could naively run the query in about 1min 30sec, at just being conservative with multiple workers, though it probably would be much faster if I add additional optimizations that would drive the time complexity to a small multiplier of a constant -- basically still measuring it in seconds.

2) Regarding the conversion from VCF to CSV, that is straight-forward, as the start of the VCF tells defines for you columns aside from the 8 mandatory ones (CHROM,...,INFO). Via my approach, the logic is already in data.py, but it would be easy for me to add an external schema parameter to data.py, that would map a schema structure over the flow of the incoming data, and filter that way. For example, I currently already create the CSV on the fly as the filters are applied through select.py query:

bin/data.py --region 944:1000 --file data/gnomad.genomes.v3.1.2.sites.chrY.vcf --index chry | bin/select.py --filters variant_type --columns chrom --headers chrom,variant_type --with-index | head -n7
index,chrom,variant_type
index_chry_0,chrY,snv
index_chry_1,chrY,multi-snv
index_chry_2,chrY,multi-snv
index_chry_3,chrY,snv
index_chry_4,chrY,snv
index_chry_5,chrY,multi-snv

Basically what I'm driving at is that this was done on one simple machine, and it would naturally flow that it would be easier to do it in BigQuery with Cloud Storage given all its optimizations (including network). If it's not, then I have some ideas at how to to make it that fast (possibly faster) with the current design of BigQuery and Cloud Storage, though that would require a separate discussion given the time/costs/etc involved.

Thank you again for your generous help to the Bioinformatics community.

Hope it helps,
~p

Reply all

Reply to author

Forward