Hi Bashir,
Great questions, and hope your are doing well! So let me answer both questions:
1) Regarding the first, no I have not run BigQuery on Cloud Storage as I was not sure of my monetary costs to optimize for something like that -- and I assumed it would be fairly fast given the network bandwidth within a Google data center -- but I have done something different. So a while ago I built my own functional query-engine so I can tweak for multiple things (i.e. threading to specific CPU core, indexing, caching, etc.) -- unfortunately it's not code I can share, but I think the logic of how it would be possible becomes fairly obvious shortly. These are small Python scripts covering most of the SQL/relational algebra operations. So out of curiosity I applied them to gnomAD for chrY working within the constraints available to me at the university.
So the scripts are the following:
1) The Sharder/Cacher/Indexer script -- this basically pipes in the data and caches for specific information within the VCF file, while also indexing it:
bin/data.py --region $((REGION_OFFSET)):$((REGION_OFFSET + STEP)) --file $1 --index chry | bin/select.py --filters variant_type --columns chrom,variant_type --headers chrom,variant_type --with-index | bin/cache.py --path chry/cache/variant_type --file $((REGION_OFFSET)).txt2) The Query (notice how similar it is to SQL) -- this reads in the cached data and queries for the requested information, and subsequently saves it for downstream analysis:
bin/select.py --columns variant_type --cache chry/cache/variant_type/${FILE} | bin/count.py --with-headers | bin/save.py --path query_results --file ${FILE} 3) The Aggregate/Combiner query -- this basically combines the results from the above query (2) and reports the information:
bin/aggregate.py --path query_resultsHere are the results, on a single compute node with 16 cores using the cached data -- this it to test for the worst-case scenario:
snv: 537783
multi-snv: 385189
mixed: 207795
indel: 26660
multi-indel: 10173
real 0m31.169s
user 0m52.751s
sys 0m12.698sBased on the above, the query using cached results took about 30 seconds with 1 worker (16-core/threads) machine with bare-minimum optimization. The caching took a bit longer, but I didn't measure that as I was doing it on just one machine, though usually it only has to be done once for any piece of information with a dataset. Regarding the query, it would be fairly obvious how to get it under a second with multiple workers, where one could improve on multiple fronts (inverted indices, hashes, query plan caches, etc). If I take something like chr2, which let's say could be 100 times larger, I could naively run the query in about 1min 30sec, at just being conservative with multiple workers, though it probably would be much faster if I add additional optimizations that would drive the time complexity to a small multiplier of a constant -- basically still measuring it in seconds.
2) Regarding the conversion from VCF to CSV, that is straight-forward, as the start of the VCF tells defines for you columns aside from the 8 mandatory ones (CHROM,...,INFO). Via my approach, the logic is already in data.py, but it would be easy for me to add an external schema parameter to data.py, that would map a schema structure over the flow of the incoming data, and filter that way. For example, I currently already create the CSV on the fly as the filters are applied through select.py query:
bin/data.py --region 944:1000 --file data/gnomad.genomes.v3.1.2.sites.chrY.vcf --index chry | bin/select.py --filters variant_type --columns chrom --headers chrom,variant_type --with-index | head -n7
index,chrom,variant_type
index_chry_0,chrY,snv
index_chry_1,chrY,multi-snv
index_chry_2,chrY,multi-snv
index_chry_3,chrY,snv
index_chry_4,chrY,snv
index_chry_5,chrY,multi-snvBasically what I'm driving at is that this was done on one simple machine, and it would naturally flow that it would be easier to do it in BigQuery with Cloud Storage given all its optimizations (including network). If it's not, then I have some ideas at how to to make it that fast (possibly faster) with the current design of BigQuery and Cloud Storage, though that would require a separate discussion given the time/costs/etc involved.
Thank you again for your generous help to the Bioinformatics community.
Hope it helps,
~p