Hello,
It triggered a spending anomaly on our Google Cloud Platform billing.
On Terra, the workflow says the run cost $100 but when I dig into the billing report there is a $720 data transfer charge with the actual run costing $925. Our BigQuery billing export said this cost was caused by a multi-region transfer of 38.66 Terabytes of data from the gs://regev-lab bucket.
After searching the WDL of the workflow, I found the smartseq2_per_plate function pulls the reference genome specified in the workflow input JSON on a per-sample basis.
This was ok when Broad/Terra policy was to use multi-regional buckets years ago, but the default is now US-Central1. gs://regev-lab, which is the source of the reference genome, is still a multi-region bucket, which causes workflows to incur "Network Data Transfer GCP Multi-region within Northern America" charge which is 9 times what is expected (i.e. no/minimal transfer charge in the case the references were in the same region).
I understand there may be users that operate out of different regions, necessitating the multi-region bucket. However, if it is the case most users are using Terra and in US-Central1 it might be simplest to create a bucket with references there.
Alternatively, the reference genome is specified as a string at the workflow level. Might it be possible to localize the reference across regions to a bucket (like the workspace bucket) temporarily, which the workflow then copies to the per-sample shards?
Thanks,
Mark Godek
Data Manager, Bioinfomatics Specialist
Shalek Lab