Google Lifesciences API slow file staging

Skip to first unread message

Andrew Schriefer

Mar 15, 2021, 4:52:30 PM3/15/21
to Nextflow

I have been using nextflow-21.03.0-edge with the google-lifesciences executor and I noticed that an alignment job which relies on a large index file (50 GB) stored in a google bucket takes much longer than it does locally.  This led me to believe the index file staging from the bucket was taking a long time.  

I manually created a c2-standard-16 machine to run some benchmarking of gsutil downloading a grch38 reference sequence (details below).  I found that the gsutil options -o 'GSUtil:parallel_thread_count=1' -o 'GSUtil:sliced_object_download_max_components=8' increased download speed by an almost an order of magnitude for the cloud-sdk:slim image (30 MiB/s to 240 MiB/s).  The cloud-sdk:alpine image showed only small improvements (30 MiB/s to 40 MiB/s).  I have not tested any upload performance yet.

Would it be possible to specify these arguments to the google-lifesciences executor to improve staging performance?

I took the options I tested from this article which did gsutil benchmarking as well:

The reference file I tested is here (2.9 GiB):

Alpine Test:

>gsutil -m cp gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa .
Copying gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa...
| [1/1 files][  2.9 GiB/  2.9 GiB] 100% Done  26.6 MiB/s ETA 00:00:00
Operation completed over 1 objects/2.9 GiB.

>gsutil -m -o 'GSUtil:parallel_thread_count=1' -o 'GSUtil:sliced_object_download_max_components=8' -m cp gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa .
Copying gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa...
/ [0 files][  2.9 GiB/  2.9 GiB]   40.7 MiB/s

Slim test:

>gsutil -m cp gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa .                             
Copying gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa...
\ [1/1 files][  2.9 GiB/  2.9 GiB] 100% Done  31.7 MiB/s ETA 00:00:00

>gsutil -m -o 'GSUtil:parallel_thread_count=1' -o 'GSUtil:sliced_object_download_max_components=8' -m cp gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa .
Copying gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa...
/ [1/1 files][  2.9 GiB/  2.9 GiB] 100% Done 239.4 MiB/s ETA 00:00:00

Paolo Di Tommaso

Mar 15, 2021, 5:02:09 PM3/15/21
to nextflow
This is interesting. Take in consideration submitting a pull request.

You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
To view this discussion on the web visit

Andrew Schriefer

Mar 15, 2021, 5:09:22 PM3/15/21
to Nextflow
I believe that the poor performance of alpine linux is related to this code snippet on the GoogleCloudPlatform github here:

It may be that the multiprocessing option which improved performance on the debian google-sdk image is ignored in alpine.

# On Windows and Alpine Linux, Python multi-processing presents various
# challenges so we retain compatibility with the established parallel mode
# operation, i.e. one process and 24 threads.
should_prohibit_multiprocessing, unused_os = ShouldProhibitMultiprocessing()
if should_prohibit_multiprocessing:
  DEFAULT_PARALLEL_PROCESS_COUNT = min(multiprocessing.cpu_count(), 32)

Andrew Schriefer

Mar 16, 2021, 1:51:33 PM3/16/21
to Nextflow

Paolo Di Tommaso

Mar 16, 2021, 1:53:45 PM3/16/21
to nextflow
Awesome thanks a lot. 

Reply all
Reply to author
0 new messages