Hello,
I have been using nextflow-21.03.0-edge with the google-lifesciences executor and I noticed that an alignment job which relies on a large index file (50 GB) stored in a google bucket takes much longer than it does locally. This led me to believe the index file staging from the bucket was taking a long time.
I manually created a c2-standard-16 machine to run some benchmarking of gsutil downloading a grch38 reference sequence (details below). I found that the gsutil options -o 'GSUtil:parallel_thread_count=1' -o 'GSUtil:sliced_object_download_max_components=8' increased download speed by an almost an order of magnitude for the cloud-sdk:slim image (30 MiB/s to 240 MiB/s). The cloud-sdk:alpine image showed only small improvements (30 MiB/s to 40 MiB/s). I have not tested any upload performance yet.
Would it be possible to specify these arguments to the google-lifesciences executor to improve staging performance?
I took the options I tested from this article which did gsutil benchmarking as well:
The reference file I tested is here (2.9 GiB):
gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa
Alpine Test:
>gsutil -m cp gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa .
Copying gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa...
| [1/1 files][ 2.9 GiB/ 2.9 GiB] 100% Done 26.6 MiB/s ETA 00:00:00
Operation completed over 1 objects/2.9 GiB.
>gsutil -m -o 'GSUtil:parallel_thread_count=1' -o 'GSUtil:sliced_object_download_max_components=8' -m cp gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa .
Copying gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa...
/ [0 files][ 2.9 GiB/ 2.9 GiB] 40.7 MiB/s
Slim test:
>gsutil -m cp gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa .
Copying gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa...
\ [1/1 files][ 2.9 GiB/ 2.9 GiB] 100% Done 31.7 MiB/s ETA 00:00:00
>gsutil -m -o 'GSUtil:parallel_thread_count=1' -o 'GSUtil:sliced_object_download_max_components=8' -m cp gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa .
Copying gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa...
/ [1/1 files][ 2.9 GiB/ 2.9 GiB] 100% Done 239.4 MiB/s ETA 00:00:00