Google Lifesciences API slow file staging

Andrew Schriefer

unread,

Mar 15, 2021, 4:52:30 PM3/15/21

to Nextflow

Hello,

I have been using nextflow-21.03.0-edge with the google-lifesciences executor and I noticed that an alignment job which relies on a large index file (50 GB) stored in a google bucket takes much longer than it does locally. This led me to believe the index file staging from the bucket was taking a long time.

I manually created a c2-standard-16 machine to run some benchmarking of gsutil downloading a grch38 reference sequence (details below). I found that the gsutil options -o 'GSUtil:parallel_thread_count=1' -o 'GSUtil:sliced_object_download_max_components=8' increased download speed by an almost an order of magnitude for the cloud-sdk:slim image (30 MiB/s to 240 MiB/s). The cloud-sdk:alpine image showed only small improvements (30 MiB/s to 40 MiB/s). I have not tested any upload performance yet.

Would it be possible to specify these arguments to the google-lifesciences executor to improve staging performance?

I took the options I tested from this article which did gsutil benchmarking as well:

https://jbrojbrojbro.medium.com/slice-up-your-life-large-file-download-optimization-50ee623b708c

The reference file I tested is here (2.9 GiB):

gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa

Alpine Test:

>docker run --rm -it gcr.io/google.com/cloudsdktool/cloud-sdk:alpine bash

>gsutil -m cp gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa .

Copying gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa...

| [1/1 files][ 2.9 GiB/ 2.9 GiB] 100% Done 26.6 MiB/s ETA 00:00:00

Operation completed over 1 objects/2.9 GiB.

>gsutil -m -o 'GSUtil:parallel_thread_count=1' -o 'GSUtil:sliced_object_download_max_components=8' -m cp gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa .

Copying gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa...

/ [0 files][ 2.9 GiB/ 2.9 GiB] 40.7 MiB/s

Slim test:

>docker run --rm -it gcr.io/google.com/cloudsdktool/cloud-sdk:slim bash

>gsutil -m cp gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa .

Copying gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa...

\ [1/1 files][ 2.9 GiB/ 2.9 GiB] 100% Done 31.7 MiB/s ETA 00:00:00

>gsutil -m -o 'GSUtil:parallel_thread_count=1' -o 'GSUtil:sliced_object_download_max_components=8' -m cp gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa .

Copying gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa...

/ [1/1 files][ 2.9 GiB/ 2.9 GiB] 100% Done 239.4 MiB/s ETA 00:00:00

Paolo Di Tommaso

unread,

Mar 15, 2021, 5:02:09 PM3/15/21

to nextflow

This is interesting. Take in consideration submitting a pull request.

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nextflow/1e8c6229-da0a-41fd-a826-a942c865039fn%40googlegroups.com.

Andrew Schriefer

unread,

Mar 15, 2021, 5:09:22 PM3/15/21

to Nextflow

I believe that the poor performance of alpine linux is related to this code snippet on the GoogleCloudPlatform github here:

https://github.com/GoogleCloudPlatform/gsutil/blob/431064e37f4df8c6e7b2c186ec7ca0726b7445c3/gslib/commands/config.py

It may be that the multiprocessing option which improved performance on the debian google-sdk image is ignored in alpine.

# On Windows and Alpine Linux, Python multi-processing presents various

# challenges so we retain compatibility with the established parallel mode

# operation, i.e. one process and 24 threads.

should_prohibit_multiprocessing, unused_os = ShouldProhibitMultiprocessing()

if should_prohibit_multiprocessing:

DEFAULT_PARALLEL_PROCESS_COUNT = 1

DEFAULT_PARALLEL_THREAD_COUNT = 24

else:

DEFAULT_PARALLEL_PROCESS_COUNT = min(multiprocessing.cpu_count(), 32)

DEFAULT_PARALLEL_THREAD_COUNT = 5

Andrew Schriefer

unread,

Mar 16, 2021, 1:51:33 PM3/16/21

to Nextflow

I opened a pull request here:

https://github.com/nextflow-io/nextflow/pull/1973

Paolo Di Tommaso

unread,

Mar 16, 2021, 1:53:45 PM3/16/21

to nextflow

Awesome thanks a lot.

To view this discussion on the web visit https://groups.google.com/d/msgid/nextflow/c23a3540-eb09-4947-a84a-dbbd2662e35en%40googlegroups.com.

Reply all

Reply to author

Forward