Google Lifesciences API slow file staging

113 views
Skip to first unread message

Andrew Schriefer

unread,
Mar 15, 2021, 4:52:30 PM3/15/21
to Nextflow
Hello,

I have been using nextflow-21.03.0-edge with the google-lifesciences executor and I noticed that an alignment job which relies on a large index file (50 GB) stored in a google bucket takes much longer than it does locally.  This led me to believe the index file staging from the bucket was taking a long time.  

I manually created a c2-standard-16 machine to run some benchmarking of gsutil downloading a grch38 reference sequence (details below).  I found that the gsutil options -o 'GSUtil:parallel_thread_count=1' -o 'GSUtil:sliced_object_download_max_components=8' increased download speed by an almost an order of magnitude for the cloud-sdk:slim image (30 MiB/s to 240 MiB/s).  The cloud-sdk:alpine image showed only small improvements (30 MiB/s to 40 MiB/s).  I have not tested any upload performance yet.

Would it be possible to specify these arguments to the google-lifesciences executor to improve staging performance?

I took the options I tested from this article which did gsutil benchmarking as well:

The reference file I tested is here (2.9 GiB):
gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa 

Alpine Test:

>gsutil -m cp gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa .
Copying gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa...
| [1/1 files][  2.9 GiB/  2.9 GiB] 100% Done  26.6 MiB/s ETA 00:00:00
Operation completed over 1 objects/2.9 GiB.

>gsutil -m -o 'GSUtil:parallel_thread_count=1' -o 'GSUtil:sliced_object_download_max_components=8' -m cp gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa .
Copying gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa...
/ [0 files][  2.9 GiB/  2.9 GiB]   40.7 MiB/s

Slim test:

>gsutil -m cp gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa .                             
Copying gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa...
\ [1/1 files][  2.9 GiB/  2.9 GiB] 100% Done  31.7 MiB/s ETA 00:00:00

>gsutil -m -o 'GSUtil:parallel_thread_count=1' -o 'GSUtil:sliced_object_download_max_components=8' -m cp gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa .
Copying gs://gcp-public-data--broad-references/hg38/v0/GRCh38.primary_assembly.genome.fa...
/ [1/1 files][  2.9 GiB/  2.9 GiB] 100% Done 239.4 MiB/s ETA 00:00:00

Paolo Di Tommaso

unread,
Mar 15, 2021, 5:02:09 PM3/15/21
to nextflow
This is interesting. Take in consideration submitting a pull request.

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nextflow/1e8c6229-da0a-41fd-a826-a942c865039fn%40googlegroups.com.

Andrew Schriefer

unread,
Mar 15, 2021, 5:09:22 PM3/15/21
to Nextflow
I believe that the poor performance of alpine linux is related to this code snippet on the GoogleCloudPlatform github here:


It may be that the multiprocessing option which improved performance on the debian google-sdk image is ignored in alpine.

# On Windows and Alpine Linux, Python multi-processing presents various
# challenges so we retain compatibility with the established parallel mode
# operation, i.e. one process and 24 threads.
should_prohibit_multiprocessing, unused_os = ShouldProhibitMultiprocessing()
if should_prohibit_multiprocessing:
  DEFAULT_PARALLEL_PROCESS_COUNT = 1
  DEFAULT_PARALLEL_THREAD_COUNT = 24
else:
  DEFAULT_PARALLEL_PROCESS_COUNT = min(multiprocessing.cpu_count(), 32)
  DEFAULT_PARALLEL_THREAD_COUNT = 5

Andrew Schriefer

unread,
Mar 16, 2021, 1:51:33 PM3/16/21
to Nextflow

Paolo Di Tommaso

unread,
Mar 16, 2021, 1:53:45 PM3/16/21
to nextflow
Awesome thanks a lot. 

Reply all
Reply to author
Forward
0 new messages