current status of "dsub" ?

97 views
Skip to first unread message

Sheila Reynolds

unread,
Oct 29, 2021, 8:07:39 PM10/29/21
to GCP Life Sciences Discuss
Hi all,

I'm not sure if this is the right place to ask this question, but I thought I'd give it a try ;-)
Is dsub still being supported or should people be phasing out usage of dsub and switching to <what> ?  I haven't been following too closely the development of GA4GH WES and TES standards so if someone could kindly point me towards the latest and greatest, I'd appreciate that!  Oh, and cross-platform and/or platform-agnostic is important to me.

thanks,

Sheila

Sheila Reynolds

unread,
Oct 30, 2021, 12:04:17 PM10/30/21
to GCP Life Sciences Discuss
as a follow-up question -- would it be relatively easy, say, to use Cromwell via a Python API/SDK to replicate what dsub does?  if I didn't want to go all in on a WDL workflow?  or should I just embrace WDL (if I want to use Cromwell)

thanks,

Sheila

Paul Grosu

unread,
Oct 30, 2021, 4:37:57 PM10/30/21
to GCP Life Sciences Discuss
Hi Sheila,

Cromwell is not the only one - and some are Python based:


Galaxy might be another one, but not sure as compared to the previous two.

Again it depends on what your current process is and the minimal requirements for you to shift over.  I'd try them all, or maybe you could provide some simple examples here of some of your workflows to see might be the easiest transition in your case.  I'm also not sure of the current dsub roadmap, but that's probably more for the Google folks to expand on. 

Hope it helps,
~p

Sean Davis

unread,
Oct 31, 2021, 1:06:58 PM10/31/21
to GCP Life Sciences Discuss
Hi, Sheila.

I've used nextflow in the context of GCP and it works well. Nextflow runs on the JVM (Groovy) and installs with one line. No server is needed at all. Pipelines written using nextflow run on HPC clusters or workstations without modification, though a configuration for each environment is needed. Nextflow relies on containers for cloud execution. It has a very large user community and is very actively developed.

Snakemake is also quite good, but I haven't used it in a few years and have adopted nextflow instead. 

Sean

Joe Slagel

unread,
Nov 1, 2021, 10:28:59 AM11/1/21
to Sean Davis, GCP Life Sciences Discuss
Sheila,

My opinion is to use the right tool for the right kind of job.  I see systems such as cromwell, nextflow, and snakemake as ideal tools for running more sophisticated analyses that have multiple steps and you want to leverage their abilities to do distributed computing, error handling, tailored resources, etc..  But they also can be more complicated to write and manage so to me they are more fit for use in repeated, standardized processes.  I see dsub filling a very important and useful niche as more of a tool for quick, ad-hoc distributed processing.  Things where you might only need to run a program on a large dataset.  For example, I once helped out a bioinformatician who was trying to run ericscript on a set of 700+ files from GDC.  He was running it on his laptop and found that a single file was taking more than 12 hours to run and didn't want to wait a few months for the processing to finish.  Since the files were already in GCP, it was trivial to grab an ericscript docker image and setup a quick dsub job from the command line to process all 700 files mostly in parallel and  in under a day and using preemptible VMs to keep the costs down.  Now you certainly can write a cromwell/nextflow/snakemake pipeline to do the same processing, it just seemed easier and faster to do it with dsub.

-Joe

-Joe

--
You received this message because you are subscribed to the Google Groups "GCP Life Sciences Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gcp-life-sciences-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gcp-life-sciences-discuss/feb309d9-ffc7-459f-bfe7-3990e523f3f2n%40googlegroups.com.

mboo...@google.com

unread,
Nov 1, 2021, 12:00:42 PM11/1/21
to GCP Life Sciences Discuss
Hi Sheila!

dsub certainly is still being supported, although we have not made any significant updates in a while.
I think that today, dsub continues to be a very good tool for a category of tasks, just as Joe has indicated.

FWIW, we explored the idea of creating a Cromwell back-end for dsub a few years back, but have not yet had the chance to build it.

-Matt

Paul Grosu

unread,
Nov 1, 2021, 5:43:20 PM11/1/21
to GCP Life Sciences Discuss

Hi Matt,

So would it be okay to assume that dsub is a wholly/mainly Google/Verily directly-supported project under the DataBiosphere umbrella (since before it was under googlegenomics)?

Thanks,
`p

Kyle Vernest

unread,
Nov 1, 2021, 5:52:38 PM11/1/21
to Sheila Reynolds, GCP Life Sciences Discuss
Hi Sheila,

Just one more thing to add to the conversation. The University of Melbourne was developing a product before, Janis, that took Python and converted to CWL and WDL. Might be worth having a look, potentially trying it out, but a potentially great option to be able to use Cromwell without needing to write WDL and just using Python.


--
You received this message because you are subscribed to the Google Groups "GCP Life Sciences Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gcp-life-sciences-...@googlegroups.com.


--

Kyle Vernest

Associate Director, Product Management

Data Sciences Platform

Broad Institute of MIT and Harvard

105 Broadway

Cambridge, MA 02142


kvernest@broadinstitute.org

Personal Mobile #: 781-999-4655

Admin support: Nick-Brie Guarriello ngua...@broadinstitute.org 


BroadInstLogoforDigitalRGB.png

mboo...@google.com

unread,
Nov 2, 2021, 2:17:36 PM11/2/21
to GCP Life Sciences Discuss
Hi Paul!

Your description sounds right to me. dsub was originated at Verily and over the years have maintained it, with occasional contributions from engineers at Google.

Thanks,

-Matt

Sheila Reynolds

unread,
Nov 2, 2021, 3:44:35 PM11/2/21
to mboo...@google.com, GCP Life Sciences Discuss
thanks all (and a special "hi" to Matt B and Sean D! I hope you are both doing well!)

Matt, on a related note, can you comment on the pipelines API -- which I see has graduated to "gcloud beta lifesciences pipelines" :-)   
It looks like that is still actively maintained etc -- is it also considered a GA4GH TES API or are there still differences in the detailed specs?

cheers,

Sheila


mboo...@google.com

unread,
Nov 2, 2021, 6:16:14 PM11/2/21
to GCP Life Sciences Discuss
Looking at the TES spec:
and comparing it to the Life Sciences REST API:
it looks structurally similar, but there does not appear to be any convergence there.

-Matt

Paul Grosu

unread,
Nov 2, 2021, 11:44:59 PM11/2/21
to GCP Life Sciences Discuss
Hi Matt,

That's awesome and many thanks for the continued support to the community!

Cheers,
Paul

Paul Grosu

unread,
Nov 2, 2021, 11:56:08 PM11/2/21
to GCP Life Sciences Discuss
Hi Sheila,

Matt is exactly right!

Basically those (dsub/pipelines vs TES) are two different worlds.  Though Google initially started implementing the initial GA4GH version, there was a parallel project -- if I'm remembering right -- with the Broad (Institute) for something simple to shift from local onsite computation of their sequencing computing needs to the Cloud.  That was initially called JES (Job Execution Service) out of which the Cromwell client became what it is today.  JES then became Google Genomics Pipelines API, since it looked like more people might benefit from it.  That's how dsub came to be as it provides the key requirements to interface with Pipelines.  With the PAPIv2 (for version 2 of Pipelines API) as it is called sometimes today, you get the server for free, which can save you quite a bit of money.

If you want to use TES, then you would need to setup a server on the Google Cloud Infrastructure.  For example, like Funnel: https://github.com/ohsu-comp-bio/funnel

The discussions are still going on with TES, and you are welcome to join those meetings.  TES is more a schema that folks will implement for any kind of Cloud infrastructure, so it has to be more comprehensive and flexible.  PAPIv2 has stabilized for a while, since it's scope is more narrow to become minimally necessary and sufficient for docker-as-a-service for the Google Cloud resources.

I'm sure Google folks can fill in if I forgot something.

Hope it helps,
~p
Reply all
Reply to author
Forward
0 new messages