COVID-19 / galaxy / data / biohackathon

74 views
Skip to first unread message

Rutger Vos

unread,
Mar 27, 2020, 7:47:46 AM3/27/20
to sp...@temple.edu, virtual biohackathon COVID-19 2020
Hi Sergei,

I hope you are well!

I saw the work that the Galaxy people (you included) have been doing for that "no business as usual" preprint. Very nice.

Meanwhile, the BioHackathon people are gearing up to try to do something useful. We're starting to think that maybe the key thing right now is to make sure there's relatively unified access to viral genome data. Do you agree or are there other more urgent barriers?

More specifically, there are some challenges with getting GISAID data, i.e. the viral genomes taken directly from patients and not submitted to INSDC databases. Do you see that as an issue to try to address?

Thanks! All the best!

Dr. Rutger A. Vos
Researcher / Bioinformatician






+31717519600 - +31627085806
Darwinweg 2, 2333 CR Leiden
Postbus 9517, 2300 RA Leiden










Pjotr Prins

unread,
Mar 27, 2020, 8:50:52 AM3/27/20
to Rutger Vos, sp...@temple.edu, virtual biohackathon COVID-19 2020
Galaxy gets mentioned in several threads now, but the Galaxy folks are
not commenting. What do we do? I think Galaxy is partly funded by
Elixir, maybe they can pull some people in?

It may be a good idea to use an existing Galaxy instance. But how do
you develop against such a beast? And does it support workflow
backends we want to develop?

Personally I think we can have a light-weight web interface on top of
an existing API.

Uploading largisch files is non-trivial. Maybe we can use Arvados or
Galaxy to upload the data into some storage and use a Arvados or
Galaxy API to develop the pipelines? Or should be use something like
Amazon S3 or Google drive instead?

We can hit the ground running if we use existing infrastructure. But
what existing infrastructure to use?

Pj.

On Fri, Mar 27, 2020 at 12:47:35PM +0100, Rutger Vos wrote:
> Hi Sergei,
> I hope you are well!
> I saw the work that the Galaxy people (you included) have been doing
> for that "no business as usual" preprint. Very nice.
> Meanwhile, the BioHackathon people are gearing up to try to do
> something useful. We're starting to think that maybe the key thing
> right now is to make sure there's relatively unified access to viral
> genome data. Do you agree or are there other more urgent barriers?
> More specifically, there are some challenges with getting GISAID data,
> i.e. the viral genomes taken directly from patients and not submitted
> to INSDC databases. Do you see that as an issue to try to address?
> Thanks! All the best!
> Dr. Rutger A. Vos
> Researcher / Bioinformatician
> [logo-new.png]
> [+31717519600 - +31627085806 ]
> [[1]rutge...@naturalis.nl - [2]www.naturalis.nl]
> [Darwinweg 2, 2333 CR Leiden]
> [Postbus 9517, 2300 RA Leiden]
> [3][schildpad.gif]
>
> --
> You received this message because you are subscribed to the Google
> Groups "virtual biohackathon COVID-19 2020" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [4]virtual-biohacka...@googlegroups.com.
> To view this discussion on the web, visit
> [5]https://groups.google.com/d/msgid/virtual-biohackathon/CAATi6n%3DVPy
> H0h6H8c_D1kKPGtH%2Bop_2MCE-L%2BQmv0b%3Dsemz_ow%40mail.gmail.com.
>
> References
>
> Visible links
> 1. mailto:rutge...@naturalis.nl
> 2. https://www.naturalis.nl/
> 3. https://www.naturalis.nl/lang-leve
> 4. mailto:virtual-biohacka...@googlegroups.com
> 5. https://groups.google.com/d/msgid/virtual-biohackathon/CAATi6n=VPyH0h6H8c_D1kKPGtH+op_2MCE-L+Qmv0b=sem...@mail.gmail.com?utm_medium=email&utm_source=footer
>
> Hidden links:
> 7. https://www.naturalis.nl/over-ons

Frederik Coppens

unread,
Mar 27, 2020, 9:36:11 AM3/27/20
to virtual biohackathon COVID-19 2020
Hi

a brief reply on this, I was involved in rolling out the Galaxy workflows of the paper mentioned (and am co-leading the ELIXIR Galaxy Community).

There are a LOT of communication channels, it is hard to keep track. Also ELIXIR has already and is further channeling resources into this. And many of us are involved also in national initiatives with insane deadlines, which does not help to keep on top of things. The regular Gitter channel is probably the best to get hold of Galaxy people.

For data, I'm reluctant to set up yet another place, EBI has an action plan https://www.ebi.ac.uk/covid-19. EBI is also in contact with GISAID

ELIXIR also made an overview https://elixir-europe.org/covid-19-resources

I don't get or see the point of developing a workflow backend at this point, do we really want to spend time on this now ? In galaxy we are mainly focussing on implementing the analysis workflows and enabling the roll out in different infrastructures

usegalaxy.eu is on the EU side the focal point for me, and has been running massive amounts of analyses already (see https://covid19.galaxyproject.org)
the laniakea project in Italy is also of interest to setup custom Galaxy environments https://laniakea-elixir-it.github.io

hope this helps already a bit

cheers

Frederik

Pjotr Prins

unread,
Mar 27, 2020, 9:42:26 AM3/27/20
to Frederik Coppens, virtual biohackathon COVID-19 2020
Thanks for your quick response Frederik. It clarifies we should not
count on Galaxy right now. I think our aims are quite different
and I really appreciate all the hard work you are putting in!

So, folks, what are the other options? We would like an existing
framework to upload files and run pipelines.

Pj.


Frederik Coppens

unread,
Mar 27, 2020, 10:50:02 AM3/27/20
to virtual biohackathon COVID-19 2020
That's not what I meant, not sure how your aims differ. We are trying to enable researchers to do analyses and reproduce the one of others (on their own data)

Galaxy instances across the world can be used for running workflows (usegalaxy.org, usegalaxy.eu, usegalaxy.org.au to name a few). There are 1000s of CPU cores (and some GPU) available behind these systems
We can add/wrap tools if they are not yet there
Data can be uploaded using the ways that already exist. But Galaxy is not a data repository, so making this the centre to find all data is not in scope IMO.

F


Op vrijdag 27 maart 2020 14:42:26 UTC+1 schreef pjotrp:

Peter Amstutz

unread,
Mar 27, 2020, 10:58:26 AM3/27/20
to Pjotr Prins, Rutger Vos, sp...@temple.edu, virtual biohackathon COVID-19 2020
Hi Pjotr & all,

Curii would like to sponsor an Arvados cluster for the biohackathon
which can be used for data upload, management, and sharing as well as
running CWL workflows. We can provide the sysadmin support and have
contacted a cloud vendor to find out if they will donate credits for
this project. We are waiting to hear back from them and once we have
the credits we will start setting up a biohackathon cluster
immediately.

Thanks,
Peter
> To unsubscribe from this group and stop receiving emails from it, send an email to virtual-biohacka...@googlegroups.com.
> To view this discussion on the web, visit https://groups.google.com/d/msgid/virtual-biohackathon/20200327125051.mirfx5cezfjsphsh%40thebird.nl.

Tony Wildish

unread,
Mar 27, 2020, 1:52:16 PM3/27/20
to Pjotr Prins, Rutger Vos, sp...@temple.edu, virtual biohackathon COVID-19 2020
Hi,

On 27/03/2020 12:50, Pjotr Prins wrote:
[...]
> Uploading largisch files is non-trivial. Maybe we can use Arvados or
> Galaxy to upload the data into some storage and use a Arvados or
> Galaxy API to develop the pipelines? Or should be use something like
> Amazon S3 or Google drive instead?
>
> We can hit the ground running if we use existing infrastructure. But
> what existing infrastructure to use?
[...]

on this specific point, Google drive won't scale, you'll be bandwidth-limited fairly quickly. S3 or Google Cloud Storage are vastly more practical.

On a more general point, I'm struggling to get my head around what's going to happen in this project. There are a lot of good ideas about analyses to run, interfaces and portals to create etc, but to me, there's also a lot missing. Please bear with me in the following, it may sound confrontational, but that's not my intent. Email isn't the most expressive medium, and I'm trying to be constructive here.

For example, the discussions on infrastructure revolve around 'what can we use', not 'what do we need'. If we don't address that then we'll be limited in what we can achieve, because we simply won't have prepared the resources we need to do the job.

E.g, the Serratus thread made a reference to searching all public data in SRA, ~100 PB. That is _not_ a trivial task, not even with cloud-native technology, yet there's no discussion of what it will take to do that?

Here at EBI I lead a team of cloud architects/consultants whose job it is to help EBI research teams move into the cloud. Our initial discussions with them always follow the same pattern; the only questions that get reasonably clear answers are 'what are you trying to achieve' and 'when do you want it done by'.

When we ask 'how much data do you have', 'how much CPU time do you need', 'how much data will you produce', the answer is normally vague, often not numerical. So we don't know if we should help them build a horse and cart, a truck, or a freight train, yet the choice is crucial to the success of their project.

Until we know what we're trying to do, with numbers, we can't know what infrastructure we need. Once we know, we can ask providers if they can provide that. Between us we have contacts in Google, Amazon and Oracle cloud, but to ask for 'something', without saying what, is not likely to encourage their generosity.

Similarly for the research infrastructures, ELIXIR, EOSC, and the EBI Embassy Cloud. Our Embassy cloud team are willing to help, but their first question was 'what do you need'. What should I tell them?

Unless I've missed something somwhere I think this is a discussion that needs to happen, rather urgently.

Finally, as I say, I'm aware that my tone here may come across as confrontational. If I offend anyone, I apologise unreservedly, that is certainly not my intention. I want to contribute to this hackathon to help make it a success, but I'm not a biologist, my background is in computing. As such, I'm having real trouble understanding the computational needs of this project, which is why I'm waving my little red flag.

Cheers,
Tony

Rutger Vos

unread,
Mar 28, 2020, 1:51:55 PM3/28/20
to Sergei Pond, virtual biohackathon COVID-19 2020, Pjotr Prins
Hi Sergei,
 
Not sure what threads you are talking about:)

These are threads in the google group for a virtual biohackathon against COVID-19 that is starting in a week from now. The group address is in the CC. It would be wonderful if you could find the time and energy to join in - or at least keep an eye on what's going on.

I wholeheartedly agree that raw read data for genomes need to be available.
I don’t think hosting it would be a problem (that’s what NCBI SRI is ostensibly for), but many sequencing groups we got in touch with are too overwhelmed to keep up.
 
I also don't think hosting is the issue. It would make the most sense if everything simply ended up quickly in INSDC databases, i.e. NCBI SRA or that EBI thing. The main challenge seems to be that a lot of data goes into GISAID (great) but is kind of inconvenient to get back out again (not so great). More to do with licensing than anything else, as far as I know.

Always happy to have more people involved, but I am kind of coming in mid-way through a long conversation, so I am not entirely sure how to get it going.

I guess the first thing I'm wondering about is how you see the data landscape, whether there are bottlenecks in getting data into Galaxy. For example, maybe the GISAID thing is not a problem in your view.
 
Thanks, and nice to hear from you - hope you're well!

Rutger

 
> On Mar 27, 2020, at 8:50 AM, Pjotr Prins <pjot...@gmail.com> wrote:
>
> Galaxy gets mentioned in several threads now, but the Galaxy folks are
> not commenting. What do we do? I think Galaxy is partly funded by
> Elixir, maybe they can pull some people in?
>
> It may be a good idea to use an existing Galaxy instance. But how do
> you develop against such a beast? And does it support workflow
> backends we want to develop?
>
> Personally I think we can have a light-weight web interface on top of
> an existing API.
>
> Uploading largisch files is non-trivial.  Maybe we can use Arvados or
> Galaxy to upload the data into some storage and use a Arvados or
> Galaxy API to develop the pipelines? Or should be use something like
> Amazon S3 or Google drive instead?
>
> We can hit the ground running if we use existing infrastructure. But
> what existing infrastructure to use?
>
> Pj.
>



--

Met vriendelijke groet,

Dr. Rutger A. Vos
Researcher / Bioinformatician
Darwinweg 2, 2333 CR Leiden
Postbus 9517, 2300 RA Leiden










Birgit Meldal

unread,
Mar 29, 2020, 7:16:15 AM3/29/20
to Rutger Vos, Sergei Pond, virtual biohackathon COVID-19 2020, Pjotr Prins

"That EBI thing" is called ENA and there is a direct COVID-19 entry point here for all COVID data at EBI ;-)

https://www.ebi.ac.uk/ena/pathogens/covid-19

The pages behind the links are constantly updated so please refresh each day! Many of our resources are on long release cycles but we are putting out emergency releases. E.g. UniProt very quickly built a page just for the COVID data and it's been updated at least twice last week, but it's not on their main page until the April release. IntAct/Complex Portal (= molecular interactions, my field) are planning a release for the end of next week or start of the following week. I don't know the ENA release cycle as I work at the protein end of things...

Compared to GISAID it's all open and accessible by API :)

Happy Sunday!
Birgit

--
You received this message because you are subscribed to the Google Groups "virtual biohackathon COVID-19 2020" group.
To unsubscribe from this group and stop receiving emails from it, send an email to virtual-biohacka...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/virtual-biohackathon/CAATi6nm0wznataZmeDSqKNQwrD_arxPyF4pjfjz16sr8Xoj9kA%40mail.gmail.com.
-- 
----------------------------------
Dr. Birgit Meldal
Senior Complex Portal & IntAct Curator
European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD
United Kingdom

+44 1223 494107
bme...@ebi.ac.uk

http://www.ebi.ac.uk/intact/
http://www.ebi.ac.uk/complexportal

@complexportal
@intact_project
@bmeldal
Reply all
Reply to author
Forward
0 new messages