The GISAID situation

86 views
Skip to first unread message

Rutger Vos

unread,
Mar 28, 2020, 1:36:52 PM3/28/20
to virtual biohackathon COVID-19 2020
Hi all,

I'm getting the general sense that whoever I ask about their experiences reusing GISAID data for serious work is dissatisfied with how this is going. The complaints are roughly like this:
  • there is no download API so that data can be pulled into workflows
  • they are slow to respond in creating user accounts
  • unpredictable in assigning download rights to individual accounts
  • slow to respond to help requests or other emails
So far I've learned that this has been a problem for Galaxy, for nextstrain, and for individual researchers. In this thread, I would like to discuss strategies for addressing this. 

In an ideal world, the data would just be freely accessible the way INSDC data are. Assuming that there are licensing issues, the next best thing is that users at least get an account quickly and can then fetch data with some kind of API token. 

This seems very solvable. Maybe it is simply a matter of persuading them to do this, perhaps with our help?

I wonder if some kind of letter, signed by a sufficiently large number of participants, might be compelling. What do you think?

(I guess another option might be opening up the letter and naming and shaming them in other ways - but I assume they're just very busy/understaffed/unsure how to comply with the licensing scheme they're locked in.)

Met vriendelijke groet,

Dr. Rutger A. Vos
Researcher / Bioinformatician






+31717519600 - +31627085806
Darwinweg 2, 2333 CR Leiden
Postbus 9517, 2300 RA Leiden










Pjotr Prins

unread,
Mar 28, 2020, 3:36:38 PM3/28/20
to Rutger Vos, virtual biohackathon COVID-19 2020
Maybe write a public letter? If we sign it is effectively public. Can
do on github like we did for the manifesto:

https://github.com/pjotrp/bioinformatics

The other option is, once we have our infrastructure set up, we'll
just post it as an open alternative. Next write a paper on the
importance thereof.

On Sat, Mar 28, 2020 at 06:36:39PM +0100, Rutger Vos wrote:
> Hi all,
> I'm getting the general sense that whoever I ask about their
> experiences reusing GISAID data for serious work is dissatisfied with
> how this is going. The complaints are roughly like this:
> * there is no download API so that data can be pulled into workflows
> * they are slow to respond in creating user accounts
> * unpredictable in assigning download rights to individual accounts
> * slow to respond to help requests or other emails
>
> So far I've learned that this has been a problem for Galaxy, for
> nextstrain, and for individual researchers. In this thread, I would
> like to discuss strategies for addressing this.
> In an ideal world, the data would just be freely accessible the way
> INSDC data are. Assuming that there are licensing issues, the next best
> thing is that users at least get an account quickly and can then fetch
> data with some kind of API token.
> This seems very solvable. Maybe it is simply a matter of persuading
> them to do this, perhaps with our help?
> I wonder if some kind of letter, signed by a sufficiently large number
> of participants, might be compelling. What do you think?
> (I guess another option might be opening up the letter and naming and
> shaming them in other ways - but I assume they're just very
> busy/understaffed/unsure how to comply with the licensing scheme
> they're locked in.)
> Met vriendelijke groet,
> Dr. Rutger A. Vos
> Researcher / Bioinformatician
> [logo-new.png]
> [+31717519600 - +31627085806 ]
> [[1]rutge...@naturalis.nl - [2]www.naturalis.nl]
> [Darwinweg 2, 2333 CR Leiden]
> [Postbus 9517, 2300 RA Leiden]
> [3][schildpad.gif]
>
> --
> You received this message because you are subscribed to the Google
> Groups "virtual biohackathon COVID-19 2020" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [4]virtual-biohacka...@googlegroups.com.
> To view this discussion on the web, visit
> [5]https://groups.google.com/d/msgid/virtual-biohackathon/CAATi6nn9dodU
> rnAqnz4m2shf1dxKKHoiDiMiBFaaXSazWfQkpA%40mail.gmail.com.
>
> References
>
> Visible links
> 1. mailto:rutge...@naturalis.nl
> 2. https://www.naturalis.nl/
> 3. https://www.naturalis.nl/lang-leve
> 4. mailto:virtual-biohacka...@googlegroups.com
> 5. https://groups.google.com/d/msgid/virtual-biohackathon/CAATi6nn9dodUrnAqnz4m2shf...@mail.gmail.com?utm_medium=email&utm_source=footer
>
> Hidden links:
> 7. https://www.naturalis.nl/over-ons

Rutger Vos

unread,
Mar 29, 2020, 5:49:17 AM3/29/20
to Pjotr Prins, virtual biohackathon COVID-19 2020
I would like to refrain from going public until they have had a chance to explain their reasons. At this point we are simply guessing what is going on behind their doors.
--

Met vriendelijke groet,

Dr. Rutger A. Vos
Researcher / Bioinformatician
Darwinweg 2, 2333 CR Leiden
Postbus 9517, 2300 RA Leiden










José María Fernández

unread,
Mar 29, 2020, 6:32:34 AM3/29/20
to Rutger Vos, Pjotr Prins, virtual biohackathon COVID-19 2020, Salvador Capella

Hi everyone,
    from my point of view, maybe GISAID does not have the technical expertise to do what the community needs at this moment. We should use such letter as a request to be a data proxy only for coronavirus related sequences, so we could provide both the open disposition of the sequences. And one of the outcomes of Virtual BioHackathon, which could be then reused by GISAID, should be an open API (considering controlled access to sensitive data if they allowed it us) which could be used by them later in their servers.

    Hopefully redistribution restrictions by legal reasons could be partially loosened.

    Best,
        José María

PS: If they agree, we, either as BSC or as ELIXIR, could provide the public storage for the data they are willing to openly share with the community

To unsubscribe from this group and stop receiving emails from it, send an email to virtual-biohacka...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/virtual-biohackathon/CAATi6nm%3DA3obvHM5h7Dfovk_-XxKxjRDiu9j6m2XWywnHLXW8w%40mail.gmail.com.
--
"There is no reason why anybody would want a computer in their home" -
	Ken Olson, founder of DEC 1977
"640K ought to be enough for anybody" - Bill Gates, 1981 
"Nobody will ever outgrow a 20Mb hard drive." - ???

"Premature optimization is the root of all evil." - Donald Knuth
"Los ordenadores son inútiles. Sólo pueden darte respuestas" - Pablo Ruíz Picasso

José María Fernández González
Senior Research Scientist
e-mail: jose.m.f...@bsc.es
INB Node, Life Sciences Department
Torre Girona Building, 1st floor, Barcelona Supercomputing Center
C/. Jordi Girona, 31
Zip Code: 08034				Barcelona (Spain)
Phone: (+34) 934117074

Birgit Meldal

unread,
Mar 29, 2020, 7:24:41 AM3/29/20
to José María Fernández, Rutger Vos, Pjotr Prins, virtual biohackathon COVID-19 2020, Salvador Capella

Here's a very naive question:

To get round the licencing and access issues for GISAID, wouldn't it be better to convince anyone submitting to GISAID to also submit to a INSDC database? Then the data is FAIR by default (as long as it complies with the sample consent in the first place).

Disclaimer: I work at the protein end of things and never heard of GISAID before last week, hence why this might be a very naive comment :)

Birgit

To view this discussion on the web, visit https://groups.google.com/d/msgid/virtual-biohackathon/f1ff6543-3366-2b05-7b1f-f46b66fc98a4%40bsc.es.
-- 
----------------------------------
Dr. Birgit Meldal
Senior Complex Portal & IntAct Curator
European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD
United Kingdom

+44 1223 494107
bme...@ebi.ac.uk

http://www.ebi.ac.uk/intact/
http://www.ebi.ac.uk/complexportal

@complexportal
@intact_project
@bmeldal

ababaian

unread,
Mar 29, 2020, 12:22:03 PM3/29/20
to virtual biohackathon COVID-19 2020
To play devils advocate, (I met a developer of EpiFlu and we had pretty much this discussion).

If it was possible to release this data openly and publicly then I believe that they would. The problem is two-fold

1) Clinicians/biologists who do the legwork and generate the data are often not accredited when data is public and free. Therefore they can't prove to granting agencies to continue funding.
2) As above, they also are rapidly outcompeted by large informatics groups who can do the analysis and make the discoveries much faster then they could.

GISAID is a compromise, and in my opinion (at least for influenza) a necessary evil. Biomedical researchers don't have the same mentality as bioinformatics, this is an unfortunate truth. People horde biological samples, data, information because that's what they were taught to do and that's what they teach. GISAID is a wedge into that armor and is opening up people to data-sharing.

I would not go out on the offensive against them, the issue isn't with what the organizers want it's what they are bound to with the researchers who submitted the data. A productive action would be to request SARS-CoV-2 sequences in particular be released publically during this pandemic, not from GISAID itself but from the biologists who submitted the data. It again will be the wedge for open science policies. This can solve many of the access issues and is far more likely to succeed. If you write an open letter I would address it to the greater community, try to have GISAID sign on and ask that all SARS-CoV-2 sequences be free.



Pjotr Prins

unread,
Mar 29, 2020, 12:33:05 PM3/29/20
to ababaian, virtual biohackathon COVID-19 2020
So, one thing we can do is allow for credit and attribution. You don't
have to make data hard to access for that. Free software shows the
way.
> --
> You received this message because you are subscribed to the Google
> Groups "virtual biohackathon COVID-19 2020" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [1]virtual-biohacka...@googlegroups.com.
> To view this discussion on the web, visit
> [2]https://groups.google.com/d/msgid/virtual-biohackathon/5554a5c3-a787
> -4d39-85d9-0bd2cbab334b%40googlegroups.com.
>
> References
>
> 1. mailto:virtual-biohacka...@googlegroups.com
> 2. https://groups.google.com/d/msgid/virtual-biohackathon/5554a5c3-a787-4d39...@googlegroups.com?utm_medium=email&utm_source=footer

Pjotr Prins

unread,
Mar 29, 2020, 12:37:49 PM3/29/20
to Pjotr Prins, ababaian, virtual biohackathon COVID-19 2020
On Sun, Mar 29, 2020 at 11:33:02AM -0500, Pjotr Prins wrote:
> So, one thing we can do is allow for credit and attribution. You don't
> have to make data hard to access for that. Free software shows the
> way.

To underwrite your response: access to data is going to be one of the
main challenges in biomedical science the coming years. To build
analysis systems that are fast and useful we need access to data.
Currently almost all biomedical data is hidden and/or hard to access.

At the Japan Biohackathon we looked at MediKanren - a tool that does
amazing analysis. Unfortunately no one can use it because the
underlying data comes from hidden sources and the different licenses
do not allow use. This is what we need to do something about. And we
are no where close to solving that! FAIR data also does not address
this problem fully.

Pj.
> To unsubscribe from this group and stop receiving emails from it, send an email to virtual-biohacka...@googlegroups.com.
> To view this discussion on the web, visit https://groups.google.com/d/msgid/virtual-biohackathon/20200329163302.gyehapq2mpie2xni%40thebird.nl.
>

David Yu Yuan

unread,
Mar 29, 2020, 12:59:09 PM3/29/20
to Pjotr Prins, ababaian, virtual biohackathon COVID-19 2020
Since the spirit of this hackathon is open data and open science, we should only consider the same-minded repositories. We have spent more time than necessary talking about GISAID.

INSDC is undeniably the largest and the most open repository. No matter which one of the three databases (DDBJ, ENA or GeneBank) you submit your data, it will eventually end up in all three. There is already enough and comprehensive data for this hackathon to get started. As the governments are pouring money into state-sponsored projects against COVID-19, we can only expect much more data to show up in INSDC databases very quickly.

In my opinion, we should start considering how to access one of the databases at INSDC now. However, I do want to hear the different opinions why we can not or should not use INSDC databases as the source and target. If so, what are the viable alternatives?


Best regards,

David Yuan

Rutger Vos

unread,
Mar 29, 2020, 4:41:47 PM3/29/20
to David Yu Yuan, Pjotr Prins, ababaian, virtual biohackathon COVID-19 2020
The thing is, we should do both. We should be able to fetch data from all of them. The INSDC data is easy, the Bio* toolkits can, for the most part, do that already. But a lot of the viral genomes (hundreds of them) are now only ending up in GISAID and that impedes analysis.

I get Ababian's point fully, and it is something that other domains are dealing with as well. For example, biodiversity data (species occurrences) is painstakingly collected by individual researchers and institutions that have a tendency to want to hoard that data. This is why it was difficult to access these data automatically via the aggregator GBIF.org for ages. The way it works now is that you can fetch these data via an API (accessible by R packages) where you use a personalized API key (implying your agreement with the data-sharing provisions) and where the downloaded data contains metadata about the submitters, who should be cited upon reuse. 

GISAID needs something like this. I'm sure we all agree with crediting data submitters, and appending something like the credits section that nextstrain adds to their reports is really not an issue, it should just be easier, and automated.

Hilmar Lapp

unread,
Mar 29, 2020, 5:26:24 PM3/29/20
to Rutger Vos, virtual biohackathon COVID-19 2020
The salient points have I think already been made in this thread by a number of people, so I won’t repeat them all here. Yes, the technical end of the problem ought to be very solvable, but I don’t think it’s a technical problem. It’s a social problem and these are almost always very hard to solve.

I have heard time and again from numerous scientists who produce sequence data that isn’t easy to recreate or for which the samples aren’t easy to come by (a lot of biodiversity sequence data is in this category) that the fact that the INSDC databases provide very little or no means of ensuring everyone gets credited is a major issue. The great majority of microbiome data is also private, not public. 

If this group wanted to contribute to addressing this problem on the technical end, then I think targeting this issue would be a good candidate. However, any solutions would take a long time to have an effect, because the desired effect would be culture change. If this group wanted to contribute to addressing the problem on the social end, one candidate could be to create valuable products from INSDC data and to credit the contributors of the sequence data aggressively.

Another candidate might be to instigate something akin to the Ft. Lauderdale agreement, i.e., an agreement among everyone in the field as to what types of analyses others reusing the data can do and publish on their own, and which ones are considered to be the first right of the sequence producer(s), in exchange for early sequence release. However, in contrast to a whole genome, a single viral sequence on its own has quite limited insight that could be extracted from it, though I don’t know, maybe not.

My $0.02.

  -hilmar


--
You received this message because you are subscribed to the Google Groups "virtual biohackathon COVID-19 2020" group.
To unsubscribe from this group and stop receiving emails from it, send an email to virtual-biohacka...@googlegroups.com.

LJ.Garcia

unread,
Mar 30, 2020, 12:13:44 PM3/30/20
to Rutger Vos, Pjotr Prins, virtual biohackathon COVID-19 2020
Dear all,

ELIXIR has kindly offered to contact them, let's wait a bit to see what happens.

Kind regards,

To unsubscribe from this group and stop receiving emails from it, send an email to virtual-biohacka...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/virtual-biohackathon/CAATi6nm%3DA3obvHM5h7Dfovk_-XxKxjRDiu9j6m2XWywnHLXW8w%40mail.gmail.com.

Rutger Vos

unread,
Mar 30, 2020, 5:28:29 PM3/30/20
to LJ.Garcia, Pjotr Prins, virtual biohackathon COVID-19 2020
That's great!

ababaian

unread,
Mar 30, 2020, 7:50:10 PM3/30/20
to virtual biohackathon COVID-19 2020
Maybe we can also remind all of us of the Bermuda Principles, an exemplary time biologists agreed to be part of something bigger then themselves. It would be appropriate to enact similar principals during the Pandemic. Our projects included.

Artem

Rutger Vos

unread,
Mar 31, 2020, 8:35:39 AM3/31/20
to ababaian, virtual biohackathon COVID-19 2020
They're not doing anything wrong in regard to those principles, though, right?

Rutger

On Tue, Mar 31, 2020 at 1:50 AM ababaian <4.tr...@gmail.com> wrote:
Maybe we can also remind all of us of the Bermuda Principles, an exemplary time biologists agreed to be part of something bigger then themselves. It would be appropriate to enact similar principals during the Pandemic. Our projects included.

Artem

--
You received this message because you are subscribed to the Google Groups "virtual biohackathon COVID-19 2020" group.
To unsubscribe from this group and stop receiving emails from it, send an email to virtual-biohacka...@googlegroups.com.

Erik Garrison

unread,
Mar 31, 2020, 9:49:48 AM3/31/20
to virtual biohackathon COVID-19 2020
> I would not go out on the offensive against them, the issue isn't with what the organizers want it's what they are bound to with the researchers who submitted the data. A productive action would be to request SARS-CoV-2 sequences in particular be released publically during this pandemic, not from GISAID itself but from the biologists who submitted the data. It again will be the wedge for open science policies. This can solve many of the access issues and is far more likely to succeed. If you write an open letter I would address it to the greater community, try to have GISAID sign on and ask that all SARS-CoV-2 sequences be free.

Could we just ask GISAID to allow us to communicate (such as via ELIXIR) to communicate with all the biologists and clinicians who have uploaded SARS-CoV-2 sequences and invite them to submit to an open repository (anyone in INSDC) with a CC-BY-SA-NC license augmented to handle the particulars of this situation wrt. publication and citation?

As far as I can tell, it's going to be impossible for GISAID to change the license on the submitted sequences. But the people submitting them retain IP, and so they could submit them elsewhere. With the right campaign and motivation many of them might.

LJ.Garcia

unread,
Mar 31, 2020, 10:58:00 AM3/31/20
to Pjotr Prins, Tazro Ohta, Jennifer Harrow, virtual biohackathon COVID-19 2020

Dear All,


This week all groups should self-appoint coordinators. We are asking people to step up and volunteer. It is important to update the Wiki page so it includes a short description of the topic and the sort of projects you would want to tackle during the hacking, appointed coordinator(s), communications channels, any scheduled meetings, available resources (data and tools) you could use, already involved participants, skills needed, and/or and other information you think could be useful for participants to know a bit more on what to expect from the topic. If you are looking for inspiration, these topics are a good starting point:  



Jennifer Harrow, coordinator of the ELIXIR Europe Tools Platform, has offered to help organise the agenda for the coming week so we have spaces to present updates on topics, gather feedback and brief on subjects of general interest.  We'll have a streaming presentation and meeting every day around a topic of general interest that cut across initiatives. There will be two meetings a day so even in different zones everybody can find an option that works for them. An option for webinars on supporting computing platforms may also be presented l. Some examples of the topics: 


  1. Use of Elixir/EBI resources - what is on offer?

  2. Contributing to Galaxy - how to leverage existing infrastructure?

  3. Creating a workflow with CWL - how do I create and share a COVID-19 workflow


If you want to offer a live presentation/webinar from your group, please reply to this thread or email us directly. We would like these sessions to be as informative and interactive as possible


Best wishes, 


Hilmar Lapp

unread,
Mar 31, 2020, 1:05:12 PM3/31/20
to Erik Garrison, virtual biohackathon COVID-19 2020
A CC-BY-SA-NC license (as any other CC license) is unsuited for this purpose, because they are licenses. I.e., they assert copyright, and then based on that grant rights that one otherwise would not have (none by default beyond fair use). (CC0 is the exception, but even though it is often referred to as a license, it is actually not – it’s a waiver of rights by the author.)

However, a license would almost certainly not hold up if challenged in court because the data it seeks to cover are facts of nature. (Hand-edited alignments may be different.) EU law allows for sweat of the brow to count as well, but that would only apply to the database as a whole, not individual records. US copyright law does not admit sweat of the brow in whether a given work is eligible for copyright.

This is why dbGAP, GISAID, etc use Data Use Agreements (DUAs). Once agreed to, these are in fact legally enforceable, but their terms, rights, or restrictions are not based in asserting copyright. Instead, they are based on withholding access until one agrees. (This is also the Achilles heel of DUAs. If you receive data from someone who used it under a DUA but who doesn’t bind you under the same terms, you are not bound by the original DUA. This is why DUAs typically come with very strong stipulations on who and how you can and can’t share the data with others who have not also signed the same DUA.)

  -hilmar

--
You received this message because you are subscribed to the Google Groups "virtual biohackathon COVID-19 2020" group.
To unsubscribe from this group and stop receiving emails from it, send an email to virtual-biohacka...@googlegroups.com.

Gianluca Della Vedova

unread,
Mar 31, 2020, 4:18:23 PM3/31/20
to virtual biohackathon COVID-19 2020
Hilmar,
this is a wonderful summary of the situation. And it shows why CC0
(public domain) should be the standard, together with a social norm that
forces proper acknowledgments. Which is essentially what we have as
researchers: what has been published can be freely reused (just as it
were CC0) but must be cited appropriately.

It is time to have the same standards for ideas and data, but it takes a
lot of time to change social norms.

Best,


Gianluca Della Vedova
https://gianluca.dellavedova.org
>virtual-biohackathon/AA7F4054-8D36-47EC-AA64-F9287462D6E0%40drycafe.net.

Erik Garrison

unread,
Mar 31, 2020, 4:55:29 PM3/31/20
to Hilmar Lapp, virtual biohackathon COVID-19 2020
Thank you for the clear breakdown of the situation. These are works of nature. Courts have finally clarified that finding one doesn't let you patent it or own it. Leading to the DUA. 

This suggests an alternative arrangement that could perhaps align with the needs of those contributing the sequences and those who want to use them for derivative research. 

We can suggest that GISAID sponsor a consortium paper that describes the SARS-CoV-2 pangenome. Everyone who has contributed sequences so far gets authorship (if they agree to participate of course). The sequences up to now are published in the paper. The paper will get a crazy number of citations, which will help contributors. This is better than the present situation, where people are basically citing GISAID. Everyone gets credit and we all get to share the genomes of these dangerous little works of nature. 

We could find a way to update this incrementally. Maybe there are subsequent papers. Maybe future sequences get shared openly under something like ft. Lauderdale.

Fields, Christopher J

unread,
Mar 31, 2020, 5:12:38 PM3/31/20
to Gianluca Della Vedova, virtual biohackathon COVID-19 2020
Related to this (now on Twitter) from Francis:

https://twitter.com/bffo/status/1245070041076436992

chris
To view this discussion on the web, visit https://groups.google.com/d/msgid/virtual-biohackathon/20200331201814.vqqhmwus5tz4lcos%40Dell-7290.


Reply all
Reply to author
Forward
0 new messages