serratus: Ultra-deep coronaviridae homology search

131 views
Skip to first unread message

ababaian

unread,
Mar 26, 2020, 12:53:06 PM3/26/20
to virtual biohackathon COVID-19 2020
Serratus

COVID-19 came out of seemingly nowhere. We will search all SRA sequence data to identify new members of the family Coronaviridae to trace the lineage of SARS-CoV-2.

Objectives

1) Create a phylogenetic tree for coronaviridae with all available sequences.
2) Identify libraries with novel coronaviruses by searching all public data on SRA (~100 PB)
3) Assemble putative coronaviruse genomes and return to step 1)



We're currently building the framework for very high-efficiency (cost) skimming/alignment of data off of SRA. Since February, SRA has been mirrored to AWS S3, as such we can access all the data for almost no cost using AWS services.

Rutger Vos

unread,
Mar 26, 2020, 2:38:21 PM3/26/20
to ababaian, virtual biohackathon COVID-19 2020
I think this is a good idea but I would strongly suggest a couple of considerations:
  1. this should be integrated in existing projects that are building viral phylogenies
  2. the tree building and downstream analysis should partition by the different genes, with codon alignments
Although this is a bit out of my expertise it seems to me that it is vital to be able to track how rapidly the different genes are evolving, whether they are under selection (dN/dS ratio) and whether there are conserved parts in the genes. I suppose that would be useful information for trying to identify epitopes that could be targeted by treatments and vaccines, no?

--
You received this message because you are subscribed to the Google Groups "virtual biohackathon COVID-19 2020" group.
To unsubscribe from this group and stop receiving emails from it, send an email to virtual-biohacka...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/virtual-biohackathon/faceff95-a443-4804-9643-9cefead4e500%40googlegroups.com.


--

Met vriendelijke groet,

Dr. Rutger A. Vos
Researcher / Bioinformatician






+31717519600 - +31627085806
Darwinweg 2, 2333 CR Leiden
Postbus 9517, 2300 RA Leiden










ababaian

unread,
Mar 26, 2020, 2:48:27 PM3/26/20
to virtual biohackathon COVID-19 2020
Agreed on both points. I am not an expert on building phylogenetic trees and if we can team up with another group doing this that can share that data then I'd be very happy. I've seen a lot of work being done on SARS-CoV-2 phylogenetics but not so much on Coronaviruses in a broader context. If you know of such a group please let me know!

The main focus here is identifying novel coronavirus sequences by skimming SRA.

Fotis Psomopoulos

unread,
Mar 26, 2020, 2:50:58 PM3/26/20
to Rutger Vos, ababaian, virtual biohackathon COVID-19 2020
Hi all,

I completely echo Rutger's point. There is a similar idea listed within the ML topic on GitHub, the suggestion being the application of clustering at different levels of as a form of phylogeny.

Kind regards,

Fotis  



-- 
Fotis E. Psomopoulos
Assistant Research Professor
INAB - Institute of Applied Biosciences
CERTH - Center for Research and Technology Hellas
Thermi 57001, Greece

Phone: +30 2310 498 478
Fax  : +30 2310 498 270
ORCID: 0000-0002-0222-4273

While I may be sending this email outside my normal office hours, I have no expectation to receive a reply outside yours.

Rutger Vos

unread,
Mar 26, 2020, 3:24:29 PM3/26/20
to ababaian, virtual biohackathon COVID-19 2020
I wouldn't necessarily call myself an expert in building phylogenetic trees (though I've dabbled here and there) but I would very much be willing and able to participate in an activity that extends one of the viral tree building pipelines to add some analysis of rates of adaptive evolution. For example, take an existing tree (nextstrain, or from serratus, or whatever), partition the underlying alignment by gene (as codon alignment) and run something like codeml (paml) or branchsiterel (hyphy) to be able to characterize rates of stabilizing or directional evolution in each of the viral genes. I imagine this might fall under the Workflows (or maybe BioStatistics) activities.

On Thu, Mar 26, 2020 at 7:48 PM ababaian <4.tr...@gmail.com> wrote:
Agreed on both points. I am not an expert on building phylogenetic trees and if we can team up with another group doing this that can share that data then I'd be very happy. I've seen a lot of work being done on SARS-CoV-2 phylogenetics but not so much on Coronaviruses in a broader context. If you know of such a group please let me know!

The main focus here is identifying novel coronavirus sequences by skimming SRA.

--
You received this message because you are subscribed to the Google Groups "virtual biohackathon COVID-19 2020" group.
To unsubscribe from this group and stop receiving emails from it, send an email to virtual-biohacka...@googlegroups.com.

ababaian

unread,
Mar 26, 2020, 3:31:22 PM3/26/20
to virtual biohackathon COVID-19 2020
That type of analysis is foundational! I mean there's also a lot of important information outside of coding regions to consider. Mainly regulatory motifs in untranslated regions and/or structural units in the RNA genome. If you can account for the biology that is known and do that type of evolutionary analysis I believe this is how to convert sequence alignments into meaningful interpretations of pathogenesis and virulence.

"Nothing in biology makes sense, except in the light of evolution."

Rutger Vos

unread,
Mar 26, 2020, 3:47:21 PM3/26/20
to ababaian, virtual biohackathon COVID-19 2020
I think it would be a lot of fun to work on this, but I think we have to be very careful about where we direct our attention. At a hackathon we can build things, but at this particular one the things we're building have to be as urgently useful as possible. I'd work on this if we connect it to the rest of the community, avoid duplication, and keep it going. I'm not planning to do an ephemeral proof of concept - then I'd rather work on FAIR sequence data access for everyone.

On Thu, Mar 26, 2020 at 8:31 PM ababaian <4.tr...@gmail.com> wrote:
That type of analysis is foundational! I mean there's also a lot of important information outside of coding regions to consider. Mainly regulatory motifs in untranslated regions and/or structural units in the RNA genome. If you can account for the biology that is known and do that type of evolutionary analysis I believe this is how to convert sequence alignments into meaningful interpretations of pathogenesis and virulence.

"Nothing in biology makes sense, except in the light of evolution."

--
You received this message because you are subscribed to the Google Groups "virtual biohackathon COVID-19 2020" group.
To unsubscribe from this group and stop receiving emails from it, send an email to virtual-biohacka...@googlegroups.com.

ababaian

unread,
Mar 26, 2020, 3:58:56 PM3/26/20
to virtual biohackathon COVID-19 2020
: ) For sure, to me there's a clear application for a freely available dataset for very deep-evolutionary conservation analysis for Coronaviridae with relation to understanding SARS-CoV-2.

ababaian

unread,
Mar 31, 2020, 11:54:42 AM3/31/20
to virtual biohackathon COVID-19 2020
Quick Update:
- I've clarified the core goals of Serratus
- We have the #serratus channel on the virtual biohackathon slack

-------------
The SARS-CoV-2 pandemic will infect millions and has already crippled the global economy.

While there is an intense research effort to sequence SARS-CoV-2 isolates to understand the evolution of the virus in real-time, our understanding of where it originated is limited by the sparse characterization of other members of the Coronaviridae family (only 53/436 CoV sp. Genomes are available).
 
We are re-analyzing all RNA-sequencing data in the NCBI Short Read Archive to discover new members of Coronaviridae. Our initial focus is mammalian RNA-sequencing libraries followed by avian/vertebrate, metagenomic, and finally all 1.12M entries (5.72 petabytes).

The impact of this research is immediate upon data release. Rich evolutionary data increases the precision of research on conserved, therefore functional elements of CoVs. A specific application is also that a broader CoV annotation will improve the specificity of RT-PCR primer design for diagnosing COVID-19 and can predict possible sources of false positives in clinical tests.
-------------

ababaian

unread,
Mar 18, 2021, 2:40:05 AM3/18/21
to virtual biohackathon COVID-19 2020
Kind of crazy to think this was almost a year ago! I just wanted to give everyone an update and possibly recruit some "fresh blood".

We've fully built out Serratus, and it's capable of aligning in excess of 1 million NGS datasets per day for a cost of under 1 cent per dataset. With this we've aligned 5.7 million libraries in the Sequence Read Archive (10.2 petabases) to uncover in excess of 100,000 novel species of RNA viruses (defined by RDRP identity >10% diverged). This is about an order of magnitude increase in the number of known RNA virus species that are available in GenBank and other public databases.

You can find more information in our pre-print: https://www.biorxiv.org/content/10.1101/2020.08.07.241729v2

We now have a massive dataset of viruses to characterize, and an additional 11,200 assemblies of Coronaviruses (half are non-SARS-CoV-2). I've been doing some work on the evolutionary conservation of splice variants in CoV but if someone would be interested in helping out with this, or doing a systematic recombination analysis we're always looking for more collaborators.

If you have any interest in _any_ RNA virus, we also have literally thousands of uncharacterized viruses that need some TLC. Please do reach out!

Cheers,

Artem

ababaian

unread,
Mar 18, 2021, 2:44:28 AM3/18/21
to virtual biohackathon COVID-19 2020
p.s. We're actively developing a user-interface for "Earth's Virome", think of it like searching for a particular virus across all the data or exploring virus families and geography. If someone would be interested in helping with web / UI / database development, we really really could use help in that area


A
Reply all
Reply to author
Forward
0 new messages