The International Committee on Taxonomy of Viruses (ICTV) classifies viruses into families, genera and species and provides a regulated system for their nomenclature that is universally used in virus descriptions. Virus taxonomic assignments have traditionally been based upon virus phenotypic properties such as host range, virion morphology and replication mechanisms, particularly at family level. However, gene sequence comparisons provide a clearer guide to their evolutionary relationships and provide the only information that may guide the incorporation of viruses detected in environmental (metagenomic) studies that lack any phenotypic data.
A rapid and objective means to explore metagenomic viral diversity and make informed recommendations for their assignments at each taxonomic layer is essential. GRAViTy provides one means to make rule-based assignments at family and order levels in a manner that preserves the integrity and underlying organisational principles of the current ICTV taxonomy framework. Such methods are increasingly required as the vast virosphere is explored.
Virus taxonomy is a man-made construct that seeks to describe and catalogue the vast diversity of known viruses and their genetic interrelationships. Viruses are formally classified into orders, families, genera and species by the International Committee on Taxonomy of Viruses (ICTV; ). This organisation maintains a universal taxonomy of viruses that encapsulates their extraordinary genetic and structural diversity. Viral diversity is far greater than encountered in other organisms, with major differences in their genetic material (RNA or DNA) and configurations (double or single stranded) and orientation of their encoded genes. Viral genomes may be segmented, often co-packaged together or, more frequently, into separate virions that are then required to productively infect a cell. Virion morphology and size varies from particles with icosahedral or more complex symmetries or may form filamentous, rectangular, bullet, even bottle-shaped nucleocapsids. Some viruses are enveloped in a host-derived lipid bilayer. Finally, viral genomes are hugely variable in size and their complements of genes, ranging from less than 2000 bases encoding 2 genes to 2.5 million base pairs encoding over 2500 genes [1].
The broadest division of viruses is the Baltimore classification, assignments that are based on their genome configurations as follows: I: dsDNA, II: ssDNA, III: dsRNA, IV: ssRNA sense orientation of genes, V: ssRNA, antisense orientation, VI: ssRNA with reverse transcription of a dsDNA replication intermediate and VII: dsDNA with a ssRNA replication intermediate [2]. With the exception of groups VI and VII, members of which show substantial similarities in genome organisation and replication strategies, this functional division splits viruses into groups that are largely or entirely unrelated to each other in evolutionary terms. However, the division is coarse with several groups, most evidently group I, containing several unlinked virus groups.
While current ICTV taxonomy has incorporated this diverse collection of evolutionarily related and unrelated groups into a single, overarching framework, there are further challenges from the explosion in virus nucleotide sequence data that have been accrued from next generation or high-throughput sequencing (HTS) methods. Their application to aquatic and terrestrial environmental samples, as well as to the gut microbiome, has revealed an astonishing diversity of virus sequences, many bacteriophages, but others likely infecting a range of eukaryotes, including amoebae, algae, insects, fish and plants [3,4,5,6,7]. The majority of such sequences do not match any of those of viruses in currently assigned taxa, and clearly, the ICTV classification would have to be greatly expanded to incorporate this much greater dataset of viruses.
Recently, the ICTV, on advice from an expert group [8], expressed the intention to consider the incorporation of viruses known only by their nucleotide sequences into the formal taxonomy. Classification of such viruses would be subject to there being coding complete genome sequences available and with appropriate quality control to ensure sequence accuracy and avoid problems of misassembly [8]. However, these newly described viruses lack information on their phenotypic properties that have historically been used in their classification, such as virion structure, pathogenicity in their hosts, replication mechanisms and epidemiology/transmission routes. It was therefore proposed that the genome sequence itself may be used to infer a number of properties that may be used as attributes that assist in their taxonomic assignments.
The policy to accept metagenomic-derived sequences into the ICTV taxonomy is not entirely new, and large numbers of recent assignments of further species and genera within existing families have been made in recent years [9]. Many such taxonomy additions, particularly at the level of species or genus, can be justified because there is an existing framework of taxon assignments within such families, often based upon phenotypic properties of isolates of their founder members.
However, the incorporation of viruses that are much more divergent from the existing virus datasets is far more problematic. The ICTV taxonomy provides little information that might guide decisions on the classification of more divergent viruses to existing families or conversely justifying the creation of new virus families or orders. Indeed, there is little or no systematic information on what genomic attributes delineate these higher taxonomic divisions; does simple possession of homologous genes or shared organisational features such as gene order and segmentation suffice to justify family assignment? Do genes encoding structural proteins and which therefore define virion morphology need to be shared? Is there any consistency in how viruses are currently divided into families and orders at the genomic level? These uncertainties require urgent resolution if further classification of the more divergent viruses discovered in recent HTS and related investigations are to proceed on a rational and consistent basis in the future.
A complete list of 3854 eukaryotic viruses for which complete genome sequences are available was assembled (Additional file 1: Table S1, Additional file 2: Table S2). These exemplify each of the current ICTV taxonomy assignments down to species level. This information was drawn from the ICTV Master Species List, the Virus Metadata Repository and further assignments approved by the ICTV Executive Committee in July 2017, currently under ratification vote. This collection provides the most complete and up-to-date collection of viruses with defined assignments.
The first step in the analysis was the extraction of information on those genomic features from complete genome sequences of each virus. This use of multiple features extracted from viral sequences as potential contributors to taxonomy assignments contrasts with traditional phylogenetic methods, in which viruses are often represented by only small, highly conserved portions of their genomes, such as the catalytic core of RNA-dependent RNA polymerase (RdRp) gene sequences for different groups of RNA viruses. Features extracted included gene complements, genomic organisation and metrics of gene homology. Herein, viruses are annotated with databases of protein profile hidden Markov models (PPHMMs) and genomic organisation models (GOMs). Instead of a molecular sequence, each virus is represented by a PPHMM signature and a GOM signature. A PPHMM signature is simply a list of the degrees of similarity of genes present in the virus to various PPHMMs in the database at the amino acid level. Similarly, a GOM signature is a list of the degrees of similarity of its genomic organisation to various GOMs in the database. Additional file 3: Table S3 summarises PPHMMs used in this study.
Picorna-, Mononega- and Herpesvirales were however not monophyletic. For Picornavcirales, members of the Caliciviridae and Solinviviridae families, which are not classified into this order, were embedded within the clade, while Potyviridae showed a sister relationship. However, this phylogeny is indeed consistent with the previously noted relationships of these groups based on RdRp phylogenies and originates from discrepancies in replication gene relationships from structural protein structures that define their capsid morphology and symmetry [11].
The inclusion of the three families (Herpes-, Alloherpes- and Malacoherpesviridae) in the order Herpesvirales is primarily based upon their characteristic capsid morphology, without readily detectable sequence homology that defines this order [12]. We found that the only shared profile across these three families was between their genes coding for DNA packaging terminase, consistent with previous analyses [13]. Nevertheless, the herpesvirus families were collectively embedded within a larger clade of large DNA virus which exhibit detectable, similarity to each other through homologous DNA polymerase, protein kinase and ribonucleotide reductase genes (Baculo-, Nudi-, Hytrosa-, Asco-, Irido-, Asfar-, Marseille-, Phycodna-, Pox-, Mimi- and Nimaviridae). This higher level grouping showed 100% bootstrap support but excluded the Polydnaviridae that showed a much less degree of relatedness to other large DNA viruses (74% bootstrap support). For this latter virus family, its two genera are considered to be independently derived from perhaps an ancestral nudivirus (Bracovirus) and another large cytoplasmic DNA virus (Ichnovirus) [14]. They nevertheless formed a bootstrap supported but highly divergent clade, reflecting shared profiles of their cysteine-rich protein-coding genes (c4.1 and d9.2 of the Hyposoter fugitivus ichnovirus, homologues in Campoletis sonorensis ichnovirus and CRP1 and CRP3 proteins of the Cotesia congregate brachovirus [15,16,17]). Finally, there was further support for all DNA viruses possessing DNA polymerase creating a larger clade that encompassed Adenoviridae and Lavidaviridae from which the small DNA virus families of Polyomavirdae and Papillomaviridae were excluded (100% bootstrap support). These two latter families were, however, linked through PPHMM matches of their E1 and NS proteins corresponding to the previously noted protein sequence homology [18, 19], creating a relatively deeply branching clade with 98% bootstrap support.
7fc3f7cf58