Gnodes using only BUSCO genes as cds

27 views

Skip to first unread message

Eric Edsinger

unread,

Jul 24, 2023, 7:09:25 AM7/24/23

to EvidentialGene

Hi Don,

I have around 150 species genome assemblies built from Illumina and coming off the assembly line this summer as part of a Biodiversity Genomics project. Many of the species have no RNAseq data - they also have no flow cell cytometry or other physical estimates. I'd like to use a standardized method for all of them to estimate genome size based on read data.

Gnodes seems really great, offering improved estimates of genome size over existing tools with a biologically-informed approach (a classic Evigene framing I think). I'm not entirely clear but it seems like Gnodes expects RNAseq-based gene model cds as input. For example, from http://arthropods.eugenes.org/EvidentialGene/other/gnodes/

Gnodes requires data inputs of...

c. Gene coding sequences for species. Gene CDS assembled from RNA independently of chromosome assemblies provide measures
of errors in chr-assemblies for more reliable estimates, including missing genes.  The Evigene package has a pipeline,
SRA2Genes, for constructing accurate and complete gene sets from RNA.

and in the example command line from the same webpage:

$evigene/scripts/genoasm/gnodes_pipe.pl -title arath18t1a -chr arath18tair_chr.fa -cds arath18tair1cds.fa \
 -sumdata arath20asm.metad -ncpu 24  -maxmem 128gb -reads readsf/ERR4586299_1.fastq

The documentation for Gnodes and Evigene emphasizes the importance / advantages of generating genes and alternate transcripts independent of the genome. And while I agree in our case we have many assemblies where any available cds sequence would be via BUSCO-based identification using the genome assemblies.

At the same time, Gnodes seems to leverage BUSCO within its pipelines.

So I am wondering, would it be ok to provide genome-based BUSCO genes as the entire cds sequence set for Gnodes (the -cds flag for gnodes_pipe.pl). Would the number of BUSCO sequences (150-5000 depending on the clade and assembly quality) be sufficient for Gnodes to estimate genome size? Is there an issue of circularity in providing only BUSCO cds when BUSCO is used by Gnodes? I can imagine some Gnodes statistics will fail or become misleading, for example, estimates of missing genes in the genome assembly - but maybe genome actual size estimates will remain accurate?

If you have any advice, I'm curious what you think / would recommend? Hopefully I haven't totally misunderstood Gnodes - I haven't actually run it yet - just read about it this afternoon - and would love to leverage it, if appropriate.

Thank you very much,
Eric

Don Gilbert

unread,

Jul 31, 2023, 1:46:03 PM7/31/23

to Eric Edsinger, EvidentialGene

Eric,

Thanks much for your interest and questions about Gnodes. I'll be happy to help you, and others, try this out and get it working for your project if you want. It will help me much to have feedback on details that don't work easily.

Gnodes offers something important beyond what other g. size measure tools do: measures of main genome components that help understand where an assembly differs from DNA contents, and decide if that matters to you. Under-assembly of centro/telo-mere "junk" repetitive portions may not matter much, versus under-assembly of coding genes. The model plant project got the 80% of genome that is euchromatic/non-repetitive fully assembled over 20 years ago, and only recently has fully assembled, almost, the remaining centro/telo-mere portions and other highly repetitive spans.

Sorry for my delay in reply, in part due to due date for this project report (3 pages, 1/2 page highlights of Gnodes):
http://eugenes.org/EvidentialGene/about/ProjectReports/genomes_project_progress2023.pdf

BUSCO-called genes are what I call unique conserved genes (UCG) as there are other routes to finding them (ie expert curated gene sets). I use gene sets built on genome assemblies (e.g NCBI RefSeq) and RNA-assembled gene sets. For the critical need of UCG subset, these are often equivalent today, so doesn't matter. If genome assemblies are bad, then UCG from them can also be bad, causing problems.

Q: Would the number of BUSCO sequences (150-5000 depending on the clade and assembly quality) be sufficient for Gnodes to estimate genome size?

A: About 500 UCG seems sufficient, a lower number may work (200?). There is a minimum length of ~300-600 bases for UCG-CDS to accurately calculate the unit coverage depth; alignment effects creep in if shorter. Larger sample sizes have more reliability, which can be helpful, as all of these genome statistics are messy.

A good UCG sample shows good statistical properties (low error, most UCG have nearly same values) and I've confidence now it is providing an accurate unit coverage depth, which is the critical statistic for these measures. All the K-mer and other g. size estimators, and assembly software, are based around this formula: G = L*N/C, where L*N (dna bases sequenced) is a near certain measure, but C unit coverage depth is hard to reliably measure.

The formula also relies on an assumption of even coverage depth of DNA sequencing, which doens't always hold. 10-year old Illumina DNA is often very uneven (I wrestle w/ old model fly and plant DNA like this; Gnodes paper #1 has some examples). Also DNA bases count (LN) can be tricky: contaminants and non-nuclear DNA inflate it; method artifacts can affect it (e.g. PCR amplified versus PCR-free DNA).

Gnodes provides 2+ types of estimates of G: LN/C and mapped-LN/C, the latter for DNA mapped to assembly, which cuts out contaminants, but also cuts out valid parts not in your assembly, which is okay if missing parts are high-identity duplicates, but not so good for missing unique DNA.

One of Gnodes measures is for missing unique gene sequence, this is one way that independent RNA assembled genes helps validate your genome assembly. Some recent genome assemblies I've seen are missing 1% to 5% or more of *unique gene sequences*, which can be extrapolated to whole genome. Those missed mRNA genes are easier to validated as true organismal content, versus contaminant, than can a value like 10% DNA that doesn't map to your assembly.

Q: Is there an issue of circularity in providing only BUSCO cds when BUSCO is used by Gnodes?

A: Probably not. Although I've not tried this, it should work. There may be an effect of DNA that belongs to missing genes being mapped onto UCG, inflating unit C coverage depth. There are low-identity cut-offs in Gnodes, but for this use it may be worth testing with a known sample like At-plant genes, UCG-only versus full gene set.

Q: .. maybe genome actual size estimates will remain accurate?

A: There are no reliable ways to measure genome sizes, in my now broad experience trying to find such. The molecular/cytology methods work, usually, if agreement w/ DNA measures means they do, but can also be way off (or are measuring differently sized genomes). The various DNA measures (K-mer, Gnodes-like, and assembly) can agree, but often disagree, sometimes by very large percentages.

Behind much of genome size measure and assembly problems, are the very-high-copy, high-identity repeats and duplications, including genes, but mostly the structure of centromere and telomere, satellite DNA or higher order repeats, and/or transposons that are actively replicating. This leads to large *biological* fluctuations in genome sizes, within populations. Cases I've noted include hybrid F1 that drops 30% in size. It is also clear that different populations of species vary, eg. by temperature/climate where cold leads to larger g. sizes, by extending telo/centromeres and such repeats (noted in model plant, daphnia, maybe fruitfly).

Q: Pipeline Usage

This way should work as a test:

$evigene/scripts/genoasm/gnodes_pipe.pl -title arath18t1a -chr arath18tair_chr.fa -cds arath18tair1cds.fa \
-sumdata arath20asm.metad -ncpu 24 -maxmem 128gb -reads readsf/ERR4586299_1.fastq

The pipeline includes running versions of RepeatMasker and BUSCO that I use, but not everyone has.

This gnodes_pipe writes shell scripts, which I often edit, and you will want to change for production use.
E.g. run your own annotation tools, BUSCO, RepeatMasker, etc. and create the few tables that Gnodes then uses to measure major contents of your genomes. These annotations are as GFF, Fasta, and/or simple tables (e.g genecds.idtable has ID's, type as CDS/TE/other, "busco" or "UCG", to indicate unique genes).

For related species one might reuse the same transposon, repeat sequences and even gene sequences, mapping those to each assembly with blast or other, and avoid having to run sometimes slow (repeatmasker/modeller) tools to make independent annotations. I do that some, it works with caveat you may miss some species-specific changes: a quick-n-dirty versus careful-slow trade-off.

I am working out a test run data package for this, so you all can try it without too many hassles.

- Don Gilbert

--
You received this message because you are subscribed to the Google Groups "EvidentialGene" group.
To unsubscribe from this group and stop receiving emails from it, send an email to evidentialgen...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/evidentialgene/CAPTjnjByLkMQmJYNZ2R%2BW4vRMO7fSOC%3D8g0k9L-gWzUxSV%2Bb1A%40mail.gmail.com.