Eric,
Thanks much for your interest and questions about Gnodes. I'll be happy to help you, and others, try this out and get it working for your project if you want. It will help me much to have feedback on details that don't work easily.
Gnodes offers something important beyond what other g. size measure tools do: measures of main genome components that help understand where an assembly differs from DNA contents, and decide if that matters to you. Under-assembly of centro/telo-mere "junk" repetitive portions may not matter much, versus under-assembly of coding genes. The model plant project got the 80% of genome that is euchromatic/non-repetitive fully assembled over 20 years ago, and only recently has fully assembled, almost, the remaining centro/telo-mere portions and other highly repetitive spans.
BUSCO-called genes are what I call unique conserved genes (UCG) as there are other routes to finding them (ie expert curated gene sets). I use gene sets built on genome assemblies (e.g NCBI RefSeq) and RNA-assembled gene sets. For the critical need of UCG subset, these are often equivalent today, so doesn't matter. If genome assemblies are bad, then UCG from them can also be bad, causing problems.
Q: Would the number of BUSCO sequences (150-5000 depending on the clade and assembly quality) be sufficient for Gnodes to estimate genome size?
A: About 500 UCG seems sufficient, a lower number may work (200?). There is a minimum length of ~300-600 bases for UCG-CDS to accurately calculate the unit coverage depth; alignment effects creep in if shorter. Larger sample sizes have more reliability, which can be helpful, as all of these genome statistics are messy.
A good UCG sample shows good statistical properties (low error, most UCG have nearly same values) and I've confidence now it is providing an accurate unit coverage depth, which is the critical statistic for these measures. All the K-mer and other g. size estimators, and assembly software, are based around this formula: G = L*N/C, where L*N (dna bases sequenced) is a near certain measure, but C unit coverage depth is hard to reliably measure.
The formula also relies on an assumption of even coverage depth of DNA sequencing, which doens't always hold. 10-year old Illumina DNA is often very uneven (I wrestle w/ old model fly and plant DNA like this; Gnodes paper #1 has some examples). Also DNA bases count (LN) can be tricky: contaminants and non-nuclear DNA inflate it; method artifacts can affect it (e.g. PCR amplified versus PCR-free DNA).
Gnodes provides 2+ types of estimates of G: LN/C and mapped-LN/C, the latter for DNA mapped to assembly, which cuts out contaminants, but also cuts out valid parts not in your assembly, which is okay if missing parts are high-identity duplicates, but not so good for missing unique DNA.
One of Gnodes measures is for missing unique gene sequence, this is one way that independent RNA assembled genes helps validate your genome assembly. Some recent genome assemblies I've seen are missing 1% to 5% or more of *unique gene sequences*, which can be extrapolated to whole genome. Those missed mRNA genes are easier to validated as true organismal content, versus contaminant, than can a value like 10% DNA that doesn't map to your assembly.
Q: Is there an issue of circularity in providing only BUSCO cds when BUSCO is used by Gnodes?
A: Probably not. Although I've not tried this, it should work. There may be an effect of DNA that belongs to missing genes being mapped onto UCG, inflating unit C coverage depth. There are low-identity cut-offs in Gnodes, but for this use it may be worth testing with a known sample like At-plant genes, UCG-only versus full gene set.
Q: .. maybe genome actual size estimates will remain accurate?
A: There are no reliable ways to measure genome sizes, in my now broad experience trying to find such. The molecular/cytology methods work, usually, if agreement w/ DNA measures means they do, but can also be way off (or are measuring differently sized genomes). The various DNA measures (K-mer, Gnodes-like, and assembly) can agree, but often disagree, sometimes by very large percentages.
Behind much of genome size measure and assembly problems, are the very-high-copy, high-identity repeats and duplications, including genes, but mostly the structure of centromere and telomere, satellite DNA or higher order repeats, and/or transposons that are actively replicating. This leads to large *biological* fluctuations in genome sizes, within populations. Cases I've noted include hybrid F1 that drops 30% in size. It is also clear that different populations of species vary, eg. by temperature/climate where cold leads to larger g. sizes, by extending telo/centromeres and such repeats (noted in model plant, daphnia, maybe fruitfly).
Q: Pipeline Usage
This way should work as a test:
The pipeline includes running versions of RepeatMasker and BUSCO that I use, but not everyone has.
This gnodes_pipe writes shell scripts, which I often edit, and you will want to change for production use.
E.g. run your own annotation tools, BUSCO, RepeatMasker, etc. and create the few tables that Gnodes then uses to measure major contents of your genomes. These annotations are as GFF, Fasta, and/or simple tables (e.g genecds.idtable has ID's, type as CDS/TE/other, "busco" or "UCG", to indicate unique genes).
For related species one might reuse the same transposon, repeat sequences and even gene sequences, mapping those to each assembly with blast or other, and avoid having to run sometimes slow (repeatmasker/modeller) tools to make independent annotations. I do that some, it works with caveat you may miss some species-specific changes: a quick-n-dirty versus careful-slow trade-off.
I am working out a test run data package for this, so you all can try it without too many hassles.
- Don Gilbert