picking best gene model using gff

78 views

Skip to first unread message

Robert King

unread,

Apr 2, 2019, 1:14:46 PM4/2/19

to EvidentialGene

I've using the tr2aacds.pl script to produce a good transcriptome and I've used this in annotating a genome plus just transferred it to a genome using exonerate to produce gene models. I now want to pick the best gene model. combining these sets produces a high busco but lots of redundency which is why want to pick the best gene model. I can use the tr2aacds.pl script again using the transcripts produced from the gff files and using those ID's take gene models from each gff but I think I see something more comprehensive in the evigene scripts which i'm not sure how to run. Is there something to pick the best gene model from multiple gff?

Don Gilbert

unread,

Apr 3, 2019, 10:00:20 PM4/3/19

to EvidentialGene

Dear Robert King,

Your message has a few parts to it that warrant separate answers.

A > .. combining these sets produces a high busco but lots of redundency, which is why want to pick the best gene model.

B > Is there something to pick the best gene model from multiple gff?
There is but one of your needs may be better met without the GFF mapping, read further.

A)
First, while BUSCO is a very useful standard test, it has flaws,
one being that it doesn't recognize alternate transcripts (and proteins),
and people using it are rightly confused by its duplicated score,
or "lots of redundancy" .

With the model plant official gene set, BUSCO scores more "single-copy" genes
as duplicated than single, because it just doesn't recognize alternate
transcripts: busco -m proteins -c 8 -sp arabidopsis
    C:99.1%[S:37.8%,D:61.3%],F:0.3%,M:0.6%,n:1440
    1426    Complete BUSCOs (C)
    544    Complete and single-copy BUSCOs (S)
    882    Complete and duplicated BUSCOs (D)
    4    Fragmented BUSCOs (F)
    10    Missing BUSCOs (M)
    1440    Total BUSCO groups searched

My recommendations are to ignore BUSCO's complete-but-duplicated score,
and just count Complete, versus fragmented + missing in your
orthology assessment. There are other reasons for this than alternates.
But another answer to this is to separate out a single "best"
protein per locus. Best in terms of BUSCO is most-homologous, not

necessarily longest, or most expressed. If your gene assembly is rather fragmented,
there will also be apparent duplications as different fragments
often appear as separate loci (a common result for large, complex genes).

Your suggestion of mapping to chromosomes then
parsing gene locations is one way, and has its uses, but also
various problems: multiple mapped paralogs are one, combined with
likely mis-assemblies in gene duplication regions, as well as
hassles of parsing gene locations back to gene protein IDs.
One has to measure overlapping CDS in such GFF tables to
score loci, account for multiple locations per transcript, etc.

The Evigene method of calculating gene loci and alternates is
analogous to mapping to chromosomes. but simplified: overlapping
coding sequences by self-alignment tells you which transcripts
belong to one locus. I've measured the accuracy of this way
for standard/model organisms and it works well.

Evigene's tr2aacds does classify transcripts by gene locus, but it doesn't
take the next needed step of providing new gene locus identifiers. It
produces two subsets of validated sequences in okayset/, the "okay",
and "okalt", both of which should be used for BUSCO and other further
analyses. "okay" has the longest protein for each putative gene locus,
"okalt" has shorter, but often more homologous proteins which will
give you a higher completeness score with homology tests.

I strongly recommend you and all who use tr2aacds of EvidentialGene then
follow that component with this one, which produces gene locus documented
sequences:
$evigene/scripts/evgmrna2tsa2.pl -onlypubset -idprefix Aspecies1EVm -class aspecies.trclass

then use publicset/aspecies.aa_pub.fa, and other publicset/ files. for any
further analyses such as BUSCO, homology tests. This contains same validated
sequences as okayset/aspecies.{okay,okalt}.aa but with IDs that indicate their
gene locus and alternate transcripts.

Read this in your evigene/docs/evgpipe_sra2genes.help.txt or at
https://sourceforge.net/p/evidentialgene/blog/2018/03/gene-transcript-id-table-from-evgmrna2tsa/
----

A second BUSCO flaw is in its "transcriptome" mode, which uses a very
quick and *dirty* all frame translation of transcripts. The results of
this mode disagree with standard protein translations and also yeild
spurious homology matches (e.g. one gene transcript with one valid
protein can be partly matched to two different conserved genes by
faulty translation).

Always use another protein translator, with BUSCO's "protein" mode.
See evigene/docs/protein_calcs_compared.txt or
https://sourceforge.net/p/evidentialgene/blog/2019/01/proteins-from-transcripts-orf-methods-compared/

B) pick the best gene model from multiple gff

See this doc, evigene/docs/evg_overbestgene.help.txt

This is the basic method of evigene/scripts/overbestgene2.pl:
gene models with overlapping CDS exons are "the same locus", each model has some
form of evidence score, and the method picks out those models with highest
evidence score. The trick or trouble is mainly in applying various evidence scores,
and their combination, to return the best models that a human expert would pick.