[maker-devel] Augustus retraining

370 views
Skip to first unread message

Panos Ioannidis

unread,
Mar 24, 2015, 7:31:21 AM3/24/15
to maker-devel
Hello All,

I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon).

Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl" step), I get a warning for each gene that doesn't contain a start or stop codon.

.....
gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg
gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
....


Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set?

Oh, and by the way, the best guide to retraining Augustus is here. The official web page isn't bad, but doesn't explain in detail certain things.

Thanks,
Panos

Xabier Vázquez Campos

unread,
Mar 24, 2015, 8:06:39 AM3/24/15
to Panos Ioannidis, maker-devel
Hi Panos,

Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus.

Cheers,

_______________________________________________
maker-devel mailing list
maker...@box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org




--
Xabier Vázquez Campos
PhD Candidate
Water Research Centre
School of Civil and Environmental Engineering
The University of New South Wales
Sydney NSW 2052 AUSTRALIA

Panos Ioannidis

unread,
Mar 24, 2015, 8:25:17 AM3/24/15
to Xabier Vázquez Campos, maker-devel
Hi Xabier,

Thanks for your quick reply!

No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it!

Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker).

P


Carson Holt

unread,
Mar 24, 2015, 10:15:10 AM3/24/15
to Panos Ioannidis, maker-devel
Hi Panos,

EST’s and mRNA-seq assemblies will bey their nature be partial.  After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file.  Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it’s own.  Then take these gene models and use them to retrain Augustus.  This is the standard bootstrap retraining procedure, and can be repeated as needed.

More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) —>  http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors
Here is an excellent explanation of Augustus training —> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html
and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)—> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl

Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training.

—Carson

Panos Ioannidis

unread,
Mar 24, 2015, 10:31:38 AM3/24/15
to Carson Holt, maker-devel
Hi Carson,

So you think it's okay to include incomplete gene models when training Augustus?

I'll certainly try the bootstrap method you're suggesting. Even though I did it for SNAP, for some weird reason I forgot it for Augustus :p Do you think, however, that I can get a big improvement in gene-level sensitivity? Currently, I have only 6%...

Thanks,
Panos

Carson Holt

unread,
Mar 24, 2015, 10:39:38 AM3/24/15
to Panos Ioannidis, maker-devel
On your first round it is fine.  It gives the predictor enough to work with, then on the second round you use improved models. When you say 6% sensitivity is that Augustus running on it’s own?  If it’s inside of MAKER that means you are not providing sufficient protein evidence (you need the full proteome of at least two related species). Also is that the gene level, exon level, or nucleotide level sensitivity.  If you are looking at the gene level sensitivity measure, you only get a match when you perfectly match all transcripts in a gene (models that may not be correct in the first place). This value will rarely go above 10% for any predictor. You need to use the nucleotide level sensitivity/specificity metrics.  The gene and exon level metrics are basically meaningless (unless it’s Drosophila which is the only species annotated correctly enough to use them).

—Carson

Panos Ioannidis

unread,
Mar 24, 2015, 11:06:25 AM3/24/15
to Carson Holt, maker-devel
Yes, 6% is gene-level sensitivity. Exon-level is 62% and nucleotide-level is 88%. I only mentioned gene-level, because that's the only metric mentioned in the Augustus web site.

I got these numbers outside of Maker. Actually, I only used Maker to generate the gff files needed to start the training (ran it using only EST evidence and only on a subset of my assembly, using this as a guide).

Now, I've started running the second round of training, as you suggested. Since, however, I don't have data from closely related species, I'm only using Uniref50 as protein evidence.

P

Carson Holt

unread,
Mar 24, 2015, 11:38:24 AM3/24/15
to Panos Ioannidis, maker-devel
I’d pick a couple of species that are as closely related as you can find.  Proteins will still align over large evolutionary distances, but you need a breadth of protein data that UniRef and even databases like Swiss-Prot won’t have (those databases are usually a little too conservative).

The gene level sensitivity/specificity values make certain assumptions about the models you are comparing with.  Unfortunately the only species these assumptions hold true for is Drosophila and maybe C. elegans to a point.  This is the reason you will rarely see species other than those two used for the gene and exon sensitivity/specificity metrics.

Thanks,
Carson
Reply all
Reply to author
Forward
0 new messages