In addition, you should try running your pipeline through CEGMA
(http://korflab.ucdavis.edu/datasets/cegma/) to identify the expected
completeness of the genome. For example if a genome of 70% completeness
then you only expect to recover 70% of the genes. I believe CEGMA can also
be run online from the iPlant discovery environment and iPlant atmosphere
images. Also make sure you are including proteins with your MAKER run,
as not all genes will be expressed, so mRNAseq will only capture a portion
of the genes and that portion can be as low as 50%.
Thanks,
Carson
Here is more explanation on each point:
1. In addition to any mRNA/EST data, you should provided full proteomes
from a minimum of two species as closely related as possible, and perhaps
a comprehensive database such as UniProt/Swissprot. Note that based on
experience the comprehensive database cannot substitute for a related
species proteome, they can complement it, but not substitute for it. So
you need to supply full proteomes from something. mRNA/EST data is not
sufficient by itself, so make sure you have enough protein evidence.
2. All models are ultimately generated by the predictors (maker doesn’t
generate these), so care should be taken to train the predictors as best
as possible. Also train at least two predictors (SNAP and Augustus are
recommended). If they are both well trained, then they will be in general
concordance with one another. If they are not well trained, then each
program will produce very different models. So visually inspecting their
concordance can give you an idea of if they need to be retrained.
3. More often than not, poor predictor performance is actually the result
of repeat related complications. Many genomes that at first may seem
repeat poor may actually contain novel repeats that can affect the
performance of the gene predictors. If you are getting fewer genes than
you expect or ab initio models are not in concordance from two independent
predictors, run something like RepeatScout to generate species specific
libraries. This may seem minor, but I have seen predictions go from
apparently random to textbook perfect just by producing a species specific
library of novel repeats.
4. You can’t have gene models if you don’t have open reading frames to
translate through. Also gene predictors need sequence upstream and
downstream of genes to work correctly, so if contigs are too short they
won’t be useful for prediction even if the sum of the contigs is large
enough to encompass the whole genome. In general any contig smaller than
10kb is not annotatable, so you should aim for as high an N50 value as
possible.
Annotating a new genome is sort of like a moving target. No two organisms
are alike, so you usually have to to identify what deficiencies exist
based on preliminary runs and then correct for them in subsequent runs.
Thanks,
Carson
mpi_evaluator [options] <eval_opts> <eval_bopts> <eval_exe>
Ok. Content looks good. Just make sure to use gff3_merge to join the GFF3’s without stripping out the fasta sequence at the end when training SNAP.Thanks,Carson
From: dhivya arasappan <daras...@gmail.com>
Date: Thursday, February 6, 2014 at 10:29 AM
To: Carson Holt <cars...@gmail.com>
Cc: Daniel Ence <de...@genetics.utah.edu>
Subject: Re: [maker-devel] maker annotation with cufflinks output
Sorry I was just trying to make it small enough to be approved by the mailing list.
Here is the whole file: