[maker-devel] mapping legacy annotation names forward to maker-updated annotations suing map_forward=1

249 views
Skip to first unread message

Lindy McBride

unread,
Jan 24, 2012, 10:43:30 PM1/24/12
to maker...@yandell-lab.org

Hi All,

I am using MAKER on a lab cluster to update a legacy annotation with
RNAseq data. I am having trouble getting names and attributes from
the base annotation to be mapped forward to the final updated
annotation. I'll paste below some relative parameters from the
maker_opts.ctl file I am using and sections from the legacy gff file
and maker output gff file representing the same gene - for which the
name was not mapped forward. Most of the annotations have been
altered, at least in minor ways, from the legacy annotations and the
only thing I can think of to explain why map_forward is not working is
that the new and old genes are not identical. Any thoughts?

Thank you!
Lindy

maker_opts.ctl excerpt

maker_gff=legacy_annotation.gff3
est_gff=cufflinks_output.gff3
map_forward=1

legacy_annotation.gff3 excerpt

CH478573.1 VectorBase gene 27689 45150 .
+ . ID=AAEL015004; Name=AAEL015004;
description=hypothetical protein;
CH478573.1 VectorBase mRNA 27689 45150 .
+ . ID=AAEL015004-RA; Parent=AAEL015004; Name=AAEL015004;
description=hypothetical protein;
CH478573.1 VectorBase exon 27689 27997 .
+ . ID=E030486; Parent=AAEL015004-RA;
CH478573.1 VectorBase exon 28063 28292 .
+ . ID=E030487; Parent=AAEL015004-RA;
CH478573.1 VectorBase exon 43603 45150 .
+ . ID=E030488; Parent=AAEL015004-RA;
CH478573.1 VectorBase CDS 27689 27997 .
+ . ID=AAEL015004-PA; Parent=AAEL015004-RA;
CH478573.1 VectorBase CDS 28063 28292 .
+ . ID=AAEL015004-PA; Parent=AAEL015004-RA;
CH478573.1 VectorBase CDS 43603 43636 .
+ . ID=AAEL015004-PA; Parent=AAEL015004-RA;
CH478573.1 VectorBase three_prime_utr 43637
45150 . + . Parent=AAEL015004-RA;

maker output gff excerpt

CH478573.1 maker gene 27681 45143 .
+ . ID=maker-CH478573.1-augustus-gene-0.14;Name=maker-
CH478573.1-augustus-gene-0.14;
CH478573.1 maker mRNA 27681 45143 .
+ . ID=maker-CH478573.1-augustus-gene-0.14-
mRNA-1;Parent=maker-CH478573.1-augustus-gene-0.14;Name=maker-
CH478573.1-augustus-gene-0.14-mRNA-1;_AED=0.03;_eAED=0.03;_QI=62|1|1|1|
0.33|0.25|4|1395|172;
CH478573.1 maker exon 27681 27997 .
+ . ID=maker-CH478573.1-augustus-gene-0.14-mRNA-1:exon:
0;Parent=maker-CH478573.1-augustus-gene-0.14-mRNA-1;
CH478573.1 maker exon 28063 28292 .
+ . ID=maker-CH478573.1-augustus-gene-0.14-mRNA-1:exon:
1;Parent=maker-CH478573.1-augustus-gene-0.14-mRNA-1;
CH478573.1 maker exon 43603 43662 .
+ . ID=maker-CH478573.1-augustus-gene-0.14-mRNA-1:exon:
2;Parent=maker-CH478573.1-augustus-gene-0.14-mRNA-1;
CH478573.1 maker exon 43775 45143 .
+ . ID=maker-CH478573.1-augustus-gene-0.14-mRNA-1:exon:
3;Parent=maker-CH478573.1-augustus-gene-0.14-mRNA-1;
CH478573.1 maker five_prime_UTR 27681 27742 .
+ . ID=maker-CH478573.1-augustus-gene-0.14-
mRNA-1:five_prime_utr;Parent=maker-CH478573.1-augustus-gene-0.14-mRNA-1;
CH478573.1 maker CDS 27743 27997 . +
0 ID=maker-CH478573.1-augustus-gene-0.14-mRNA-1:cds;Parent=maker-
CH478573.1-augustus-gene-0.14-mRNA-1;
CH478573.1 maker CDS 28063 28292 . +
0 ID=maker-CH478573.1-augustus-gene-0.14-mRNA-1:cds;Parent=maker-
CH478573.1-augustus-gene-0.14-mRNA-1;
CH478573.1 maker CDS 43603 43636 . +
1 ID=maker-CH478573.1-augustus-gene-0.14-mRNA-1:cds;Parent=maker-
CH478573.1-augustus-gene-0.14-mRNA-1;
CH478573.1 maker three_prime_UTR 43637 43662 .
+ . ID=maker-CH478573.1-augustus-gene-0.14-
mRNA-1:three_prime_utr;Parent=maker-CH478573.1-augustus-gene-0.14-
mRNA-1;
CH478573.1 maker three_prime_UTR 43775 45143 .
+ . ID=maker-CH478573.1-augustus-gene-0.14-
mRNA-1:three_prime_utr;Parent=maker-CH478573.1-augustus-gene-0.14-
mRNA-1;

_______________________________________________
maker-devel mailing list
maker...@box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Carson Holt

unread,
Jan 25, 2012, 9:12:22 AM1/25/12
to Lindy McBride, maker...@yandell-lab.org
The maker_gff option is for MAKER derived GFF3. You will need to give
your legacy annotation file to the model_gff option and then run again.

FYI. If you also supply the same file to pred_gff (simultaneous with
model_gff), MAKER can also try to take the models and extend UTR using the
cufflinks_output.gff3 results.

Here is a breakdown of how different evidence types are used.

**model_gff --> These are assumed to be high confidence gene models. These
will affect clustering of evidence. Because they are high confidence, the
clustering will slightly bias MAKER towards keeping rather than replacing
previous models for borderline cases. They can also be used for
maintaining names. MAKER is only allowed to keep or replace models and
cannot modify them. If no evidence supports them, MAKER can still keep
them because they are assumed to be high confidence (but MAKER will still
tag them with an AED score of 1 in those cases).

**pred_gff/snap/augustus/etc --> Gene predictions (lower confidence than
gene models). They do not affect evidence clustering. MAKER can keep
them as they are or modify them by trimming and adding exons based on EST
evidence. These will only be maintained in the final annotation set if
there is some form of evidence supporting their structure.

**est_gff/est --> These are assumed to be correctly assembled and aligned
around splice sites (MAKER uses exonerate to align around splice sites for
ESTs in FASTA files). MAKER can use them to infer gene models directly
(est2genome option), can use them as support for maintaining predictions,
and can use them to modify structure and add UTR to predictions. If you
let MAKER try and find alternative splice forms, they will be used to
identify support for splice variants. How these cluster with other
evidence will help MAKER infer gene boundaries in some cases. MAKER will
also use splice sites inferred from the ESTs to inform gene predictors
during the prediction step.

**protein_gff/protein --> MAKER uses exonerate to align around splice
sites for proteins in FASTA files. MAKER can use them to infer gene
models directly (protein2genome option), but only if they align correctly
around splice sites. MAKER can use them as support for maintaining
predictions (the CDS will be checked where possible to ensure the gene
prediction and protein alignment are in the same reading frame). How
these cluster with other evidence will help MAKER infer gene boundaries in
some cases. MAKER will also use ORFs inferred from the proteins to inform
gene predictors during the prediction step.


**repeat_gff/rmlib/model_org/repeat_protein --> Repeats will be masked to
stop EST and proteins from aligning to repetitive regions and to keep gene
prediction algorithms from being allowed to call exons in those regions.
Many repeats encode real proteins (i.e. retro-transposase and others).
Because of this gene predictors and aligners are often confused by them
(they can falsely be added as exons onto gene calls for example).

**other_gff --> These are GFF3 lines you just want MAKER to add to your
files. Normally representing things MAKER doesn't predict (promotors
regions, CpG islands, restrictions site, non-cdong RNAs, etc). MAKER will
not attempt to validate the features, but will just pass them through "as
is" to the final GFF3 file.

**maker_gff --> This is primarily a convenience option which allows you to
provide a previous MAKER derived GFF3 file and then select what types of
features to keep using the "pass" options (model_pass, pred_pass,
est_pass, etc). This allows feature types to be mixed in the same file as
opposed to the the model_gff option for example where all elements are
required to be part of a gene model or est_gff where all elements must
represent an EST. The maker_gff option only works with unmodified MAKER
derived GFF3 files as MAKER is looking for specific tags and will ignore
any lines that do not have them.


Thanks,
Carson


On 12-01-24 10:43 PM, "Lindy McBride" <lmcb...@mail.rockefeller.edu>
wrote:

Carson Holt

unread,
Jan 25, 2012, 2:21:44 PM1/25/12
to Lindy McBride, maker-devel@yandell-lab.org List
Yes. Tophat can call splice site crossing reads all over the place. If
you are not using est2genome, it can sometimes help to support and
separate best models produced by alternate gene predictors. But if you
use it with est2genome, you end up with a lot of random splice sites being
called as genes. I usually throw it out as well when cufflinks data looks
good. You can also try trinity and other programs as alternatives to
cufflinks. Be aware that in fungi and some organisms that commonly have
UTR overlap across genes that cufflinks, trinity, and programs like them
will merge transcripts.

Thanks,
Carson

On 12-01-25 2:10 PM, "Lindy McBride" <lmcb...@mail.rockefeller.edu> wrote:

>Thanks for your detailed answer and suggestion. I now have it working
>with model_gff and pred_gff.
>
>I'm finding that I get the best results so far when I send maker
>cufflinks data only and allow direct mapping to the genome. If I
>include tophat data, I cannot specify est2genome=1 or else I get alot
>of tiny fragments being annotated.
>
>Cheers,
>Lindy

>>> mRNA-1:five_prime_utr;Parent=maker-CH478573.1-augustus-gene-0.14-
>>> mRNA-1;

Reply all
Reply to author
Forward
0 new messages