[maker-devel] Annotation quality and converting gff3 to gtf

586 views

Skip to first unread message

James Eckert

unread,

Apr 14, 2013, 5:07:33 PM4/14/13

to maker...@yandell-lab.org

Hello,

I'm currently trying to figure out ways to evaluate the quality of annotations that MAKER produces. I'm working on a novel species, so there isn't a reference genome to compare the annotation quality to.

After doing a bit of searching on the web, I came across the EVAL tool, which I thought may be useful for checking the output quality. EVAL takes in gtf files, not gff3, however MAKER seems to have addressed this problem through its accessory scripts.

I first used the script "gff3_merge" to have my whole annotation under one gff3 file. Next I used "add_utr_start_stop_gff". This would explicitly add the UTRs, which would be needed for converting the gff3 file to gtf. The problem arose when trying to run "gff3_to_eval_gtf". I was expecting MAKER to process the whole gff3 file, but it seems to have only processed 2 nodes. The same thing happens when running the "gff3_2_gtf" script.

Here is the command I'm running, along with the output:
gff3_to_eval_gtf assem_kmer_57_utr.gff3
NODE_20666_length_66353_cov_18.405483 maker CDS 8801 8984 . - 0 gene_id "1"; transcript_id "2";
NODE_20666_length_66353_cov_18.405483 maker CDS 8113 8717 . - 2 gene_id "1"; transcript_id "2";

My question is whether the "gff3_to_eval_gtf" and "gff3_2_gtf" scripts have a bug in them, or whether I'm just doing the process wrong? Perhaps if the conversion doesn't work, there exists an alternative to EVAL that works with native MAKER annotations?

Attached is my whole genome gff3 file, along with the file I ran "gff3_to_eval_gtf" on.

assem_kmer-57_exp-44_covcutoff-auto_contigs.all.gff3

assem_kmer_57_utr.gff3

Thank you in advance for your help,

James

Carson Holt

unread,

Apr 16, 2013, 10:20:01 AM4/16/13

to James Eckert, maker...@yandell-lab.org

The input GFF3 file you have the link to only contains one gene? Is that correct. If so then you should only get one gene in the output. The resulting GTF should only have the genes (ignoring all the evidence).

To convert for eval use these command lines (note the flags such as -g for gff3_merge so you are only looking at genes and the fast must be included in the file, so no -n flag)

gff3_merge -d maker_datastore_index.log -g -o some_file.gff

add_utr_start_stop_gff some_file.gff > some_file2.gff

maker2eval some_file2.gff

Note that all version of MAKER after 2.09 no longer have add_utr_start_stop_gff, the UTR is now always there explicitly, so you go strait from gff3_merge and then use maker2eval_gtf

However with that explanation, I have to wonder if EVAL is appropriate for you. EVAL requires a reference annotation set (that is assumed to be 100% perfect) for comparison, and you get a perfect score whenever you call the genes exactly identical to the reference set (which in itself has obvious bias, but we won't get into that). Given that you have no reference set it will not give you anything other than statistics for the distribution of introns and exon sizes.

Alternate means for quality given no reference genome are AED (computed for each gene as part of the MAKER run), this is basically a variation of EVAL like statistics run against evidence clusters rather than a reference genome, or you can just use % domain content.

See these links for examples of the statistics -->

http://www.biomedcentral.com/1471-2105/12/491

http://www.biomedcentral.com/1471-2105/10/67

Also a figure is attached with an example of quality analysis using combined AED, domain content, and comparative orthologs.

--Carson

_______________________________________________ maker-devel mailing list maker...@box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Reply all

Reply to author

Forward

0 new messages