—Carson
SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff
my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ‘gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . “\t"; print $gene->splice_complexity . “\n”; } }
Just a quick thought
The smallest summary of what you’re after might be the jaccard difference between you annotation as computed by bedtools http://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html
??
—Carson
> <maker_opts_run2.log>_______________________________________________
> maker-devel mailing list
> maker...@yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
_______________________________________________
maker-devel mailing list
maker...@yandell-lab.org
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
On Apr 25, 2016, at 5:55 AM, Florian <fdo...@students.uni-mainz.de> wrote:
Hello All,
First off, thank you all for your input! I took a look at all your suggestions and have some questions:
The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file:
scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus);Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages?
For the moment I will take a look at GAL, though perl is not my strongest language.
For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned.
The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite?
You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation?
I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input.
kind regards,
Florian
On 20.04.2016 15:16, Campbell, Michael wrote:
I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. https://github.com/mscampbell/Genome_annotation
The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. Mike On Apr 19, 2016, at 5:44 PM, Cook, Malcolm <M...@stowers.org<mailto:M...@stowers.org>> wrote: Just a quick thought The smallest summary of what you’re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html
?? From: maker-devel [mailto:maker-dev...@yandell-lab.org
] On Behalf Of Barry Moore Sent: Tuesday, April 19, 2016 4:37 PM To: Florian <fdo...@students.uni-mainz.de<mailto:fdo...@students.uni-mainz.de>>; maker-devel <maker...@yandell-lab.org<mailto:maker...@yandell-lab.org>> Cc: Campbell, Michael <mcam...@cshl.edu<mailto:mcam...@cshl.edu>> Subject: Re: [maker-devel] A way to compare 2 annotation runs? The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout.
https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you’re looking for, but is primarily a programing library to make it easy to roll your own
If you’re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ‘gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . “\t"; print $gene->splice_complexity . “\n”; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt <cars...@gmail.com<mailto:cars...@gmail.com>> wrote: I’m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. —Carson On Apr 19, 2016, at 6:08 AM, Florian <fdo...@students.uni-mainz.de<mailto:fdo...@students.uni-mainz.de>> wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list
<AED_cummulative.png><comparison.txt>maker...@yandell-lab.org<mailto:maker...@yandell-lab.org> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org