Hello All,I am using PASA for the annotation of our two de-novo assemblies for two different but related species using Trinity. PASA tools works perfectly in both cases. However, there is a little problem I am facing currently. So here's my question and error:-1) Question :- After running "Launch_PASA_pipeline.pl" a bunch of files generated. And on of them is "alignment.validations.output". During the examination of this file, I find some "ERROR" entries in 9th column i.e. "alignment_valid coord_span". My question is what is this ERROR mean here?
2) Error :- In the "Loading pre-existing protein-coding gene annotation" section - It has been mentioned that before laoding your gff3 file in the database- one has to check the validity of the gff3 file using "pasa_gff3_validator.pl". Now I am getting error when I tried to run this on both of my gff3 files. Here's the errors :-Assembly_Sepecies1 = "Fatal Error: cannot parse ID from entry chrM dictyBase Curator CDS 36 1658 . + . Parent=DDB0201582 at /sw/opt/PASA/misc_utilities/pasa_gff3_validator.pl line 54, <$fh> line 1."Assembly_Species2 = "Fatal Error, cannot locate data entry for ID: [PPA1271346] at /sw/opt/PASA/misc_utilities/pasa_gff3_validator.pl line 119"Here is the gene that present in line 119:-"GL290983 GenBank gene 43626 45778 . - . ID=PPA_G1268120;Name=PPL_00094 GL290983 GenBank mRNA 43626 45778 . - . ID=PPA1268122;Name=PPL_00094.t00;Parent=PPA_G1268120 GL290983 GenBank exon 43626 43645 . - . ID=PPA1268124;Name=exon-auto1268124;Parent=PPA1268122 GL290983 GenBank exon 43760 43798 . - . ID=PPA1268126;Name=exon-auto1268126;Parent=PPA1268122 GL290983 GenBank exon 43926 44396 . - . ID=PPA1268128;Name=exon-auto1268128;Parent=PPA1268122 GL290983 GenBank exon 44604 44752 . - . ID=PPA1268130;Name=exon-auto1268130;Parent=PPA1268122 GL290983 GenBank exon 44860 45138 . - . ID=PPA1268132;Name=exon-auto1268132;Parent=PPA1268122 GL290983 GenBank exon 45254 45326 . - . ID=PPA1268134;Name=exon-auto1268134;Parent=PPA1268122 GL290983 GenBank exon 45416 45592 . - . ID=PPA1268136;Name=exon-auto1268136;Parent=PPA1268122 GL290983 GenBank exon 45733 45778 . - . ID=PPA1268138;Name=exon-auto1268138;Parent=PPA1268122"I tried to delete the gene completely(gene,mRNA,exon) from line 119. But then again it stuck with the same line 119. When I was searching on web for Assembly_Species2 error. I find out following post :- http://sourceforge.net/p/pasa/mailman/message/32505345/. I have tried to solve this by using the solution suggested by Brian in the same post - " misc_utilities/gff3_file_to_proteins.pl gff3_file genome_db". But here I am not sure what is genome_db here? Is this the original genome fasta file? I have also followd that Jessica mentioned to use gff3 files directly without validating. But the problem here is that - in the final output it gives me the results without any modification. In each entries it says -"original gene structure, not modified by PASA". So I am not sure about the output.
Could I ask everyone for there views about what's going on here. I would highly appreciate any suggestion/help.Many Thanks,Reema Singh
Post-doctoral Research Assistant
The Pauline Schaap Lab and The Barton Group
Division of Cell and Developmental Biology and Division of Computational Biology
College of Life Sciences University of Dundee, Dundee, Scotland, UK
www.lifesci.dundee.ac.uk/groups/pauline_schaap/
www.compbio.dundee.ac.uk
twitter : @ReemaSingh28
--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.
I was just looking and comparing PASA annotation with the existing curated set of genes and trinity assembled transcripts. I find an example where trinity assembled existing transcripts with extended UTR in shown figure [ http://www.compbio.dundee.ac.uk/user/rsingh/Figure_PASA.jpeg ]. Here the transcript “comp5893_c34_seq1” aligns very well with the chr1. Here’s the description:-
comp5893_c34_seq1 (a) = 1400 (Length) [Dark Blue in figure]
1-62 (a) = 2639972 2640033(chr1) 100% identity
63-381 = 2640305 2640623
382-1400 = 2640791 2641809
For the curated gene DDB0191262 [Orange colour] [DDB_G0269146] are exactly same chromosome coordinates. Please check [http://dictybase.org/gene/DDB_G0269146/feature/DDB0191262] for detail information. As evidence, I have also check similarity between DDB0191262 protein sequence and the longest ORF encoded by comp5893_c34_seq1 transcript. Both are same protein so similarity is 100%. The read depth [Green and dark red for control and knockout sample] for this gene is also adding more confidence in the accuracy of this new transcript. But if when I look at the PASA annotation- it only shown a fragment[Red colour] and the rest of the transcript part is not even in the valid [blue colour fragments] and the failed alignment. Now this makes me curious to know what’s happening here.. Any ideas or opinion?
Also, If I will go for annotation comparison and updation then there would be no updation in the existing annotation even though figure clearly showing the extended UTR [if I am not interpreting this wrong]. Why there is this difference?
Thanks,
Reema,
Question1:- Why PASA joined genes after comparison and updation with the existing annotation, even though rest of the evidence (read depth, trinity transcript and existing curated/predicted gene model) shows that these ate two genes? [Attached figure Updation_Join_1 and Updation_Join_2]
Question2:- Why PASA missed some annotation? All the features has been highlighted in the attached figure and the missed annotation highlighted with the black box [attached figure Updation_Missing_1]
Best,
Reema,
Hello Brian,I have got very good annotation after finishing PASA assembly comparison and updation with the existing annotation. The updated annotation shows extended UTRs and corrected gene models. However, I still have some questions related to updation:-Question1:- Why PASA joined genes after comparison and updation with the existing annotation, even though rest of the evidence (read depth, trinity transcript and existing curated/predicted gene model) shows that these ate two genes? [Attached figure Updation_Join_1 and Updation_Join_2]
Question2:- Why PASA missed some annotation? All the features has been highlighted in the attached figure and the missed annotation highlighted with the black box [attached figure Updation_Missing_1]
$PASA_HOME/misc_utilities/gtf_to_gff3_format.pl genes.gtf genome.fa > genes.converted.gff3
There is an error:
-parsing GTF file: gene-models-v1.0.gff
of lineannot get gene_id from MMa09781
at /home/bio_soft/PASApipeline-2.0.2/misc_utilities/../PerlLib/GTF_utils.pm line 85, <$fh> line 1.
GTF_utils::GTF_to_gene_objs('gene-models-v1.0.gff') called at /home/bio_soft/PASApipeline-2.0.2/misc_utilities/../PerlLib/GTF_utils.pm line 30
GTF_utils::index_GTF_gene_objs_from_GTF('gene-models-v1.0.gff', 'HASH(0x1bd1a68)') called at ../misc_utilities/gtf_to_gff3_format.pl line 27
Can you help me to solve it? I do not know how to convert my file. Thank you very much!