Split genes interpretation

Carol Buitrago

unread,

Mar 10, 2020, 4:44:18 PM3/10/20

to pasapipeline-users

Hi everyone!

I hope you are all well. I need a bit of help to comprehend the PASA update output.

I'm pretty new in this world of genome annotations, but I've done my best to annotate a genome using PASA-AUGUSTUS-PASAupdate

After the last step of the PASA update, I get a gff3 file with some split genes in it and I would like to understand how to interpret the name assigned to them.

For instance, I'll have something like this:

# PASA_UPDATE: split_gene_g3357-g3357.t1-m152, new gene addition, valid-1, status:[pasa:asmbl_6162,status:40], valid-1

Pver_Sc0000041_size824484 . gene 217005 219513 . - . ID=split_gene_g3357-g151;Name=%2A%2A%20NO%20NAME%20ASSIGNED%20%2A%2A

Pver_Sc0000041_size824484 . mRNA 217005 219513 . - . ID=split_gene_g3357-g3357.t1-m152;Parent=split_gene_g3357-g151;Name=%2A%2A%20NO%20NAME%20ASSIGNED%20%2A%2A

Pver_Sc0000041_size824484 . five_prime_UTR 219433 219513 . - . ID=split_gene_g3357-g3357.t1-m152.utr5p1;Parent=split_gene_g3357-g3357.t1-m152

Pver_Sc0000041_size824484 . five_prime_UTR 219161 219326 . - . ID=split_gene_g3357-g3357.t1-m152.utr5p2;Parent=split_gene_g3357-g3357.t1-m152

Pver_Sc0000041_size824484 . exon 219433 219513 . - . ID=split_gene_g3357-g3357.t1-m152.exon1;Parent=split_gene_g3357-g3357.t1-m152

Pver_Sc0000041_size824484 . exon 217005 219326 . - . ID=split_gene_g3357-g3357.t1-m152.exon2;Parent=split_gene_g3357-g3357.t1-m152

Pver_Sc0000041_size824484 . CDS 217451 219160 . - 0 ID=cds.split_gene_g3357-g3357.t1-m152;Parent=split_gene_g3357-g3357.t1-m152

Pver_Sc0000041_size824484 . three_prime_UTR 217005 217450 . - . ID=split_gene_g3357-g3357.t1-m152.utr3p1;Parent=split_gene_g3357-g3357.t1-m152

#PROT split_gene_g3357-g3357.t1-m152 split_gene_g3357-g151 MNIPEQGHKTPHTVSLIVEDGKEFKTLSHILSTASPFFEKLLSSNMKENQEGVIRLEVFTESLMKDVLEFIQTGNVRISTRENAEELVAAADYLCLSKLKSFAGKFIEQTMSSSNCVSTYFMAEKFDHSELLDSVRKFILSNFADVAQTENFPRLPSHEVEQLVSSDDIVINSEEDVFNAIFKWTLHKKSERSKEFCKLFSHVRLTFLSRDFLLKDVVTNEIVTENEDCIDRVNSALAWMNRKTDCDLSRPYSPRKAMETCVIAIWGEDKTFRPLFYVPENNDFYQLPGMAEPDCVPQHVFSCRGKLFFVAQEVNKSQYYDPDSNCWHPAPWTKTDSKPNWCKTNRREIGKLFLRAVLVVENEICFIEENLQIYSSCLCQFNLDKKSTTRSKDWLKTTRTCLVTLGQYIYVIGGTKSDVNIVPHCSRYDIVKNKWQKLANLRFARFRALGIGTQEKIYVAGGWLDFGGEITNTCEVYSVLTDEWHLIGRLTVPRDIGNMLSIDESLYVLGGVCHPVLGKIWSVESYEHEKDEWKENRHFLDIHKIMTFTACTFRLLKDVSVKLKHHGNS*

Is the split_gene_g3357-g3357.t1-m152 resulting from the split of gene_g3357 or gene_g151? Please, help me understand how to interpret these names.

I'm intending to report only on the longest gene isoform and I'm confused if I should add these split genes as isoforms of a parent gene. if so, what is the parent gene, gene_g3357 or gene_g151?

Any advice or suggestion will be greatly appreciated. Also, I'll greatly appreciate if you could share some documentation on how to interpret the outcomes of PASA update.

Thanks in advance!

Brian Haas

unread,

Mar 10, 2020, 6:35:59 PM3/10/20

to Carol Buitrago, pasapipeline-users

Hi Carol,

If you can get the pasa web portal set up, it'll make it easier to navigate reports of how updates were made and how they're connected to the original feature. There's no great text formatted file that explains it. In this case, I suspect the parent gene was g3357. PASA is mostly concerned here with ensuring that the identifiers are unique, and less that they're otherwise greatly informative - though it would be nice if they were unique and informative. ;-)

hope this helps,

~brian

--
You received this message because you are subscribed to the Google Groups "pasapipeline-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pasapipeline-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pasapipeline-users/bb5291ed-b8c6-4046-9af4-62f4ce105ad9%40googlegroups.com.

--

--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

Carol Buitrago

unread,

Mar 11, 2020, 5:12:28 AM3/11/20

to Brian Haas, pasapipeline-users

Hi Brian,

Thanks a lot for the quick reply. Indeed, I've run PASA through the terminal and I'm not very familiar with the web portal, but I'll try to give it a try. I can just upload the PASA output to screen the updates?

Additionally and taking advantage of your responsiveness I wanted to ask you a question that arose to my mind while screening the PASA update gff3 file. For instance, for the novel gene model, there are some that look pretty normal to me. With normal I mean the name of the transcript coincides with the names of the parent gene.

In the screenshot you can see that the transcript name is novel_model_145_5de57afd and the parent gene is novel_gene_145_5de57afd

However, other novel genes have a different parent gene (I mean the prefix doesn't coincide) like this:

While in this case (like in many others) the transcript name is novel_model_147_5de57afd and the parent gene is novel_gene_146_5de57afd

Is this correct? or is it a bug? I downloaded the PASA from the GitHub repository last year in November 19, 2019, which to the best of my knowledge corresponds to the latest version.

Thanks in advance for clarifying, and I agree with you, it will be awesome to have unique and informative names ;)

Brian Haas

unread,

Mar 11, 2020, 6:56:15 AM3/11/20

to Carol Buitrago, pasapipeline-users

Hi Carol,

I expect there's a counter in there that gets incremented for each new gene or new isoform, and if additional models are added (ie. alt splice isoforms), then the counters will get out of synch. Again, just ensuring uniqueness, not super informative.

best,

~b

Brian Haas

unread,

Mar 11, 2020, 6:56:56 AM3/11/20

to Carol Buitrago, pasapipeline-users

Also, the pasa web portal connects to the pasa database, so you shouldn't need to upload anything more.