PASA not updating annotations...

Nick Schurch

unread,

Apr 21, 2015, 10:10:18 AM4/21/15

to pasapipel...@googlegroups.com

Hi,

I have a suite of strand-specific (dUTP) RNA-seq data in a model organism that I'd like to use to create an experiment-specific annotation. I used Bridger to build initial de-novo transcripts and then used PASA (Release r20140417) to merge these transcripts. Finally, I used PASA to update the existing annotation. Bridger appears to be doing a pretty good job of building transcripts. The first part of PASA seems to be doing a reasonable job too (although occasionally it seems to throw away things it shouldn't), but I'm having an issue with the last aspect.

My annotation gff loaded to the database just fine, but after running the annotation update I get no original annotation s being updates, but instead thousands and thousands of novel genes. A careful look at whats going on, via the web portal, is leaving me very confused. Here is an example:

Here, the first two diagrams are the existing gene annotation, the bottom three are the PASA merges of the bridger transcripts (which look good) and the diagram with the red bar (which means what, exactly?) is the PASA assembly combined with the annotation (IIUC). In this case the annotation update lists this as a failure and indicates that "Major update in gene structure required.no protein sequence available for comparison.". This doesn't seem right at all! All of these transcript models look completely comparable to me. I'd love to know what I'm doing wrong here...

Nick

Brian Haas

unread,

Apr 21, 2015, 11:16:59 AM4/21/15

to Nick Schurch, pasapipel...@googlegroups.com

Hi Nick,

It looks like the original annotations are being imported without coding regions - or the file format isn't being parsed as it should by pasa.

Does the annotation import meet the GFF3 gene formatting requirements?

http://www.sequenceontology.org/gff3.shtml

with gene, mRNA, exon, and CDS features, and the corresponding parent relationships?

best,

~b

--
You received this message because you are subscribed to the Google Groups "pasapipeline-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pasapipeline-us...@googlegroups.com.
To post to this group, send email to pasapipel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pasapipeline-users/f87a7cd4-7819-4244-b9be-86eee4ee3ebe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

Nick Schurch

unread,

Apr 21, 2015, 11:36:33 AM4/21/15

to pasapipel...@googlegroups.com, nicks...@gmail.com

Well, thats a good question. I've tried two sets of annotations, one downloaded from NCBI and one from Ensembl. The short answer is yes, both meet the gff3 formatting requirements. But... both fail when pasa_gff3_validator.pl is run on them. In particular, amongst other things, not every entry in these files has an ID tag (the official gff3 format only requires an ID tag if a specific entry has child entries), which causes pasa_gff3_validator.pl to fail when run on them.

So I basically ignored pasa_gff3_validator.pl and just loaded the gff3 annotations with Load_Current_Gene_Annotations.dbi. Interestingly, this fails completely for the Ensembl gff3, but works for the NCBI annotation. I get output that says persistent objects are being loaded to the database and there are indeed objects loaded that look sensible - they appear to be alternative transcripts with the correct coordinates andt he correct parent gene name.

That said, I also get a lot of warnings of the form:

Error, no gene feature found for G01040.1,G01040.1-Protein.... ignoring feature

So, the slightly longer answer (I think) is that while the files do meet the correct gff3 specification, PASA appears to have a different idea of what it's expecting from a gff3 file and the files don't appear to meet that. Does this look like the cause of the problem I'm running into? Should I be worried? How do I fix it?

Thanks in advance for your assistance...

Nick

Brian Haas

unread,

Apr 21, 2015, 3:30:57 PM4/21/15

to Nick Schurch, pasapipel...@googlegroups.com

Hi Nick,

The pasa gene validation script needs to be updated. Also, I should make sure that PASA is compatible with both of your gff3 files. Can you send me URLs to the gff3 files you're using? I'll aim to resolve this.

best,

~brian

Nick Schurch

unread,

Apr 21, 2015, 5:49:01 PM4/21/15

to Brian Haas, pasapipel...@googlegroups.com

Sure, this is they:

ftp://ftp.ensemblgenomes.org/pub/plants/release-26/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.26.gff3.gz

ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff

--

Cheers,

Nick Schurch

Brian Haas

unread,

Apr 21, 2015, 8:02:27 PM4/21/15

to Nick Schurch, pasapipel...@googlegroups.com

Hi Nick,

From what I could tell, the TAIR10_GFF3_genes.gff should have loaded ok with the genome 'TAIR10_chr_all.fas'.

In the image you provided, the top 2 entries have names 'GO104.2' and '.GO104.1'. Where did these identifiers come from?

Perhaps try reloading the TAIR10_GFF3_genes.gff file and rerunning the annotation comparison step.

~brian

Nick Schurch

unread,

Apr 22, 2015, 4:13:52 AM4/22/15

to Brian Haas, pasapipeline-users

Hi Brian, The image was one I prepared for a discussion with some people where we did not want the genes identified clearly, so I 'shopped the 'AT1' part of the ids off - I just reused that image here cos I'm lazy ;) The full IDS are there in the original database so I don't think this is the problem.

I actually use the ensembl genome fasta for all my analysis and really would prefer to use the much more complete ensemble TAIR10 gff file too, but that wasn't working so I tried the NCBI file. One complication is that the ensembl genome uses chromosome identifiers without the 'Chr' part, so I have a munged version of the NCBI gff that has the Chr part of the first column removed. I've attached it FYI. Other than that it's identical (I think!). This is the file that does appear to load, but gives the warnings above and produces the weird results shown in the image.

Nick

On 22 April 2015 at 01:02, Brian Haas <bh...@broadinstitute.org> wrote:

Hi Nick,

From what I could tell, the TAIR10_GFF3_genes.gff should have loaded ok with the genome 'TAIR10_chr_all.fas'.

In the image you provided, the top 2 entries have names 'GO104.2' and '.GO104.1'. Where did these identifiers come from?

Perhaps try reloading the TAIR10_GFF3_genes.gff file and rerunning the annotation comparison step.

~brian

On Tue, Apr 21, 2015 at 5:48 PM, Nick Schurch <nicks...@gmail.com> wrote:
Sure, this is they:

ftp://ftp.ensemblgenomes.org/pub/plants/release-26/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.26.gff3.gz
ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff

On 21 April 2015 at 20:30, Brian Haas <bh...@broadinstitute.org> wrote:
Hi Nick,

The pasa gene validation script needs to be updated. Also, I should make sure that PASA is compatible with both of your gff3 files. Can you send me URLs to the gff3 files you're using? I'll aim to resolve this.

best,

~brian

--
Cheers,

Nick Schurch

--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

--

Cheers,

Nick Schurch

TAIR10_GFF3_genes_noChr.gff.gz

Brian Haas

unread,

Apr 22, 2015, 10:08:26 AM4/22/15

to Nick Schurch, pasapipeline-users

I see. I'll track this down shortly and get back to you.

~brian

Brian Haas

unread,

Apr 22, 2015, 10:37:37 AM4/22/15

to Nick Schurch, pasapipeline-users

Hi Nick,

There could be a couple of different issues here, and I'm not sure which it is exactly.

First, the genome that I'm using has accessions 'Chr1, Chr2, ...', and the gff3 file didn't have the 'Chr' part, instead just '1, 2, ...'. When I added Chr as a prefix to the chromsome names in your gff3 file, it worked fine.

The other issue has to do with the Ensembl gff3 formatting (from the file you pointed me to earlier). Here, instead of 'mRNA', the term 'transcript' was being used. Only 'mRNA' was being parsed by the parser in PASA. I've updated the corresponding module that you can drop in as a replacement in the PASA/PerlLib folder, and it'll now treat 'transcript' as equivalent to 'mRNA', and it'll then parse the file properly if you choose to use that. Just be sure the genome fasta accessions match up with the chromosome fields in the gff3.

Hopefully something works here. ;)

Note - PASA is/has-been entirely focused on coding region annotations, and so won't be parsing and analyzing any ncRNA features in the original genome annotation file. This is just a legacy issue and PASA needs some updates to better handle ncRNAs (future work).

best,

~brian

GFF3_utils.pm

Nick Schurch

unread,

Apr 23, 2015, 4:52:17 AM4/23/15

to Brian Haas, pasapipeline-users

Thanks for updating the gff utilities Brian, I'll give them a go. Do I need to rerun pasa completely, or just the annotation step?

As I said, originally I used the ensembl genome and gff and these don't have the 'Chr' on the start of the chromosome ids - that's why I adjusted the gff file from NCBI to reflect this. I'll give it a try again with the ensembl genome and gff and we'll see if that works.

On the ncRNA issue, surely the first stage of PASA - the merging of compatable transcript models - will work just fine for any and all the transcript models from the transcriptome assembly? I can see how that wouldn't be the case for the annotation update stage though. Any idea on when/if PASA will be updated to increase support for ncRNAs?

The other thing I'd like to get an opinion on is how you thing ribo-minus data, as opposed to polyA-pulldown data will effect its performance. I have the former because we wanted to see ncRNAs - but it also means we see a considerable minority of gene expression as pre-mRNAs with incomplete splicing. As such, we get quite a few transcript models that cover annotated introns. Do you anticipate that these will confuse PASA at either stage? How wary should I be of the merging process on these?

Sorry for the quick-fire raft of questions, just trying to understand ;)

Nick

--

Cheers,

Nick Schurch

Nick Schurch

unread,

Aug 17, 2015, 6:59:51 AM8/17/15

to pasapipeline-users, bh...@broadinstitute.org

just poking this back up to the top. It'd be great to get peoples thoughts on these questions...

Brian Haas

unread,

Aug 17, 2015, 7:28:19 AM8/17/15

to Nick Schurch, pasapipeline-users

Hi Nick

Responses below

-Brian

(by iPhone)

On Aug 17, 2015, at 6:59 AM, Nick Schurch <nicks...@gmail.com> wrote:

just poking this back up to the top. It'd be great to get peoples thoughts on these questions...

On Thursday, 23 April 2015 09:52:17 UTC+1, Nick Schurch wrote:
Thanks for updating the gff utilities Brian, I'll give them a go. Do I need to rerun pasa completely, or just the annotation step?

Just the annotation step, I think. The key here is that the annotation file contig identifiers must match the genome fasta file exactly.

As I said, originally I used the ensembl genome and gff and these don't have the 'Chr' on the start of the chromosome ids - that's why I adjusted the gff file from NCBI to reflect this. I'll give it a try again with the ensembl genome and gff and we'll see if that works.

On the ncRNA issue, surely the first stage of PASA - the merging of compatable transcript models - will work just fine for any and all the transcript models from the transcriptome assembly? I can see how that wouldn't be the case for the annotation update stage though. Any idea on when/if PASA will be updated to increase support for ncRNAs?

That's right. The initial pasa assembly doesn't care what kind of transcript it is. It's just the annotation section that's coding centric right now.

I'm not sure when pasa will be updated to better handle ncrna annotation. I've got a few other priorities right now and no one else is working on it. Probably next year sometime.

The other thing I'd like to get an opinion on is how you thing ribo-minus data, as opposed to polyA-pulldown data will effect its performance. I have the former because we wanted to see ncRNAs - but it also means we see a considerable minority of gene expression as pre-mRNAs with incomplete splicing. As such, we get quite a few transcript models that cover annotated introns. Do you anticipate that these will confuse PASA at either stage? How wary should I be of the merging process on these?

Garbage in, garbage out - so to speak. Use poly A for coding annotations. Use total RNA (rubominus) for ncrna exploration, but I wouldn't run it through the annotation updater. You'll enrich for all sorts of unprocessed introns, etc.

Nick Schurch

unread,

Aug 17, 2015, 8:15:08 AM8/17/15

to Brian Haas, pasapipeline-users

Hi Brian,

Thanks for getting back to me on these point - that's great. I was certainly getting some pretty odd results from pasa from the ribo-minus data, so I'll probably skip the annotation update step at least.

Feel free to let me know if you need letters of support or anything for grants, BTW - we've used PASA quite a bit over the last year in the group here in Dundee and we think its a really good, useful, tool and I'd be happy to lend my voice in its support if it'll help.

Nick

--

Cheers,

Nick Schurch

Brian Haas

unread,

Aug 17, 2015, 9:11:27 AM8/17/15

to Nick Schurch, pasapipeline-users

Thanks! great to know. We should look into getting funding support for PASA, as I'm clearly not doing a fantastic job at keeping it maintained in my spare time. ;)