Some basic statistics about transcriptome

22 views
Skip to first unread message

Krešimir Križanović

unread,
Jan 31, 2023, 7:49:14 AM1/31/23
to pasapipeline-users
Hello,

I've used Pasa docker version (SQLite) to generate a transcriptome database (using Trinity for initial transcriptome).

Is there a way to automatically obtain some basic statistics about the transcriptome, such as:
- number of genes
- median gene length
- median number of exons 
...

Or do I have to calculate it from the GFF output myself?

Krešimir Križanović

Brian Haas

unread,
Jan 31, 2023, 10:28:51 AM1/31/23
to Krešimir Križanović, pasapipeline-users
Hi,

I don't think there's anything in the current release.  If you don't find another tool that does this, send me your wish list and I'll see what I can do.  I probably have code lying around that does most of it somewhere and I can aim to integrate it.

best,

~b

--
You received this message because you are subscribed to the Google Groups "pasapipeline-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pasapipeline-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pasapipeline-users/902c3595-a302-4e75-a38c-9f5952f029cdn%40googlegroups.com.


--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

Krešimir Križanović

unread,
Feb 9, 2023, 9:44:58 AM2/9/23
to pasapipeline-users
Sorry for taking this long to respond.

First, thanks for the answer.

Second, no need for you to implement anything, I'll write a python script to calculate what I need.

Which file should I look at? Are these the right ones: "sample_mydb_pasa.sqlite.pasa_assemblies" (bed, gff and gtf should all have the same data in different format)?

One more question, can I use PASA to perform homology based prediction, to download transcriptomes for similar species and then use them to predict genes on my genome?

Brian Haas

unread,
Feb 9, 2023, 9:58:15 AM2/9/23
to Krešimir Križanović, pasapipeline-users

Sorry - got buried in my inbox!  Please find my responses below


Which file should I look at? Are these the right ones: "sample_mydb_pasa.sqlite.pasa_assemblies" (bed, gff and gtf should all have the same data in different format)?


Yes, the various extensions should correspond to identical data but in the different formats.
 
One more question, can I use PASA to perform homology based prediction, to download transcriptomes for similar species and then use them to predict genes on my genome?

For homology-based genome annotation, PASA can tie into EVM:  https://github.com/EVidenceModeler/EVidenceModeler/wiki

PASA only uses same-species alignments (they need to be near perfect), but EVM can accept a variety of alignment types including from other species.

best,

~b
 

Krešimir Križanović

unread,
Feb 12, 2023, 11:23:35 AM2/12/23
to pasapipeline-users
Thanks for the advice, I've been trying to follow it last few days :).

I'm having trouble mapping protein sequences from an appropriate clade (asparagales) to my genome assembly, to produce a GFF file with alignments.

I found papers using GeneWise and TBLASTN for mapping, but both tools seem not to work for me. With TBLASTN I get segmentation fault. GeneWise web does not accept large enough files, while Wise2 package also returns some sort of memory allocation error. This should not happen since my server has 1TB of RAM. I've split my assembly into chromosomes and am running it with the smallest chromosome now. I'll see how that goes.

However, could you suggest some other tools to do so?

K.K.

Brian Haas

unread,
Feb 12, 2023, 12:24:15 PM2/12/23
to Krešimir Križanović, pasapipeline-users
I find a lot of people using GeMoMa these days. I don't have experience with it myself (yet) but could be promising:
http://www.jstacs.de/index.php/GeMoMa

I haven't done much of this work in a while now. We used to rely on Genewise 10-15 yrs ago.

Back in the earlier days (15-20 yrs ago), we used this:
https://github.com/brianjohnhaas/AAT_XHuang_aligner
which still works incredibly well, but very rigorous and comparatively slow.

You can get along pretty far with just having rna-seq data, assuming you've got expression from a diverse enough set of conditions that you'll get evidence of most genes being expressed, but it definitely helps to have the protein alignments  not only for visualization but for exploring predicted genes that don't seem to be expressed based on the rna-seq data you might have.

hope this helps,

~b

Krešimir Križanović

unread,
Feb 13, 2023, 7:08:36 AM2/13/23
to pasapipeline-users
Thanks for the info.

I just found this tool, published this y<ear: https://academic.oup.com/bioinformatics/article/39/1/btad014/6989621.

Going by the authors track record, it should work really well.

I'll try it out and report the results.

K.K.

Brian Haas

unread,
Feb 13, 2023, 7:56:02 AM2/13/23
to Krešimir Križanović, pasapipeline-users
Yes, that looks *very* promising!



Reply all
Reply to author
Forward
0 new messages