Multiple Trinity 'genes' have the same Blast annotation

112 views
Skip to first unread message

Neeraja Balasubrahmaniam

unread,
Jul 31, 2023, 7:17:21 AM7/31/23
to trinityrnaseq-users
Hi Brian,

I am performing downstream analysis after de novo assembly and DE using Trinity's pipeline on mixed community samples at different environmental conditions. 
I ran the DE analysis on both the isoform and gene levels as well as GO enrichment on the DEGs and everything worked great. 

My issue is only to do with dealing with repeats in Blast annotations across different 'genes'. I believe the Trinity isoform level collapses into the Trinity 'gene' level using tximport.

My 2 questions are:
1. Why do I see repeat Blast annotations at the 'gene' level? Same for Kegg too but I am not focussing on Kegg as much. Could it be due to multiple genes coding for the same protein or something pipeline related? For example:

Trinity gene                          BlastX                       Kegg                                KO
TRINITY_DN17019_c1_g1 1433_CANAL cal:CAALFM_C103220CA K06630
TRINITY_DN10270_c0_g1 2AAA_SCHPO spo:SPAP8A3.09c K03456
TRINITY_DN16382_c0_g1 2AAA_SCHPO spo:SPAP8A3.09c K03456
TRINITY_DN38232_c0_g1 2AAA_SCHPO spo:SPAP8A3.09c K03456
TRINITY_DN12873_c0_g1 2ABA_SCHPO spo:SPAC227.07c K04354
TRINITY_DN1626_c0_g1 2ABA_SCHPO spo:SPAC227.07c K04354
TRINITY_DN31029_c0_g1 2ABA_SCHPO spo:SPAC227.07c K04354
TRINITY_DN49340_c0_g1 2NPD_NEUCR ncr:NCU03949 K00459
TRINITY_DN11466_c0_g1 6PGD_CANAX . .
TRINITY_DN112784_c0_g2 6PGD_EMENI . .
TRINITY_DN1860_c0_g1 6PGD_EMENI . .
TRINITY_DN1860_c0_g2 6PGD_EMENI . .
TRINITY_DN57_c0_g3 6PGD_EMENI . .
TRINITY_DN57_c3_g1 6PGD_EMENI . .

2. I would like to cluster the 'gene' level expression of the same Blast annotations into one. Is it valid to do this? If so, how do I do this and what could be the drawbacks of doing this? 
For example, my idea was to use the same method that Trinity does which is, to collapse isoform level to gene level using tximport: I could potentially do the same for 'gene' level to these protein-annotated gene cluster level. My end goal is to find gene targets for nucleotide and protein assay design and would like them to be biologically interpretable using database protein/gene annotations. 

Any help/pointers would be greatly appreciated!

Thanks,
Neeraja


Brian Haas

unread,
Aug 2, 2023, 8:45:22 PM8/2/23
to Neeraja Balasubrahmaniam, trinityrnaseq-users
Hi,

There are several reasons why you might have multiple trinity 'genes'
having the same top blast matches. The first is biological - they
represent paralogs. Other reasons are more technical, such as the
trinity 'genes' are partial and represent different non-overlapping
parts of the same gene, but ended up being in a fragmented assembly
due to insufficient read coverage or algorithmic complications.
Looking at the regions of sequence homology along the target best
match could give some clues here.

If it turns out that they're paralogs, you might want to keep them
separate instead of collapsing. If they're 'parts' of the same gene,
then collapsing could be better justified.

hope this helps,

~b
> --
> You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/trinityrnaseq-users/1827fa73-4ad8-4a57-a27a-1d32fb8442aan%40googlegroups.com.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

Neeraja Balasubrahmaniam

unread,
Aug 5, 2023, 7:31:38 AM8/5/23
to trinityrnaseq-users
Hi Brian,

This was really helpful, thank you! How would one look at sequence homology of the best match, are there other tools? And by 'target best match', I assume you mean the top blast hit- Although I did think, since I used Trinotate, it uses Blast+/Swissprot to annotate that does use a homology based search for annotation? Not sure if that's correct though.

Best,
Neeraja

Brian Haas

unread,
Aug 5, 2023, 7:39:49 AM8/5/23
to Neeraja Balasubrahmaniam, trinityrnaseq-users
Hi,

You can probably get what you want from the Trinotate report. The
blast result is provided like so:

FMP45_YEAST^FMP45_YEAST^Q:3-197,H:28-92^64.6%ID^E:2.46e-26^RecName:
Full=SUR7 family protein FMP45;^Eukaryota; Fungi; Dikarya; Ascomycota;
Saccharomycotina; Saccharomycetes; Saccharomycetales;
Saccharomycetaceae; Saccharomyces

and the Q:3-197,H:28-92
part indicates the sequence range of the match for the query Q (your
sequence) and the hit H (the database match).

best,

~b

On Sat, Aug 5, 2023 at 7:31 AM Neeraja Balasubrahmaniam
> To view this discussion on the web visit https://groups.google.com/d/msgid/trinityrnaseq-users/be609893-060e-481f-9dc9-ec8ad68c010cn%40googlegroups.com.

Neeraja Balasubrahmaniam

unread,
Aug 6, 2023, 9:05:17 AM8/6/23
to trinityrnaseq-users
Hi Brian,

So, from your example, the database match is part of the query sequence, if I am right? Since it shows sequence overlap. I know paralogs sometimes have different functions even with sequence similarity, is that why it doesn't make sense to combine them?

Thanks for these pointers!! And apologies if I am picking your brain for this, haha

Best,
Neeraja

Brian Haas

unread,
Aug 6, 2023, 9:09:08 AM8/6/23
to Neeraja Balasubrahmaniam, trinityrnaseq-users
If you want more convincing evidence about them being paralogs, just
pull out the protein sequences for each and align them. If the
proteins show some level of sequence divergence, that would be good
evidence for them to be from distinct / separate genes. If they're
near identical, then pull out the transcript sequences for them and
align them together. It would be curious for Trinity to assign
separate genes to transcripts that should have been grouped into the
same read cluster.

On Sun, Aug 6, 2023 at 9:05 AM Neeraja Balasubrahmaniam
> To view this discussion on the web visit https://groups.google.com/d/msgid/trinityrnaseq-users/5b58de41-9ac8-4059-a6a5-563f746bc8f7n%40googlegroups.com.

Neeraja Balasubrahmaniam

unread,
Aug 6, 2023, 7:41:01 PM8/6/23
to trinityrnaseq-users
Hi Brian,

Makes sense, thanks so much! My last question about this would be: instead of collapsing the multiple Trinity genes with the same BlastX annotation, is there a better way to report such Trinity genes in a summarized way, using the log2FC and p-value attached for each DE gene? Say, there are 10 Trinity genes, each with 10 log2FC & p-value attached, with the same Blast annotation, how could use a summarized value representing all those? 
For the downstream application, it would be helpful for our study to say that what those upregulated Trinity genes at that condition represent in terms of protein function, that is why I had the previous questions too.
It does not make sense to use an average log2FC or similar, to report, I think. 


Thanks so much again!
Neeraja

Brian Haas

unread,
Aug 6, 2023, 7:45:00 PM8/6/23
to Neeraja Balasubrahmaniam, trinityrnaseq-users
Hi,

I like to annotate then gene identifiers along with their Trinotate
results - eg. see bottom of page here:
https://github.com/trinityrnaseq/trinityrnaseq/wiki/Functional-Annotation-of-Transcripts

In your case, you might want to just simply tack on the blast hit
symbol, but there's much more you could include as shown in the above.

best,

~b

On Sun, Aug 6, 2023 at 7:41 PM Neeraja Balasubrahmaniam
> To view this discussion on the web visit https://groups.google.com/d/msgid/trinityrnaseq-users/37092a43-2e4f-42a6-95dd-6cc4e296a16bn%40googlegroups.com.

Neeraja Balasubrahmaniam

unread,
Aug 7, 2023, 10:34:03 AM8/7/23
to trinityrnaseq-users
This is probably what would work best, thanks so much for all the help and for being patient!

Best,
Neeraja

Reply all
Reply to author
Forward
0 new messages