Collapse genes based upon trinnotate annotation

34 views
Skip to first unread message

David Bradshaw

unread,
Nov 17, 2023, 8:03:21 AM11/17/23
to trinityrnaseq-users
Dear Trinity/Trinotate creators,

Thank you very much for such great tools!

I am running into an issue that was partially addressed in a similar question (https://groups.google.com/g/trinityrnaseq-users/c/DJ3LfDCW0g4/m/ZkVpiGc7BQAJ). 

I similarly had genes annotated with the same blastx gene identity which is complicating interpretation of downstream DESeq2 analysis since either the same gene name shows up multiple times or on opposite sides of a comparison. 

I had two main questions representing two ways of addressing this. 

1) If I were to continue with this analysis as is, I need to explain why those genes/transcripts were not clustered together. Was there just not enough evidence during assembly (I used de novo Trinity assembly on pooled samples since I have a non-model organism) to combine them? I am leaning towards this since I trust the tools, but unsure how to best explain to others (i.e. reviewers) honesty. 

2) If I decide to move forward with combining genes by annotation, is there a script in trinity/trinotate that could help with that or another tool? In a previous question Dr. Haas put forth PASA as a potential solution to a similar issue (https://groups.google.com/g/trinityrnaseq-users/c/IXFYk9qc5aw/m/WThVjncNAAAJ) but I am unsure if that is the best route to explore. Per the previous questions, I understand that paralogs complicate matters. 

Please let me know if you need any other information.

Thank you very much for your time and help. 

Sincerely, 

David


Brian Haas

unread,
Nov 17, 2023, 8:26:38 AM11/17/23
to David Bradshaw, trinityrnaseq-users
Hi David,

To see if they're likely part of the same gene and just split contigs, try aligning the two transcripts together. I they mostly align and show a lot of sequence divergence (more than a couple percent), then they're probably paralogs.  If they don't align, then they could be part of the same gene (or not.... but you won't know unless you have other info to tie them together like a genome to align to or paired-ends that connect them).   If the Trinity 'gene' cluster info (from the Trinity 'gene id' portion of the identifier) is the same, then they were at least grouped together by Trinity into the same 'gene' cluster based on sequence info (like read pairing). 

Hope this helps,

B

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/trinityrnaseq-users/803da11f-1ef0-456d-a381-adac05097d48n%40googlegroups.com.


--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

David Bradshaw

unread,
Nov 20, 2023, 9:52:27 AM11/20/23
to trinityrnaseq-users
Dear Dr. Haas, 

Thank you very much. I will apply your suggestion on a small set of DEGs annotated the same and report back. I greatly appreciate the help. 

Sincerely, 

David

David Bradshaw

unread,
Dec 6, 2023, 1:27:14 PM12/6/23
to trinityrnaseq-users
Dear Dr. Hass, 

Sorry for the delay, this is a dataset I am working on for my old position. I focused on two DEGs identified as COX1 by blastx (one has a single transcript while the other has 8 transcripts clustered together)  that were on opposite sides (one upregulated and one downregulated) of the DEG analysis and determined via BLAST that there was no significant overlap (evalue < 10[default]) between them although blasting them against the same uniprot_sprot protein lead to all of them having significant overlap to that gene. Blast outfmt tables are attached. Those results seem to support the assertion that they are different parts of the same gene instead of paralogs correct? 

I am unsure how to proceed in explaining why what appear to be different parts of the same gene are differentially expressed on opposite sides of the analysis. Any help would be appreciated, I have not seen this yet in any manuscripts I have looked at, but may not be looking in the correct places. 

On another note, and I apologize if this is too off-topic I was doing it to attempt to address the issue, per your FAQ to reduce genes (thank you for such wonderful documentation btw) I also tried using Corset and cd-hit to see if they would cluster with different algorithms to no avail. Interestingly, cd-hit leads to more clusters/genes than trinity. Corset also had more clusters/genes when compared to a filtered trinity assembly i.e. removed the transcripts that corset filters by default if a transcript had <10 sequences of support. Is this a legitimate filter to apply to a Trinity assembly? Removing the reads that Corset identified as having potentially too little sequential support after Trinity has already clustered transcripts into genes? It improved my N50 and E90N50 values with little change in the bowtie2 overall alignment and busco scores. 

Thank you very much for your time and help and once again sorry for the delay. Please let me know if you need more information to help.

Sincerely, 

David

COX1_DEG_transcripts_vs_themselves_table.txt
COX1_DEG_transcripts_vs_COX1_PANBU_table.txt

Brian Haas

unread,
Dec 6, 2023, 1:41:00 PM12/6/23
to David Bradshaw, trinityrnaseq-users
Probably paralogs, as the c3 transcript isn't aligning to any of the c0 ones, and they all match the same range of the target COX1 protein from what I gather.

For reducing the number of transcripts based on read support, I'd use the expression value-based filtering combined with any functional annotation info.  For example, if you have a transcript with little read support and it has no high quality matches to known proteins, you could consider excluding them.  There's no perfect way to do this. I do find this analysis useful as a guide:  https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Transcript-Quantification#counting-numbers-of-expressed-transcripts-or-genes

hope this helps,

Brian

David Bradshaw

unread,
Dec 8, 2023, 7:55:49 AM12/8/23
to trinityrnaseq-users
Dear Dr. Hass,

Thank you for your help. I was unsure if there was enough overlap on the protein/subject for them to be paralogs, thanks for the clarification. So because they overlap in the same region of a protein but not with each other, they are likely paralogs because that shows the sequence divergence you referred to earlier? 

Honestly, I would prefer to do what you suggest in the part of the guide you referenced and keep the entire set of assembled transcripts instead of filtering. So if I can explain the difference in differential expression (one up and one downregulated) by saying they are likely paralogs, then that makes me feel less of a need to filter any lowly expressed sequences since paralogs can respond differently to the same stimulus from my understanding. 

My instinct to filter or recluster was initially driven by the concern that I would have to check all of the gene-level annotations and then reanalysis those that have the same blastx top hit to define them as paralogs or different parts of the same genes. Was honestly hoping reclustering with a different tool or filtering would address the issue by removing one of the genes/transcripts. Have you found that it is generally accepted to only do that for genes that pop out as significant in downstream analysis like these COX1 paralogs did in DEG analysis instead of checking on the entire gene-level transcriptome? I've read some manuscripts that did not even annotate a gene unless it's a DEG, but have not come across one with genes with the same annotation having opposite responses in DEG analysis. I also understand if you are not comfortable answering such a question though. 

Thanks for your time and help. I greatly appreciate you allowing me to pick your brain on these issues.

Sincerely, 

David

Brian Haas

unread,
Dec 8, 2023, 8:09:56 AM12/8/23
to David Bradshaw, trinityrnaseq-users
Hi David,

responses below

On Fri, Dec 8, 2023 at 7:55 AM 'David Bradshaw' via trinityrnaseq-users <trinityrn...@googlegroups.com> wrote:
Dear Dr. Hass,

Thank you for your help. I was unsure if there was enough overlap on the protein/subject for them to be paralogs, thanks for the clarification. So because they overlap in the same region of a protein but not with each other, they are likely paralogs because that shows the sequence divergence you referred to earlier? 

That's right.  It could also just be the case that they share domains in common and aren't exactly paralogs.  If you have full-length protein matches and complete orfs from the transcript data, then it'll be easier to figure out the relationships based on sequence alignments.
 

Honestly, I would prefer to do what you suggest in the part of the guide you referenced and keep the entire set of assembled transcripts instead of filtering. So if I can explain the difference in differential expression (one up and one downregulated) by saying they are likely paralogs, then that makes me feel less of a need to filter any lowly expressed sequences since paralogs can respond differently to the same stimulus from my understanding. 


The differential expression analysis automatically does some filtering on the data anyway based on the read support, as those transcripts with few reads won't have enough power for detecting them as differentially expressed.

Yes, paralogs can respond very differently to stimuli, so would be a sensible explanation here if this is the case.

 
My instinct to filter or recluster was initially driven by the concern that I would have to check all of the gene-level annotations and then reanalysis those that have the same blastx top hit to define them as paralogs or different parts of the same genes. Was honestly hoping reclustering with a different tool or filtering would address the issue by removing one of the genes/transcripts. Have you found that it is generally accepted to only do that for genes that pop out as significant in downstream analysis like these COX1 paralogs did in DEG analysis instead of checking on the entire gene-level transcriptome? I've read some manuscripts that did not even annotate a gene unless it's a DEG, but have not come across one with genes with the same annotation having opposite responses in DEG analysis. I also understand if you are not comfortable answering such a question though. 


I'd only spend time doing deep dives into those transcripts that are found as biologically interesting or part of a story that you're putting together.

best of luck!

Brian


 

David Bradshaw

unread,
Dec 15, 2023, 7:36:08 AM12/15/23
to trinityrnaseq-users
Hi Dr. Hass,

Thank you very much for all of your answers, I think I have a path forward then for analysis. Looking into more detail regarding the DEG-related transcripts that are annotated the same to determine if they are paralogs or parts of the same gene by comparing them to the UniProt reference and to each other. With the caveat that they could share similar domains but not be paralogs. 

It is good to know that some filtering for read support happens in differential expression analysis as part of their normal algorithm. 

Thank you very much for your time and help. I greatly appreciate your thoughts and going back and forth with me on these topics. I would consider this line of questioning resolved. 

Sincerely, 

David

Brian Haas

unread,
Dec 17, 2023, 9:23:31 AM12/17/23
to David Bradshaw, trinityrnaseq-users
Sure thing.  Best of luck!

Reply all
Reply to author
Forward
0 new messages