Duplicate genes with same gene symbol annotation

536 views
Skip to first unread message

Jason Toy

unread,
May 24, 2021, 1:09:48 AM5/24/21
to trinityrnaseq-users
Hi Brian,

I ran into a bit of a theoretical hiccup in my analysis and am hoping you can guide me on the best practice here.

I am currently trying to run enrichment analysis with GOseq and I am realizing that my annotated expression matrix (gene level) has a lot of duplicate genes (I annotated it through Trinotate using blastx against swissprot). Most of the time these duplicates are all sequential too. E.g. gene IDs XLOC_000044 - XLOC_000049 all annotate as gene symbol ATRN1 (Attractin Like 1). The transcriptome assembly was done with a reference through tophat/cufflinks.

This presents an immediate problem because GOseq won't allow duplicate gene names, but it also poses a bigger question of "should these all be grouped together since this is a gene-level analysis?". Looking at a number of duplicate cases, they tend to show similar DE patterns across duplicate genes. In other words, should I merge the rows for each of these into one for each gene symbol? And if so, at what point during the analysis (raw counts? After TMM normalization?)

Thanks in advance!
Jason

--
Jason A. ToyPhD Candidate
Rent Burden: 41% (what’s this?)
Dept. of Ecology and Evolutionary Biology
University of California, Santa Cruz
    


Brian Haas

unread,
May 24, 2021, 9:16:26 AM5/24/21
to Jason Toy, trinityrnaseq-users
Hi Jason,

If all these 'genes' are actually aligning at the same genomic locus, then they were probably just not annotated properly as isoforms of the same gene, and you'd need to regroup them.  This could easily be a cufflinks issue.

best,

~b

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/trinityrnaseq-users/CAKaDAZzh7oUug7njN7dW7pA%2BT762yU8OdQVAxxvU6goT5rti9Q%40mail.gmail.com.


--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

Jason Toy

unread,
May 24, 2021, 2:57:20 PM5/24/21
to Brian Haas, trinityrnaseq-users
Hm, okay. How would I go about confirming if these 'genes' are actually aligning at the same genomic locus?

And if I need to regroup them, should I do this by summing their rows in the raw counts matrix (RSEM.gene.counts.matrix) or after running edgeR's calcnormfactors()? Or does it not matter?

Thanks again,
Jason

Brian Haas

unread,
May 24, 2021, 3:05:01 PM5/24/21
to Jason Toy, trinityrnaseq-users
yeah, if it's cufflinks, then you'll have genome locations for everything.

You'd sum them up to the same gene if you're working with RSEM's estimated counts (not raw counts).  Then, you could redo your edgeR analysis.

Jason Toy

unread,
May 24, 2021, 3:47:30 PM5/24/21
to Brian Haas, trinityrnaseq-users
Okay, gotcha. Is the RSEM estimated counts the RSEM.gene.counts.matrix? Or a different one?

Thanks so much!

Brian Haas

unread,
May 24, 2021, 4:19:17 PM5/24/21
to Jason Toy, trinityrnaseq-users
Given the redefined trans/gene mapping file, you'd rerun the abundance estimation matrix building step, and that'll give you the refined gene matrix.

The isoform matrix has the estimated counts.  The gene matrix involves some additional math that involves some weights - details on the wiki.

best,

~b

Jason Toy

unread,
May 25, 2021, 8:58:53 PM5/25/21
to Brian Haas, trinityrnaseq-users
Hi Brian,

After manually investigating my annotated transcriptome, it seems that most, if not all of these cases of duplicate annotations are due to split genes. Each of these putative genes map to a different non-overlapping portion of the annotation reference gene. So these appear to be cases where many putative genes are actually one single gene. Is there a good tool to automatically detect and merge split genes across my entire transcriptome?

Thanks!
Jason

Brian Haas

unread,
May 26, 2021, 7:39:02 AM5/26/21
to Jason Toy, trinityrnaseq-users
Hi Jason,

PASA will merge annotations if there's transcriptome evidence for them to be merged, given a reference gene annotation and the annotation update functionality.  This may not be what you're after, though, and could be overly complicated to address this, plus I'm not sure how comprehensive it would be - solving some but not all cases.

You might need to roll your own solution here.

Jason Toy

unread,
May 26, 2021, 5:14:15 PM5/26/21
to Brian Haas, trinityrnaseq-users
Alright, ya PASA might not be what I'm looking for on this. Thanks for your insight Brian!
Reply all
Reply to author
Forward
0 new messages