Hi Brian,
I ran into a bit of a theoretical hiccup in my analysis and am hoping you can guide me on the best practice here.
I am currently trying to run enrichment analysis with GOseq and I am realizing that my annotated expression matrix (gene level) has a lot of duplicate genes (I annotated it through Trinotate using blastx against swissprot). Most of the time these duplicates are all sequential too. E.g. gene IDs XLOC_000044 - XLOC_000049 all annotate as gene symbol ATRN1 (Attractin Like 1). The transcriptome assembly was done with a reference through tophat/cufflinks.
This presents an immediate problem because GOseq won't allow duplicate gene names, but it also poses a bigger question of "should these all be grouped together since this is a gene-level analysis?". Looking at a number of duplicate cases, they tend to show similar DE patterns across duplicate genes. In other words, should I merge the rows for each of these into one for each gene symbol? And if so, at what point during the analysis (raw counts? After TMM normalization?)
Thanks in advance!
Jason
--
Jason A. Toy | PhD Candidate
Dept. of Ecology and Evolutionary Biology
University of California, Santa Cruz