How does Omics playground handle duplicates in counts file?

62 views

Skip to first unread message

BigOmics Analytics

unread,

May 19, 2021, 3:51:17 AM5/19/21

to Omics Playground

[email Thorben S. 19.05.2021]

We observed that we have some duplicates in one of the counts file we uploaded for analysis. We noticed that the playground automatically reduces the number of counts and excludes the duplicates in the anaylsis, however we were wondering which of the duplicate is chosen for keeping and which ones are excluded. Or is some kind of merging process occuring?

The duplicates originated from a protein grouping algorithm where protein ID (or gene IDs for gene groups) may be present in multiple groups, depending on the indentified peptides. However, since we have to reduce the protein/gene groups to single identifiers, we typically just keep the first identifier and delete the following. Is that approach okay? I dont know how to handle protein groups for differential analysis or geneset enrichment analysis where single identifiers are needed. Could you give me some guidance on how to hande these kind of data?

BigOmics Analytics

unread,

May 19, 2021, 4:05:01 AM5/19/21

to Omics Playground

Hi Thorben,

Duplicated row identifiers (genes/proteins with same name) are handled by summing up their linear intensities/counts. If the data was in logarithm, it will be automatically detected (in most cases...) and exponentiated. The rational of summing up the counts (or linear intensities in proteomics) is that we don't differentiate between possible gene/protein isoforms and sum them up as a group.

If you want to retain the isoforms, you may keep the names as GENE.1 and GENE.2 but you must turn off any gene filter. However as currently such gene/protein variants are not recognized in the gene sets, this will result in wrong enrichment test.