Collapse or No Collapse

623 views
Skip to first unread message

Giorgia Silvestrini

unread,
Feb 17, 2022, 4:17:05 PM2/17/22
to gsea-help
Hi Anthony, 
I tried redoing the analysis as you suggested the day before yesterday. Using GSEA PreRanked and providing as input the list of genes as ENSEMBL ID and choosing as chip platform COLLAPSE the system tells me "Success with warnings". The warning is: "Scoring produced infinite or NaNs values which may have prevented plotting for certain gene sets. Sere the log for more details." What is the problem this time?
Just one more question. When I use a list of genes for which I have the gene symbol and then select the Human_Gene_Symbol_with_Remapping chip, using the Collapse function I go from 54547 genes to 37219 genes. Instead, using the No Collapse function all genes are considered. What is the difference between the two options, because in the first case it excludes some genes? 
Thanks,
Giorgia

Anthony Castanza

unread,
Feb 22, 2022, 4:48:41 PM2/22/22
to gsea-help
Hi Giorgia,

It looks like my reply here might not have gone through, we've been having some issues with email service here at UCSD. Did you get my answer on this question?
In case it didn't come through;
The most likely cause of this success with warnings response is that when GSEA evaluated your ranked list, it scored some gene sets highly, but when it went to compute the permutation matrix for those sets it was never able to sample a single set in the null distribution that had the same sign as the enrichment result, effectively creating a divide by zero error when attempting to calculate the NES and pValue. This can sometimes be remedied by increasing the number of permutations to something like 10,000 (which should be ok to do in gene set permutation mode/GSEA Preranked).

With regard to your second question:

"When I use a list of genes for which I have the gene symbol and then select the Human_Gene_Symbol_with_Remapping chip, using the Collapse function I go from 54547 genes to 37219 genes. Instead, using the No Collapse function all genes are considered. What is the difference between the two options, because in the first case it excludes some genes?"
When using the collapse function by default genes that aren't mapped to a valid gene symbol are removed, this can clean up the data somewhat by keeping only genes that have been annotated in our dataset. If using No_collapse, GSEA won't match the any genes that have been renamed to their current gene symbol versions, so while it will "keep" all the genes in the dataset, it will actually perform worse because there may be genes that are in the gene set under one name, but not recognized because they're in the dataset under a completely different name. We don't recommend this later option. If a gene doesn't appear in the chip file, then it also could never appear in a gene set since we use the same database to build both sets of gene annotations. However, if for whatever reason you do want to get the benefit of both remapping your gene symbols using the collapse function and also want to keep those gene symbols that we didn't recognize and so were thrown away, there is an additional option you can set. Under "Advanced fields" you'll want to the "Omit features with no symbol match" parameter from the default 'true' to 'false'.

Let me know if you have more questions hopefully my future replies won't vanish into the ether!

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego
Reply all
Reply to author
Forward
0 new messages