Hi Anthony,
Thank you for the response!
The study has about 7300 aptamers, so smaller than you're suggesting is recommended. After getting rid of duplicates, it brings it down to about 6300 unique genes. I did averaging at first, not knowing it's not ideal for FC. I'm not sure what you mean by "sum whatever countable gene entities you have together to produce a binned representation of the gene."
And here's what appears to be the relevant portion for describing the data provided in the supplementary files:
"Serum samples were analyzed using the SomaScan platform (SomaLogic
Operating Co., Inc.), yielding a dataset for 7326 specific
aptamer-detected protein targets. Outlier detection removed four samples
based on Mahalanobis distances of log10-transformed values, followed by
PCA and a chi-square test (p < 0.1). High-leverage aptamers were
identified using Z-scores (±1.8, corresponding to the 2.5th and 97.5th
percentiles) and assigned NA, after which missing values were imputed
using the variable’s minimum and maximum post-outlier removal. Given
these adjustments, log10-transformed data were back-transformed using
base-10 antilogarithms, followed by a final log2 transformation."
The GSEA User Guide recommends using natural values instead of log transformed values, so I reversed the final log2. I had already run GSEA on a few gene set collections with the "averaged" method, and I have now done it with the "max" method.
The results seem quite strange in both cases in that so many gene sets are significant. With the human phenotype collection, out of 2237 gene sets, "819 gene sets are significantly enriched at FDR < 25%". I also tried ImmuneSigDB collection. Out of 4872 gene sets, 3569 are FDR < 25%. I can't figure out what would cause this.
The User Guide says in this case it's possible "you might be seeing significant differences between the phenotypes due
to technical artifacts, such as samples being run in different labs, by
different operators, or against different arrays", but I don't see how that would cause lots of gene sets to be enriched. I can imagine all genes might be shifted up or down if different labs did things differently, but I don't think that should affect gene set enrichment. Why would specifically the genes in most of these gene sets be shifted up or down?
Thanks for the information about the ranking metric sign.
Best,
Bill