Using genes from aptamer study

Billy

unread,

Jun 1, 2025, 3:34:45 PMJun 1

to gsea-help

I'm reading an aptamer-based proteomics study and I'm wondering if I could use GSEA to gain insights about the upregulated proteins. The study has supplementary data in the form of log2 transformed values for each aptamer from the SomaScan platform.

The issue is that some genes show up multiple times because they are targeted by different aptamers. So I wondered if it would be acceptable to do something like average the values for a given participant when there are multiple aptamers for a single gene, then use the resulting value in standard GSEA.

If so, any tips about what ranking metric to use in GSEA?

----

I have a separate question about ranking metrics. Why would one not prefer an absolute metric (all positive values) in most cases?

The way I see it, if a gene pathway is affected in a disease, some of the genes in the pathway may be upregulated and some downregulated, such as through feedback loops where an increase of one protein leads to a decrease in another. Only up or down-regulated genes will contribute to the final enrichment score, depending on which one contributes more. So half of them could be kind of ignored.

So as a simple example, I could maybe imagine that if 6 genes are upregulated in one gene set but none are downregulated, that might score higher than a gene set where 5 are upregulated and 5 are downregulated. But I would be more interested in a gene set where 10 genes are affected than 6.

If instead the p-value or absolute value of fold change was used, it would allow for incorporating all differentially expressed genes into the score.

Anthony Castanza

unread,

Jun 3, 2025, 6:43:23 PMJun 3

to gsea-help

Hi Billy,

The suitability of this data for GSEA would, at least in part, depend on the proteome-wide coverage of the study. GSEA is generally intended for studies with "complete" (more or less) coverage, so, tens of thousands of genes.

Multiple representations of a gene in the data is a longstanding problem across multiple data types. GSEA, when computing enrichment scores, needs a unique representation of a given gene. We have "Collapse" functionality baked into the software to do this for supported data types but not for this kind of proteomics. Additionally doing this in the fold change space isn't really recommended because it wouldn't be aware of the underlying relative differences in a abundance. In collapse for GSEA Preranked we, by default, take the max absolute fold change for the entities that map to a single gene symbol. This is also definitely not an ideal solution but it's at least an explainable one.

Without knowing exactly what kind of data you have access to here it's hard to give recommendations. The best option, if the data is available, would be to sum whatever countable gene entities you have together to produce a binned representation of the gene.

GSEA supports computations in both a signed and absolute space, they give you different pieces of information. Operating in the absolute space does exactly what you say it does allowing you to essentially score sets on their overall perturbation. Operating in the signed space allows you to compute up or down collective differential expression which is useful for response or marker gene sets. They're just different use cases.

Sorry I couldn't be of more help here, feel free to reach back out with any additional questions though!

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/gsea-help/ecd96a7d-a855-48e0-b179-2b5a0e3d15e5n%40googlegroups.com.

Bill

unread,

Jun 3, 2025, 9:23:39 PMJun 3

to gsea-help

Hi Anthony,

Thank you for the response!

The study has about 7300 aptamers, so smaller than you're suggesting is recommended. After getting rid of duplicates, it brings it down to about 6300 unique genes. I did averaging at first, not knowing it's not ideal for FC. I'm not sure what you mean by "sum whatever countable gene entities you have together to produce a binned representation of the gene."

Here's the study if you want to look: https://www.medrxiv.org/content/10.1101/2025.05.28.25328245v1

And here's what appears to be the relevant portion for describing the data provided in the supplementary files:

"Serum samples were analyzed using the SomaScan platform (SomaLogic Operating Co., Inc.), yielding a dataset for 7326 specific aptamer-detected protein targets. Outlier detection removed four samples based on Mahalanobis distances of log10-transformed values, followed by PCA and a chi-square test (p < 0.1). High-leverage aptamers were identified using Z-scores (±1.8, corresponding to the 2.5th and 97.5th percentiles) and assigned NA, after which missing values were imputed using the variable’s minimum and maximum post-outlier removal. Given these adjustments, log10-transformed data were back-transformed using base-10 antilogarithms, followed by a final log2 transformation."

The GSEA User Guide recommends using natural values instead of log transformed values, so I reversed the final log2. I had already run GSEA on a few gene set collections with the "averaged" method, and I have now done it with the "max" method.

The results seem quite strange in both cases in that so many gene sets are significant. With the human phenotype collection, out of 2237 gene sets, "819 gene sets are significantly enriched at FDR < 25%". I also tried ImmuneSigDB collection. Out of 4872 gene sets, 3569 are FDR < 25%. I can't figure out what would cause this.

The User Guide says in this case it's possible "you might be seeing significant differences between the phenotypes due to technical artifacts, such as samples being run in different labs, by different operators, or against different arrays", but I don't see how that would cause lots of gene sets to be enriched. I can imagine all genes might be shifted up or down if different labs did things differently, but I don't think that should affect gene set enrichment. Why would specifically the genes in most of these gene sets be shifted up or down?

Thanks for the information about the ranking metric sign.

Best,

Bill

unread,

Jun 3, 2025, 9:33:18 PMJun 3

to gsea-help

PS. To clarify, I'm doing standard GSEA that uses the expression data for each participant.

Anthony Castanza

unread,

Jun 4, 2025, 6:21:31 PMJun 4

to gsea...@googlegroups.com

Hey Bill,

I see. ~7000 "genes" isn't ideal, but is definitely not impossible to work with.

Referring to "countable entities", I'm not sure how the aptamer quantification works, I assumed it's counting instances of aptamers that have bound their target, in which case ideally you'd want to count all the aptamers bound to all the proteins from a given gene. However, I don't think that this data is available in the supplementals, so you're kind of stuck with the log data. On that subject, I think that this data type might actually be a case for the "cosine" ranking metric:

● Cosine is a variant of Pearson’s that is defined by Eisen et al. (1998). It is appropriate only when the expression profiles are based on log-expression ratios relative to a control. (https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideFrame.html)

Log ratios relative to a control sounds like the data you've described (and what I saw in Table S1)

That is more enriched gene sets than I would expect to see.
I would try this again with Cosine, Phenotype Permutation (I think you have enough samples), and filtering to max.
If you're still seeing abnormally large numbers of enriched sets it might be helpful to send me some of the plots from one of them, it could help in assessing if it looks like there are distributional issues.
Another thing might be to look at the output file for:

List of gene sets used and their sizes (restricted to features in the specified dataset)

You can look at how well the sets are being represented in the analysis, if they have really low ratios of "After restricting" size compared to "Original size" the genes in the sets are probably not particularly well suited to the data. Although considering the size of the input data here it's hard to say what a valid relative size would be, maybe 60%? but that's not a value based on much.

I'd note that a lot of the user guide is based on experience from old microarray platforms where data could be split across multiple plates and the different plates could have specific gene subsets. It's in need of an update for more modern concerns.

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

To view this discussion visit https://groups.google.com/d/msgid/gsea-help/02a48914-d61f-4e45-a675-ed18bc757f10n%40googlegroups.com.

Bill

unread,

Jun 5, 2025, 2:14:18 PMJun 5

to gsea-help

Hi Anthony,

Thanks for that information. I think you were right. I tried using a continuous phenotype with cosine on the human phenotypes collection and now there are 0 FDR < 25% with a few gene sets at p < 1%.

I also tried doing preranked with the summary stats the study provides. They aren't data about quite the same thing since I think they controlled for demographic covariates, but with that I also get a more normal result with 0 significant.

The original test that gave too many results was with Signal2Noise (~800). I tried with the other categorical metrics. tTest, Diff_of_Classes, and log2_Ratio_of_Classes which resulted in even more significant gene sets (~900). With Ratio_of_Classes, there are none significant.

I looked at the gene set sizes like you suggested. Of those that were included in the final analysis (2237), on average 41% of the genes in the gene sets were used.