U95 Affymetrix microarray data redundant probes

3 views
Skip to first unread message

Yahya Yozbatıran

unread,
Aug 27, 2025, 3:39:39 AM (11 days ago) Aug 27
to gsea-help

Dear GSEA Support Team,

In microarray data, multiple probes may target the same gene but sometimes show different expression patterns. I would like to ask how GSEA handles such cases, and what would be the recommended approach on my side.

For example, if one probe for a gene is significantly upregulated while another probe for the same gene is not significant, which probe should I consider for GSEA input? Should I select one based on certain criteria (e.g., highest variance, average expression, most significant p-value), or does GSEA have a preferred or automatic way of resolving such redundancy?

Thank you very much for your guidance.

Best regards,
Yahya Yozbatiran

Anthony Castanza

unread,
Aug 27, 2025, 4:55:09 PM (11 days ago) Aug 27
to gsea...@googlegroups.com
Hello,

Since GSEA operates at the gene level, when operating on microarray probes the software "collapses" multiple probes that represent the same gene. By default GSEA uses the "max_probe" method to select the probe with the highest intensity to represent that gene. However, the software offers several additional options described in the "Advanced fields" section of the GSEA parameters.

image.png

If these options don't meet your needs, you can additionally map the probe level data to gene level in external software and supply GSEA with the pre-mapped data, although in that case we'd likely recommend using the symbol remapping chip files in a second round of collapse in the GSEA software just to ensure that the symbols used in your data match those used in the MSigDB gene sets.

Let me know if you have any additional questions

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/gsea-help/78fe5d3e-628d-4cd0-9132-89ed6c6ffeadn%40googlegroups.com.

Yahya Yozbatıran

unread,
Aug 28, 2025, 2:57:05 AM (10 days ago) Aug 28
to gsea...@googlegroups.com

Dear Anthony,

Thank you very much for your clear explanation.

As a follow-up, I would like to ask your recommendation on the preprocessing step. Which strategy would you suggest is generally the most appropriate or commonly used? For example, some approaches include selecting the probe with the highest variance across samples, averaging all probes for a gene, or selecting the most significant probe after differential expression testing.

In your experience, would you recommend relying on the default GSEA “max_probe” method, or do you find that preprocessing at the gene level (e.g., variance- or average-based) provides a more robust input for GSEA?

Thank you again for your guidance.

Best regards,
Yahya Yozbatiran


Anthony Castanza

unread,
Aug 28, 2025, 8:14:42 PM (9 days ago) Aug 28
to gsea...@googlegroups.com
My inclination would be that while the GSEA max_probe method is generally accepted, preprocessing through a dedicated microarray analysis method before supplying the gene-level matrix to GSEA would be a more robust approach than the simple max_probe collapse offered through the GSEA software.
I would encourage you to perhaps try both the variance based selection and the GSEA max_probe method with your dataset and to compare the results. Do note that for the sake of a direct comparison you would need to set a fixed random seed value in the GSEA advanced fields.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego
Reply all
Reply to author
Forward
0 new messages