U95 Affymetrix microarray data redundant probes

Yahya Yozbatıran

unread,

Aug 27, 2025, 3:39:39 AM8/27/25

to gsea-help

Dear GSEA Support Team,

In microarray data, multiple probes may target the same gene but sometimes show different expression patterns. I would like to ask how GSEA handles such cases, and what would be the recommended approach on my side.

For example, if one probe for a gene is significantly upregulated while another probe for the same gene is not significant, which probe should I consider for GSEA input? Should I select one based on certain criteria (e.g., highest variance, average expression, most significant p-value), or does GSEA have a preferred or automatic way of resolving such redundancy?

Thank you very much for your guidance.

Best regards,
Yahya Yozbatiran

Anthony Castanza

unread,

Aug 27, 2025, 4:55:09 PM8/27/25

to gsea...@googlegroups.com

Hello,

Since GSEA operates at the gene level, when operating on microarray probes the software "collapses" multiple probes that represent the same gene. By default GSEA uses the "max_probe" method to select the probe with the highest intensity to represent that gene. However, the software offers several additional options described in the "Advanced fields" section of the GSEA parameters.

If these options don't meet your needs, you can additionally map the probe level data to gene level in external software and supply GSEA with the pre-mapped data, although in that case we'd likely recommend using the symbol remapping chip files in a second round of collapse in the GSEA software just to ensure that the symbols used in your data match those used in the MSigDB gene sets.

Let me know if you have any additional questions

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/gsea-help/78fe5d3e-628d-4cd0-9132-89ed6c6ffeadn%40googlegroups.com.

Yahya Yozbatıran

unread,

Aug 28, 2025, 2:57:05 AM8/28/25

to gsea...@googlegroups.com

Dear Anthony,

Thank you very much for your clear explanation.

As a follow-up, I would like to ask your recommendation on the preprocessing step. Which strategy would you suggest is generally the most appropriate or commonly used? For example, some approaches include selecting the probe with the highest variance across samples, averaging all probes for a gene, or selecting the most significant probe after differential expression testing.

In your experience, would you recommend relying on the default GSEA “max_probe” method, or do you find that preprocessing at the gene level (e.g., variance- or average-based) provides a more robust input for GSEA?

Thank you again for your guidance.

Best regards,
Yahya Yozbatiran

To view this discussion visit https://groups.google.com/d/msgid/gsea-help/CAGCeyZx8OowGyyTwnaCgYTuYwpMh4TY1nBPmve5dF8vKOaj7sQ%40mail.gmail.com.

Anthony Castanza

unread,

Aug 28, 2025, 8:14:42 PM8/28/25

to gsea...@googlegroups.com

My inclination would be that while the GSEA max_probe method is generally accepted, preprocessing through a dedicated microarray analysis method before supplying the gene-level matrix to GSEA would be a more robust approach than the simple max_probe collapse offered through the GSEA software.

I would encourage you to perhaps try both the variance based selection and the GSEA max_probe method with your dataset and to compare the results. Do note that for the sake of a direct comparison you would need to set a fixed random seed value in the GSEA advanced fields.

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

To view this discussion visit https://groups.google.com/d/msgid/gsea-help/CAAacJH8XPP1fh48YVq7ZiputGzcp8QXfow2J_C2-NFEStFmKnA%40mail.gmail.com.

Yahya Yozbatıran

unread,

Sep 12, 2025, 4:58:47 PM9/12/25

to gsea...@googlegroups.com

Dear Dr. Anthony Castanza,

Thank you very much for your helpful explanation, it helped me a lot.

I would like to kindly ask a follow-up question:
When preparing input for GSEA from microarray datasets, is it acceptable to provide log2-transformed expression values, or should the input strictly consist of raw signal intensities summarized at the gene level?

Thank you again for your valuable guidance. Have healthy days.

Best regards
Yahya Yozbatiran

To view this discussion visit https://groups.google.com/d/msgid/gsea-help/CAGCeyZwRguqgb0RMVjhMMQfbEAiVtkKnRC8LO7MGp%3D6XA2vbOg%40mail.gmail.com.

Anthony Castanza

unread,

Sep 12, 2025, 5:07:56 PM9/12/25

to gsea-help

Hello,

My apologies for the extremely delayed reply. Your message got caught in the google groups spam filter over a holiday and was overlooked. Our official recommendation per the GSEA FAQ is to use the natural scale data, not the log transformed data. A log transformation changes the nature of the ranking calculations that GSEA performs and we can not vouch for their performance in such a case.

Let me know if you have any additional questions,

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

Yahya Yozbatıran

unread,

Sep 12, 2025, 5:11:16 PM9/12/25

to gsea...@googlegroups.com

Hello,

Thank you very much for your kind response and the information. Have healthy days.

Best regards

Yahya Yozbatıran

12 Eyl 2025 Cum, saat 23:08 tarihinde Anthony Castanza <acas...@cloud.ucsd.edu> şunu yazdı:

To view this discussion visit https://groups.google.com/d/msgid/gsea-help/c4125710-84db-4283-a789-0ba6ef4a6d4dn%40googlegroups.com.

Reply all

Reply to author

Forward