After pruning, none of the gene sets passed size thresholds

6,493 views
Skip to first unread message

Elizabeth Bartom

unread,
Jun 6, 2018, 11:33:20 PM6/6/18
to gsea-help
Hello,

I am trying to run GSEA-preranked, with a set of about 19,000 human genes, ranked according to a score calculated based on sequence motifs.
As far as I can tell, the gene names are official HUGO gene symbols, although they were defined as "gene name" in the Ensembl Biomart download tool, rather than as HUGO symbols.  I have formatted them in a .rnk file, with the format: gene name<tab>score<newline>.  I am happy to compare them to GO or mSigDB or any other gene sets, and I also have some gene sets that I have defined from the literature (one ~400 genes, the other ~1800 genes).  I put the literature defined sets each into a separate .grp file.  

All of the files get parsed into GSEA correctly, but when I run GSEA-preranked, I always get error 1001.
Is there a definitive way to check that my gene symbols are HUGO symbols?  All 19k in one batch?
Is it possible that the problem is in the .rnk file, even though the error seems to indicate a problem with the gene sets?  
Is there a sample .rnk file that I could test?  It seems that the .rnk file should include all of the transcriptome, or all of the transcriptome for which there is data.  True?
I have tried different gene sets, to no avail, so it seems unlikely that the problem is in the gene set definition.
Could the problem be due to the scores, somehow? I have ties among my scores; could that possibly lead to this kind of error?

I'm running GSEA v3.0 and Java 8 update 171 on a macbook pro.

thanks,
Elizabeth



<Error Details>

---- Full Error Message ----
After pruning, none of the gene sets passed size thresholds.

---- Stack Trace ----
# of exceptions: 1
------After pruning, none of the gene sets passed size thresholds.------
xtools.api.param.BadParamException: After pruning, none of the gene sets passed size thresholds.
at xtools.api.param.ParamFactory.checkAndBarfIfZeroSets(ParamFactory.java:88)
at xtools.gsea.GseaPreranked.execute(GseaPreranked.java:95)
at edu.mit.broad.xbench.tui.TaskManager$ToolRunnable.run(TaskManager.java:436)
at java.lang.Thread.run(Thread.java:748)


David Eby

unread,
Jun 8, 2018, 9:12:03 PM6/8/18
to gsea-help
Hi Elizabeth,

By far the most common cause of the pruning error is a symbol name space mismatch.  The key thing is consistency between the symbols in the dataset and those in the gene sets.  The easiest way to check the symbols in your dataset would be to run an analysis with them against one or more of the MSigDB gene sets we provide online.

It's hard to be absolutely certain that this will check all of the 19K symbols at once, but trying one of the larger collections like C2, C5, or C7 (or all three) will increase your chances.  Usually we recommend users to start analyses with the Hallmarks collection to get more focused results, but here you're trying to cast a wide net to verify the dataset.  You can turn down the number of permutations to say 10 so this runs more quickly since you're not running a real analysis (don't do this in general for normal runs).

If those seem to run correctly, I would be more suspicious of the custom gene sets.  It's harder to check those, but if your dataset works with MSigDB then it's a logical conclusion.  You may have already tried this - it wasn't clear to me if that was the case.

While your description of the formatting sounds correct, another possibility would be hidden characters in either the RNK or GRP files, particularly spaces.  A notoriously hard-to-spot but all too common case is to have trailing spaces after the symbols in one or the files (i.e. gene name<space><tab>score<newline> in your RNK).  That will cause a mismatch because GSEA tries to match exactly.  Yes, we could be smarter here...

To follow up on your other questions:
  • Yes, you are correct that you should operate with all 19K symbols; using filtered lists will blunt GSEA's statistical power.
  • While we do recommend that your ranking scores be unique, it's generally OK for there to be a few ties scattered among them.  Too many ties at one particular point (numerous zero-valued items, for example) might skew the calculations.  In no case, however, should that lead to this type of error.
If the above suggestions don't help, feel free to send us copies of your files and we can take a closer look on this end.  If it's data that you need to keep private, see the Contact page of our website for our private email address.  You could also send say the first 20 lines of each file as an example if you're unable to share the full contents.

Regards,
David

kentonch...@gmail.com

unread,
Feb 24, 2020, 7:06:42 AM2/24/20
to gsea-help
Hi, David,
I have the similar issue as Elizabeth had previously, and I tried to run the preranked in GSEA with 12000 gene in mouse, which ranked via pvalue between treated and control samples. I tried to fix the error 1001 with the instruction you provided. However, I failed to get through it. Is it possible that I send you the copied of the RNK file and have a look at what the problem is.

Best,
Kenton

David Eby

unread,
Feb 24, 2020, 5:43:02 PM2/24/20
to gsea...@googlegroups.com
Hi Kenton,

Sure, that's fine.  You can send it to us at gsea...@broadinstitute.org if you'd like to keep the data private.
--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/75d1c28f-fb0e-4ad5-b2cb-320520fbec59%40googlegroups.com.

Prit Benny

unread,
Apr 6, 2021, 8:57:59 PM4/6/21
to gsea-help
Hello All 

I have recently made a geneset for 790 genes (based on my pathways that I am interested to see). I was looking for GSEA for a RNAseq data. Apparently, I am getting errors related to pruning. I have made sure the gene id's in both the file (.gct and .rnk) files are similar. Can anyone help me in this?

Thanks in advance.

Prit

Anthony Castanza

unread,
Apr 6, 2021, 9:01:08 PM4/6/21
to gsea...@googlegroups.com

Hi Prit,

 

790 genes is over the default GSEA maximum gene set size threshold.

 

You’ll need to expand the “Basic Fields” section and change the “Max size: exclude larger sets” parameter from the default 500 to something above your gene set size.

 

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

http://gsea-msigdb.org/

Prit Benny

unread,
Apr 9, 2021, 8:56:47 PM4/9/21
to gsea-help
Thanks Anthony for the suggestion. Luckily it worked and I got what I hypothesize. However, I couldnt get any NES, p-value or FWER pvalue, while the FDR came out to be 1. 

Any suggestions on this?

Thanks
Prit

Deep verma

unread,
Aug 31, 2021, 10:17:32 AM8/31/21
to gsea-help
Hi Prit 
If you initiate with a bigger size gene set, it may be possible that genes don’t all go in the same direction, and the method for combining them into a single signature involves coercing them to do so. They will not be in the dominant direction. This can be a reason behind insignificant results. As Anthony told, You should be more focused on the genes while including them in the set. 

Best
Deepak Verma

Nanette Bishopric

unread,
Sep 10, 2021, 10:50:05 PM9/10/21
to gsea-help

GeneSets should have unique names. The lookup is case INsensitive. Found duplicate name: SIG_BCR_SIGNALING_PATHWAY
GeneSets should have unique names. The lookup is case INsensitive. Found duplicate name: SA_G2_AND_M_PHASES
GeneSets should have unique names. The lookup is case INsensitive. Found duplicate name: SA_MMP_CYTOKINE_CONNECTION
GeneSets should have unique names. The lookup is case INsensitive. Found duplicate name: SA_PROGRAMMED_CELL_DEATH
GeneSets should have unique names. The lookup is case INsensitive. Found duplicate name: SA_PTEN_PATHWAY
GeneSets should have unique names. The lookup is case INsensitive. Found duplicate name: SA_REG_CASCADE_OF_CYCLIN_EXPR
GeneSets should have unique names. The lookup is case INsensitive. Found duplicate name: SA_TRKA_RECEPTOR

at org.gsea_msigdb.gsea/edu.mit.broad.genome.Errors.barfIfNotEmptyRuntime(Errors.java:112)
at org.gsea_msigdb.gsea/edu.mit.broad.genome.Errors.barfIfNotEmptyRuntime(Errors.java:98)
at org.gsea_msigdb.gsea/edu.mit.broad.genome.objects.AbstractGeneSetMatrix.initMatrix(AbstractGeneSetMatrix.java:60)
at org.gsea_msigdb.gsea/edu.mit.broad.genome.objects.DefaultGeneSetMatrix.<init>(DefaultGeneSetMatrix.java:44)
at org.gsea_msigdb.gsea/xtools.api.param.GeneSetMatrixMultiChooserParam$GeneSetsStruc.toGm(GeneSetMatrixMultiChooserParam.java:148)
at org.gsea_msigdb.gsea/xtools.api.param.GeneSetMatrixMultiChooserParam.getGeneSetMatrixCombo(GeneSetMatrixMultiChooserParam.java:61)
at org.gsea_msigdb.gsea/xtools.gsea.Gsea.execute(Gsea.java:156)
at org.gsea_msigdb.gsea/edu.mit.broad.xbench.tui.TaskManager$ToolRunnable.run(TaskManager.java:435)
at java.base/java.lang.Thread.run(Unknown Source)

Anthony Castanza

unread,
Sep 11, 2021, 12:06:18 AM9/11/21
to gsea...@googlegroups.com

Hello,

 

The error message you’ve included is unrelated to the previous issues in this thread.

In your case, it would appear that you’ve selected some combination of Gene Set Database files that includes both a subcollection and at least one of it’s parent collections. The MSigDB gene set files are hierarchical (eg. You can run C5:GO:BP and C5:GO:MF together but you couldn’t run both C5:GO:BP and the higher level C5:GO in the same run as it will contain repeats of the sets that contain both and give this error. In this case, I believe that the duplicates are in one of the C2 levels. If you share what you used in the gene sets database input box, I can tell you where they’re coming from more specifically.

 

That’s assuming you’re using MSigDB gene sets. If you’re using your own custom file, you’d need to manually check it for duplicates.

 

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

 

Nanette Bishopric

unread,
Sep 14, 2021, 11:18:22 AM9/14/21
to gsea-help
Hi Anthony, I apologize for my clumsy query above and am sorry if this is the wrong thread.  I am trying to use the desktop version of GSEA 4.1.0 on a MacBook Pro OS10.15.7 to analyze a set of 641 genes mapped from RNASeq data. By pasting the gene symbols in the dataset, I have done overlap queries from the https://www.gsea-msigdb.org/gsea/msigdb/annotate.jsp page and have come up with 30+ highly significant hits in the C3, C2 and C7 sets. I assume that means my symbols are mapping correctly onto the gene sets.  When I try to run the expression dataset that contains those symbols, picking one of the same sets,  I get error 3001. If I try to run more than one, even if they are subsets of unrelated groups, I  get error message #31096.  What am I getting wrong? Would it help to send the expression dataset I'm using?

Anthony Castanza

unread,
Sep 14, 2021, 12:57:08 PM9/14/21
to gsea...@googlegroups.com

When using the annotate page to perform an overlap statistic test, GSEA internally uses the Human_Gene_Symbol_with_Remapping and Mouse_Gene_Symbol_with_Remapping_to_Human_orthologs CHIP files to ensure that all symbols are Harmonized into the MSigDB namespace (additionally, a few other CHIPs are also used to pick up other commonly used namespaces).

 

Running GSEA in the desktop app is different than running an overlap statistic test. When running an overlap statistic test, you need to select just significant genes or genes of interest and usually these are run as separate lists of up and downregulated genes (but not always). When running GSEA you actually need the *entire* list of all expressed genes both significant and non-significant along with either the full expression information (in GSEA's regular mode) or with a precomputed ranking metric (in GSEA Preranked mode). Including the non-significant genes allows GSEA to compute the full ranking distribution when testing for overrepresentation. If you don’t have the full expression information, you're likely to get the "none of the gene sets passed size thresholds" error as GSEA strips genes from gene sets if they aren't in the underlying expression dataset.

 

If your dataset already contains the full expression information for all genes and you are still experiencing these errors, then yes, sending your dataset (an be done confidentially to gsea...@broadinstitute.org if you want to keep it off the public help forum), as well as any other input files you used for GSEA, will help us in debugging the error messages here.

 

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

 

Eunjoo Kim

unread,
Nov 16, 2021, 5:45:15 PM11/16/21
to gsea-help
I keep having error .
After pruning, non of the gene sets passed size thresholds.
Don't know why...

2021년 9월 14일 화요일 오전 10시 57분 8초 UTC-6에 Anthony Castanza님이 작성:

Anthony Castanza

unread,
Nov 16, 2021, 5:50:16 PM11/16/21
to gsea-help
Hello,

We need some more information to help debug this issue. How many genes are in your input dataset? GSEA expects information for all the expressed genes (should be on the order of 10,000+ genes) not supplying all the expressed genes can result in this error as GSEA filters the input gene sets to remove genes not in the dataset.

Secondly, gene identifiers are being used in the input dataset? Please send a screenshot. If the dataset is not using Human Gene Symbols, it is necessary to use one of the provided CHIP files (one matched specifically to the gene identifier type in the dataset) with the Collapse Dataset functionality.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego


Eunjoo Kim

unread,
Nov 16, 2021, 6:21:15 PM11/16/21
to gsea-help

Number of genes are21898 and it is mouse sample.
gsea.gif

exp file.gif

2021년 11월 16일 화요일 오후 3시 50분 16초 UTC-7에 Anthony Castanza님이 작성:

Anthony Castanza

unread,
Nov 16, 2021, 6:36:47 PM11/16/21
to gsea...@googlegroups.com

Hello,

 

So, what you're going to want to do here is copy the Ensembl gene id's from the Description column to the Name column but remove their version suffixes (so ENSMUSG00000045515.5 would become ENSMUSG00000045515 and similar for all the genes). Then you can use the Mouse_ENSEMBL_Gene_ID_Human_Orthologs_MSigDB.v7.4.chip file selected from the dropdown for the Chip platform field.

 

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

 

From: gsea...@googlegroups.com <gsea...@googlegroups.com> on behalf of Eunjoo Kim <7wan...@gmail.com>
Date: Tuesday, November 16, 2021 at 3:21 PM
To: gsea-help <gsea...@googlegroups.com>
Subject: Re: [gsea-help]

 

Number of genes are21898 and it is mouse sample.

 

Eunjoo Kim

unread,
Nov 17, 2021, 12:06:04 PM11/17/21
to gsea-help
Now I have different error...

<Error Details>

---- Full Error Message ----
The collapsed dataset was empty when used with chip:ftp.broadinstitute.org://pub ...

---- Stack Trace ----
# of exceptions: 1
------The collapsed dataset was empty when used with chip:ftp.broadinstitute.org://pub/gsea/annotations_versioned/Mouse_ENSEMBL_Gene_ID_Human_Orthologs_MSigDB.v7.4.chip------
xtools.api.param.BadParamException: The collapsed dataset was empty when used with chip:ftp.broadinstitute.org://pub/gsea/annotations_versioned/Mouse_ENSEMBL_Gene_ID_Human_Orthologs_MSigDB.v7.4.chip
at org.gsea_msigdb.gsea/xtools.gsea.Gsea.getDataset(Gsea.java:100)
at org.gsea_msigdb.gsea/xtools.gsea.Gsea.execute(Gsea.java:146)
at org.gsea_msigdb.gsea/edu.mit.broad.xbench.tui.TaskManager$ToolRunnable.run(TaskManager.java:435)
at java.base/java.lang.Thread.run(Unknown Source)



2021년 11월 16일 화요일 오후 4시 36분 47초 UTC-7에 Anthony Castanza님이 작성:

Anthony Castanza

unread,
Nov 17, 2021, 2:53:49 PM11/17/21
to gsea...@googlegroups.com

Hello,

 

It looks like GSEA is still not able to match the IDs. Did you ensure that both the IDs were moved to the first column (the column would still need to be called NAME) and the decimal version suffixes were stripped from all genes?
If you could provide a screenshot of the modified file opened in a plain text editor I can give suggestions on what might've gone wrong.

Image removed by sender.

 

Image removed by sender.

Eunjoo Kim

unread,
Nov 17, 2021, 3:24:59 PM11/17/21
to gsea-help
Picture1.jpg

2021년 11월 17일 수요일 오후 12시 53분 49초 UTC-7에 Anthony Castanza님이 작성:

Anthony Castanza

unread,
Nov 17, 2021, 3:31:05 PM11/17/21
to gsea...@googlegroups.com

Okay, I'm not actually seeing anything wrong with that data here. I suppose it's possible there could be some hidden spaces or something that the parser is choking on though.

Would you possibly be willing to send this dataset (confidentially) to us so I can take a closer look and debug the issue? We have a private email address gsea...@broadinstitute.org that can be used to send confidential data.

 

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

 

From: gsea...@googlegroups.com <gsea...@googlegroups.com> on behalf of Eunjoo Kim <7wan...@gmail.com>
Date: Wednesday, November 17, 2021 at 12:25 PM
To: gsea-help <gsea...@googlegroups.com>
Subject: Re: [gsea-help]

Error! Filename not specified.

 

Error! Filename not specified.

Shraddha Ranganathan

unread,
Feb 4, 2022, 8:42:09 AM2/4/22
to gsea-help
Hi all, 
I'm dealing with the same issue on GSEA at the moment, but with non standard data sets. 
My .gmt file is for Arabidopsis thaliana, and is from here. My input gene list (237 genes) comes out of DESeq2, but with the gene names switched out with their A. thaliana orthologs. Also maybe relevant - I did have to convert my input from csv to txt, which has caused CR/LF errors when I switch to linux.
 I have made the phenotype labels on the fly, and am trying to run GSEA with the no collapse option. I recognise that several things here are non-standard, which may be contributing to the problem too (?) 

I keep getting the 'After pruning, none of the gene sets passed size thresholds' error. Do you maybe have an insight to why it won't work for me? 

Thanks and kind regards

Anthony Castanza

unread,
Feb 7, 2022, 5:43:01 PM2/7/22
to gsea-help
Hi Shraddha,

In the future, we ask that a new issue be opened to prevent unwanted replies to the original posters.

Since we don't provide datasets for Arabidopsis thaliana, I'm going to assume here that you're providing your own gene set database file in addition to your input expression/ranking data.
Assuming that you've matched the gene identifiers in your gene list with the gene identifiers in the gene sets, the likely issue here is the number of genes in your input dataset, GSEA expects that all of the expressed genes are provided in the input list, i.e. the data is not filtered for log2FC or pValue thresholds. 237 genes is unlikely to result in gene sets that pass the minimum size thresholds for GSEA after the gene sets are restricted to just the genes for which there is available expression information. A quick literature search indicated that the Arabidopsis thaliana genome encodes somewhere on the order of 25,500 genes. GSEA would expect to have information on the majority of them.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

William Chao

unread,
Oct 11, 2022, 7:24:02 AM10/11/22
to gsea-help
Hi Anthony,
I got the same error when running GESA, and my dataset is attached below, I also checked the detail on the website, however, it did not solve the problem. My input file contains 12752 genes from the mouse.  And, I still don't know how to solve it. Thanks a lot.
擷取.PNG擷取.PNG
Anthony Castanza 在 2022年2月8日 星期二清晨6:43:01 [UTC+8] 的信中寫道:

Anthony Castanza

unread,
Oct 11, 2022, 9:10:25 AM10/11/22
to gsea-help
Hi William

In the future, we ask that a new issue be opened to prevent unwanted replies to the original posters.

These appear to be mouse genes in the Riken gene I'd format. Are all the IDs Riken IDs? Or is this a mix of gene symbols and Riken IDs? I don't think we fully support the Riken database IDs, just those that are accepted as interim gene symbols. 

If these are gene symbols and not just Riken IDs then are you using GSEA's collapse option with the Mouse gene symbols chip?

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego
擷取.PNG
擷取.PNG

Owaisa Haider

unread,
Feb 26, 2023, 12:02:00 PM2/26/23
to gsea-help
I have 16 genes in my dataset I am using it the first time  it gives the error of the threshold 

Castanza, Anthony

unread,
Feb 27, 2023, 12:39:13 PM2/27/23
to gsea...@googlegroups.com

Hi Owaisa,

In the future please create a new thread to discuss your specific error message.

That said, GSEA expects ranking information for all expressed genes, not highly filtered subsets (i.e. just “significant” genes) like you appear to have provided. GSEA needs the additional information that these non-differentially expressed genes provide in order to properly compute the enrichment scores.

 

Please try again with the full dataset and let me know if you continue to encounter errors.

 

-Anthony

 

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine

University of California, San Diego

 

From: Owaisa Haider
Sent: Sunday, February 26, 2023 9:02 AM
To: gsea-help
Subject: Re: [gsea-help]

 

I have 16 genes in my dataset I am using it the first time  it gives the error of the threshold 

On Tuesday, 11 October 2022 at 18:10:25 UTC+5 Anthony Castanza wrote:

Hi William

 

In the future, we ask that a new issue be opened to prevent unwanted replies to the original posters.

 

These appear to be mouse genes in the Riken gene I'd format. Are all the IDs Riken IDs? Or is this a mix of gene symbols and Riken IDs? I don't think we fully support the Riken database IDs, just those that are accepted as interim gene symbols. 

 

If these are gene symbols and not just Riken IDs then are you using GSEA's collapse option with the Mouse gene symbols chip?

 

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

On Tue, Oct 11, 2022, 4:24 AM William Chao <asq257...@gmail.com> wrote:

Hi Anthony,

I got the same error when running GESA, and my dataset is attached below, I also checked the detail on the website, however, it did not solve the problem. My input file contains 12752 genes from the mouse.  And, I still don't know how to solve it. Thanks a lot.

Reply all
Reply to author
Forward
0 new messages