Confusion about phenotype file and CLS format

Mel Thalhammer

unread,

Jun 12, 2021, 11:32:02 AM6/12/21

to gsea-help

Hey everyone,

thank you for providing this great software! I am trying to correlate gene expression data obtained from the Allen Institute with alterations in cortical thickness after premature birth (cortical thickness was obtained with Freesurfer 7.1.1. from MRI scans). I know that this might not be a typical questionaire, but this was done before using GSEA (e.g., by Grothe et al., 2018, doi: 10.1093/brain/awy189).

I have loaded the gene expression data already. My gene expression file contains microarray data of post-mortem human brains, mapped into 34 cortical regions, i.e., it encompasses ~20,000 genes expressed in 34 cortical regions (~20,000 x 34 matrix).

I am now struggling with my phenotype file. So, the gene expression data stem from different subjects than the cortical thickness data. Thus, it makes no sense to create a categorical label file, which would assign a label to every gene in my gene expression file.

I want to know which spatial gene expression patterns correlate with the cortical thickness spatial pattern of preterm-born adults (94 subjects x 34 cortical regions matrix). I don't know how I can pack this information into the software. I have tried to select one cortical regions and have the cortical thickness values per region for preterm (group 1) and term (group 2) in the .cls file, but that would not completely target my question. Furthermore, I was not able to load this file into the software as the following error occurs:

<Error Details>

---- Full Error Message ----
There were errors: ERROR(S) #:1
Parsing trouble
edu.mit.broad.genome.parsers.Par ...

---- Stack Trace ----
# of exceptions: 1
------Bad format - expect ncols: 1 but found: 91 on line: 2.749562047    2.809810894    2.866150727    2.962548811    2.723956322    2.750069286    2.82168507    2.814755101    2.759250852    2.897585537    2.877386235    2.702128659    2.762923479    2.803125165    2.700904401    2.836375431    2.78659849    2.71437065    2.879222549    2.73334584    2.696816635    2.928802572    2.870653038    2.627452449    2.761699222    2.814952018    2.886370885    2.786795407    2.652745406    1.585854809    2.904930644    2.981246396    2.792719486    2.860247359    2.77558061    2.800261511    3.007275509    2.753342278    2.87508468    2.708947496    2.737134    2.673829077    2.927583762    2.500938524    2.499587915    3.243827755    2.585854806    2.849421171    2.727857637    1.998297241    2.735783391    2.753342278    2.84671963    2.86022669    2.64006159    2.928934372    2.844018412    2.730380309    2.808721447    2.931635912    2.60764439    2.753520824    2.483201091    2.915605859    2.799445084    2.87508468    2.918307077    2.928934372    3.270663328    2.815653362    2.839966262    2.622502381    2.495535765    2.739835219    2.358936016    3.003223359    2.756043819    2.794042002    2.920830072    2.758745359    2.900747868    2.59413733    2.340025875    2.855996316    2.746588909    3.359809984    2.983141155    3.143875258    2.600925952    2.88583039    2.926438127                                                            ------
edu.mit.broad.genome.parsers.ParserException: Bad format - expect ncols: 1 but found: 91 on line: 2.749562047    2.809810894    2.866150727    2.962548811    2.723956322    2.750069286    2.82168507    2.814755101    2.759250852    2.897585537    2.877386235    2.702128659    2.762923479    2.803125165    2.700904401    2.836375431    2.78659849    2.71437065    2.879222549    2.73334584    2.696816635    2.928802572    2.870653038    2.627452449    2.761699222    2.814952018    2.886370885    2.786795407    2.652745406    1.585854809    2.904930644    2.981246396    2.792719486    2.860247359    2.77558061    2.800261511    3.007275509    2.753342278    2.87508468    2.708947496    2.737134    2.673829077    2.927583762    2.500938524    2.499587915    3.243827755    2.585854806    2.849421171    2.727857637    1.998297241    2.735783391    2.753342278    2.84671963    2.86022669    2.64006159    2.928934372    2.844018412    2.730380309    2.808721447    2.931635912    2.60764439    2.753520824    2.483201091    2.915605859    2.799445084    2.87508468    2.918307077    2.928934372    3.270663328    2.815653362    2.839966262    2.622502381    2.495535765    2.739835219    2.358936016    3.003223359    2.756043819    2.794042002    2.920830072    2.758745359    2.900747868    2.59413733    2.340025875    2.855996316    2.746588909    3.359809984    2.983141155    3.143875258    2.600925952    2.88583039    2.926438127
   at org.gsea_msigdb.gsea/edu.mit.broad.genome.parsers.StringDataframeParser._parse(StringDataframeParser.java:164)
   at org.gsea_msigdb.gsea/edu.mit.broad.genome.parsers.StringDataframeParser.parseSdf(StringDataframeParser.java:144)
   at org.gsea_msigdb.gsea/edu.mit.broad.genome.parsers.ClsParser._parse_new_style(ClsParser.java:273)
   at org.gsea_msigdb.gsea/edu.mit.broad.genome.parsers.ClsParser.parse(ClsParser.java:228)
   at org.gsea_msigdb.gsea/edu.mit.broad.genome.parsers.ParserFactory._readTemplates(ParserFactory.java:341)
   at org.gsea_msigdb.gsea/edu.mit.broad.genome.parsers.ParserFactory.readTemplate(ParserFactory.java:292)
   at org.gsea_msigdb.gsea/edu.mit.broad.genome.parsers.ParserFactory.read(ParserFactory.java:752)
   at org.gsea_msigdb.gsea/edu.mit.broad.genome.parsers.ParserFactory.read(ParserFactory.java:725)
   at org.gsea_msigdb.gsea/edu.mit.broad.genome.parsers.ParserWorker.doInBackground(ParserWorker.java:51)
   at java.desktop/javax.swing.SwingWorker$1.call(Unknown Source)
   at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
   at java.desktop/javax.swing.SwingWorker.run(Unknown Source)
   at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
   at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
   at java.base/java.lang.Thread.run(Unknown Source)

Do you have any idea how I could address my question?

Best, Melissa

lh_CoTh_GSEA-excel.xlsx

lh_CoTh_GSEA-excel.cls.txt

Anthony Castanza

unread,

Jun 13, 2021, 3:13:14 PM6/13/21

to gsea...@googlegroups.com

Hi Melissa,

There were just two small issues with the CLS file, it had a .txt extension after the .cls, which will cause GSEA to parse it incorrectly, and also it was missing the first #numeric line, which tells GSEA to parse it as a numeric CLS not a categorical one.

I've made the applicable corrections in the attached file.

Also note that for continuous cls files you'll need to change the ranking metric to pearson correlation as the default metric is for categorical experiments.

With regard to how best to model your experiment, GSEA is rather limited in options here, really only supporting these two modes, binary or continuous, for the built-in functions. Examining each region separately, then combining the results for each of the multiple regions in something like EnrichmentMap might be the best approach to take here.

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/bcbe2568-8625-4fca-91fe-306cb60730ben%40googlegroups.com.

--

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

lh_CoTh_GSEA-excel.cls

Melissa Thalhammer

unread,

Jun 14, 2021, 6:20:22 AM6/14/21

to gsea-help

Thank you!

Unfortunately, I receive another error when I try to run the GSEA.

I have used Entrez IDs as a NAME in my expression dataset and have downloaded c5.all.v7.4.entrez.gmt from your website. I use phenotype labels for generating the null distribution and selected No_collapse. I did not select a Chip platform, which should be optional when selecting No_collapse. I get the following error:

---- Full Error Message ----

col:35 > matrix's fColCnt:35

---- Stack Trace ----
# of exceptions: 1

------col:35 > matrix's fColCnt:35------
java.lang.ArrayIndexOutOfBoundsException: col:35 > matrix's fColCnt:35
   at org.gsea_msigdb.gsea/edu.mit.broad.genome.math.Matrix.getColumnV(Matrix.java:261)
   at org.gsea_msigdb.gsea/edu.mit.broad.genome.objects.DefaultDataset.getColumn(DefaultDataset.java:291)
   at org.gsea_msigdb.gsea/edu.mit.broad.genome.objects.TemplateFactory.extract(TemplateFactory.java:95)
   at org.gsea_msigdb.gsea/edu.mit.broad.genome.alg.DatasetGenerators.extract(DatasetGenerators.java:300)
   at org.gsea_msigdb.gsea/edu.mit.broad.genome.alg.DatasetGenerators.extract(DatasetGenerators.java:294)
   at org.gsea_msigdb.gsea/xtools.gsea.AbstractGsea2Tool.execute_one(AbstractGsea2Tool.java:86)
   at org.gsea_msigdb.gsea/xtools.gsea.AbstractGsea2Tool.execute_one_with_reporting(AbstractGsea2Tool.java:112)
   at org.gsea_msigdb.gsea/xtools.gsea.Gsea.execute(Gsea.java:165)
   at org.gsea_msigdb.gsea/edu.mit.broad.xbench.tui.TaskManager$ToolRunnable.run(TaskManager.java:435)
   at java.base/java.lang.Thread.run(Unknown Source)

In another trial, I have selected Human_NCBI_Gene_ID_MSigDB.v7.4.chip as a chip file, but the same error occurs. Is the error somehow regarding the naming of my files or have I missed something else?

Best, Melissa

Anthony Castanza

unread,

Jun 14, 2021, 1:04:08 PM6/14/21

to gsea...@googlegroups.com

Hi Melissa,

This error normally occurs when the number of samples in the dataset file is less than the number of samples defined in the CLS file. Looking back at the CLS file you sent earlier, I noticed that the cls file had a different number of samples for each of the two conditions. GSEA would expect that these have the same number of samples as both each other, and the dataset being used. If you need to run the analysis on different subsets of samples, then you’d need to split the two groups into separate CLS files with separate matching datasets. Also to note, the sample to phenotype assignment is ordered so when computing the correlation, the first value in the cls will be assigned to the first value in the GCT, etc. across the dataset.

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

http://gsea-msigdb.org/

at java.base/java.util.concurrent.FutureTask.run(UnknoError! Filename not specified.wn Error! Filename not specified.Source)

   at java.desktop/javax.swing.SwingWorker.run(Unknown Source)
   at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
   at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
   at java.base/java.lang.Thread.run(Unknown Source)

Do you have any idea how I could address my question?

Best, Melissa

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/bcbe2568-8625-4fca-91fe-306cb60730ben%40googlegroups.com.

--

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/f1543a3c-bd21-46e0-b353-c05f39621c22n%40googlegroups.com.

Reply all

Reply to author

Forward