Can't get the file format right

2,507 views
Skip to first unread message
Assigned to he...@broadinstitute.org by me

Adam Summerfield

unread,
Oct 30, 2017, 7:37:51 AM10/30/17
to gsea-help
Hi everyone,

So when I follow the instructions to make an acceptable file to load into GSEA 3.0: "To create a tab-delimited text file: select File>Save As, enter the file name in quotes to preserve the the file extension (for example, "p53.gct"), and select "Text(Tab delimited)(*.txt)" as the file type. Excel displays a message warning you that your file may contain features that are not compatible with this format and asks if you want to keep the workbook in this format. Click Yes to keep this format. Your file has now been saved. Exit from Excel. When Excel asks if you want to save your changes to this file, select No (you have already saved the file"
it doesn't work. Instead I get this error message:

---- Full Error Message ---
There were errors: ERROR(S) #:1
Parsing trouble
java.lang.IllegalArgumentExcepti ...

---- Stack Trace ----
# of exceptions: 1
------Unknown file format: Z:Adam\RNA-Seq\CD34negWT no known Parser for ext: ------
java.lang.IllegalArgumentException: Unknown file format: Z:\Adam\RNA-Seq\CD34negWT no known Parser for ext:
    at edu.mit.broad.genome.parsers.ParserFactory.read(ParserFactory.java:769)
    at edu.mit.broad.genome.parsers.ParserFactory.read(ParserFactory.java:726)
    at edu.mit.broad.genome.parsers.ParserWorker.doInBackground(ParserWorker.java:52)
    at javax.swing.SwingWorker$1.call(Unknown Source)
    at java.util.concurrent.FutureTask.run(Unknown Source)
    at javax.swing.SwingWorker.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

The .txt file I make here is from an Excel (.xls) file. I save it as a tab delimited and it is still not working.

Many thanks for your help,
Adam

Helga Thorvaldsdottir

unread,
Oct 31, 2017, 10:24:13 AM10/31/17
to gsea-help
Hi Adam,

If this is a GCT file you are creating, make sure the filename extension is .gct, for example, CD34negWT.gct.

Helga

Arthur Liberzon

unread,
Oct 31, 2017, 11:42:55 AM10/31/17
to gsea-help
GSEA infers file format from extension of the file name. Thus, if your file name ends with .txt, then GSEA decides that it is in the GSEA TXT format. Make sure that whatever input file you are using, its format is compatible with its purpose and our specifications. For example, check file format specs for expression data here.

Adam Summerfield

unread,
Nov 2, 2017, 9:09:39 AM11/2/17
to gsea-help
Hi Helga,

It's a tab delimited file (.txt), I was following the instructions on the broadinstitute website. Should I convert to .gct instead?

Arthur Liberzon

unread,
Nov 2, 2017, 10:14:44 AM11/2/17
to gsea...@googlegroups.com
Hi Adam,

Please attach the error message you get as well as a snippet with the first 100 lines from the file you try to feed into GSEA.


--
You received this message because you are subscribed to a topic in the Google Groups "gsea-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gsea-help/WFrjlljI1ms/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gsea-help+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/9a8c2e1c-8b95-4ca4-a097-ba1c7e6f2160%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

___________________________________________

Arthur Liberzon, Ph.D.

Molecular Signatures Database (MSigDB) curator and

CMAP Bioinformatics Scientist I

Cancer Program


The Broad Institute of MIT and Harvard

415 Main Street

Cambridge MA 02142

Phone: (617) 714 7582

E-mail: libe...@broadinstitute.org

Adam Summerfield

unread,
Nov 2, 2017, 11:11:28 AM11/2/17
to gsea-help
I'm not sure my file is in the right format to begin with, and I'm unsure how to change it into GCT.
It's an excel file I got straight from the sequencing lab. It has columns from A to P.
It goes:

Column A: no name (gene numbers in column). B: EnsemblGenID. C: GeneName. D: description. E: mean RPM. F: mean RPM (condition 1). G: ST. Dev. (condition 1). H: mean RPM (condition 2). I: ST. Dev (condition 2). J: FoldChange. K: log2 FoldChange. L: adj. p-value DESeq. M: adj. p-value edgeR. N: mean p-value. O: edgeR & DESeq < 0.01. P: edgeR & DESeq < 0.05. 


Then there's 48500 rows of gene expression data. 

I'm not sure how to get all this into the GCT format.


The full error message is here:

Arthur Liberzon

unread,
Nov 2, 2017, 11:53:38 AM11/2/17
to gsea-help
Now I see.
Briefly, this is not the kind of data GSEA would normally accept as input.

Instead, you need to organize your data as a table (you can do it in Excel if that works for you) such that the rows are genes (Ensembl Gene IDs, but without the version suffixes). The first column in the table should contain Ensembl Gene IDs, the second column should have gene names. Other columns should have expression measures for each sample (e.g., RPKM but if you can do it then TPM would be better). Do not include stuff like FoldChange, mean, st Dev or p-values. Save the file as plain text from Excel and make sure that the file name ends with '.txt'. For more details, please consult these resources:

(1)
(2)

Adam Summerfield

unread,
Nov 3, 2017, 4:50:20 AM11/3/17
to gsea-help
Thanks, I'll try that. Should I also delete the "description" column? Is mean RPM okay for a column? Or should I change it to RPKM?
Many thanks,
Adam

Arthur Liberzon

unread,
Nov 6, 2017, 8:39:12 AM11/6/17
to gsea-help
You should use RPKM values for individual samples, not mean aggregates.

Adam Summerfield

unread,
Nov 20, 2017, 9:58:46 AM11/20/17
to gsea-help
Hi again,
So I made an excel file of the RPKMs, with column one being the ENSEMBL ID (e.g. ENSMUSG00000000001), column 2 is the GeneName, and columns C - R are all sample RPKMs.
I also added a hash in cell 1A because it was empty and I got an error about row 2 being 18 cells long and row 1 being 17 long, something to that effect.
Then I saved as text (tab delimited, not MS-DOS or macintosh) and the suffix is .txt.

When I load into GSEA it gives the following error.


---- Full Error Message ----

There were errors: ERROR(S) #:1
Parsing trouble
java.lang.NumberFormatException: ...


---- Stack Trace ----
# of exceptions: 1
------For input string: "Pbsn"------
java.lang.NumberFormatException: For input string: "Pbsn"
    at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
    at sun.misc.FloatingDecimal.parseFloat(Unknown Source)
    at java.lang.Float.parseFloat(Unknown Source)
    at edu.mit.broad.genome.parsers.TxtDatasetParser._parseNoDesc(TxtDatasetParser.java:169)
    at edu.mit.broad.genome.parsers.TxtDatasetParser.parse(TxtDatasetParser.java:131)
    at edu.mit.broad.genome.parsers.TxtDatasetParser.parse(TxtDatasetParser.java:87)
    at edu.mit.broad.genome.parsers.ParserFactory.readDatasetTXT(ParserFactory.java:202)
    at edu.mit.broad.genome.parsers.ParserFactory.read(ParserFactory.java:749)

    at edu.mit.broad.genome.parsers.ParserFactory.read(ParserFactory.java:726)
    at edu.mit.broad.genome.parsers.ParserWorker.doInBackground(ParserWorker.java:52)
    at javax.swing.SwingWorker$1.call(Unknown Source)
    at java.util.concurrent.FutureTask.run(Unknown Source)
    at javax.swing.SwingWorker.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

Many thanks for your help.

Arthur Liberzon

unread,
Nov 20, 2017, 12:05:59 PM11/20/17
to gsea...@googlegroups.com

make sure that:

(1)

(2)
the file is tab delimited plain text

(3)
the extension of the file name matches its format, e.g., if it's GCT, then the file name should end with   .gct

Adam Summerfield

unread,
Nov 22, 2017, 6:31:25 AM11/22/17
to gsea-help
Okay, this is still not working. I'm attaching a screenshot of my file in Excel. It's saved as a tab delimited file (.txt).

I get the following error:

<Error Details>


---- Full Error Message ----
There were errors: ERROR(S) #:1
Parsing trouble
java.lang.NumberFormatException: ...

---- Stack Trace ----
# of exceptions: 1
------For input string: "35,39912817"------
java.lang.NumberFormatException: For input string: "35,39912817"

    at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
    at sun.misc.FloatingDecimal.parseFloat(Unknown Source)
    at java.lang.Float.parseFloat(Unknown Source)
    at edu.mit.broad.genome.parsers.TxtDatasetParser._parseHasDesc(TxtDatasetParser.java:229)
    at edu.mit.broad.genome.parsers.TxtDatasetParser.parse(TxtDatasetParser.java:129)

    at edu.mit.broad.genome.parsers.TxtDatasetParser.parse(TxtDatasetParser.java:87)
    at edu.mit.broad.genome.parsers.ParserFactory.readDatasetTXT(ParserFactory.java:202)
    at edu.mit.broad.genome.parsers.ParserFactory.read(ParserFactory.java:749)
    at edu.mit.broad.genome.parsers.ParserFactory.read(ParserFactory.java:726)
    at edu.mit.broad.genome.parsers.ParserWorker.doInBackground(ParserWorker.java:52)
    at javax.swing.SwingWorker$1.call(Unknown Source)
    at java.util.concurrent.FutureTask.run(Unknown Source)
    at javax.swing.SwingWorker.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)


gseapicture.jpg

Adam Summerfield

unread,
Nov 22, 2017, 6:46:08 AM11/22/17
to gsea-help
Huh, I just got it to load successfully. All I did was re-open my excel file on a computer with English as the default language and re-saved the file as tab delimited, loaded it in GSEA and it worked. I will let you know if I encounter more errors.

Adam Summerfield

unread,
Mar 7, 2018, 9:09:38 AM3/7/18
to gsea-help
I'm having difficulty creating an on-the-fly phenotype label. Does it not accept + and - symbols in the class A and class B names?
thanks, Adam

Adam Summerfield

unread,
Mar 7, 2018, 10:48:16 AM3/7/18
to gsea-help
I'm trying to create an on-the-fly phenotype like this: Class A = X1CD34-Miz1
X2CD34-Miz1
X3CD34-Miz1
X4CD34-Miz1

Class B = X1CD34-WT
X2CD34-WT
X3CD34-WT
X4CD34-WT

These titles are exactly what they are in the Excel file (tab delimited txt file) column headers. I don't understand the problem, it's saying there is no such name in the "ds".
And it's saying there is no "X1CD34-MIZ1" - but there is, only it's lower case, not MIZ1, and I didn't tell it to create a label all in upper case like that...


<Error Details>

---- Full Error Message ----
No such sample in ds: X1CD34-MIZ1
1CD34-Miz1   
1CD34-WT    1CD34+Miz1    1CD34+WT    2CD3 ...


---- Stack Trace ----
# of exceptions: 1
------No such sample in ds: X1CD34-MIZ1
1CD34-Miz1   
1CD34-WT    1CD34+Miz1    1CD34+WT    2CD34-Miz1    2CD34-WT    2CD34+Miz1    2CD34+WT    3CD34-Miz1    3CD34-WT    3CD34+Miz1    3CD34+WT    4CD34-Miz1    4CD34-WT    4CD34+Miz1    4CD34+WT   
Number of elements: 16
------
java.lang.IllegalArgumentException: No such sample in ds: X1CD34-MIZ1
1CD34-Miz1   
1CD34-WT    1CD34+Miz1    1CD34+WT    2CD34-Miz1    2CD34-WT    2CD34+Miz1    2CD34+WT    3CD34-Miz1    3CD34-WT    3CD34+Miz1    3CD34+WT    4CD34-Miz1    4CD34-WT    4CD34+Miz1    4CD34+WT   
Number of elements: 16

Arthur Liberzon

unread,
Mar 7, 2018, 11:04:29 AM3/7/18
to gsea...@googlegroups.com
Don't use the "on-the  fly" option for this. It is prone to errors (as you noticed already) and is impossible to reproduce.
Instead, I encourage you to spend some time and define phenotype classes in a dedicated file in the CLS format.

--
You received this message because you are subscribed to a topic in the Google Groups "gsea-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gsea-help/WFrjlljI1ms/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gsea-help+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Adam Summerfield

unread,
Apr 5, 2018, 3:35:48 AM4/5/18
to gsea-help
Thanks Arthur. I'm using Notepad, but there's no option to save it as a CLS file and I don't know how to convert the file to CLS format.

Arthur Liberzon

unread,
Apr 5, 2018, 3:29:25 PM4/5/18
to gsea...@googlegroups.com
Save is as plain text, then change extension to .cls



--
You received this message because you are subscribed to a topic in the Google Groups "gsea-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gsea-help/WFrjlljI1ms/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gsea-help+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Adam Summerfield

unread,
May 27, 2018, 11:47:35 AM5/27/18
to gsea-help
I still can't get any .clsfile to load into GSEA.
There are the columns of my dataset file:
Name    Description    1CD34-Miz1    1CD34-WT    1CD34+Miz1    1CD34+WT    2CD34-Miz1    2CD34-WT    2CD34+Miz1    2CD34+WT    3CD34-Miz1    3CD34-WT    3CD34+Miz1    3CD34+WT    4CD34-Miz1    4CD34-WT    4CD34+Miz1    4CD34+WT.

And this is my .cls file:

16 4 1
# CD34-Miz1 CD34-WT CD34+Miz1 CD34+WT

1CD34-Miz1 1CD34-WT 1CD34+Miz1 1CD34+WT 2CD34-Miz1 2CD34-WT 2CD34+Miz1 2CD34+WT 3CD34-Miz1 3CD34-WT 3CD34+Miz1 3CD34+WT 4CD34-Miz1 4CD34-WT 4CD34+Miz1 4CD34+WT

It's telling me the items are mismatched with the templates, but I can't work out how to match them.


















David Eby

unread,
May 28, 2018, 11:58:32 PM5/28/18
to gsea-help
Hi Adam,

Yes, I agree that despite being a "simple" text file format, the CLS format often confuses users.  Matching the "class names" of line 2 to the "class labels" of line 3 seems to be the sticking point.  It's a common issue and we sometimes discuss adopting an alternative format.

You may find it easier to work with symbolic class labels like simple numbers on line 3 (so, 0-3 in your case with a 4-class dataset).  The key point is that as line 3 is processed left-to-right, it will take the first label it finds *no matter what it is* and map it to the first class name from line 2 (also left-to-right).  Any other instances of that label then map to that same name.

Now, the second label found (on 3) *different from the first* is mapped to the second name (on 2), and likewise any other instances.  And so on for the third and fourth in your case.  If there are more unique labels (on 3) than there are names (on 2) then you'll get an error as in your example.

Since line 3 represents your samples column-wise as they appear in the expression dataset, you need to arrange the class names on line 2 in the order in which they're first encountered among your samples.  If you're also using numbers for labels, then you should encounter [0, 1, 2, 3] in order on line 3 when reading left-to-right.

I hope this helps.

jur...@gmail.com

unread,
Oct 22, 2019, 3:57:41 PM10/22/19
to gsea-help

This subtlety has been a major issue for me. Can this information, specifically the fact that the labels in line 2 must reflect the order of occurrence of phenotype in line 3, be added to the "Data Formats" documentation here: http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats ?

Jonathan Urbach

---------------------------------------------------------
Jonathan M. Urbach, Ph.D.
Ragon Institute

David Eby

unread,
Oct 23, 2019, 10:49:33 PM10/23/19
to gsea...@googlegroups.com
Hi Jonathan,

Yes, we periodically get user help requests on this subject.  We talked about it a little in our team meeting today but didn't come up with any good clear and obvious alternatives.  At a minimum we'll go back and review the Data Formats to see if we can explain it better.  There is a note to this effect on that page, but maybe that can be improved.

It is unfortunately one of those topics that never quite rises to the level of urgency where it takes precedence over all the other tasks we need to address.  The next time I have the opportunity to take a look at the code involved I'll see if there's anything better that might be done.

Thanks,
David

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/9f322606-97dc-4998-bd56-0f5a795f9d42%40googlegroups.com.

jur...@broadinstitute.org

unread,
Oct 24, 2019, 11:29:17 AM10/24/19
to gsea-help
Hi David,

I see that now... My apologies for not getting it when I read it before. I think I got a little lost in the Note part.

What I think tripped me up was the initial expectation that the mapping of class names to phenotype labels would be by name, rather than by position. Clearly in the example, the names and the phenotype labels are not identical, but I didn't notice that on my first cursory reading.

I guess what would have been a bit more clear to me would have been to change the description of the second line to add that the labels in the 2nd line need not be the same as those in the 3rd, but are mapped to the corresponding phenotype labels in a position-specific manner, specifically in the order in which they will appear in the 3. You could also add "See below for explanation."

Thanks for your response.

Jonathan
To unsubscribe from this group and stop receiving emails from it, send an email to gsea...@googlegroups.com.

David Eby

unread,
Oct 24, 2019, 10:05:10 PM10/24/19
to gsea...@googlegroups.com
No apologies necessary, Jonathan.  I actually just yesterday added an expanded description based on your feedback, which is probably why you didn't see it before.

These are good suggestions; I'll see what I can do to improve the description.  We may try to improve it in code / format as well, if we get the opportunity.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/17e88593-fc45-48db-8abe-3c8bce980081%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages