java.lang.NumberFormatException Error

48 views
Skip to first unread message

Hamed Khedmatgozar

unread,
Feb 22, 2022, 2:15:06 PM2/22/22
to gsea...@googlegroups.com
Dear Sir/Madam, 

Hi, 
Could you please advise me on the error I am getting while I want to upload the data? I googled it and it says it might be because of some NA in the data, I checked it, it is not that. 

<Error Details>

---- Full Error Message ----
There were errors: ERROR(S) #:1
Parsing trouble
java.lang.NumberFormatException: ...

---- Stack Trace ----
# of exceptions: 1
------For input string: "PK
GSEA.txt

David Eby

unread,
Feb 22, 2022, 4:18:17 PM2/22/22
to gsea...@googlegroups.com
Hi Hamed,

GSEA uses the file extension to determine how to parse the data, and this file has a TXT extension instead of GCT.  It handles TXT files as another kind of matrix format, highly similar to GCT but slightly different (see the Data Formats page of our Wiki for details).  Your file is actually in the correct format for GCT, you just need to change the extension from ".txt" to ".gct" and it will load fine.  

I would suggest changing the column headers to be simple labels only, however.  That is, from "SRR9298757 - linear total RPKM" to just "SRR9298757" for example.  GSEA may have issues with special characters like spaces or dashes in names, so we generally recommend sticking with alphanumeric-only, though underscores can also be used in place of any special characters.  You are welcome to try these more complex labels but if there are issues this might be the cause.

Regards,

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/CANrnwtJnquZZ02diManMc0bqxaF56M8%3Ds4TeEXZhBDf_Z8aMzw%40mail.gmail.com.

Hamed Khedmatgozar

unread,
Feb 22, 2022, 4:33:42 PM2/22/22
to gsea...@googlegroups.com
Dear David, 

Thank you so much for your prompt reply. 
I used the .gct file and made the names based on your suggestions, but it still gives me errors. 
Here I have the error and also the .gct file I used, alongside the .cls file. 

Thank you, 

Hamed 

<Error Details>

---- Full Error Message ----
There were errors: ERROR(S) #:1
Parsing trouble
java.lang.NumberFormatException: ...

---- Stack Trace ----
# of exceptions: 1
------For input string: "PK
Dr. Matusik.cls
Matusik.gct

Anthony Castanza

unread,
Feb 22, 2022, 5:49:13 PM2/22/22
to gsea...@googlegroups.com

Hi Hamed,

 

The GCT file you've provided here is not a gct file. The file appears to be an XML file which is the structure excel uses internally. The file needs to be saved out of excel as tab delimited text and then the .txt file extension changed to .gct. Changing it directly from xls/xlsx will not work as we can't parse Microsoft's proprietary file formats. The CLS appears fine.

 

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

Hamed Khedmatgozar

unread,
Feb 22, 2022, 7:36:12 PM2/22/22
to gsea...@googlegroups.com
Dear Anthony, 

Thank you so much. Yes, it is true, I had it saved in .xlsx before. I saved it in .txt and it worked. Appreciate your great assistance. 

Best, 

Hamed

Hamed Khedmatgozar

unread,
Jun 11, 2024, 4:21:19 PMJun 11
to gsea...@googlegroups.com
Hello, 

I am running GSEA and it doesn't show anything for the control group. Would you please advise me on this? I have attached a screenshot of the results page. 


Thank you, 
Hamed 

image.png

Anthony Castanza

unread,
Jun 11, 2024, 5:52:35 PMJun 11
to gsea-help
According to the gene markers section of the report, 86% of the genes in your dataset were upregulated and only 14% were down regulated. With a dataset this highly skewed I wouldn't necessarily expect to have enough signal in the down regulated genes to find anything.

That said, this is potentially indicative of a data processing or quality control error, or batch effects that weren't removed. But without knowing more about what kind of pipeline you used to generate it I can't really give more specifics, sorry!

In general I'd recommend a DESeq2 normalization for RNA-seq datasets, and providing the complete normalized gene by sample matrix to GSEA.

If the large skew is a real biological result, that's fine but most RNA-seq processing pipelines aren't really optimized for that scenarios.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

Hamed Khedmatgozar

unread,
Jun 12, 2024, 7:21:19 AMJun 12
to gsea...@googlegroups.com
Hi,

I didn’t normalize the counts and used raw counts as input. 
I removed genes with 0 value and selected protein coding genes. 
Please let me know if I should have done it in a different way. 

Thank you, 
Hamed 

On Jun 11, 2024, at 17:52, Anthony Castanza <acas...@cloud.ucsd.edu> wrote:


According to the gene markers section of the report, 86% of the genes in your dataset were upregulated and only 14% were down regulated. With a dataset this highly skewed I wouldn't necessarily expect to have enough signal in the down regulated genes to find anything.

That said, this is potentially indicative of a data processing or quality control error, or batch effects that weren't removed. But without knowing more about what kind of pipeline you used to generate it I can't really give more specifics, sorry!

In general I'd recommend a DESeq2 normalization for RNA-seq datasets, and providing the complete normalized gene by sample matrix to GSEA.

If the large skew is a real biological result, that's fine but most RNA-seq processing pipelines aren't really optimized for that scenarios.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

On Tue, Jun 11, 2024, 1:21 PM Hamed Khedmatgozar <hamedkhed...@gmail.com> wrote:
Hello, 

I am running GSEA and it doesn't show anything for the control group. Would you please advise me on this? I have attached a screenshot of the results page. 


Thank you, 
Hamed 

<image.png>

Anthony Castanza

unread,
Jun 12, 2024, 12:59:42 PMJun 12
to gsea...@googlegroups.com
Hi Hamed,

You definitely should normalize the counts, the method we recommend is to dump the normalization table from DESeq2's internal data. We offer a DESeq2 module on cloud.genepattern.org that will do this by default if you provide it the raw counts. Alternatively you can do this in R directly by passing some variation of the following commands:
dds <- estimateSizeFactors(dds)
normalized_counts <- counts(dds, normalized=TRUE)
write.table(normalized_counts, file="data/normalized_counts.txt", sep="\t", quote=F, col.names=NA)

then converting that file to a GCT.
Removing genes with zero expression is definitely recommended, you might even go so far as to use a min threshold of 5 counts across all samples.
As for analyzing only protein-coding genes, that is something that is done sometimes, but we don't necessarily recommend it. There are valid arguments for both analyzing all expressed genes, and just protein coding genes.



-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

Hamed Khedmatgozar

unread,
Jun 12, 2024, 4:13:23 PMJun 12
to gsea...@googlegroups.com
Hi Anthony, 

I have 6 samples (3 controls, 3 experiments). What do you think of edgeR? 
Which one is better? 

Thank you, 
Hamed 

Anthony Castanza

unread,
Jun 12, 2024, 4:34:53 PMJun 12
to gsea...@googlegroups.com
Hi Hamed,

Generally both DESeq2 and EdgeR are accepted tools for differential expression analysis, however I can't really speak to the specifics of which to choose for any given dataset sorry. As long as the normalization follows similar principles intending the output for between-sample comparisons, it should be fine.


-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

Hamed Khedmatgozar

unread,
Jun 13, 2024, 1:51:42 PMJun 13
to gsea...@googlegroups.com
Hi Anthony, 

Thanks for your help. 
I have used GSEA software to do GSEA analysis. Would you please advise me if I can do it in R and if there is any code or anything? 

Thank you, 
Hamed 
 

Hamed Khedmatgozar

unread,
Jun 13, 2024, 6:14:09 PMJun 13
to gsea...@googlegroups.com
Also, 

Is it better to use normalized counts or Log-normalized counts? 

Thank you, 
Hamed 

Anthony Castanza

unread,
Jun 13, 2024, 6:31:52 PMJun 13
to gsea...@googlegroups.com
Hi Hamed,

We do have an R version in our GitHub repository, but it's not maintained and generally intended for algorithmic experimentation not for general use.
There might be other versions of GSEA available on bioconductor, but nothing that we maintain, sorry!

You should generally use non-log normalized counts


-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego
Reply all
Reply to author
Forward
0 new messages