Uploading GENIE; problems with MySQL lock table

62 views
Skip to first unread message

Peter Saffrey

unread,
Nov 16, 2022, 3:13:20 AM11/16/22
to cbiop...@googlegroups.com

Hi there,

 

I am trying to upload GENIE data into my local (docker-based) cBioPortal install and have encountered a number of error messages. I notice there is a previous post about this:

 

https://groups.google.com/g/cbioportal/c/W33cDc833Uc/m/ddnRPORfCgAJ

 

But I’ve taken this a bit further and wanted to add my experiences for others as well as ask about where I am currently stuck.

 

What I did:

 

 

The validation found the following errors:

 

  • ERROR: data_clinical_sample.txt: lines [3974, 4633, 6748, (30408 more)]: column 3: Value of numeric attribute is not a real number; value encountered: 'Unknown'
    • Substituted an age of 0 with: `sed 's/\tUnknown\t/\t0\t/' data_clinical_sample.txt > data_clinical_sample_filtered.txt`
    • Possibly I could have used NA instead
  • ERROR: data_gene_matrix.txt: lines [11, 22, 26, (44392 more)]: column 3: Blank cell found in column; value encountered: ''' (in column 'cna')'
    • Filled in NAs instead of blanks with ` sed 's/\t$/\tNA/' data_gene_matrix.txt > data_gene_matrix_filtered.txt`
  • ERROR: data_gene_matrix.txt: lines [11, 22, 26, (44441 more)]: Gene panel ID is not in database. Please import this gene panel before loading study data.; values encountered: ['', 'YALE-OCP-V2']
    • Fixing this was tricky because the panel YALE-OCP-V2 would give an error about a non-existent gene but not report which one. In the end I had to remove them one by one until I found the bad gene, which was MYO18A. This gene does seem to be in the list you can get from the cBioPortal API, but it was only by removing it that I could get the panel to upload
  • ERROR: data_clinical_patient.txt: lines [6, 10, 13, (94384 more)]: columns [7, 10, 6, (1 more)]: Value of numeric attribute is not a real number; values encountered: ['Not Applicable', 'Unknown', 'Not Collected', '(1 more)']
    • I did a series of `sed` command to replace these values with NA:
    • cp data_clinical_patient.txt data_clinical_patient_filtered.txt
    • sed -i 's/\tNot Applicable/\tNA/g' data_clinical_patient_filtered.txt
    • sed -i 's/\tUnknown/\tNA/g' data_clinical_patient_filtered.txt
    • sed -i 's/\tNot Collected/\tNA/g' data_clinical_patient_filtered.txt
    • sed -i 's/\tNot Released/\tNA/g' data_clinical_patient_filtered.txt
  • ERROR: data_mutations_extended.txt: lines [1150228, 1259549]: No Entrez gene id or gene symbol provided for gene.
    • Since this was only two entries, I removed these: `grep -v  "^Unknown" data_mutations_extended.txt > data_mutations_extended_filtered.txt`

 

After all this work, the data did validate properly but still did not upload. First, I had an error about Java memory. Then after a bit more tinkering, eventually I got this error:

 

(snip)

java.lang.RuntimeException: org.mskcc.cbio.portal.dao.DaoException: The total number of locks exceeds the lock table size

(snip)

 

I have tried increasing the memory available to the MySQL container using the instructions here:

 

https://stackoverflow.com/a/10912622

 

(in fact, you have to modify the docker-compose file to inject a new `innodb_buffer_pool_size`) but that doesn’t seem to have helped.

 

Any suggestions?

 

Thanks,

 

Peter

 

 

 

This email and any attachments thereto may contain private, confidential, and privileged material for the sole use of the intended recipient. Any review, copying or distribution of this email (or any attachments thereto) by others is strictly prohibited. If you are not the intended recipient, please delete the original and any copies of this email and any attachments thereto and notify the sender immediately.

Peter Saffrey

unread,
Nov 18, 2022, 9:15:41 AM11/18/22
to cBioPortal for Cancer Genomics Discussion Group
I fixed this problem. I had to increase the innodb_buffer_pool_size to 1G in the docker-compose file:

cbioportal-database:
  (snip)
  command: --innodb-buffer-pool-size=1G
  (snip)

Then I had a problem that Java ran out of heap space. This is auto-computed based on the host RAM availability, so I fixed this by increasing the RAM on the VM that was running docker (up to 8GB).

I'm still not sure I've done this exactly right because the GENIE study in the UI has 134,000 samples and there are quite a few operations on these samples that cause the UI to crash. However, the data is now there. I welcome any more suggestions on how to do this properly!

Thanks,

Peter

debr...@mskcc.org

unread,
Nov 18, 2022, 5:41:44 PM11/18/22
to psaf...@gmail.com, cbiop...@googlegroups.com, p...@thehyve.nl

Great! Thanks for sharing your solution, Peter!

 

For our mysql configuration we are using something like this in production:

 

https://github.com/knowledgesystems/knowledgesystems-k8s-deployment/blob/master/cbioportal/cbioportal_mysql_db_values.yml#L13-L25

 

The docker compose based installation doesn’t have redis caching set up by default. That would make the app much more performant. If you are familiar with Kubernetes you can see how we set redis up there:

 

https://github.com/knowledgesystems/knowledgesystems-k8s-deployment/tree/master/cbioportal#redis

 

Definitely would welcome a pull request to add redis to the docker compose repo and happy to guide you if this is something you’re interested in. CC’ing Pim as well who might have set this up before

 

Note also that GENIE is a really big dataset so it requires additional resources. Our production node sizes are listed here:

 

https://docs.cbioportal.org/hardware-requirements/

 

Hope that helps!

 

Best wishes,

Ino

--
You received this message because you are subscribed to the Google Groups "cBioPortal for Cancer Genomics Discussion Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cbioportal+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cbioportal/b58fe888-ea48-4839-91f7-e75708c81896n%40googlegroups.com.



*** Only open attachments or links from trusted senders. Report phishing to inf...@mskcc.org ***

 

mg_info.txt

Peter Saffrey

unread,
Nov 21, 2022, 6:47:23 AM11/21/22
to cBioPortal for Cancer Genomics Discussion Group
Hi Ino,

I'm happy to have a go at adding Redis to the docker-compose configuration, which does sound useful. I have used Kubernetes before but I'm by no means an expert.

I don't think it would be difficult to add a Redis container to the docker-compose setup, based on the settings you linked. The harder part would be to adjust the cBioPortal configuration to make use of it. I can see in the documentation you link that there is a `config_map.yaml` file referenced, which I presume would contain the necessary changes to `portal.properties`. I can see a whole Redis section in portal.properties, but guidance on what I should put in there would be helpful :)

I also see this caution about changing the `cache_type`: "caution 2: this configuration needs to be set both at compile time and run time." - whereas obviously the docker-compose version is based on a compilation of cBioPortal which I presume would have been with `no-cache`. Can we work around this?

Thanks,

Peter

Reply all
Reply to author
Forward
0 new messages