Import New Reference Genome

76 views
Skip to first unread message

Ji Hen Lau

unread,
Jan 28, 2025, 7:17:06 AMJan 28
to cBioPortal for Cancer Genomics Discussion Group
Hi !

I'm trying to import h37d5 reference genome into cBioPortal following this documentation. However, I deployed cBioPortal using Docker, and the documentation seems to assume a non-Docker deployment (please correct me if I’m wrong).

My Attempts:
1. Using importReferenceGenome.pl 
I attempted to imitate the study validation/import process with the following command:  
docker compose exec cbioportal importReferenceGenome.pl --ref-genome ./refGenome.txt
But I encountered this error:
OCI runtime exec failed: exec failed: unable to start container process: exec: "importReferenceGenome.pl": executable file not found in $PATH: unknown

2. Direct MySQL Insertion  
As previous attempt did not work out, I try another suggestion from a Google Group discussion to directly insert into MySQL database, I ran:
INSERT INTO `reference_genome` VALUES (4, 'human', 'hs37d5', 'GRCh37_hs37d5', 2900338458, 'https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz', '2017-03-16');
The insertion to the database succeeded but when I try to validate my data with the validateData.py, after adding 'reference_genome: hs37d5' to the meta_study.txt, I got:
ERROR: meta_study.txt: Unknown reference genome defined. Should be one of ['hg19', 'hg38', 'mm10']; value encountered: 'hs37d5'

To resolve this, I try to restart the cBioPortal container using docker compose restart  and docker compose down/up but both didn’t resolve the issue. On top of this, the error prompt seems saying no other reference genome will be expected, is this the latest restriction of cBioPortal?  

3.  Running importReferenceGenome.pl in Bash    
As in my second attempt, I did not import the reference genome using the importReferenceGenome.pl script and the mentioned discussion were 5 years ago, I wonder if this solution doesn't apply anymore, especially in a Docker container setting. So I choose to follow back to the original documentation.

I try to run the importReferenceGenome.pl in the bash environment of the container:
docker compose cp ./refGenome/refGenome.txt cbioportal:/core/scripts/
docker compose exec cbioportal bash
cd /core/scripts
export PORTAL_HOME=/cbioportal
./importReferenceGenome.pl --ref-genome refGenome.txt

But got this error:
PORTAL_DATA_HOME Environment Variable is not set.  Please set, and try again.

Thus I try
export PORTAL_DATA_HOME =/cbioportal
./importReferenceGenome.pl --ref-genome refGenome.txt
And this time the script seems to be able to run, but it seems to have some connection issues where I receive:
Reading reference genome from:  /core/scripts/refGenome.txt
 --> total number of lines:  1

Standard Commons Logging discovery in action with spring-jcl: please remove commons-logging.jar from classpath in order to avoid potential conflicts
09:54:21.523 [main] INFO org.mskcc.cbio.portal.util.GlobalProperties -- Attempting to read properties file: /cbioportal/application.properties
09:54:21.525 [main] INFO org.mskcc.cbio.portal.util.GlobalProperties -- Failed to read properties file: /cbioportal/application.properties
09:54:21.525 [main] INFO org.mskcc.cbio.portal.util.GlobalProperties -- Attempting to read properties file from classpath
09:54:21.526 [main] INFO org.mskcc.cbio.portal.util.GlobalProperties -- Successfully read properties file
09:54:21.527 [main] INFO org.mskcc.cbio.portal.util.GlobalProperties -- Attempting to read properties file: /cbioportal/maven.properties
09:54:21.527 [main] INFO org.mskcc.cbio.portal.util.GlobalProperties -- Failed to read properties file: /cbioportal/maven.properties
09:54:21.527 [main] INFO org.mskcc.cbio.portal.util.GlobalProperties -- Attempting to read properties file from classpath
09:54:21.527 [main] INFO org.mskcc.cbio.portal.util.GlobalProperties -- Successfully read properties file
Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
java.sql.SQLException: Cannot create PoolableConnectionFactory (Communications link failure

The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.)
        at org.apache.commons.dbcp2.BasicDataSource.createPoolableConnectionFactory(BasicDataSource.java:633)
        at org.apache.commons.dbcp2.BasicDataSource.createDataSource(BasicDataSource.java:535)
        at org.apache.commons.dbcp2.BasicDataSource.getConnection(BasicDataSource.java:711)
        at org.springframework.jdbc.datasource.DataSourceUtils.fetchConnection(DataSourceUtils.java:159)
        at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:117)
        at org.springframework.jdbc.datasource.TransactionAwareDataSourceProxy$TransactionAwareInvocationHandler.invoke(TransactionAwareDataSourceProxy.java:223)
        at jdk.proxy1/jdk.proxy1.$Proxy0.prepareStatement(Unknown Source)
        at org.mskcc.cbio.portal.dao.DaoReferenceGenome.reCache(DaoReferenceGenome.java:68)
        at org.mskcc.cbio.portal.dao.DaoReferenceGenome.<clinit>(DaoReferenceGenome.java:43)
        at org.mskcc.cbio.portal.scripts.ImportReferenceGenome.addReferenceGenomesToDB(ImportReferenceGenome.java:101)
        at org.mskcc.cbio.portal.scripts.ImportReferenceGenome.importData(ImportReferenceGenome.java:89)
        at org.mskcc.cbio.portal.scripts.ImportReferenceGenome.run(ImportReferenceGenome.java:150)
        at org.mskcc.cbio.portal.scripts.ConsoleRunnable.runInConsole(ConsoleRunnable.java:145)
        at org.mskcc.cbio.portal.scripts.ImportReferenceGenome.main(ImportReferenceGenome.java:179)
Caused by: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
And there are multiple similar chunks saying Communications link failure.

On top of this, out of those chunks, despite a prompt stating "reference genome added to the database," no new entry appeared in the SQL database.

Question  

How can I properly import a new reference genome into a Docker-deployed cBioPortal? I’m sorry for the lengthy explanation and beginner-level question, but I’d appreciate your guidance!

Thanks in advance!



Ruslan Forostianov

unread,
Jan 28, 2025, 9:29:37 AMJan 28
to Ji Hen Lau, cBioPortal for Cancer Genomics Discussion Group
Hi Ji Hen Lau,

It looks like the script failed to connect to the database. Most likely due to incorrect database credentials due to not finding the correct application.properties file.
Script tries to find the application.properties file in the PORTAL_HOME folder.
Does /cbioportal/application.properties file exist? Does it have the correct db url (aka connection string).

Regards,
Ruslan

--
You received this message because you are subscribed to the Google Groups "cBioPortal for Cancer Genomics Discussion Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cbioportal+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/cbioportal/dbed5dfd-2de5-4759-93da-81e895862b93n%40googlegroups.com.

Ji Hen Lau

unread,
Feb 3, 2025, 9:34:05 AMFeb 3
to cBioPortal for Cancer Genomics Discussion Group
Hi Ruslan,

Thank you for your prompt response!

Regarding the application.properties file, I found it in two other locations instead of /cbioportal/ :

  • One is at /cbioportal-webapp/, which I can only access in the bash environment via command such as docker compose exec cbioportal bash
  • The other is at /cbioportal-docker-compose/config/ on the host machine (Ubuntu). To add on, I'm deploying cBioPortal on AWS EC2 Ubuntu instances.

Is this expected as I deployed using Docker, or is there something wrong with my deployment?

However, based on the information in /cbioportal-docker-compose/config/application.properties and /cbioportal-docker-compose/docker-compose.yml, I set the following environment variables:

docker compose exec cbioportal bash
cd /core/scripts
export PORTAL_HOME=/cbioportal-webapp
export PORTAL_DATA_HOME=/cbioportal-webapp
./importReferenceGenome.pl --ref-genome refGenome.txt


This time, the reference genome was successfully inserted into the MySQL database:
Reading reference genome from:  /core/scripts/refGenome.txt
 --> total number of lines:  1

Standard Commons Logging discovery in action with spring-jcl: please remove commons-logging.jar from classpath in order to avoid potential conflicts
04:06:41.496 [main] INFO org.mskcc.cbio.portal.util.GlobalProperties -- Attempting to read properties file: /cbioportal-webapp/application.properties
04:06:41.498 [main] INFO org.mskcc.cbio.portal.util.GlobalProperties -- Successfully read properties file
04:06:41.500 [main] INFO org.mskcc.cbio.portal.util.GlobalProperties -- Attempting to read properties file: /cbioportal-webapp/maven.properties
04:06:41.500 [main] INFO org.mskcc.cbio.portal.util.GlobalProperties -- Successfully read properties file

Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
04:06:42.004 [main] WARN org.mskcc.cbio.portal.util.ProgressMonitor -- New reference genome added
Done. Restart tomcat to make sure the cache is replaced with the new data.

Warnings / Errors:
-------------------
0.  New reference genome added; 1x
Done.
Total time:  840 ms

However, I still have a couple of questions:

1. $PORTAL_HOME and $PORTAL_DATA_HOME

While running the importReferenceGenome.pl script, I encountered an error stating that $PORTAL_DATA_HOME is not set. After setting $PORTAL_HOME to /cbioportal-webapp, I also set $PORTAL_DATA_HOME to /cbioportal-webapp, but I am unsure if this is the correct practice. Could you confirm?

2. Reference Genome Not Updated and Caused Study Import Failure
The reference genome was successfully inserted into the database, but when I try to validate my data with validateData.py (after adding 'reference_genome: hs37d5' to meta_study.txt), I still receive the following error:

ERROR: meta_study.txt: Unknown reference genome defined. Should be one of ['hg19', 'hg38', 'mm10']; value encountered: 'hs37d5'

The error persist even if I :
  • restart all docker container
  • docker compose up/down
  • re-initialize the container with  /cbioportal-docker-compose/init.sh
I also attempted to restart Tomcat as suggested, but I received a prompt stating "service not found." I found some cache file that were relate to Tomcat so it seems that Tomcat might be managed by Docker, but I am unsure how to restart it. Could you guide me on the proper steps for restarting Tomcat in this case?

Looking forward to your guidance.

Best regards,

Ji Hen

Ji Hen Lau

unread,
Feb 4, 2025, 8:27:15 AMFeb 4
to cBioPortal for Cancer Genomics Discussion Group
Hi,

I've been trying to understand the importation of new reference genomes in cBioPortal and have some further questions:

1. Reference Genome Validation Script
While cBioPortal provides importReferenceGenome.pl to import reference genomes, the validateData.py script is seems to only accept three reference genomes, as while I look into validateData.py the I notice code chunks such as:

pythonCopyreference_genome_map = {
    'hg19': ('human', 'GRCh37', 'hg19'),
    'hg38': ('human', 'GRCh38', 'hg38'),
    'mm10': ('mouse', 'GRCm38', 'mm10')
}

# if reference_genome is specified in the meta file, override the defaults in portal properties
if 'reference_genome' in meta_dictionary:
  if meta_dictionary['reference_genome'] not in reference_genome_map:
    logger.error('Unknown reference genome defined. Should be one of %s' %
                 list(reference_genome_map.keys()),
                 extra={
                 'filename_': filename,
                 'cause': meta_dictionary['reference_genome'].strip()
                })
  else:
    genome_info = reference_genome_map[meta_dictionary['reference_genome']]
    logger.info('Setting reference genome to %s (%s, %s)' % genome_info,
                extra={'filename_': filename})
    portal_instance.species, portal_instance.ncbi_build, portal_instance.reference_genome = genome_info
else:
  logger.info('No reference genome specified -- using default (hg19)',
              extra={'filename_': filename})           

Do we expect to see this validateData.py script to be updated automatically (to include the imported reference genome), after using the importReferenceGenome.pl or do we actually still have to modify it manually?

2. Database Integration after Importing Reference Genome
According to this documentation, the purpose of importing reference genome are related to the plotting of mutation and segments plots. However, correct me if I'm wrong, the importReferenceGenome.pl seems to only adding metadata to the reference_genome table in the MySQL database. As while I look into other relevant tables within the database, such as reference_genome_gene, I couldn't find the entry that is relevant to 'hs37d5' that I import. (On top of this, I realize the reference_genome_gene table seems to only contain 'hg19' relevant gene, is it because I did not run the init_hg38.sh while deploying the cBioPortal, and does this meant I need a separate script to initialize the database as I imported 'hs37d5' ?)

Question in Summary:
  • Do importReferenceGenome.pl update the relevant utility script automatically or we have to patch the system ourselves (such as making another seed database)?
  • Would it be feasible to extend the architecture to better support custom reference genomes?
  • For now, would mapping to hg19 be the recommended workaround as hs37d5 are based on hg19?
Any insights or suggestions would be greatly appreciated.

Best regards,
Ji Hen

Prasanna Jagannathan

unread,
Feb 4, 2025, 9:17:09 AMFeb 4
to Ji Hen Lau, cBioPortal for Cancer Genomics Discussion Group
Hi Ji Hen,

Supporting a new reference genome is an advanced task in cBioPortal.

For example, I have given 3 links that focus on supporting mouse genome.

The complexity arises due to setting up a seedDB and then genomeNexus
instance for the custom genome.

https://www.thehyve.nl/articles/cbioportal-for-mouse-data
https://github.com/cBioPortal/datahub/blob/master/seedDB_mouse/README.md
https://github.com/genome-nexus/genome-nexus-importer/blob/master/docs/setup-genome-nexus-mouse.md

Based on this, if it is feasible then mapping to hg19 would be the
recommended workaround as hs37d5 is based on hg19.

Please reply only to the cbioportal google group.

regards
Jag
> To view this discussion visit https://groups.google.com/d/msgid/cbioportal/203bded8-d085-43e9-abe3-424bad97c5ean%40googlegroups.com.

Ji Hen Lau

unread,
Feb 4, 2025, 7:43:19 PMFeb 4
to cBioPortal for Cancer Genomics Discussion Group
Hi Jag, 

Thank you for the detailed response !

Just to clarify - does this mean that importing a new reference genome would require preparing and importing all the necessary genomic information that cBioPortal needs for its visualizations (like gene positions, cytoband, etc.)? And would this data need to be properly formatted and integrated into both the seedDB and genome-nexus?

Thanks again for your guidance.

Best regards,
Ji Hen

Prasanna Jagannathan

unread,
Feb 4, 2025, 7:56:05 PMFeb 4
to Ji Hen Lau, cBioPortal for Cancer Genomics Discussion Group
Hi Ji Hen,

You can see the GenomeNexus website for all the genes and gene data.

https://www.genomenexus.org/ - Genome build: GRCh37

If this instance will suit for hs37d5, then there is no need to setup
a new custom genome nexus for hs37d5. Similarly if the gene data is
identical between hg19 and hs37d5, then there won't be a need for a
new seed DB.

However if there are any differences then both will be needed. It
depends on how much of hg19 data can be used by hs37d5. You have to
determine that for the hs37d5 genome.

Please reply only to the cbioportal google group.

thanks
Jag
> To view this discussion visit https://groups.google.com/d/msgid/cbioportal/949a75e4-0a42-4f57-8a6d-8e84745fcf9en%40googlegroups.com.

Ji Hen Lau

unread,
Feb 5, 2025, 9:37:37 AMFeb 5
to cBioPortal for Cancer Genomics Discussion Group
Hi Jag,

Thanks for the clarification!

Regards,
Ji Hen

Reply all
Reply to author
Forward
0 new messages