fetchChromSizes: database not hosted at UCSC

226 views
Skip to first unread message

Vaneet Lotay

unread,
Feb 25, 2015, 3:18:23 PM2/25/15
to gen...@soe.ucsc.edu

Hello,

 

I’ve been trying to use many of the handy UCSC tools/scripts for summarizing info as well as converting file formats, specifically the latest one I need is bedToBigBed.  Many of your tools including this one requires a chromosome sizes file and always states that the user should not derive chromosome sizes independently but rather use fetchChromSizes.  The problem is fetchChromSizes requires a database name and the one I need is not hosted at UCSC.  I’m looking for Xenopus Tropicalis V. 7.1 and V. 8.0, the latest version at UCSC is 4.2 (2009) and I assume the chromosome sizes and other details have changed since then?  Anyways because of this I can’t use many of the UCSC tools because they always need a chromosome sizes files, what do you suggest I do?

 

I need to perform this bed to bigBed conversion because I’m trying to add gene model files to a track hub and there are only 4 formats supported: bigWig, bigBed, bam or vcfTabix.  I assume GFF3 is not supported right?  Let me know if just BED format is accepted or for some other reason I don’t need to perform this conversion for the track hub.

 

Thanks,

 

Vaneet

Brian Lee

unread,
Mar 4, 2015, 2:18:23 PM3/4/15
to Vaneet Lotay, gen...@soe.ucsc.edu

Dear Vaneet,

Thank you for using the UCSC Genome Browser and your question about fetchChromSizes and chrom.sizes for assemblies.

When you are using an assembly that isn't available at UCSC you can create the chrom.sizes from the underlying new assembly by turning the fasta file into a 2bit file (useful for creating an assembly hub) with faToTwoBit and then running twoBitInfo on the resulting file.

What follows are steps to create and annotate an assembly that isn't in the UCSC Genome Browser, Xenopus Tropicalis V. 8.0. Before jumping into that, please know a version of the v7 assembly is available on our genome-preview site: http://genome-preview.ucsc.edu/cgi-bin/hgGateway?db=xenTro7

Also, please note you will need a publicly-accessibly directory to host track hub files so the UCSC Browser can access the data over the internet. Another option is doing everything locally on a virtual machine downloaded and installed on your computer using a Genome Browser in a Box: http://genome-preview.ucsc.edu/goldenPath/help/gbib.html

Also, the utilities used below like faToTwoBit are available in the following link precompiled for different environments, run uname -a to see yours: http://hgdownload.soe.ucsc.edu/admin/exe/

1. Acquire the assembly in interest, in this case Xenopus Tropicalis V. 8.0.

$wget ftp://ftp.xenbase.org/pub/Genomics/JGI/Xentr8.0/Xentr_8.0_Scaffolds_NOT_on_xenbase_Gbrowse.tgz

2. Unzip and and extract files and then run the faToTwoBit command:

$ faToTwoBit Xenopus_tropicalis.main_genome.scaffolds.repeatMasked.fa xentr8.2bit

3. Create the chrome.sizes with the twoBitInfo command, http://genome.ucsc.edu/goldenPath/help/twoBit.html:

$ twoBitInfo xentr8.2bit stdout | sort -k2rn > xentr8.chrom.sizes

4. You now have the chrom.sizes you need to build bigBeds. For example, you can then create an annotation bed track with lines like "Chr01 100 10000” using this command

$ bedToBigBed Chr01.bed xentr8.chrom.sizes Chr01.bigBed

But first you need an assembly hub. Grab this example assembly hub, http://genome.ucsc.edu/goldenPath/help/hubQuickStartAssembly.html, with the wget command shown there and then recursively copy the directory to make your own xentr8 directory. Edit the hub.txt, genomes.txt and trackdb.txt pages to refer to your own hub. For example the genomes.txt you could change it like this and move the created 2bit files to the right location:

genome xentr8
trackDb xentr8/trackDb.txt
groups groups.txt
description xentr8
twoBitPath xentr8/xentr8.2bit
organism frog
defaultPos Chr01:2000-100000

Here is an example to look at: http://hgwdev.cse.ucsc.edu/~brianlee/hubTestingAssembly/xentr8/xentr8Hub/

You can load this hub in the browser by going to the Track Hub Page, http://genome.ucsc.edu/cgi-bin/hgHubConnect, clicking the "My Hubs" tab, and pasting it in the URL to the hub.txt:

http://hgwdev.cse.ucsc.edu/~brianlee/hubTestingAssembly/xentr8/xentr8Hub/hub.txt

In this hub there is also a new flavor of bigBed known as a bigGenePred. For example if you load this assembly hub and navigate to Chr01:1,035,273-1,035,333 by pasting those coordinates in, you can click into the example bigGenePred track then click "Go to bigGenePred new bigBed track controls" and set the "Color track by codons" to "genomic codons" and will see a display of codon translations, please note these gene predictions were artificially lifted from the human hg38 database. Documentation is pending for this new bigGenePred format.

Please refer to the Track Hub User Guide for more help:
http://genome.ucsc.edu/goldenPath/help/hgTrackHubHelp.html

Also please search our mailing list for previously answered questions about Track Hubs:
https://groups.google.com/a/soe.ucsc.edu/forum/?hl=en&fromgroups#!search/track$20hubs

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genome Bioinformatics Group


--


Vaneet Lotay

unread,
Mar 4, 2015, 6:15:44 PM3/4/15
to Brian Lee, gen...@soe.ucsc.edu

Thanks Brian.  I was running bedToBigBed on the newly created chromosome file and got this:

 

pass1 - making usageList (1915 chroms): 21 millis

Trailing characters parsing signed integer in column 0 of array field 11 line 1 of xtev_v3-4_JGI7-1_all-v2.bed, got "1892

 

I’m not sure whether this is an error or not or whether I can use the output bigbed file that was created or need to fix something still?  The bed file I’m working with has 12 fields so I inputted this option as an argument: -type=bed12

 

Thanks,

 

Vaneet

 

Vaneet Lotay

Bioinformatician

724 ICT Building - University of Calgary

2500 University Drive NW

Calgary AB T2N 1N4

CANADA

Brian Lee

unread,
Mar 11, 2015, 1:31:51 PM3/11/15
to Vaneet Lotay, gen...@soe.ucsc.edu
Dear Vaneet,

I apologize for the delay in responding.  Can you please send the referenced xtev_v3-4_JGI7-1_all-v2.bed file, or the first few lines from the file.  

It is hard to troubleshoot your error without knowing the file's contents, but it sounds like perhaps the first line has characters in it that are disagreeable to bedToBigBed.  Sometimes bed files will include some track annotation lines. Perhaps this is the case with your file, and if you remove that first few non-BED format lines, then bedToBigBed will work.

Thank you again for your using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.


All the best,

Brian Lee
UCSC Genome Bioinformatics Group

Vaneet Lotay

unread,
Mar 20, 2015, 3:13:57 PM3/20/15
to Brian Lee, gen...@soe.ucsc.edu

Thanks Brian.  I no longer have this issue but have a different issue now.  The bedToBigBed script requires the first blockStart in column 12 to always be zero.  Unfortunately in my file I’ve detected 720 genes in which the first block whether it be a UTR or CDS does not start at the chromStart position.

 

I always prefer not to manipulate the source data but not sure what to do.  Should I correct the chromStart in the original BED file to match the start position of the first block?  Do you think these segments shouldn’t normally exist where the first block start doesn’t match the start position of the gene?

 

Thanks,

 

Vaneet

Brian Lee

unread,
Mar 24, 2015, 1:45:32 AM3/24/15
to Vaneet Lotay, gen...@soe.ucsc.edu

Dear Vaneet

Please see the previous email about how it is hard to troubleshoot your issues without knowing your file's contents. It is not very easy to decipher what you are experiencing without providing some context for your situation.

You ignored the previous request to share an example of what you were experiencing, but based on your error message I googled the file referenced and found an assembly hub created by the Veenstra lab: http://genome.ucsc.edu/cgi-bin/hgHubConnect?hgHub_do_redirect=on&hgHubConnect.remakeTrackHub=on&hgHub_do_firstDb=1&hubUrl=http://trackhub.science.ru.nl/hubs/xenopus/hub.txt

If you are trying to recreate the Veenstra lab's assembly hub, it is likely better to contact them with questions: g.geo...@science.ru.nl

In order to try to help you I obtained the file you referenced, xtev_v3-4_JGI7-1_all.bed.gz, and created a bigBed from it. It turns out the first line is a track line, and must be removed, explaining the error you saw in the previous email. Also before creating a bigBed the file must also be sorted. In summary here are the steps I took, using the above assembly hub 2bit file and bb file to create the xt7.chrom.sizes and original.as.file which gives names to the columns in the bed file.

wget http://trackhub.science.ru.nl/hubs/xenopus/xt7_1/xt7_1.2bit
twoBitInfo xt7_1.2bit stdout | sort -k2rn > xt7.chrom.sizes

wget http://veenstra.ncmls.nl/FTP/tracks_xt_7.1/xtev_v3-4_JGI7-1_all.bed.gz
cat xtev_v3-4_JGI7-1_all.bed | sort -k1,1 -k2,2n > xtev_v3-4_JGI7-1_all.bed.sorted
bedToBigBed xtev_v3-4_JGI7-1_all.bed.sorted xt7.chrom.sizes -as=original.as.file -type=bed6+ xtev_v3-4_JGI7-1_all.wget.bb 

This resulting bigBed from the original file you referenced worked fine. Please respond with further explanation of what you are seeking. I have an idea of what you are suggesting, but your vague details leaves one guessing and having to go very far out of the way to try to approximate what you wish. Try looking at our file formats page: http://genome.ucsc.edu/FAQ/FAQformat.html Specifically, I suggest looking at the GenePred table format.

Before any future email, please spend some time investigating our archived mailing list of previously answered questions for more information: https://groups.google.com/a/soe.ucsc.edu/forum/?hl=en&fromgroups#!search/GenePred

And please include more information in any future correspondence, it will be both greatly appreciated and likely increase the ability to solve your issues.

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply togen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genome Bioinformatics Group

Vaneet Lotay

unread,
Mar 24, 2015, 2:07:23 PM3/24/15
to Brian Lee, gen...@soe.ucsc.edu

Hey Brian,

 

The previous issue was resolved which is why I didn’t provide further information for it.  I’m not trying to recreate the Veenstra lab hub, we’re actually hosting a trackhub based on the data they gave us similar to their track hub but that can be available for the users of our site.  Nevertheless this is a different BED file I’m working with now which has 12 columns not 6.

 

This error is not due to the track header line or whether the file is properly sorted as the file meets those specifications and it would result in a different error which I’ve seen before.  As I said before the error is due to the fact that the script requires that in column 12 which contains the blockStarts, the first blockStart is required to be zero meaning start at the beginning of the gene segment exactly.  It’s stated on the UCSC format page as well (https://genome.ucsc.edu/FAQ/FAQformat.html) with this line: 

 

In BED files with block definitions, the first blockStart value must be 0, so that the first block begins at chromStart. Similarly, the final blockStart position plus the final blockSize value must equal chromEnd. Blocks may not overlap.

 

My problem is I was able to detect 720 genes in which the first block does not start at zero.  So I can think of a few ways to get around it but they involve manipulating the data so I was just wondering in general is this a phenomenon that’s not supposed to happen having the first blockStart not at zero?  Are these error genes that should be removed from my file so that the script proceeds?

 

If you’d like to recreate my error I’ve attached my BED file and chromSizes file…. here’s the resulting error:

 

bedToBigBed -type=bed12 Xenbase_XT_7.1_Gene_sorted_12col.bed xt7_1.chrom.sizes test7_1.bb

pass1 - making usageList (746 chroms): 10 millis

Error line 4 of Xenbase_XT_7.1_Gene_sorted_12col.bed: BED blocks must span chromStart to chromEnd.  BED chromStarts[0] = 299, must be 0 so that (chromStart + chromStarts[0]) equals chromStart.

 

Please let me know what I can do to get around this or fix the file.

 

Thanks,

 

Vaneet

 

Vaneet Lotay

Xenbase Bioinformatician

Xenbase_XT_7.1_Gene_sorted_12col.bed
xt7_1.chrom.sizes

Brian Lee

unread,
Mar 26, 2015, 7:02:30 PM3/26/15
to Vaneet Lotay, gen...@soe.ucsc.edu

Dear Vaneet,

Thank you for sharing your files, explaining the context and working to use the UCSC Genome Browser to display this data.

I appreciate your efforts and patience, it does look like a concerning issue that there are these items where the first block does not start at zero, it suggests there is a problem in the process that originated the file. One of our engineers shares that if you are getting the original file and it has blockStarts that don't start with 0, then you should alert the lab that generated the file that some tweaks are in order for the output to be valid BEDs.

The lab will know the most about how the file was generated, so they should be able to tell the best way to fix the file. It could be that the block coords need tweaking, or it could be that chromStart needs tweaking -- in which case the block coords should most likely be reevaluated.

I did a quick look at the first file referenced, cat xtev_v3-4_JGI7-1_all.bed.sorted | awk '{print $12}' | grep -v "^0", and didn't see any entries with blockStarts starting with zero. But I did find that there are some overlapping blocks, that shouldn't be happening, an example is this line, where the size of the first block is 83, and you can see that overlaps by one base with the start of the second block, 82: scaffold_2 94120502 94121152 Xtev_2.1236.3_VX| 600 + 94120502 94121152 0 2 83,568, 0,82,

Here is a graphical link and the text of some bed 12 examples, http://hgwdev.cse.ucsc.edu/~brianlee/customTracks/examples.bed12, illustrating the different columns and the overlap:
http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chr22%3A1-6002&hgt.customText=http://hgwdev.cse.ucsc.edu/~brianlee/customTracks/examples.bed12

When looking for blockStarts that are not zero in your file I did see about 700, one being a gene "tp53i3", seen in this below session where the scaffold9 location is transposed to KB021661 for the xenTro7 equivalent on our genome-preview site. I'm not recommending doing this, but to have the line from your tp53i3 display, I upped the block count from 8 to 9, and put 0 in the size for the first exon, and 0 for the blockStart, essential making an empty exon.

http://genome-preview.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=brianlee&hgS_otherUserSessionName=xenTro7.CustomTracks

To answer your question, it does seem these are errors that you should contact your source about to discover why the process is creating blockStarts that are not zero.

I wanted to share too that we have a new type of bigBed that will soon be offered called bigGenePred. You can see it in the above example. The documentation is not available yet, but this new format allows one to take a bed12 and add 8 additional columns of gene prediction annotation, and then use bedToBigBed with a specific -as file to name the columns and the results can allow the amino acids to be displayed in the gene. If you have the contact information for your source data, I would be happy to contact them to share this new bigGenePred type as they might change their pipeline to generate it while troubleshooting the issue you discovered.

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genome Bioinformatics Group

Brian Lee

unread,
Apr 29, 2015, 7:34:59 PM4/29/15
to Vaneet Lotay, gen...@soe.ucsc.edu

Dear Vaneet,

Thank you for taking the time to look through the mailing list archives to try to find previous answers to your questions.

Beyond reviewing our archived mailing list for answers, you should also spend time reviewing all documentation inside the source when you are trying to make changes at the level you are interested in exploring. Please see this documenation, and also all other associated documentation files in the source tree: http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob_plain;f=src/hg/makeDb/doc/quickAssembly.txt

From your message, it looks like you may be referencing https://groups.google.com/a/soe.ucsc.edu/d/msg/genome/oJS2z1xvxCA/5n55DSNITggJ. You will note there the recommendation that I've given in this thread several times about creating an assembly hub for your new genome. Also the response shares how the GBiB browserbox source code is specifically for GBiB and is not appropriate for other mirror sites. The engineer behind our GBiB shares that the GBiB was never meant to be used in a way for people to build their own genome DBs, rather that is what assembly hubs were created to fulfill. In essence, please use an assembly hub instead.

Thank you for using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Brian Lee
UCSC Genome Bioinformatics Group


On Tue, Apr 21, 2015 at 3:39 PM, Vaneet Lotay <vaneet...@ucalgary.ca> wrote:

Hey Brian,

 

Scratch that last issue, I found the script that can create an AGP file when none exists (hgFakeAgp).  I have a new issue that I encountered when trying to add a new genome database to our UCSC mirror.  I’ve gotten to the step where you load the chromInfo table and gold and gap tables:

 

$ hgLoadSqlTab abcDef1 chromInfo $HOME/kent/src/hg/lib/chromInfo.sql \

             bed/chromInfo/chromInfo.tab

 

hgGoldGapGl abcDef1 abcDef1.agp

 

Unfortunately I encountered an SQL error when trying both these commands.  I looked in this forum/mailing list at a previous user who had this issue and you told him to delete a line which contained “showTableCache=xxxx” from the hg.conf file.  I did that and it only got rid of some of the error referencing the cache but not all of it:

 

SQL_CONNECT 36209 localhost xenTro8_0 localhost root

SQL_TIME 36209 localhost xenTro8_0 0.004s

SQL_QUERY 36209 localhost xenTro8_0 NOSQLINJ SELECT 1 FROM chromInfo LIMIT 0

SQL_FAILOVER 36209 localhost xenTro8_0 db -> slow-db | SELECT 1 FROM chromInfo LIMIT 0

Couldn't connect to database xenTro8_0 on genome-mysql.cse.ucsc.edu as genomep.

Unknown database 'xenTro8_0'

SQL_TOTAL_TIME 0.105s

SQL_TOTAL_QUERIES 1

SQL_DISCONNECT 36209 xenTro8_0

hgLoadSqlTab: jksql.c:395: monitorEnter: Assertion `monitorEnterTime == 0' failed.

 

Can you help me? I know for a fact I created the ‘xenTro8_0’ database as I verified when logging into mysql and checking the databases.  Also when you say edit the hg.conf file, which is the main one being used?  I have 2 copies that were already on my machine and one that came with the Kent source tree:

 

~/kent/src/browserbox/usr/local/apache/cgi-bin/hg.conf

./home/browser/git/usr/local/apache/cgi-bin/hg.conf

./usr/local/apache/cgi-bin/hg.conf

 

I assumed it was the 3rd one but please let me know which one is used by all of the UCSC loader tools.

 

Thanks,

 

Vaneet

 

Vaneet Lotay

Xenbase Bioinformatician

724 ICT Building - University of Calgary

2500 University Drive NW

Calgary AB T2N 1N4

CANADA

 



From: Vaneet Lotay
Sent: Monday, April 20, 2015 3:41 PM
To: 'Brian Lee'
Cc: gen...@soe.ucsc.edu
Subject: RE: [genome] fetchChromSizes: database not hosted at UCSC

 

Hey Brian,

 

We’ve set up our own UCSC Genome Browser or UCSC mirror.  I’m looking to add a new genome database, one that it’s not in the default genomes that come with UCSC.  I was going through the instructions and it looks like it requires a FASTA file of your sequences as well as an AGP file.  I don’t have an AGP file, is there any way I can create the AGP file from the FASTA file using a script of some sort?

 

Reply all
Reply to author
Forward
0 new messages